Description of Normalization Hard fork for discussion and review #208

Closed
opened 2018-10-03 19:23:43 +02:00 by kaykurokawa · 6 comments
kaykurokawa commented 2018-10-03 19:23:43 +02:00 (Migrated from github.com)

At the yet unspecified normalization hardfok date H, we wil impelement a normalization/encoding scheme into the claimtrie. Ideally this will also happen at the same time we perform the segwit soft fork with the planned upstream merge, but this has not been finalized.

Currently (and ever since lbrycrd launched), the claimtrie itself is not encoding aware. This means that each node on the claimtrie where you can make claims is just a byte in a byte string. So far the Daemon/App layer has been enforcing lower case only ASCII encoding (there was a period of time where we did not enforce casing however, and there are some claims made outside of daemon/app with upper case letters).

After the hardfork, the claimtrie will be encoding/normalization aware. The encoding that will be used is UTF-8 and the normalization scheme is to implement NFD normalization and than lower case the letters using locale of en_US. A discussion of this scheme can be found on issue #204 (https://github.com/lbryio/lbrycrd/issues/204). Note that each node on the claimtrie will still be a byte (and not individual unicode points), the encoding/normalization of claim entries are enfroced prior to claim trie insert.

Any already existing claims will be interpreted as if they are already encoded utilizing UTF-8. Since UTF-8 is backwards compatible with ASCII encoding, claims made through the Daemon/App layer should be properly perserved.

After normalization, there will be a process of conflict resolution (or name collapse) where for example seperate winning claims on "Dog" and "dog" will after normalization both be claims on the same name. Care should be taken on downstream apps since claims that were previously accessible as the winning claim may no longer be accessible (except through its claim ID). Downstream apps will also need to make sure methods to access name claims through their temporal ordering properly merges the collapsed names (i.e, before hardfork, there was lbry://dog#1 and lbry://Dog#1 but now there will be lbry://dog#1 and lbry://dog#2 )

Claims that are invalid UTF-8 bytes (this can happen due to UTF-8 being a variable width encoding format) will be kept as they are, and enter the claimtrie unaltered (this is possible because nodes on the claimtrie are still just a byte). There is discussion about another option whether we reject all claims on invalid UTF-8 bytes as invalid claims, this option has not been implemented.

At the yet unspecified normalization hardfok date H, we wil impelement a normalization/encoding scheme into the claimtrie. Ideally this will also happen at the same time we perform the segwit soft fork with the planned upstream merge, but this has not been finalized. Currently (and ever since lbrycrd launched), the claimtrie itself is not encoding aware. This means that each node on the claimtrie where you can make claims is just a byte in a byte string. So far the Daemon/App layer has been enforcing lower case only ASCII encoding (there was a period of time where we did not enforce casing however, and there are some claims made outside of daemon/app with upper case letters). After the hardfork, the claimtrie will be encoding/normalization aware. The encoding that will be used is UTF-8 and the normalization scheme is to implement NFD normalization and than lower case the letters using locale of en_US. A discussion of this scheme can be found on issue #204 (https://github.com/lbryio/lbrycrd/issues/204). Note that each node on the claimtrie will still be a byte (and not individual unicode points), the encoding/normalization of claim entries are enfroced prior to claim trie insert. Any already existing claims will be interpreted as if they are already encoded utilizing UTF-8. Since UTF-8 is backwards compatible with ASCII encoding, claims made through the Daemon/App layer should be properly perserved. After normalization, there will be a process of conflict resolution (or name collapse) where for example seperate winning claims on "Dog" and "dog" will after normalization both be claims on the same name. Care should be taken on downstream apps since claims that were previously accessible as the winning claim may no longer be accessible (except through its claim ID). Downstream apps will also need to make sure methods to access name claims through their temporal ordering properly merges the collapsed names (i.e, before hardfork, there was lbry://dog#1 and lbry://Dog#1 but now there will be lbry://dog#1 and lbry://dog#2 ) Claims that are invalid UTF-8 bytes (this can happen due to UTF-8 being a variable width encoding format) will be kept as they are, and enter the claimtrie unaltered (this is possible because nodes on the claimtrie are still just a byte). There is discussion about another option whether we reject all claims on invalid UTF-8 bytes as invalid claims, this option has not been implemented.
BrannonKing commented 2018-10-11 00:09:40 +02:00 (Migrated from github.com)

According to the ICU docs "unpaired surrogates are replaced with U+FFFD" (aka, � ). There is no error thrown. I was thinking I would check for that character and pass the original bytes if it existed. To improve the performance, we could just do the check if the number if incoming bytes is different than the number of bytes after the conversion (but then you run the risk of � showing up in your URL).

According to the ICU docs "unpaired surrogates are replaced with U+FFFD" (aka, � ). There is no error thrown. I was thinking I would check for that character and pass the original bytes if it existed. To improve the performance, we could just do the check if the number if incoming bytes is different than the number of bytes after the conversion (but then you run the risk of � showing up in your URL).
BrannonKing commented 2018-10-11 00:49:27 +02:00 (Migrated from github.com)

I can see where boost::locale does the conversion (in uconv.hpp), but not the implementation of the icu method:

icu_std_converter<char_type> cvt(encoding_);
icu::UnicodeString str=cvt.icu(begin,end);
I can see where boost::locale does the conversion (in uconv.hpp), but not the implementation of the `icu` method: ``` icu_std_converter<char_type> cvt(encoding_); icu::UnicodeString str=cvt.icu(begin,end); ```
BrannonKing commented 2018-11-29 13:03:25 +01:00 (Migrated from github.com)

Bringing in the requirements as posted internally:

Features (after fork):

  1. For determining the winning claim at a given URI, we use a case-insensitive and unicode-normalized comparison. This ensures that similar accent marks compete and that differing case compete.
  2. UTF8 input is supported. Non-conformant input is also supported. Thus mal-formed or partial UTF8 will still get recorded. (And those exact bytes will have to be used to look it up.)
  3. The exact bytes that came in with a claim are preserved. You can see those exact bytes in the original_name field that is returned by these RPC methods: getclaimsintrie, getvalueforname, getclaimsforname, getclaimbyid. We hope these will be rendered as the correct address; we want the consumers to see the exact casing chosen by the publisher.
  4. The name field on the UPDATE op can be used to change casing. The SUPPORT op is no longer case sensitive (but relies heavily on the claimId).
  5. The name field returned in the RPC methods in item 3 above contains the normalized, case-changed name.
  6. The name input field for these RPC methods is not case sensitive: getvalueforname, getclaimsforname, getnameproof.
  7. There is a new RPC command, checknormalization, that returns the normalized, lower-cased form of the input.
  8. Update: after discussions with @grin, the getclaimsintrie results will have a name field added inside the claim object, which will then contain the original name. The outer name field will become normalizedName. This is a breaking change and will happen with the release, not fork height.
Bringing in the requirements as posted internally: Features (after fork): 1. For determining the winning claim at a given URI, we use a case-insensitive and unicode-normalized comparison. This ensures that similar accent marks compete and that differing case compete. 2. UTF8 input is supported. Non-conformant input is also supported. Thus mal-formed or partial UTF8 will still get recorded. (And those exact bytes will have to be used to look it up.) 3. The exact bytes that came in with a claim are preserved. You can see those exact bytes in the original_name field that is returned by these RPC methods: getclaimsintrie, getvalueforname, getclaimsforname, getclaimbyid. We hope these will be rendered as the correct address; we want the consumers to see the exact casing chosen by the publisher. 4. The name field on the UPDATE op can be used to change casing. The SUPPORT op is no longer case sensitive (but relies heavily on the claimId). 5. The name field returned in the RPC methods in item 3 above contains the normalized, case-changed name. 6. The name input field for these RPC methods is not case sensitive: getvalueforname, getclaimsforname, getnameproof. 7. There is a new RPC command, checknormalization, that returns the normalized, lower-cased form of the input. 8. *Update:* after discussions with @grin, the getclaimsintrie results will have a name field added inside the claim object, which will then contain the original name. The outer name field will become normalizedName. This is a breaking change and will happen with the release, not fork height.
BrannonKing commented 2018-11-29 13:07:14 +01:00 (Migrated from github.com)

@tiger5226 , the RPC changes will necessitate an update to https://github.com/lbryio/chainquery. You will need an additional DB column to store the normalized name, and that name may be needed for various joins. There is no issue at present to track that.

@tiger5226 , the RPC changes will necessitate an update to https://github.com/lbryio/chainquery. You will need an additional DB column to store the normalized name, and that name may be needed for various joins. There is no issue at present to track that.
skhameneh commented 2018-12-04 19:48:58 +01:00 (Migrated from github.com)

Will the entire UTF-8 character set be available for claims?
Additionally, will @ or # be available? We currently treat these as reserved characters on Spee.ch

Will the entire UTF-8 character set be available for claims? Additionally, will `@` or `#` be available? We currently treat these as reserved characters on Spee.ch
BrannonKing commented 2018-12-04 20:45:47 +01:00 (Migrated from github.com)

@skhameneh , the present design does not restrict any data. All possible byte combinations will continue to be available. @ and # are currently allowed in claimnames at any location, and that will be true after the fork. Any valid UTF-8 will be normalized and lower-cased. Anything that is not valid UTF-8 will compete only with those exact bytes.

@skhameneh , the present design does not restrict any data. All possible byte combinations will continue to be available. @ and # are currently allowed in claimnames at any location, and that will be true after the fork. Any valid UTF-8 will be normalized and lower-cased. Anything that is not valid UTF-8 will compete only with those exact bytes.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: LBRYCommunity/lbrycrd#208
No description provided.