Description of Normalization Hard fork for discussion and review #208
Labels
No labels
area: devops
area: discovery
area: docs
area: livestream
area: proposal
consider soon
Epic
good first issue
hacktoberfest
hard fork
help wanted
icebox
Invalid
level: 0
level: 1
level: 2
level: 3
level: 4
needs: exploration
needs: grooming
needs: priority
needs: repro
needs: tech design
on hold
priority: blocker
priority: high
priority: low
priority: medium
resilience
soft fork
Tom's Wishlist
type: bug
type: discussion
type: improvement
type: new feature
type: refactor
type: task
type: testing
unplanned
work in progress
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: LBRYCommunity/lbrycrd#208
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
At the yet unspecified normalization hardfok date H, we wil impelement a normalization/encoding scheme into the claimtrie. Ideally this will also happen at the same time we perform the segwit soft fork with the planned upstream merge, but this has not been finalized.
Currently (and ever since lbrycrd launched), the claimtrie itself is not encoding aware. This means that each node on the claimtrie where you can make claims is just a byte in a byte string. So far the Daemon/App layer has been enforcing lower case only ASCII encoding (there was a period of time where we did not enforce casing however, and there are some claims made outside of daemon/app with upper case letters).
After the hardfork, the claimtrie will be encoding/normalization aware. The encoding that will be used is UTF-8 and the normalization scheme is to implement NFD normalization and than lower case the letters using locale of en_US. A discussion of this scheme can be found on issue #204 (https://github.com/lbryio/lbrycrd/issues/204). Note that each node on the claimtrie will still be a byte (and not individual unicode points), the encoding/normalization of claim entries are enfroced prior to claim trie insert.
Any already existing claims will be interpreted as if they are already encoded utilizing UTF-8. Since UTF-8 is backwards compatible with ASCII encoding, claims made through the Daemon/App layer should be properly perserved.
After normalization, there will be a process of conflict resolution (or name collapse) where for example seperate winning claims on "Dog" and "dog" will after normalization both be claims on the same name. Care should be taken on downstream apps since claims that were previously accessible as the winning claim may no longer be accessible (except through its claim ID). Downstream apps will also need to make sure methods to access name claims through their temporal ordering properly merges the collapsed names (i.e, before hardfork, there was lbry://dog#1 and lbry://Dog#1 but now there will be lbry://dog#1 and lbry://dog#2 )
Claims that are invalid UTF-8 bytes (this can happen due to UTF-8 being a variable width encoding format) will be kept as they are, and enter the claimtrie unaltered (this is possible because nodes on the claimtrie are still just a byte). There is discussion about another option whether we reject all claims on invalid UTF-8 bytes as invalid claims, this option has not been implemented.
According to the ICU docs "unpaired surrogates are replaced with U+FFFD" (aka, � ). There is no error thrown. I was thinking I would check for that character and pass the original bytes if it existed. To improve the performance, we could just do the check if the number if incoming bytes is different than the number of bytes after the conversion (but then you run the risk of � showing up in your URL).
I can see where boost::locale does the conversion (in uconv.hpp), but not the implementation of the
icu
method:Bringing in the requirements as posted internally:
Features (after fork):
@tiger5226 , the RPC changes will necessitate an update to https://github.com/lbryio/chainquery. You will need an additional DB column to store the normalized name, and that name may be needed for various joins. There is no issue at present to track that.
Will the entire UTF-8 character set be available for claims?
Additionally, will
@
or#
be available? We currently treat these as reserved characters on Spee.ch@skhameneh , the present design does not restrict any data. All possible byte combinations will continue to be available. @ and # are currently allowed in claimnames at any location, and that will be true after the fork. Any valid UTF-8 will be normalized and lower-cased. Anything that is not valid UTF-8 will compete only with those exact bytes.