Discuss character normalization scheme #204
Labels
No labels
area: devops
area: discovery
area: docs
area: livestream
area: proposal
consider soon
Epic
good first issue
hacktoberfest
hard fork
help wanted
icebox
Invalid
level: 0
level: 1
level: 2
level: 3
level: 4
needs: exploration
needs: grooming
needs: priority
needs: repro
needs: tech design
on hold
priority: blocker
priority: high
priority: low
priority: medium
resilience
soft fork
Tom's Wishlist
type: bug
type: discussion
type: improvement
type: new feature
type: refactor
type: task
type: testing
unplanned
work in progress
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: LBRYCommunity/lbrycrd#204
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This is an issue opened up to discuss the character normalization scheme in https://github.com/lbryio/lbrycrd/pull/159 . The issue of character normalization is not a topic of expertise that the blockchain team has in particular so this should be opened up for discussion with everyone else. The details of the normalization should not affect the PR outside of the specific normalization function and unit tests.
The current scheme proposed in 159 is to implement NFD normalization (see below link on information)
http://unicode.org/reports/tr15/
http://unicode.org/faq/normalization.html
And than lower case the letters using locale en_US
I suspect that this scheme will have some problems handling other languages. In particular note that international domain names utilizes a custom scheme to deal with various locale needs ( https://en.wikipedia.org/wiki/Internationalized_domain_name - I wonder if there is there some module we can import for this?? ).
Note that any bad scheme we end up implementing here will be extremely troublesome for us to undo since we have to dive into the details of various languages and manually fix the normalization, so it is important that we get it right the first time.
I haven't done a deep dive on this yet, but chiming in to say that we should avoid any US-centric solution.
A few thoughts: lbrycrd is already US-centric in that it's CLI only shows English help, and its RPC commands use English field names. It also returns JSON, which according to that standard, requires US-like periods in floating point numbers. JSON also has invariant date format requirements.
You can see here a chart of the methods that use the locale: http://userguide.icu-project.org/services Unfortunately, it's hard to determine exactly which ICU methods we're using because of the boost wrapper. Generally speaking, though, those formatting methods aren't the features we're looking for.
In Naut's code, we see that he first calls
normalize
and thento_lower
. The former call should put it into a locale-neutral form before the case is lowered. I don't think it's actually using en_US to do the case conversion. I think that this is the best that we can do without taking locale in on claim requests.As I understand it (and mentioned above), the "normalize" method is the equivalent of "make it non-locale-specific".
From my understanding NFD normalization is not locale specific, but the function "lower" is.
Unicode is unicode, so picking a normalization form (like NFD) is language agnostic. It's correct that the lower case algorithm we're using is. Boost provides the locale aware version as well (via ICU), but I opted not to use that because while it is locale specific, it will not work where that locale is unavailable. What this means is that the lower case algorithm chosen may not actually lower case the words represented in unicode that cannot be lower cased without the locale awareness, and I think that's a choice that we live with for consistency.
Agreed, but it's a topic that I have more experience in than many. Doesn't mean I'm an expert.
I don't agree since the unicode normalization will not be an issue. Knowing that there may be international casing issues is something I think we can live with. In the end, we are still doing a best effort to normalize the code points consistently, which is far better than what we have today (which is just about guaranteed to break or be inconsistent with different languages/locales).
This was an interesting read:
http://unicode.org/faq/casemap_charprop.html
One consequence of our current trie structure is that partial unicode names will no longer make sense; there's actually a data loss on visualization. Hence, our RPC method
getclaimtrie
will be even more useless than it already is. E.g.Late comment on this, but I find rejecting claims for names with unreadable/unrenderable characters to be perfectly acceptable.
There may be other reasons this is a bad idea, in which case I defer to the blockchain team's judgement. But from a UX perspective, I think it's completely unobjectionable to reject unrenderable claim names.
@lbrynaut says that we should look into whether lowering the case can throw an error , i.e there may be cases where it may not know what to do.
I think this can be closed. Reopen it for further discussion. As it stands, we'll go with the current decisions:
We will use the NFD, en_US locale.
If the incoming UTF8 data is invalid, we will use the data but it will not be normalized or lower-cased, and will compete only against those exact bytes. We assume that current apps and clients will display partial characters (aka, � symbols) and that they will disallow entry of them.
We will eliminate the archaic getclaimtrie RPC method.