Normalization UX level discussion

kaykurokawa commented

2018-11-05 21:39:24 +01:00

(Migrated from github.com)

Discuss here what should happen at the UX level for normalization. A lot of this will be done at the layers above lbrycrd, but some of the implementations described below will need some help from lbrycrd. I think important thing to consider here is what an unnormalized claim name is. Is it just completely invalid ? Or does it contain some information that we want to preserve?

a) After normalization hard fork, what should happen when users attempt to make a claim with an unnormalized string (i.e. user wants to make claim DOG, but it will be normalized to dog). Should we reject such claim attempts and tell them the correct normalized form? Should we allow them to make the claim, and than perhaps afterwards tell them what the proper normalized form is?

b) If a claim name is normalized into a different string, do we need to preserve the original unnormalized string for the user somehow? (i.e, a user makes a claim as DOG, we normalize it to dog, but we still store somewhere that the user wanted to make the claim as DOG). Do other users need to see the original unnormalized string as well (make it accessible via blockchain, instead of personal storage)?

c) After normalization hard fork, what should happen when user attempt to search for an unnormalized string ? Do we autocorrect to the normalized form? or do we warn the user that it is an invalid string and give them the normalized form?

Might be missing some other UX considerations.. please list if needed.

Discuss here what should happen at the UX level for normalization. A lot of this will be done at the layers above lbrycrd, but some of the implementations described below will need some help from lbrycrd. I think important thing to consider here is what an unnormalized claim name is. Is it just completely invalid ? Or does it contain some information that we want to preserve? a) After normalization hard fork, what should happen when users attempt to make a claim with an unnormalized string (i.e. user wants to make claim DOG, but it will be normalized to dog). Should we reject such claim attempts and tell them the correct normalized form? Should we allow them to make the claim, and than perhaps afterwards tell them what the proper normalized form is? b) If a claim name is normalized into a different string, do we need to preserve the original unnormalized string for the user somehow? (i.e, a user makes a claim as DOG, we normalize it to dog, but we still store somewhere that the user wanted to make the claim as DOG). Do other users need to see the original unnormalized string as well (make it accessible via blockchain, instead of personal storage)? c) After normalization hard fork, what should happen when user attempt to search for an unnormalized string ? Do we autocorrect to the normalized form? or do we warn the user that it is an invalid string and give them the normalized form? Might be missing some other UX considerations.. please list if needed.

kaykurokawa commented

2018-11-05 21:51:39 +01:00

(Migrated from github.com)

Personal opinions:

It seems like for a), it would be better to reject claim attempts instead of auto-normalizing since user may not be aware that normalization exists

For b), I feel that it is not necessary to preserve the unnormalized string. Maybe only downside is that users might be confused if they had claims before the normalization hard fork and sees that their claims changed..

For c), I think this one does not make much of a difference, but it seems like autocorrect would work fine.

Personal opinions: It seems like for a), it would be better to reject claim attempts instead of auto-normalizing since user may not be aware that normalization exists For b), I feel that it is not necessary to preserve the unnormalized string. Maybe only downside is that users might be confused if they had claims before the normalization hard fork and sees that their claims changed.. For c), I think this one does not make much of a difference, but it seems like autocorrect would work fine.

BrannonKing commented

2018-11-05 22:04:02 +01:00

(Migrated from github.com)

My vote:

a) making an claim with non-normalized characters will be unnoticeable to the user.

b) yes, the original string is always preserved. When running getClaimById you would always expect to see the original bytes. No user should ever care/know that we changed the case on DOG for our internal structure. RPC methods like getclaimtrie return original names/bytes for all claims.

c) searching for an unnormalized string will return the unnormalized string. We normalized it when we put it into the trie, we normalized the search text before we used it on the trie, and we pulled the original name bytes on the winning claim before returning.

My vote: a) making an claim with non-normalized characters will be unnoticeable to the user. b) yes, the original string is always preserved. When running getClaimById you would always expect to see the original bytes. No user should ever care/know that we changed the case on DOG for our internal structure. RPC methods like getclaimtrie return original names/bytes for all claims. c) searching for an unnormalized string will return the unnormalized string. We normalized it when we put it into the trie, we normalized the search text before we used it on the trie, and we pulled the original name bytes on the winning claim before returning.

kauffj commented

2018-11-05 22:13:38 +01:00

(Migrated from github.com)

My answers:

A) I think it is probably safest to reject these at the blockchain level but I'm somewhat ambivalent. Relatedly, normalization should be happening at other levels and I'm not sure this is filed. @BrannonKing if we do not have an epic for changes at other levels related to normalization, can you start one?

B) At the blockchain level, storing the non-normalized name would waste space. I am not concerned with losing historical names at this time.

C) Ambivalent. I would be okay with either auto-normalizing search terms or erroring but providing the correct one.

A and C could potentially be options with the default to be to normalize.

If we're doing any auto normalization, it would be good (necessary?) for the blockchain layer to expose methods that allow me to directly call normalization functions as well.

My answers: A) I think it is _probably_ safest to reject these at the blockchain level but I'm somewhat ambivalent. Relatedly, normalization should be happening at other levels and I'm not sure this is filed. @BrannonKing if we do not have an epic for changes at other levels related to normalization, can you start one? B) At the blockchain level, storing the non-normalized name would waste space. I am not concerned with losing historical names at this time. C) Ambivalent. I would be okay with either auto-normalizing search terms or erroring but providing the correct one. A and C could potentially be options with the default to be to normalize. If we're doing any auto normalization, it would be good (necessary?) for the blockchain layer to expose methods that allow me to directly call normalization functions as well.

eukreign commented

2018-11-05 22:59:17 +01:00

(Migrated from github.com)

I may be missing some background on this but from a superficial point of view:

a) Magic bad. Fail fast. Reject.

b) I don't really understand this one. If a transaction is submitted with a non-normalized claim name in it, how would lbrycrd "normalize" the claim name in the transaction without modifying and thus invalidating the transaction?

c) Magic bad. Fail fast. Reject. Returning the normalized form in the error message would be nice though. From a UX perspective I see it working like this: user types stuff, app submits search to server, server responds with error and the correct normalized string, app updates the input box with the correct normalized form and then subsequently submits that normalized search string to server. Or app implements normalization itself (less appealing).

I may be missing some background on this but from a superficial point of view: a) Magic bad. Fail fast. Reject. b) I don't really understand this one. If a transaction is submitted with a non-normalized claim name in it, how would lbrycrd "normalize" the claim name in the transaction without modifying and thus invalidating the transaction? c) Magic bad. Fail fast. Reject. Returning the normalized form in the error message would be nice though. From a UX perspective I see it working like this: user types stuff, app submits search to server, server responds with error and the correct normalized string, app updates the input box with the correct normalized form and then subsequently submits that normalized search string to server. Or app implements normalization itself (less appealing).

BrannonKing commented

2018-11-05 23:30:02 +01:00

(Migrated from github.com)

I don't think we should change the users' data: what they put into the system should be what people see. That's what they paid for. Normalization and case indifference are there to allow people to wholly own their brand, with other niceties to help people locate items without having to be perfectly specific. If it doesn't look right they can try it again; let's put that responsibility on the user. I've been thinking about this problem since @kaykurokawa 's comment about how to avoid messing up SI units and acronyms. The only way to do it (at least at the lbrycrd level) is to return for display exactly what the user sent.

👍 1

kauffj commented

2018-11-05 23:45:51 +01:00

(Migrated from github.com)

I'm more ambivalent on this than my original comment now that I better understand the issue.

I had never even considered continuing to display the names the way the user entered them. I had simply assumed we'd be dropping the old formatting.

For what it's worth, there is precedence of explicitly disallowing and it's the current domain system. But that doesn't mean we can't do better.

Here's some pros/cons of each approach.

Keeping User Formatting

Gives user full/complete precision
Handles some corner cases better like acronyms or units

Dropping User Formatting

Saves space
Doesn't force user to think or know about formatting
Simpler UI/UX (more below)
Prevents weird naming or naming designed to be noisy (tYpInG lIkE tHIs)

In thinking about this issue, please consider UI/UX all the way down to the user-interaction and browser level.

Currently domains are all lower case and this makes the choices in how to handle URLs at the browser level quite simple. If names can be mixed case but are only searched as lower-case, it may introduce some weird UX. For example, at what point do we normalize what the user has typed? Browsers replace any domain name with all lower-case as soon as I hit enter - what would a LBRY browser do if names are resolved as lower-case but can have upper-case when resolved?

I'm more ambivalent on this than my original comment now that I better understand the issue. I had never even considered continuing to display the names the way the user entered them. I had simply assumed we'd be dropping the old formatting. For what it's worth, there is precedence of explicitly disallowing and it's the current domain system. But that doesn't mean we can't do better. Here's some pros/cons of each approach. ### Keeping User Formatting - Gives user full/complete precision - Handles some corner cases better like acronyms or units ### Dropping User Formatting - Saves space - Doesn't force user to think or know about formatting - Simpler UI/UX (more below) - Prevents weird naming or naming designed to be noisy (tYpInG lIkE tHIs) In thinking about this issue, please consider UI/UX all the way down to the user-interaction and browser level. Currently domains are all lower case and this makes the choices in how to handle URLs at the browser level quite simple. If names can be mixed case but are only _searched_ as lower-case, it may introduce some weird UX. For example, at what point do we normalize what the user has typed? Browsers replace any domain name with all lower-case as soon as I hit enter - what would a LBRY browser do if names are resolved as lower-case but can have upper-case when resolved?

BrannonKing commented

2018-11-06 00:12:52 +01:00

(Migrated from github.com)

First, an example. Consideration: 3Blue1Brown . On his domain he uses the lowercase version of that, but his channel names keep the casing. I don't know if he would be offended if we lower-cased his brand, but I can bet he would prefer that we don't.

Second, I can't think of a way that dropping the user formatting would save us space. We store the name for every claim in the DB either way.

Concerning the UX, consider this example:
A. Four claims are owned: BROWN at 1LBC, Brown at 3, broWN at 2, and LeRoy_Brown.
B. Searching for brown, with any case variation, should return all four.
C. Opening lbry://BROWN (with any case variation) would immediately switch to lbry://Brown (the current node winner).
D. Opening lbry://brown$3 would immediately switch to lbry://BROWN$3.

Point being: the user didn't have to think about the casing.

Example 2:
A. One user owns Amélie (aka, "Ame\u0301lie") and a second user owns Amélie (aka, "Am\u0065lie").
B. Searching for either one shows both, since the search would be normalized on its way in.
C. Opening one or the other would switch to the current winner on that node with no obvious way to know that another very similar endpoint exists.

First, an example. Consideration: 3Blue1Brown . On his domain he uses the lowercase version of that, but his channel names keep the casing. I don't know if he would be offended if we lower-cased his brand, but I can bet he would prefer that we don't. Second, I can't think of a way that dropping the user formatting would save us space. We store the name for every claim in the DB either way. Concerning the UX, consider this example: A. Four claims are owned: BROWN at 1LBC, Brown at 3, broWN at 2, and LeRoy_Brown. B. Searching for brown, with any case variation, should return all four. C. Opening lbry://BROWN (with any case variation) would immediately switch to lbry://Brown (the current node winner). D. Opening lbry://brown$3 would immediately switch to lbry://BROWN$3. Point being: the user didn't have to think about the casing. Example 2: A. One user owns Amélie (aka, "Ame\u0301lie") and a second user owns Amélie (aka, "Am\u0065lie"). B. Searching for either one shows both, since the search would be normalized on its way in. C. Opening one or the other would switch to the current winner on that node with no obvious way to know that another very similar endpoint exists.

👍 1

bvbfan commented

2018-11-06 09:19:10 +01:00

(Migrated from github.com)

To me, user can be frustrated of word lower-casing, most of sites / applications have a simple rules of correct naming that will not affect user acceptance.
a) reject, we should not guessing or correcting user' input
b) no, simple rules - clean relationship
c) warn user that we can show only normalization form of its demand

To me, user can be frustrated of word lower-casing, most of sites / applications have a simple rules of correct naming that will not affect user acceptance. a) reject, we should not guessing or correcting user' input b) no, simple rules - clean relationship c) warn user that we can show only normalization form of its demand

kauffj commented

2018-11-06 17:11:07 +01:00

(Migrated from github.com)

OK, I'm fairly persuaded by the case to keep user formatting.

Will it be possible for claim updates to change the formatting?

OK, I'm fairly persuaded by the case to keep user formatting. Will it be possible for claim updates to change the formatting?

BrannonKing commented

2018-11-06 21:30:28 +01:00

(Migrated from github.com)

@lyoshenka suggested that we don't change the structures for the output of the RPC calls; instead, we can add an original_name field to the claims. I like this; it keeps backwards compatibility on the RPC calls.

BrannonKing commented

2018-11-06 21:31:28 +01:00

(Migrated from github.com)

@kauffj , yes it would be possible for the claim updates to change the formatting. I like that plan!

lyoshenka commented

2018-11-07 11:53:12 +01:00

(Migrated from github.com)

Some premises:

In general magic is bad, in the sense that we want clear and consistent rules for users to understand.
Agreed with Brannon that we should not be changing user data. However, there are two users: the claim publisher and the content consumer. I think we have to change data for at least one of them. Sounds like we're leaning towards doing it to the consumer (when they search for a non-normalized string) and I think that's best.
Agreed that lbrycrd should expose an RPC method to normalize a string, so that users can tell in advance what the result of normalization would be. It should also store its test data for normalization in JSON, so anyone who reimplements our claimtrie can ensure they got normalization right.

With that in mind, I'm gonna answer these in reverse because my reasoning flows that way.

c) For a search, the search string should be normalized on the way in and search should be done against normalized names. Each search result should contain both name (the original) and the normalized_name fields. Then upstream apps can do what is right for them.

One thing to consider here is that by returning both, we're forcing upstream app devs to understand what normalization is and to make a decision about which field they need. I'd like to avoid this (simple is better) but I'm not sure we can.

b) We're talking about storing in memory, right? Yes, store both. Is there a concern about the amount of data that needs to be in memory? In most cases the names will be the same (I assume), so we're not increasing memory consumption by that much. If I'm misunderstanding the tradeoff, please explain.

a) Allow non-normalized claims. I don't think this will be confusing to users. When they search for their intended claim name before making the claim, they will see all the claims that they are competing with. So when they make their claim, they will not be surprised about the result.

Some premises: - In general magic is bad, in the sense that we want clear and consistent rules for users to understand. - Agreed with Brannon that we should not be changing user data. However, there are two users: the claim publisher and the content consumer. I think we have to change data for at least one of them. Sounds like we're leaning towards doing it to the consumer (when they search for a non-normalized string) and I think that's best. - Agreed that lbrycrd should expose an RPC method to normalize a string, so that users can tell in advance what the result of normalization would be. It should also store its test data for normalization in JSON, so anyone who reimplements our claimtrie can ensure they got normalization right. With that in mind, I'm gonna answer these in reverse because my reasoning flows that way. c) For a search, the search string should be normalized on the way in and search should be done against normalized names. Each search result should contain both `name` (the original) and the `normalized_name` fields. Then upstream apps can do what is right for them. One thing to consider here is that by returning both, we're forcing upstream app devs to understand what normalization is and to make a decision about which field they need. I'd like to avoid this (simple is better) but I'm not sure we can. b) We're talking about storing in memory, right? Yes, store both. Is there a concern about the amount of data that needs to be in memory? In most cases the names will be the same (I assume), so we're not increasing memory consumption by that much. If I'm misunderstanding the tradeoff, please explain. a) Allow non-normalized claims. I don't think this will be confusing to users. When they search for their intended claim name before making the claim, they will see all the claims that they are competing with. So when they make their claim, they will not be surprised about the result.

kaykurokawa commented

2018-11-07 18:14:08 +01:00

(Migrated from github.com)

@grin for b) the trade off is mostly just a matter of whether the unnormalized version of the string is preserved and means anything or not for the users.

If the unnormalized string means something (it is preserved and shown to the user), than it allows the user to have better expressiveness. If it mean nothing, than I think there is better simplicity in how the naming works. So I think there is no right answer here, just whether we want to allow better expressiveness and functionality at the cost of simplicity and comprehensibility.

@grin for b) the trade off is mostly just a matter of whether the unnormalized version of the string is preserved and means anything or not for the users. If the unnormalized string means something (it is preserved and shown to the user), than it allows the user to have better expressiveness. If it mean nothing, than I think there is better simplicity in how the naming works. So I think there is no right answer here, just whether we want to allow better expressiveness and functionality at the cost of simplicity and comprehensibility.

lyoshenka commented

2018-11-08 14:15:13 +01:00

(Migrated from github.com)

In general I would prefer only having one version of the string returned (either the original or the normalized version) because that's simpler and we should not be asking people to make decisions when we can make the decision for them. However I don't think we can do that here. The Unicode normalization FAQ says

Q: Why should my program normalize strings?
A: Programs should always compare canonical-equivalent Unicode strings as equal.

I take that to mean that normalization is just for comparison. For other purposes (such as displaying the name), we should be using the original form. Since most upstream applications will want to do both (e.g. sorting claims by name requires comparisons), we have to either return both or ask all upstream apps to implement normalization themselves if they want to compare names.

One mistake I've been making is thinking of names as needing to be normalized "in general". I now realize that's not the right way to look at it. Normalization is just for comparison. For everything else (displaying, storing, etc), we use the original.

Does this make sense? It also leads to the following answers to the original questions:

a) After normalization hard fork, what should happen when users attempt to make a claim with an unnormalized string?

This is fine. Nothing changes here.

b) If a claim name is normalized into a different string, do we need to preserve the original unnormalized string for the user somehow?

Yes. Claim names are stored in their original form and returned to the user that way. We should also return the normalized form for convenience, unless we expect upstream users to implement normalization themselves.

c) After normalization hard fork, what should happen when user attempt to search for an unnormalized string?

Search involves comparisons. Anytime a comparison is made, the normalized form of the string should be used. You can search for any string you want, and it will be normalized before comparing.

In general I would prefer only having one version of the string returned (either the original or the normalized version) because that's simpler and we should not be asking people to make decisions when we can make the decision for them. However I don't think we can do that here. The [Unicode normalization FAQ](http://www.unicode.org/faq/normalization.html) says > Q: Why should my program normalize strings? > A: Programs should always compare canonical-equivalent Unicode strings as equal. I take that to mean that normalization is just for comparison. For other purposes (such as displaying the name), we should be using the original form. Since most upstream applications will want to do both (e.g. sorting claims by name requires comparisons), we have to either return both or ask all upstream apps to implement normalization themselves if they want to compare names. One mistake I've been making is thinking of names as needing to be normalized "in general". I now realize that's not the right way to look at it. Normalization is just for comparison. For everything else (displaying, storing, etc), we use the original. Does this make sense? It also leads to the following answers to the original questions: > a) After normalization hard fork, what should happen when users attempt to make a claim with an unnormalized string? This is fine. Nothing changes here. > b) If a claim name is normalized into a different string, do we need to preserve the original unnormalized string for the user somehow? Yes. Claim names are stored in their original form and returned to the user that way. We should *also* return the normalized form for convenience, unless we expect upstream users to implement normalization themselves. > c) After normalization hard fork, what should happen when user attempt to search for an unnormalized string? Search involves comparisons. Anytime a comparison is made, the normalized form of the string should be used. You can search for any string you want, and it will be normalized before comparing.

BrannonKing commented

2018-11-08 18:35:10 +01:00

(Migrated from github.com)

Data returned in the RPC commands typically looks like this:

{
  "name": "blah",
  "claims": [
      { "claimId": "xxx", ... },
      { "claimId": "yyy", ... },
   ]
}

The name of the node is the normalized name, and it's not "per claim". We can continue to return that name in addition to adding one more field inside the claim structures that have the original name.

Data returned in the RPC commands typically looks like this: ```json { "name": "blah", "claims": [ { "claimId": "xxx", ... }, { "claimId": "yyy", ... }, ] } ``` The name of the node is the normalized name, and it's not "per claim". We can continue to return that name in addition to adding one more field inside the claim structures that have the original name.

kauffj commented

2018-11-08 18:44:57 +01:00

(Migrated from github.com)

One mistake I've been making is thinking of names as needing to be normalized "in general".

I was making the same mistake and have updated my thinking.

> One mistake I've been making is thinking of names as needing to be normalized "in general". I was making the same mistake and have updated my thinking.

BrannonKing commented

2018-11-09 18:17:00 +01:00

(Migrated from github.com)

I've been working on this. I need to understand one more part of this: when removing items from the activation and expiration queues we compare the name and the outpoint. Do we really need to compare the name in that situation?

lyoshenka commented

2018-11-09 19:24:37 +01:00

(Migrated from github.com)

Addendum: I said above that we should be using the original name for "everything else (displaying, storing, etc)". I think that's wrong in the case of storing names in the claimtrie, because the location of the claim in the trie depends on the name and claims in the same location compete for the same name. So we should be using the normalized name as the path in the claimtrie.

lyoshenka commented

2018-11-09 19:30:50 +01:00

(Migrated from github.com)

@BrannonKing i think a field called name should always contain the original name. if we are returning the normalized name, we should call it normalized_name to indicate that its not the name they claimed. We can also use original_name for the original if we want to return both and be very clear about it.

For the RPC command you gave above, I'd recommend using normalized_name for the top-level name and original_name inside each claim object. Dropping name would be a BC break, but I think its clearer and more consistent that way.

I'm open to something better than normalized_name, but it can't simply be name.

@BrannonKing i think a field called `name` should always contain the original name. if we are returning the normalized name, we should call it `normalized_name` to indicate that its not the name they claimed. We can also use `original_name` for the original if we want to return both and be very clear about it. For the RPC command you gave above, I'd recommend using `normalized_name` for the top-level name and `original_name` inside each claim object. Dropping `name` would be a BC break, but I think its clearer and more consistent that way. I'm open to something better than `normalized_name`, but it can't simply be `name`.

lyoshenka commented

2018-11-09 19:33:41 +01:00

(Migrated from github.com)

Or, an alternative change to the above would be to not return any names at the top level, and to simply return a list of claim objects, each of which has name and normalized_name as fields. This also lets us be more consistent - everywhere that a claim is returned in by the API, each claim always has those two fields.

Or, an alternative change to the above would be to not return any names at the top level, and to simply return a list of claim objects, each of which has `name` and `normalized_name` as fields. This also lets us be more consistent - everywhere that a claim is returned in by the API, each claim always has those two fields.

BrannonKing commented

2018-11-29 12:41:58 +01:00

(Migrated from github.com)

Back to the original post with conclusions:

After normalization hard fork, what should happen when users attempt to make a claim with an unnormalized string...

We decided to use normalization for internal comparisons and competition. No normalization is required on the data before it is sent to lbrycrd.

If a claim name is normalized into a different string, do we need to preserve the original unnormalized string for the user somehow? ...

Yes. We decided to preserve the original in the claimId table and to return that value in various RPC calls. We anticipate that those displaying the claim will use the returned original name.

After normalization hard fork, what should happen when user attempt to search for an unnormalized string ? Do we autocorrect to the normalized form? ...

Yes, we normalize all inputs and return all competitors that match that.

Back to the original post with conclusions: > After normalization hard fork, what should happen when users attempt to make a claim with an unnormalized string... We decided to use normalization for internal comparisons and competition. No normalization is required on the data before it is sent to lbrycrd. > If a claim name is normalized into a different string, do we need to preserve the original unnormalized string for the user somehow? ... Yes. We decided to preserve the original in the claimId table and to return that value in various RPC calls. We anticipate that those displaying the claim will use the returned original name. > After normalization hard fork, what should happen when user attempt to search for an unnormalized string ? Do we autocorrect to the normalized form? ... Yes, we normalize all inputs and return all competitors that match that.

Rows
Columns

Normalization UX level discussion #234

Keeping User Formatting

Dropping User Formatting