LBRYCommunity/lbry-desktop

Fork 0

Support proper i18n pluralization #3440

New issue

Open

opened 2020-01-03 22:31:07 +01:00 by kauffj · 8 comments

kauffj commented

2020-01-03 22:31:07 +01:00

(Migrated from github.com)

We should use __n for numeric messages to support unlimited, language-specific messages.

More: https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html

We should use `__n` for numeric messages to support unlimited, language-specific messages. More: https://www.gnu.org/software/gettext/manual/html_node/Plural-forms.html

marko-lorentz commented

2020-06-11 14:52:57 +02:00

(Migrated from github.com)

I saw that the number of places where if-statements are used in the code to pick the correct plural form is increasing. Having if in every piece of source code is a mess that makes it less readable and introduces new errors. It also assumes that the developer understands the plural rules of all language as he must add the correct if-blocks (for Russian this would be 4, for Arabic it's up to 6).
All this should go to a central function that does the desired replacements automatically. In the source code, only a single method call should remain.

I would recommend to adapt the ICU recommendations for plural and gender oriented message formatting as far as possible.

The ICU format for JSON is (AFAIK partially) supported by Transifex, so there will be tool support for the translators.
In the source code, react-intl or Globalize might be used to format the messages. These can be hidden behind a facade method like __().

More information can be found here

I saw that the number of places where `if`-statements are used in the code to pick the correct plural form is increasing. Having `if` in every piece of source code is a mess that makes it less readable and introduces new errors. It also assumes that the developer understands the plural rules of all language as he must add the correct `if`-blocks (for Russian this would be 4, for Arabic it's up to 6). All this should go to a central function that does the desired replacements automatically. In the source code, only a single method call should remain. I would recommend to adapt the ICU recommendations for plural and gender oriented message formatting as far as possible. - The ICU format for JSON is (AFAIK partially) supported by Transifex, so there will be tool support for the translators. - In the source code, [react-intl](https://github.com/formatjs/react-intl) or [Globalize](https://github.com/globalizejs/globalize) might be used to format the messages. These can be hidden behind a facade method like `__()`. More information can be found [here](https://medium.com/i18n-and-l10n-resources-for-developers/the-missing-guide-to-the-icu-message-format-d7f8efc50bab)

Suisse00 commented

2020-06-12 20:15:50 +02:00

(Migrated from github.com)

WARNING: No real experience with i18 and languages I work with only contains one plural form

On the code side it could be only one method with ONE specific translation key that, behind the scene, change according with the number.

Eg. __n('You have %n messages', n)

TLDR: Behind the scene it will look for the key 'You have %n messages' + [${n}]
I added a kind of [n] at the end just to recognise such string, could append only the number

If n is 1 then it will look for 'You have %n messages[1]', if this is 3 'You have %n messages[3]', ...

Downsides:

Programmers will need to flag potentiel plural keys (either to add them to a special file that will generate all translation keys for them, or they will have to create (not a good idea) X translation key
Either programmers add some i18 logic per language to select up to N translation key (eg. french and english have only 2 plural format, >2 is the same as 2) or translators will need to copy the string into other translation keys (uh :( )
Idealy translator shouldn't have to copy nor see plurals form that don't exist (like 3,4,5,6 that are the same as 2 for french/english)

WARNING: No real experience with i18 and languages I work with only contains one plural form On the code side it could be only one method with ONE specific translation key that, behind the scene, change according with the number. Eg. __n('You have %n messages', n) TLDR: Behind the scene it will look for the key 'You have %n messages' + `[${n}]` I added a kind of [n] at the end just to recognise such string, could append only the number If n is 1 then it will look for 'You have %n messages[1]', if this is 3 'You have %n messages[3]', ... Downsides: - Programmers will need to flag potentiel plural keys (either to add them to a special file that will generate all translation keys for them, or they will have to create (not a good idea) X translation key - Either programmers add some i18 logic per language to select up to N translation key (eg. french and english have only 2 plural format, >2 is the same as 2) or translators will need to copy the string into other translation keys (uh :( ) - Idealy translator shouldn't have to copy nor see plurals form that don't exist (like 3,4,5,6 that are the same as 2 for french/english)

marko-lorentz commented

2020-06-12 20:43:39 +02:00

(Migrated from github.com)

@Suisse00 The downsides you mention are nearly completely covered by the ICU model:

just 1 line for the programmer (as with all translation tokens with replacements, he must know what he doing, of course - i.e. he must know the types of all replacement parameters)
all decision making code is hidden completely (i.e. the knowledge of how many plural forms a language has, is not needed for writing code)
real number of plural forms per language is only visible for the translator of the specific language - so the English guy will never know that there's languages with more than 2 plural forms in other part's of the world (see Transifex docu here and here)

Would it make sense when I provide an example in source code?

@Suisse00 The downsides you mention are nearly completely covered by the ICU model: - just 1 line for the _programmer_ (as with all translation tokens with replacements, he must know what he doing, of course - i.e. he must know the types of all replacement parameters) - all decision making code is hidden completely (i.e. the knowledge of how many plural forms a language has, is not needed for writing code) - real number of plural forms per language is only visible for the _translator of the specific language_ - so the English guy will never know that there's languages with more than 2 plural forms in other part's of the world (see Transifex docu [here](https://docs.transifex.com/formats/introduction#plurals) and [here](https://docs.transifex.com/formats/json#plurals-support)) Would it make sense when I provide an example in source code?

kauffj commented

2020-06-12 21:53:33 +02:00

(Migrated from github.com)

@marko-lorentz thanks for your message. I get the impression you know more about i18n than anyone at @lbryio currently 😁, but we're always happy to learn from those that know more than us.

Can you clarify what the syntax is when there is no variable in the message? How is foo != 1 ? __('Plural foos') : __('One foo') reduced to one string?
To reduce to one string in code, I assume this would also require us to begin putting some English strings into Transifex. Is that correct? Right now we rely on the English message passing through for English strings, and do not maintain or access a translations file.

@marko-lorentz thanks for your message. I get the impression you know more about `i18n` than anyone at @lbryio currently :grin:, but we're always happy to learn from those that know more than us. 1. Can you clarify what the syntax is when there is no variable in the message? How is `foo != 1 ? __('Plural foos') : __('One foo')` reduced to one string? 2. To reduce to one string in code, I assume this would also require us to begin putting some English strings into Transifex. Is that correct? Right now we rely on the English message passing through for English strings, and do not maintain or access a translations file.

kauffj commented

2020-06-12 21:54:26 +02:00

(Migrated from github.com)

@seanyesmunt please read/follow this thread

👍 1

infinite-persistence commented

2020-06-13 07:31:05 +02:00

(Migrated from github.com)

For number 2, doing that would also solve the "can't reuse the same string in different locations for my language" problem that few have reported (https://github.com/lbryio/lbry-desktop/pull/4340#issuecomment-641476349). I believe that if more translators verify the context of the strings, we'll get more reports of this issue.

marko-lorentz commented

2020-06-13 08:59:48 +02:00

(Migrated from github.com)

Sidenote concerning the use of entire strings in code:
The "have a complete string in the code"-approach has some charm (like: the software will work out of the box - even without any translations - and of course the explicit in-context listing of all replacement parameters with (hopefully) good names in an example phrase to ease understanding for the translators).
But it has also some well known disadvantages (like: fixing typos will ping all translators if Transifex is aware of what you do and you will end up having duplicates or close duplicates that cannot easily be kept "similar" by the translators).
My conclusion at the moment: I don't dare to propose a solution for this as you can do it one way or the other. It must make sense to those who work with the code and do tests on a daily base. Maybe it might be worth a try to do both: Have a set constants that can be reused as well as plain strings. The decision, when to offer a publicly visible constant like 'TERM_FOLLOWER' instead of a private plain string might be challenging.

(Translators should at least have a short look at the actual places where their texts are used. And it feels strange that I wasn't able to find good support for hyperlinks (e.g. to the correct place in the lbry.tv UI or to the source code) and screenshots in Transifex - something that is quite common in other translation support systems. Maybe I'm just to stupid to find it.)

_Sidenote concerning the use of entire strings in code:_ The "have a complete string in the code"-approach has some charm (like: the software will work out of the box - even without any translations - and of course the explicit in-context listing of all replacement parameters with (hopefully) good names in an example phrase to ease understanding for the translators). But it has also some well known disadvantages (like: fixing typos will ping all translators if Transifex is aware of what you do and you will end up having duplicates or close duplicates that cannot easily be kept "similar" by the translators). My conclusion at the moment: I don't dare to propose a solution for this as you can do it one way or the other. It must make sense to those who work with the code and do tests on a daily base. Maybe it might be worth a try to do both: Have a set constants that can be reused as well as plain strings. The decision, when to offer a publicly visible constant like 'TERM_FOLLOWER' instead of a private plain string might be challenging. (Translators should at least have a short look at the actual places where their texts are used. And it feels strange that I wasn't able to find good support for hyperlinks (e.g. to the correct place in the lbry.tv UI or to the source code) and screenshots _in Transifex_ - something that is quite common in other translation support systems. Maybe I'm just to stupid to find it.)

marko-lorentz commented

2020-06-13 10:40:48 +02:00

(Migrated from github.com)

I will try to answer the questions raised by @kauffj:

yes - using an ICU-based translation model means that you also need to add a dedicated "English translation" as the plural forms vanish from the code
- Transifex proposes to upload the developer strings as language "en" to the system and have (developers or) translators create a translation for language "en_US" which will then be used by the production system (see explanation here)
- the translators are able to view both (the "en" and the "en_US" strings) to derive the correct translation by switching the Transifex translation view to "show a second source language"
- the strings that will remain in the source code should be the those with the most abstract plural like "I have %n% cars" in order to show the intention of the author (this form is called other in the ICU model)
the content of the translation file will look different, similar to this simple example:

{
  "You have %count% files": "{count, plural, one {You have {count} file.} other {You have {count} files.}}"
}

the translators will ideally not be in touch with this special format as Transifex should ask them to enter the one form and the other form separately if the target language requires these 2 forms (this will need a try I think - never trust the docu of a tool) (the Transifex docu also doesn't mention how the zero form is supported and it would really be a mess if it is not, as you need "You don't have a car." instead of "You have 0 cars.")
in the source code, representation is depending on the library you use and how you wrap it; raw example for "Format JS/Intl MessageFormat" lib without wrapper:

const enNumPhotos = new IntlMessageFormat(
    `
    {count, plural, one {You have {count} file.} other {You have {count} files.}}
    `,
    'en-US'
  );
  return enNumPhotos.format({count: 1000});

I'm not a JS developer at all (I usually do Java backend stuff) but I'm sure this can be wrapped with a generic vararg function or fancy other stuff. Whether it is __n(...) or whatever might be a decision to make after reading through the docu of the lib - as depending on lib, there might be some more to discover, like e.g. nested replacements.

I will try to answer the questions raised by @kauffj: - yes - using an ICU-based translation model means that you also need to add a dedicated "English translation" as the plural forms vanish from the code - Transifex proposes to upload the developer strings as language "en" to the system and have (developers or) translators create a translation for language "en_US" which will then be used by the production system (see [explanation here](https://docs.transifex.com/localization-tips-workflows/non-english-as-a-source-language#“developer”-english-source:-you-need-to-translate-from-non-perfect-english-into-proper-english-and-then-into-french-)) - the translators are able to view both (the "en" and the "en_US" strings) to derive the correct translation by switching the Transifex translation view to "show a second source language" - the strings that will remain in the source code should be the those with the most abstract plural like "I have %n% cars" in order to show the intention of the author (this form is called `other` in the ICU model) - the content of the translation file will look different, similar to this simple example: ``` { "You have %count% files": "{count, plural, one {You have {count} file.} other {You have {count} files.}}" } ``` - the translators will ideally not be in touch with this special format as Transifex should ask them to enter the `one` form and the `other` form separately if the target language requires these 2 forms (this will need a try I think - never trust the docu of a tool) (the Transifex docu also doesn't mention how the `zero` form is supported and it would really be a mess if it is not, as you need "You don't have a car." instead of "You have 0 cars.") - in the source code, representation is depending on the library you use and how you wrap it; raw example for "Format JS/Intl MessageFormat" lib without wrapper: ``` const enNumPhotos = new IntlMessageFormat( ` {count, plural, one {You have {count} file.} other {You have {count} files.}} `, 'en-US' ); return enNumPhotos.format({count: 1000}); ``` I'm not a JS developer at all (I usually do Java backend stuff) but I'm sure this can be wrapped with a generic vararg function or fancy other stuff. Whether it is `__n(...)` or whatever might be a decision to make after reading through the docu of the lib - as depending on lib, there might be some more to discover, like e.g. nested replacements.