Translation process

nuschpl · April 20, 2024, 8:06am

I’ve tried to fix some translation issues,inlcuding some English text used in more than single place in different context. I’ve did changes commited some local files, then spotted the JSON file, with same modification needed. I’ve introduced same change attempted to compile and got confused as the build system added same variable twice.
How this is supposed to be used. exactly from translator side?
How the translations actually work from applicationm logic perspective ?

ian · April 22, 2024, 9:47am

Thank you so much for looking over the translations. Here’s some docs on the translation setup:

nuschpl · April 22, 2024, 8:14pm

So I’ve seen this article after initial struggles, but this is quite interactive - using some web service then posting to forum. In the end someone will just patch the code, so I’m just curious how to do it myself properly and raise pull request. Is there a way to do it ? If there is an issue with a lot of small changes maybe dedicated translation branch on Github would do the job - from which it could be merged from time to time or when some milestone linke translating whole single language was reached.

ian · April 23, 2024, 12:40pm

Sure, I’ll push a translation branch and you can pull request against the .json file. It’s also fine at this scale to just pull request to the main branch.

I’m not tied to the current setup, so if you have a better workflow I’ll happily implement it. The only thing that I absolutely need is the ability to rebuild from the en-us base translation and fill in defaults for other languages. It was such a hassle to update n header files every time I add a new phrase, that I found myself not actually using it consistently.

henrygab · October 1, 2024, 9:08am

One potential option:

Force translated text retrieval to go through accessor function (instead of directly accessing the global t pointer).
In that accessor function, if current language’s text is NULL, return the text from the default language (en-us).
- If that’s still NULL, return the poop emoji.
In json2h.py, provide fixed size to generated arrays in the generated .h files, and do not store en-us duplicates … store NULL.

ian · October 1, 2024, 11:14am

I really like this, especially the emoji, or whatever the ASCII art equivalent is.

henrygab · October 1, 2024, 10:28pm

This is now Issue #112, and I’ve assigned myself to it as I already had much of the changes prototyped.

henrygab · October 2, 2024, 5:35am

PR 115 implements this. It uses the poop emoji, which works fine in TeraTerm.

I chose to name the accessor function GET_T(). If you could look it over, that would be appreciated. I will probably merge this tomorrow, unless something is found.

lawrence · October 2, 2024, 6:01am

How do source language files get updated? Sorry I have not looked through the code so this may be off the wall.

I wonder whether having the initial processor update the source language files with the us English for entries that do not exist in the alternative languages. That way you can pass the file to a native speaker who can edit just the new us English content.

All language files, then have a complete set of entries.

henrygab · October 2, 2024, 6:24am

Not off the wall at all. It’s not self-evident, although I tried to document it more in this PR as I had to learn it. Here’s my summary:

New translation-capable strings are simply added to translation/en-us.h. Simply create an unused enumeration name, and add it.
run python ./json2h.py from within the translation/ directory. This will parse the translation/en-us.h header file, and update all the other files.

More about json2h.py

Goes through four main steps:

Read the translation/en-us.h file, extracting KVPs (key/value pairs) for each string in that file.
Generates / overwrites translation/en-us.json based on the extracted KVPs
Generates translation/base.h from the template file translation/base.ht, adding each of the keys (from the KVPs) to the enumeration type.
Finally, for all other translation/*.json files (except en-us.json), a corresponding translation/*.h file is generated.
- Loop is based on the extracted KVPs…
- If the localized .json doesn’t have that key, then the value (en-us string) is added to the localized .h file.
- Else if the localized .json string is identical to the en-us string, it’s added to the localized .h file (Note: may be set to NULL in the near future).
- else if the localized .json differ listed for that key, then that localized string is added to the localized .h file.

henrygab · October 2, 2024, 6:25am

How to add an entirely new translation / dialect

Edit translation/en-us.h to add a string for the new dialect.
The enumeration name MUST begin with T_CONFIG_LANGUAGE_.
For example, to add Esperanto as a new option, it might look like:

       [T_CONFIG_LANGUAGE_ESPERANTO] = "Lingvo - esperanto (Mondo)",

Review [T_CONFIG_LANGUAGE] in en-us.h.
Review the new dialect’s word for “language”.
If it is significantly different from existing options,
modify the string to include the new translation’s word for “language”.
For example, if the word language in Esperanto was XYZZY, one might change:
[T_CONFIG_LANGUAGE] = "Language / Jezik / Lingua / 语言"
to:
[T_CONFIG_LANGUAGE] = "Language / Jezik / Lingua / 语言 / XYZZY"
If the word language in Esperanto was instead Lingo, then it’s probably fine to leave the existing string due to similarity with Lingua … it’s close enough to be understood.
Create a new .json file in the translation/ directory, with the base filename corresponding to the IETF language tag for the new dialect.
- Recommend starting with translation/en-us.json
  For example, if adding Esperanto:
  cp en-us.json eo-001.json
Translate or remove the strings in that new .json file
- There is no requirement to translate all strings.
- If a string is not translated, simply exclude that entry from
  the json file entirely.
- Missing or entries storing null (special json value) will
  automatically load the most recent version of the en-US string.
  Made-up words for example Esperanto translation:

       {
           "T_CONFIG_LANGUAGE_ESPERANTO": null,
           "T_CONFIG_PIN": "PyN"
       }

Edit translation/base.ht and add the new dialect to the enum T_language_t.
- This will become part of the new generated base.h in the next step.
- Name should be language_idx_ followed by the IETF language tag, replacing dash with underscore
  Continuing this example of adding esperanto:

       language_idx_eo_001, // Esperanto

Run python ./json2h.py while in the translation/ directory.
- This parses translation/en-us.h to extracts key/value pairs for each string.
- Generates translation/en-us.json (for use as a template for new languages).
- Generates translation/base.h from translation/base.ht, and automatically creates the enumerations for all the string IDs.
- Generates translation/XX-YY.h files for each translation/XX-YY.json file in the directory (excluding en-us.json, of course)
Manually update the table near the top of translation/base.c to include the new T_language_t that was added to base.ht above, as well as the name of the table in the new dialect’s generated header.
Continuing the example of adding Esperanto, the added line might be:

    [language_idx_eo_001] = &eo_001, // eo-001 aka Esperanto

henrygab · October 2, 2024, 6:45am

OK! It is now possible to differentiate in a translation JSON file between entries which are translated (differ from en-us string), entries which have not been translated yet (no entry in JSON at all), and entries which have been reviewed and explicitly chose ot revert to en-us (set to JSON keyword null).

See PR #115, in particular the end of the translation/json2h.py file, for a commented-out modification that should (in the long run) make translators’ lives easier, by listing the “new” strings that need review.

ian · October 2, 2024, 3:08pm

What @lawrence said sounds like how it currently works.

The python script fills in defaults from english prior to compiling.

It also rebuilds the file of array key defines from the en_us.h file, so that developers only have to add items, not update two places every time.

There is some basic documentation on it, and how to add languages in the docs. I bookmarked @henrygab s post and will add the detailed overview the next time I’m doc-ing (with your permission).

There is also a browser app for loading and updating translations from the github repo. It’s simple but I’m proud of its simplicity.

Thank you for putting those changes through, I will check on them.

ian · October 2, 2024, 3:13pm

Wow, 115 looks great. I really like the translation tables.

henrygab · October 2, 2024, 4:58pm

Almost … there was no distinction in the current method between the following situations (parenthesis indicate new method of differentiation):

Needs review by translator (e.g., New string added to en-us.h, but dialect’s XX-YY.json entirely missing the entry)
String already reviewed by translator, decided not to translate / use en-us version including updates (e.g., XX-YY.json lists the T_... enumeration set to JSON keyword null)
String already reviewed by translator, and translation happens to match the current en-us text. Updates to en-us version should NOT be used.

Permission implicit with the PR, but explicitly given also.

That’s outstanding. Does the browser app need to be updated (with PR 115), to differentiate between the JSON file having an explicit null, vs. having no current translation?

henrygab · October 2, 2024, 5:15pm

While PR 115 improves things a bit, there’s more that could be done, at the cost of slightly more complex json2h.py.

There is no validation of format string parameters.

If the en-us string is used as a format string (with embedded % format codes), this means the firmware will prepare and use that number and order of parameters. The compiler cannot validate the parameters match the format string in this situation. Therefore, it’s important to at least validate the en-us string, if it’s a format string, has the same order / types as the translated string.

There is no tracking of the en-us string that a given translation was based on.

Thus, if a string was previously translated, but then the en-us string changes, there is no automatic way to detect the need to re-review the translated string.

If the JSON was a struct , it could store the en-us string that was the basis of the decision (as well as the decision, as noted above). This would allow detecting and flagging which strings need to be re-reviewed by translators.

e.g., instead of:

{
    "T_ON": "已开启",
    "T_OFF": "已关闭",
    "T_GND": null,
    ....
}

the format could be:

{
    "T_ON":  { "t": "已开启", "original": "ON"  },
    "T_OFF": { "t": "已关闭", "original": "OFF" },
    "T_GND": { "t": null,    "original": "GND" },
    ....
}

As you can see, this is still a very simple format, while also being easily extensible.

Many strings are not translated.

Enough said. Is it a goal to translate all terminal UI strings?

Let me know if any of these are of interest, and I’ll open the corresponding github issue to track it.

ian · October 3, 2024, 10:34am

Yes, it probably should be when the dust settles. It is the primary way people have created an updated the translations so far.

Your three other points are all valid. I support updates to the JSON, it makes the translation process easier to have the actual string instead of just the tags.

In terms of translating every string - it’s just how far down the rabbit hole you want to go. I generally try to make sure most system wide things are using the translation system, things that won’t change a lot and have global impact. I don’t bother with, especially, little apps because they can evolve so fast that they have a kind of Frankenstein combo of translated and not.

henrygab · January 17, 2025, 2:03am

Major improvements are now in PR #190.

It already detected an interesting edge case in the italian translation … most folks would not realize that "% A" would require a double (float) argument, for example … but the automated checks did.

This also detects and tracks the format string specifiers in all translated strings, including keeping track of what it used to be. EN-US adding or removing a format specifier? For now, that’s an error … since it needs careful consideration to avoid breaking translations / crashing when using mismatched translation.

Thus, also (temporarily) preventing changes to existing EN-US strings. New strings are fine. This prevents changing an EN-US string that has already been localized, which (for now) would invalidate the translation already done … in a way that we cannot easily discover.

Next step is to modify the translation JSON files, so they include additional fields. Example:

{ 
    "T_EXIT": { translation:"Esci", based_on_en_us: "exit",  },
    "T_GND": { translation: null, based_on_en_us: "GND", comments: "Universal term" }
}

Directly including the original term will avoid all the terrible hacks I came up with to try to guess what version / commit / etc. a translated string was based on. With the above, it’s trivial to determine if a translated string may need to be updated … and allows changing EN-US strings again.

So much more coming down the pipe.

mbrugman · January 19, 2025, 8:09pm

Quick question - I’m doing some work and I want to change the English text of some existing items (not adding or removing items).

The python script gives me an error that en-us.h does not match en-us.json`. I don’t want to mess things up too bad, so what’s the best way to handle this case?

AreYouLoco · January 19, 2025, 8:16pm

Maybe create new language xx-xx and use that