I’ve tried to fix some translation issues,inlcuding some English text used in more than single place in different context. I’ve did changes commited some local files, then spotted the JSON file, with same modification needed. I’ve introduced same change attempted to compile and got confused as the build system added same variable twice.
How this is supposed to be used. exactly from translator side?
How the translations actually work from applicationm logic perspective ?
Thank you so much for looking over the translations. Here’s some docs on the translation setup:
So I’ve seen this article after initial struggles, but this is quite interactive - using some web service then posting to forum. In the end someone will just patch the code, so I’m just curious how to do it myself properly and raise pull request. Is there a way to do it ? If there is an issue with a lot of small changes maybe dedicated translation branch on Github would do the job - from which it could be merged from time to time or when some milestone linke translating whole single language was reached.
Sure, I’ll push a translation branch and you can pull request against the .json file. It’s also fine at this scale to just pull request to the main branch.
I’m not tied to the current setup, so if you have a better workflow I’ll happily implement it. The only thing that I absolutely need is the ability to rebuild from the en-us base translation and fill in defaults for other languages. It was such a hassle to update n header files every time I add a new phrase, that I found myself not actually using it consistently.
One potential option:
- Force translated text retrieval to go through accessor function (instead of directly accessing the global
t
pointer). - In that accessor function, if current language’s text is
NULL
, return the text from the default language (en-us
).- If that’s still
NULL
, return the poop emoji.
- If that’s still
- In
json2h.py
, provide fixed size to generated arrays in the generated.h
files, and do not storeen-us
duplicates … storeNULL
.
I really like this, especially the emoji, or whatever the ASCII art equivalent is.
This is now Issue #112, and I’ve assigned myself to it as I already had much of the changes prototyped.
PR 115 implements this. It uses the poop emoji, which works fine in TeraTerm.
I chose to name the accessor function GET_T()
. If you could look it over, that would be appreciated. I will probably merge this tomorrow, unless something is found.
How do source language files get updated? Sorry I have not looked through the code so this may be off the wall.
I wonder whether having the initial processor update the source language files with the us English for entries that do not exist in the alternative languages. That way you can pass the file to a native speaker who can edit just the new us English content.
All language files, then have a complete set of entries.
Not off the wall at all. It’s not self-evident, although I tried to document it more in this PR as I had to learn it. Here’s my summary:
- New translation-capable strings are simply added to
translation/en-us.h
. Simply create an unused enumeration name, and add it. - run
python ./json2h.py
from within thetranslation/
directory. This will parse thetranslation/en-us.h
header file, and update all the other files.
More about json2h.py
Goes through four main steps:
- Read the
translation/en-us.h
file, extracting KVPs (key/value pairs) for each string in that file. - Generates / overwrites
translation/en-us.json
based on the extracted KVPs - Generates
translation/base.h
from the template filetranslation/base.ht
, adding each of the keys (from the KVPs) to the enumeration type. - Finally, for all other
translation/*.json
files (excepten-us.json
), a correspondingtranslation/*.h
file is generated.- Loop is based on the extracted KVPs…
- If the localized
.json
doesn’t have that key, then the value (en-us string) is added to the localized.h
file. - Else if the localized
.json
string is identical to the en-us string, it’s added to the localized.h
file (Note: may be set to NULL in the near future). - else if the localized
.json
differ listed for that key, then that localized string is added to the localized.h
file.
How to add an entirely new translation / dialect
- Edit
translation/en-us.h
to add a string for the new dialect.
The enumeration name MUST begin withT_CONFIG_LANGUAGE_
.
For example, to add Esperanto as a new option, it might look like:
[T_CONFIG_LANGUAGE_ESPERANTO] = "Lingvo - esperanto (Mondo)",
-
Review [T_CONFIG_LANGUAGE] in en-us.h.
Review the new dialect’s word for “language”.
If it is significantly different from existing options,
modify the string to include the new translation’s word for “language”.
For example, if the wordlanguage
in Esperanto wasXYZZY
, one might change:
[T_CONFIG_LANGUAGE] = "Language / Jezik / Lingua / 语言"
to:
[T_CONFIG_LANGUAGE] = "Language / Jezik / Lingua / 语言 / XYZZY"
If the wordlanguage
in Esperanto was insteadLingo
, then it’s probably fine to leave the existing string due to similarity withLingua
… it’s close enough to be understood. -
Create a new .json file in the
translation/
directory, with the base filename corresponding to the IETF language tag for the new dialect.- Recommend starting with
translation/en-us.json
For example, if adding Esperanto:
cp en-us.json eo-001.json
- Recommend starting with
-
Translate or remove the strings in that new
.json
file- There is no requirement to translate all strings.
- If a string is not translated, simply exclude that entry from
the json file entirely. - Missing or entries storing null (special json value) will
automatically load the most recent version of the en-US string.
Made-up words for example Esperanto translation:
{
"T_CONFIG_LANGUAGE_ESPERANTO": null,
"T_CONFIG_PIN": "PyN"
}
- Edit
translation/base.ht
and add the new dialect to the enumT_language_t
.- This will become part of the new generated
base.h
in the next step. - Name should be
language_idx_
followed by the IETF language tag, replacing dash with underscore
Continuing this example of adding esperanto:
- This will become part of the new generated
language_idx_eo_001, // Esperanto
-
Run
python ./json2h.py
while in thetranslation/
directory.- This parses
translation/en-us.h
to extracts key/value pairs for each string. - Generates
translation/en-us.json
(for use as a template for new languages). - Generates
translation/base.h
fromtranslation/base.ht
, and automatically creates the enumerations for all the string IDs. - Generates
translation/XX-YY.h
files for eachtranslation/XX-YY.json
file in the directory (excludingen-us.json
, of course)
- This parses
-
Manually update the table near the top of
translation/base.c
to include the newT_language_t
that was added tobase.ht
above, as well as the name of the table in the new dialect’s generated header.
Continuing the example of adding Esperanto, the added line might be:
[language_idx_eo_001] = &eo_001, // eo-001 aka Esperanto
OK! It is now possible to differentiate in a translation JSON file between entries which are translated (differ from en-us string), entries which have not been translated yet (no entry in JSON at all), and entries which have been reviewed and explicitly chose ot revert to en-us (set to JSON keyword null
).
See PR #115, in particular the end of the translation/json2h.py
file, for a commented-out modification that should (in the long run) make translators’ lives easier, by listing the “new” strings that need review.
What @lawrence said sounds like how it currently works.
The python script fills in defaults from english prior to compiling.
It also rebuilds the file of array key defines from the en_us.h file, so that developers only have to add items, not update two places every time.
There is some basic documentation on it, and how to add languages in the docs. I bookmarked @henrygab s post and will add the detailed overview the next time I’m doc-ing (with your permission).
There is also a browser app for loading and updating translations from the github repo. It’s simple but I’m proud of its simplicity.
Thank you for putting those changes through, I will check on them.
Wow, 115 looks great. I really like the translation tables.
Almost … there was no distinction in the current method between the following situations (parenthesis indicate new method of differentiation):
- Needs review by translator (e.g., New string added to
en-us.h
, but dialect’sXX-YY.json
entirely missing the entry) - String already reviewed by translator, decided not to translate / use en-us version including updates (e.g.,
XX-YY.json
lists theT_...
enumeration set to JSON keywordnull
) - String already reviewed by translator, and translation happens to match the current en-us text. Updates to en-us version should NOT be used.
Permission implicit with the PR, but explicitly given also.
That’s outstanding. Does the browser app need to be updated (with PR 115), to differentiate between the JSON file having an explicit null
, vs. having no current translation?
While PR 115 improves things a bit, there’s more that could be done, at the cost of slightly more complex json2h.py
.
There is no validation of format string parameters.
If the en-us string is used as a format string (with embedded %
format codes), this means the firmware will prepare and use that number and order of parameters. The compiler cannot validate the parameters match the format string in this situation. Therefore, it’s important to at least validate the en-us string, if it’s a format string, has the same order / types as the translated string.
There is no tracking of the en-us string that a given translation was based on.
Thus, if a string was previously translated, but then the en-us string changes, there is no automatic way to detect the need to re-review the translated string.
If the JSON was a struct , it could store the en-us string that was the basis of the decision (as well as the decision, as noted above). This would allow detecting and flagging which strings need to be re-reviewed by translators.
e.g., instead of:
{
"T_ON": "已开启",
"T_OFF": "已关闭",
"T_GND": null,
....
}
the format could be:
{
"T_ON": { "t": "已开启", "original": "ON" },
"T_OFF": { "t": "已关闭", "original": "OFF" },
"T_GND": { "t": null, "original": "GND" },
....
}
As you can see, this is still a very simple format, while also being easily extensible.
Many strings are not translated.
Enough said. Is it a goal to translate all terminal UI strings?
Let me know if any of these are of interest, and I’ll open the corresponding github issue to track it.
Yes, it probably should be when the dust settles. It is the primary way people have created an updated the translations so far.
Your three other points are all valid. I support updates to the JSON, it makes the translation process easier to have the actual string instead of just the tags.
In terms of translating every string - it’s just how far down the rabbit hole you want to go. I generally try to make sure most system wide things are using the translation system, things that won’t change a lot and have global impact. I don’t bother with, especially, little apps because they can evolve so fast that they have a kind of Frankenstein combo of translated and not.
Major improvements are now in PR #190.
It already detected an interesting edge case in the italian translation … most folks would not realize that "% A"
would require a double (float) argument, for example … but the automated checks did.
This also detects and tracks the format string specifiers in all translated strings, including keeping track of what it used to be. EN-US adding or removing a format specifier? For now, that’s an error … since it needs careful consideration to avoid breaking translations / crashing when using mismatched translation.
Thus, also (temporarily) preventing changes to existing EN-US strings. New strings are fine. This prevents changing an EN-US string that has already been localized, which (for now) would invalidate the translation already done … in a way that we cannot easily discover.
Next step is to modify the translation JSON files, so they include additional fields. Example:
{
"T_EXIT": { translation:"Esci", based_on_en_us: "exit", },
"T_GND": { translation: null, based_on_en_us: "GND", comments: "Universal term" }
}
Directly including the original term will avoid all the terrible hacks I came up with to try to guess what version / commit / etc. a translated string was based on. With the above, it’s trivial to determine if a translated string may need to be updated … and allows changing EN-US strings again.
So much more coming down the pipe.
Quick question - I’m doing some work and I want to change the English text of some existing items (not adding or removing items).
The python script gives me an error that en-us.h does not match
en-us.json`. I don’t want to mess things up too bad, so what’s the best way to handle this case?
Maybe create new language xx-xx and use that