R Can Use Your Help: Translating R Messages

If you use R you may have wondered if there are ways you can contribute to making R better. An important feature of R that encourages its use around the world is the support for localization. This enables R’s messages, warnings and errors, as well as menu labels in the Windows and Mac OS GUIs, to be shown in the user’s local language.

Localization relies on translations that are contributed and maintained by volunteer translation teams. We recently ran a series of Collaboration Campfires, where we explored what motivates people to contribute translations, the current status of translations in R and how people can get involved. In this post we share the insights from those sessions.

Why get involved in translation?

An obvious reason for helping to translate R messages is that it makes R more accessible to non-English speakers, especially in communities where a working knowledge of English is uncommon. If you are a package developer, it is useful to learn about the translation infrastructure, because the same infrastructure can be used to add translations to your own package(s). Indeed, the same infrastructure is used by other open source projects. Contributing translations to the R project is a good starting point for learning about and contributing to the development of R more widely. As with any open source contribution, there is the benefit of building your knowledge and your network as you interact with other developers. It can also be a nice addition to your CV/resume!

Current status of translations in R

The translations of R messages are stored in PO files, a plain-text file format for use with the GNU gettext software. The source code for each package in base R has a po directory which contains the PO files. There are up to three PO files for each package: one for messages contained in the R code, one for messages contained in the C code, and for the base package only, one for text displayed in the Windows GUI.

Therefore we can explore the current status of R by extracting the PO files from the 14 packages in base R (some translation teams also provide translations for the Recommended packages, Mac OS GUI, and the Windows installer, which we don’t consider here). For each message, we can determine whether a translation is available for a particular language and if so, whether it is up-to-date or if the message has changed since it was translated, i.e. the translated message is “fuzzy”.

Stacked bar chart of translation status in base and default packages (12 July 2022). Languages are ordered by number of correctly translated messages plus the number of fuzzy translations, out of a possible 5862. French: 5760 + 48; Italian: 5764 + 40; Russian: 5693 + 74; Lithuanian: 5677 + 86; German: 5371 + 226; Polish: 5370 + 226; Traditional Chinese: 5368 + 220; Japanese: 5005 + 361; Simplified Chinese: 4789 + 528; Korean: 4487 + 199; Brazilian Portuguese: 2835 + 1209; Norwegian Nynorsk: 1866 + 539; Turkish: 1846 + 532; Danish: 1778 + 471; Spanish: 1381 + 540; Persian: 266 + 9; British English: 11 + 18, English: 1 + 0.

We can see that there are a few languages with near-complete, correct translations: French, Italian, Russian, and Lithuanian. Then there is a group of languages with slightly lower coverage and a higher proportion of fuzzy messages: German, Polish, Chinese (Traditional), Japanese, Chinese (Simplified) and Korean. There is a third group with only about a third of the messages translated: Norwegian (Nynorsk), Turkish, Danish and Spanish. Only the GUI messages have been translated into Persian. The standard English messages have been translated into British English for a few cases in the base and grDevices packages. Finally there is one message in standard English that adds information about the locale to the startup message in R.

The metadata in the PO files includes both the date that the English messages were last updated and the date the translation was last updated. The plot below represents the last translation date as a lag time in years from the last message update, for each PO file.

Scatterplot of lag in translation (years) for each PO file by language (12 July 2022). Languages are ordered by decreasing mean lag. French: mostly zero with a few lags of 1 or 2 years. Lithuanian: all zero (less than a year). Russian: mostly zero with spread from -1 to 2. Traditional Chinese and Korean: scatter spread from 0 to -3. Polish: only a couple of data points at -3. German: scatter spread from 0 to -10. Spanish: only a few data points: 0, -9, -9. Japanese: bulk of scatter from -5 to -8. Danish: scatter from -6 to -9. Persian: only a few data points at -9. Brazilian Portuguese: spread from -7 to 10. Norwegian Nyorsk and Turkish: main scatter at -10. Simplified Chinese: main scatter from -13 to -16. British English: -13, -16, -16. English: single PO file with lag of -13.

The last translation date was missing for some of the PO files, in particular, no dates were available for Italian. However, the plot shows a clear correspondence to the previous plot - the languages with higher translation coverage have been updated closer to when the English messages were last updated. The languages with poor coverage have not been updated for at least 5 years prior to the time the English messages were last updated. For Chinese (Simplified) most of the files have not been updated for at least 10 years.

For the languages with lower coverage, we can explore the choices translation teams have made regarding which messages to prioritize for translation. The plot below compares the coverage by package for the languages with lower coverage:

Translation status for low coverage languages (12 July 2022). Stacked bars show the number of correctly translated messages plus the number of fuzzy messages. 
base (2489 messages): Norwegian Nyorsk: 1764 + 519; Brazilian Portuguese: 1702 + 559; Spanish: 1285 + 515; Turkish: 1391 + 392; Danish: 1226 + 157. 
compiler (38 messages): Brazilian Portuguese and Danish: 29 + 3.
graphics (283 messages): Brazilian Portuguese: 217 + 45; Norwegian Nyorsk and Turkish: 102 + 20; Spanish: 96 + 25; Danish: 12 + 5.
grDevice (336 messages): Danish: 183 + 83; Brazilian Portuguese: 106 + 59.
grid (21 messages): Brazilian Portuguese: 44 + 133; Danish: 12 + 76. 
methods (436 messages): Brazilian Portuguese: 51 + 229; Danish: 26 + 28. 
parallel (59 messages): Danish: 32 + 13. 
splines (26 messages): Brazilian Portuguese and Danish: 16 + 5.
stats (1003 messages): Brazilian Portuguese: 603 + 142; Turkish: 341 + 116; Danish: 100 + 48. stats4 (8 messages): Brazilian Portuguese, Turkish and Danish: 3 + 1. 
tcltk (40 messages): Brazilian Portuguese and Danish: 37 + 1. 
tools (460 messages): Danish: 61 + 20; Brazilian Portuguese: 21 + 14; Turkish: 2 + 0. 
utils (472 messages): Danish: 41 + 31; Brazilian Portuguese: 5 + 18; Turkish: 7 + 3.

Norwegian and Spanish translations are only available for the base and graphics packages. Turkish translations cover a few more packages, including about half of the messages in the stats and stats4 packages. Brazilian Portuguese and Danish translations are available for all packages (there are no messages to translate in the datasets package), but for several packages the proportion of translated messages is very low (less than a quarter).

How can you help?

Clearly there is a lot of scope for people to contribute new translations of messages, or to update translations that are no longer correct. The first step is to learn more about how translations are added to R packages. We recommend starting with the Translating R to your Language tutorial from useR! 2021 - you can watch the video and/or read the slides.

Once you have a basic understanding of the process, find the contact person for the language(s) you can contribute to on the Translation Teams page. If your language is not there, or the team requires a new maintainer, post a message on the #core-translations channel of the community-run R Contributors Slack or on the R-Devel mailing list to offer your help.

The maintainer or help channels should be able to tell you how to contribute for a specific language. Some translation teams maintain translations on GitHub, e.g. Italian and French. The Hungarian team is trialing a Weblate server that allows translations to be contributed via a browser. Several contributors involved in translation are active on the #core-translations channel of the R Contributors Slack, so that is a good place to ask any questions on how to get started or deal with any issues you encounter as you start contributing.

Further Resources

There are number of works in progress that will provide additional support in future:


Thanks to the participants of the Collaboration Campfire “Explore R’s Process for Localization (Translation)” that provided ideas and draft visualizations for this blog post: Shimelis Abebe Tegegn, Iman Al Hasani, Michael Blanks, Michael Chirico, Toby Dylan Hocking, Pawan Jangra, Ella Kaye, Piyush Kumar, Beatriz Milz, Kozo Nishida, Lucy Njoki Njuki, Riva Quiroga, Marcel Ramos, and Ben Ubah.