Improvements in handling bytes encoding



In R, a string can be declared to be in bytes encoding. According to ?Encoding, it must be a non-ASCII string which should be manipulated as bytes and never converted to a character encoding (e.g. Latin 1, UTF-8). This text summarizes recent improvements in how R handles bytes encoded strings and provides some of thoughts about what they should and shouldn’t be used for today.

Character vector, string and encoding

Particularly for readers not familiar with R, it may be useful to highlight how strings are supported in the language. A character vector is a vector of strings. As any vector, it may be of length zero or more, and it may be NA. The string type is not visible at the R level, but a single string is represented using a character vector of length one, with that string as the element.

A string literal, such as "hello", is a character vector of length one:

> x <- "hello"
> length(x)
[1] 1
> x[1]
[1] "hello"

Similarly, there is no type in R to hold a single character. One may extract a single character using e.g. a substring function, but such character would be represented as a string (so a character vector of length one with that single-character string as element):

> substring(x, 1, 1)
[1] "h"

Strings are immutable, they cannot be constructed incrementally, e.g. by filling in individual bytes or characters as in C. Creating a string is potentially an expensive operation, strings are cached/interned and some of their properties are examined and recorded. Currently it is checked and recorded whether the string is ASCII.

Encoding information is attached to the string, so one character vector may contain strings in different encodings. Supported encodings are currently “UTF-8”, “latin1”, “bytes” and “native/unknown” (more about which comes later).

Functions accepting character vectors handle strings according to their encoding. E.g. substring counts in bytes for bytes encoded strings, but in characters for character strings (“UTF-8”, “latin1” and “native/unknown”). Not all functions support bytes encoded strings, e.g. nchar(,type="chars") is a runtime error, because a byte encoded string has no characters.

Functions have to deal with the situation when different strings are in different encodings. Individual functions differ in how they do it, but often character strings are converted to a single character encoding, usually UTF-8, and when that happens, any newly created result strings are also in UTF-8. The user doesn’t have to worry as long as the strings are valid, because they can always then be represented in UTF-8.

This is more complicated with bytes encoded strings, which cannot be converted to a character encoding. Some functions, such as gsub or match, switch to a different mode if any of the input strings is bytes encoded. In this mode, they ignore encodings of strings and treat all of them as byte sequences. As discussed later, this only makes sense in certain situations.

Bytes encoded strings are not byte arrays

From the previous, it is clear that bytes encoded strings are not like byte arrays in Java or Python or char arrays in C, because one cannot refer to the individual bytes in them. Also, one cannot modify individual bytes using [] operator.

There are additional differences. The zero byte is reserved and cannot be included in any string. Also, every byte encoded string must contain at least one byte of value greater than 127, because the string must be non-ASCII. ASCII strings are always encoded as “native/unknown” (and while encoding flags can sometimes be manipulated, this rule cannot be violated). It would become clearer, later, that this is due to identity/comparison of ASCII strings.

So, bytes encoded strings are not usable to represent binary data. Instead, there are raw vectors in R for that. Elements of a raw vector are arbitrary bytes (including zero) and can be indexed and mutated at R level using []. They don’t work like strings, aren’t printed as strings and aren’t supported by string functions.

Encoding-agnostic string operations

Particularly in the past when there were only single-byte encodings, it made sense to think of encoding-agnostic string operations. Not only because sometimes the input encoding wasn’t reliably known, but also because possibly old code not encoding aware or not aware of new encodings could be re-used. Also, there were many different encodings in use.

When all strings are in the same (stateless) single-byte encoding, one can concatenate them without knowing the encoding, one can do search/replace. If they are all a super-set of ASCII (encodings supported Bby R all are), one can even do parsing of a language that is all-ASCII, including trivial parts such as splitting lines and columns.

People sometimes had input files in really unknown encoding (the provider of the file didn’t tell). And as long as most of the bytes/characters were ASCII, many things could be done at byte level ignoring encodings.

A concrete example that still exists in today’s R is the package DESCRIPTION file. The file may be in different encodings, but the encoding is defined in a field named Encoding: inside that file. The file can even have records in different encodings, each with its own Encoding: field. Parsing such file in R requires some encoding-agnostic operations: one doesn’t know in advance of reading the file what the encoding is.

With multi-byte encodings, things are much more complicated and encoding-agnostic operations no longer really make sense. Still, UTF-8 allows some of them, to the point that it is supported in DESCRIPTION files. UTF-8 is ASCII safe: a multi-byte character is encoded using only non-ASCII bytes, so that all ASCII bytes represent ASCII characters. Also, in UTF-8, searching can be based on bytes: the byte representation of a multi-byte character doesn’t include a byte representation of another character. Still, currently, Debian Control Files (DCF) which DESCRIPTION files are based on, do not allow to define the encoding inside them, today they are required to be in UTF-8. It would make sense to eventually move to UTF-8 in DESCRIPTION files as well.

Even with UTF-8, indeed, some basic encoding-agnostic operations are not possible, as characters may be represented by multiple bytes. Other multi-byte and particularly stateful encodings make encoding-agnostic operations on the byte stream impossible.

The current trend seems to be that files must be in a defined known encoding (known without parsing text of the file), and often this encoding is known implicitly as it is required to be UTF-8.

Still, to support old-style files, such as current DESCRIPTION (or e.g. old LaTeX), byte-based encoding-agnostic operations are needed, and the bytes encoded strings are the right tool for that in R.

“unknown” encoding is not suitable for encoding-agnostic operations

R has an encoding referred to as “unknown” (see e.g. ?Encoding).

In most parts of R today, strings in this encoding are expected to be valid strings in the native encoding of the R session, and this is why I used “unknown/native” elsewhere in this text. Any encoding conversion (typically to UTF-8) relies on this. If it doesn’t hold, there is in an error, a warning, substituting invalid bytes, etc, depending on the operation.

Such string conversions may happen at almost any time internally without a direct control of the user, so using “unknown/native” strings to perform encoding-agnostic operations is brittle and error prone. It is still sometimes possible as string validity is not currently checked at creation, but it is not impossible it would be turned into an error in the future, as often invalid strings are simply created by user error.

Bytes encoded strings, instead, are safe against accidental conversion as by design/definition they cannot be converted to a character string.

For completeness, it should be said that some parts of R implement certain uncertainty for strings with “unknown/native” encoding. They are meant to be valid strings in the native encoding, but the idea is that we are not so sure we believe that unless it is confirmed by an explicit declaration of the user (some functions allow marking such strings UTF-8 or Latin 1) or by successful conversion to a different encoding. Still, whenever the string encoding is actually needed, it is expected to be the native encoding, and if it is not, there is an error, warning, substitution, transliteration, etc.

In the past and recently on Windows, the native encoding was often single-byte (and Latin 1), so conversions did not detect invalid bytes as often as now and the results were often acceptable for the reasons described above. Now, when the native encoding is mostly UTF-8, where many byte values cannot be the lead byte, conversions more often detect invalid bytes in old single-byte encoded files.

Particularly in the past, another bit of uncertainty was what actually was the native encoding and, even today, finding out is platform specific. So, strings were assumed to be in the native encoding, but it was sometimes unknown what that encoding actually was.

Finally, while it is discouraged, the R session encoding can be changed at runtime. This makes the existing strings in “native/unknown” encoding invalid, or in other words, it is then not known in what encoding are which strings.

I think that all these sources of uncertainty are becoming of less concern today and that the “unknown” encoding should be understood as “native” and all strings marked with that encoding should be valid in it. The R session encoding should never be changed at runtime. On recent Windows, it should never be changed at all (it should be UTF-8, because, that is the build-time choice for the system encoding in Windows for R). Definitely, “unknown” encoding should not be used for encoding-agnostic operations: we have the bytes encoding for that.

Limitations of “bytes” encoding implementation

It recently turned out that the existing support for “bytes” encoding had several limitations, which were recently fixed.

First, it wasn’t possible to read lines from a text file (such as DESCRIPTION) as strings in “bytes” encoding. One would normally read using readLines without specifying an encoding, and then mark as “bytes” using Encoding(x) <- "bytes", but that approach uses invalid strings, because for a short while the strings are marked as “native/unknown”. This has been improved and now one can use readLines(,encoding="bytes") to read lines from the file as “bytes”. Indeed, this assumes that line separators have that meaning (which must be the case for encoding-agnostic operations).

Then, there was a problem with regexp operations gsub, sub and strsplit. These operations create sometimes new strings, by substitution or splitting, and the question is what encoding these strings should have. When any of the inputs is encoded as bytes, these operations “use bytes” (work on byte level). But, for historical reasons, they used to return these new strings as “unknown/native”.

Hence, by processing an input line from say a DESCRIPTION file, represented as bytes encoded string, one could get an invalid “native/unknown” string, which could then be corrupted by accidental conversion to some other encoding. One would have to always change the encoding of the result of every single regexp operation to “bytes”, but that is inconvenient and sometimes cannot be easily done by user, e.g. when calling a function that isn’t doing that (e.g. trimws, which may apply two regexp operations in sequence).

These functions were changed to mark the newly created strings as bytes when at least one of the input is marked as bytes. It should be said that while the regexp functions allow mixed-encoding use, only a small subset of that makes any sense. Either, all inputs are in a character encoding (so convertible to UTF-8), and then the results will also be in a character encoding. Or, all inputs are bytes encoded or ASCII, and then the results will also be bytes encoded or ASCII. Mixing bytes encoded and other non-ASCII strings doesn’t make sense.

useBytes=TRUE in regexp operations and type instability

Now, a natural question is whether we shouldn’t also do this whenever useBytes=TRUE, whether the newly created strings or possibly all strings returned should not be marked as bytes.

This has been tried in R-devel but reverted for further analysis as it broke too much existing code. I’ve first wanted to mark only the newly created strings as bytes (because we haven’t changed the old ones, so why forgetting about their encoding). This would conceptually make sense, but broke this pattern in user code:

xx <- gsub(<something_strange>, "", x, useBytes = TRUE)
stopifnot(identical(xx, yy))

The pattern removes “something strange” from an input text in a character encoding. When replacement happens, the result element is bytes encoded after the change (but “unknown/native” before the change). When replacement doesn’t happen, it is encoded in the original character encoding of x . However, a bytes encoded string is never treated identical to a string in a character encoding. So, the change has introduced type instability (character vs bytes encoding) where it wasn’t before and tests started failing. I tried to fix this by making all strings returned by the function bytes encoded, but while “stable”, it broke even more code, because it ended up passing bytes encoded strings to string functions that did not (and some could not) support them.

In the previous, I wrote that using a mixture of bytes and character encoded non-ASCII strings on input doesn’t make sense. useBytes = TRUE with inputs in multiple different character encodings doesn’t make sense, either, for the same reasons (simply the bytes in different inputs mean different things). But, useBytes = TRUE is historically being used, as in this pattern, to achieve some level of robustness against invalid input UTF-8 strings. This works with a subset of regular expressions on UTF-8 inputs with some invalid bytes.

Being able to process UTF-8 with invalid bytes is a useful feature e.g.  when processing textual logs from multiple parallel processes without proper synchronization: multi-byte characters may not be written atomically. While PCRE2 today has better support for invalid bytes in UTF-8 strings, R doesn’t yet provide access to it. Indeed, for some applications, one could simply substitute invalid bytes using iconv and get a valid UTF-8 string to process.

It should be noted here that the “bytes” encoding (and also character encodings) already do have another type instability wrt to ASCII. If an operation on bytes encoded string say extracts some parts of the string or otherwise processes them, the result may be bytes encoded (when it has at least one non-ASCII byte) or “native/unknown” (when it is ASCII). substring is a trivial example. Hence, results of strings operations should already be treated with some type instability in mind.

It would seem the pattern above could be handled by (WARNING: this doesn’t work, see below):

xx <- gsub(<something_strange>, "", x, useBytes = TRUE)
xx <- iconv(xx)
stopifnot(identical(xx, yy))

which would re-flag bytes encoded elements of xx as “unknown/native” and convert elements in a character encoding to “unknown/native” as well. But, this has two problems. The first is that some of the input characters may not be representable in the “unknown/native” encoding (on old systems where UTF-8 is not the native encoding). That could be solved by using xx <- iconv(xx, to="UTF-8").

But, there is another problem: iconv(,from="") historically ignores the encoding flag of the input string, but always converts from the “unknown/native” encoding, so it misinterprets strings in other encodings.

This behavior of iconv has been changed. Now, the encoding flag of the input string takes precedence if it is UTF-8 or Latin 1. This is a change to the documented behavior, but in principle it could only break code that used to depend on using invalid strings. Checking all of CRAN an Bioconductor packages revealed that only one package started failing after the change, and it was actually a good thing because the package had an error; it worked by accident with the old behavior.

I believe that when considering using useBytes = TRUE, primarily it should be decided whether invalid inputs need to be supported at all, in many applications probably not, but in some they do. Then, one should I think first consider whether substitution using iconv(,sub=) to valid UTF-8 input would be acceptable. If so, that is the simplest, most defensive and future-compatible option for accepting invalid strings.

Only if that is not acceptable and useBytes = TRUE with regular expressions is to be used, the code should handle type instability wrt to getting results in bytes or “native/unknown” encoding, as discussed above. The documentation of the regexp operations has been updated to make it explicit that in some cases, it is unspecified whether the results would be “bytes” or “unknown/native” encoded (before it was unspecified indirectly). Code should be made robust against possible changes within this range (which may not only be a result of cleanups, but also performance optimizations or refactorings to support new features). Once R gets safer regexp support for handling invalid UTF-8 inputs, such code may have to be updated, anyway.

I would not consider using useBytes = TRUE in regexp operations for any other reason, because of not only the type instability, but also limitations of the regular expressions that may be used. In the past, this has been done for performance, but performance of regexp operations was improved recently for this reason (see this blog post). It also used to be done when the support for handling UTF-8 strings in R was limited, but that again should no longer be a valid reason anymore.

Summary

Bytes encoding is R is a bit unusual feature, which is suitable for encoding-agnostic operations at byte level.

It allows to perform such operations safely. Unsafe alternatives used in the past included using invalid strings in the “unknown/native” encoding, sometimes together with changing the R session locale, but these lead to wrong results (due to accidental transliteration and substitution) or warnings and errors. Unsafe alternatives are also only possible because R tolerates creation of invalid strings, which in turn is hiding errors in user code and packages, which could otherwise be detected by checking string validity at string creation time.

Recent improvements in R made it easier to use bytes encoding for encoding-agnostic operations at byte level, when they are needed. This text, however, also argues that encoding-agnostic operations should not be much needed in the future when encodings are properly supported, known (and ideally/mostly UTF-8).

Providing safe alternatives for unsafe operations with “native/unknown” encoding done now, in the form of bytes encoding improvements, better support for regular expressions on invalid UTF-8 inputs or regular expressions speedups, should allow to better detect encoding bugs which are now causing incorrect results or errors, but also to simplify the encoding support in R in the future. Since now UTF-8 is the native encoding also on recent Windows, it should be possible to once have only UTF-8 as the character encoding supported in R.