Exploring UTF-8 Character Encoding Problems with R

In my work, I’ve come across a lot of character encoding problems. These problems usually manifest themselves as weird character groupings that seem to always start with â. And then I thought I would see if I could reproduce the problem using R to investigate it.

UTF-8 seems pretty simple because most characters—characters between 0 and 0x7f—are represented as-is in UTF-8. That seems fine, but a lot of old character sets like Windows 1252 can represent characters above 0x7f all the way to 0xff.

The problem is that in UTF-8 characters above 0x7f in UTF-8 are encoded. For example, the Unicode character 0x20ac (two bytes), which is the Euro symbol, is encoded and represented as three bytes in UTF-8: 0xe2 0x82 and 0xac.

This encoding causes some issues where files appear to have gibberish characters. Stuff like this:

  • ‘ instead of an open single quote (‘)
  • “ instead of an open double quote (“)

I was hoping that I could find an easy way to reproduce the problem so I could study it. Fortunately, R made it easy to reproduce the problem.

Use the str_conv function to convert a single open quote to the Windows 1252 character set. In Windows 1252 the single open quote is 0x91. In Unicode, the single open quote is 0x2018.

char <- str_conv("‘", "windows-1252")

So what’s in char? It’s: ‘

What is char in hex?

[1] c3 a2 e2 82 ac cb 9c

[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

How did R go from a single open quote character to c3 a2 e2 82 ac cb 9c? What is c3 a2 e2 82 ac cb 9c anyway?

Digression on How UTF-8 Encodes Values above 0x7f

UTF-8 encodes characters above 127 (0x7f) using a somewhat complex system of prefix bits. To represent Unicode characters from 0x80 to 0x7ff, the character is encoded as two bytes like the following:

110xxxxx 10xxxxxx

On the other hand, for the Unicode characters 0x800 to 0xffff the character is encoded as three bytes as follows:

1110xxxx 10xxxxxx 10xxxxxx

Where the x’s represent the binary bits of the to-be-encoded character.

For example the above two characters, c3 a2 are the UTF-8 encoded version of the Unicode character e2, which is â. To decode these UTF-8 encoded characters, remove the beginning “110” from the binary representation of c3 and the “10” from the beginning of the binary representation of a2 and mash the bits together.

In other words, c3 is 11000011, so removing the “110” from the beginning gives “00011” as the first bits. For a2, which is 10100010, remove the beginning “10” and get “100010” in binary. Mash them together and get “000 1110 0010” in binary. This is e2.

So to represent e2 in UTF-8, it’s c3a2. What’s e2 in Unicode? Latin Small Letter A with circumflex.

The second UTF-8 encoded character of char is e2 82 ac. Representing that in binary is

11100010 10000010 10101100

So removing all of the prefix bits (1110, 10, and 10), we get

0010 000010 and 101100 which is 0010 0000 1010 1100 or 20ac in hex. 20ac is the codepoint for the Euro symbol.

The last character is cb 9c, which is the UTF-8 encoding of 0x02dc for small tilde.

Or, working it out, we have 11001011 10011100. Removing the 110 and 10 prefixes to both bytes gives:

010 1101 1100 or 0x02dc

So the unencoded representation of the three characters is 0xe2, 0x20ac, and 0x02dc. Representing those in the Windows 1252 character set gives, instead, 0xe2, 0x80, 0x98. That is, a circumflex, Euro symbol, and tilde in Windows 1252. But e2, 80, 98 is also a UTF-8 encoding of the Unicode character 0x2018 or open single quote.

Here are the steps to reverse the process in R, which helps to understand what happened in the first place:

char <- str_conv("‘", "windows-1252")
[1] c3 a2 e2 82 ac cb 9c
stri_conv(char, from="utf-8", to="windows-1252")
[1] "\\xe2\\x80\\x98"
stri_conv(stri_conv(char, from="utf-8", to="windows-1252"),
     from="utf-8", to="windows-1252")
[1] "\\x91"
stri_conv(stri_conv(stri_conv(char, from="utf-8", to="windows-1252"),
     from="utf-8", to="windows-1252"),
     from="windows-1252", to="utf-8")
[1] "‘"

Or narrating what the above R code does:

  1. Convert from UTF-8 to Windows-1252. Or 0xc3a2 0xe282ac 0xcb9c becomes 0xe2 0x80 0x98.

  2. Convert the result from #1 from UTF-8 to Windows-1252 again. Or 0xe2 0x80 0x98 become 0x91 (Windows 1252 open single quote)

  3. Convert the result from #2 from Windows-1252 to UTF-8. Or 0x91 becomes 0x2018.

And to understand what happened in the first place, the steps need to be reversed:

The code char <- str_conv("‘", “windows-1252”) encoded the Unicode character 2018 (open single quote) to UTF-8 as e2 80 98, interpreted those three characters as three Windows-1252 characters and converted them to Unicode codepoints e2 20ac cb and encoded those as UTF-8, which is c2 a2 e2 82 ac cb 9c. So it did exactly what I asked even if it was somewhat unexpected.

comments powered by Disqus