How to get from theyâ€™re to they’re

In last week’s article, you learned a short process that solves most encoding problems. But there’s one encoding problem that’s much harder to solve.

I know you’ve seen it. (Or maybe youâ€™ve seen it?) It’s when a curly quote turns into â€™, or an em-dash turns into â€”. It’ll make you think you’ve gone crazy. It should just work!

You could create a giant table, so you could find bad characters and replace them with good ones:

[{broken: 'â€“',    fixed: "—"}
 {broken: "â€”",    fixed: "–"}
 {broken: "â€˜",    fixed: "‘"}
 {broken: "â€™",    fixed: "’"}
 {broken: "â€œ",    fixed: "“"}
 {broken: "â€", fixed: "”"}, ...]

But there’s an easier, more reliable way to fix those broken characters.

Why does good typography always break?

Last week, you learned that an encoding is just a way to turn groups of meaningless bytes into displayable characters. Not every character can be represented in a single byte, because there are more than 256 possible characters. So some characters, like the curly quote ’, are represented with more than one byte:

irb(main):001:0> "they’re".bytes
=> [116, 104, 101, 121, 226, 128, 153, 114, 101]

Even though the string only has 7 characters, they’re represented by 9 bytes!

When you focus on just the curly quote:

irb(main):002:0> "’".bytes
=> [226, 128, 153]

You’ll see it uses 3 bytes. And our messed up string, theyâ€™re, has three characters where it should just have one. That seems like more than a coincidence, right?

It seems like those three bytes should be read as UTF-8, where they’d represent a curly quote. Instead, each byte is showing up as a different character. So, which encoding would represent [226, 128, 153] as â€™? If you look at a few tables of popular encodings, you’ll see it’s Windows-1252.

You can check this in irb:

irb(main):003:0> "they’re".force_encoding("Windows-1252").encode("UTF-8")
=> "theyâ€™re"

(We need that last .encode("UTF-8") to display the string in the console.)

Yep! That’s the problem. But it gets worse.

The data is supposed to be UTF-8, but is being misread as Windows-1252. But you’ll probably save that data to a database, or a file, as UTF-8. Ruby will helpfully convert it to UTF-8 for you, so you’ll end up with:

irb(main):004:0> "they’re".force_encoding("Windows-1252").encode("UTF-8")
=> "theyâ€™re"
irb(main):005:0> "they’re".force_encoding("Windows-1252").encode("UTF-8").bytes
=> [116, 104, 101, 121, 195, 162, 226, 130, 172, 226, 132, 162, 114, 101]

Your string has been badly-encoded twice. Those broken characters now look like they’re supposed to be there. And if you didn’t know how it happened, it’d be almost impossible to untangle it.

How do you fix it?

How do you get things back to normal? Let’s think about the problem backwards:

You have a UTF-8 string, (theyâ€™re)
converted from a Windows-1252 string, (theyâ€™re)
whose bytes should have been read as UTF-8 (they’re)

To fix it, you just have to follow those backwards steps. Use encode to convert the UTF-8 string back into a Windows-1252 string. Then, use force_encoding to force that mis-encoded Windows-1252 string to be read as UTF-8:

irb(main):006:0> "theyâ€™re".encode("Windows-1252").force_encoding("UTF-8")
=> "they’re"

Fixed!

There’s one small problem…

Unfortunately, you probably found this problem because a bunch of files or database records had badly encoded data in it. And not every file or record is necessarily badly encoded – you might have a mix of good and bad data. Especially if that data came from the people visiting your site.

If that’s the case, you can’t blindly run that code on every string:

irb(main):007:0> "theyâ€™re".encode("Windows-1252").force_encoding("UTF-8")
=> "they’re"
irb(main):008:0> "they’re".encode("Windows-1252").force_encoding("UTF-8")
=> "they\x92re"

If you run it on good data, you’ll just turn it into bad data. So what can you do?

You can use a heuristic: only change strings that have one of the bad characters in them, like â. This works well if a character like â won’t ever appear in a valid string.

The last time I fixed this kind of bug, though, I wanted to play it safe. I used another useful tool to help: my eyes.

Whenever I found a badly encoded string, I printed it out, along with its replacement:

Changing title with ID 6 from "Theyâ€™re over there!" to "They’re over there!"

That way, I could spot-check the small number of strings that changed, and make sure they didn’t break any further.

I think I have a headache

Like I said last week, keeping different interpretations of the same data straight in your head is hard! But if you’re confused, exploring in an irb console will help. So try it out! Open one up, and see if you can go back and forth between — and â€”, or “ and â€œ.

Practicing complicated ideas like these is the fastest way to feel confident when you need them. And in the free sample chapter of Practicing Rails, you’ll learn the best techniques and processes to do just that.

How to Get From Theyâ€™re to They’re

Why does good typography always break?

How do you fix it?

There’s one small problem…

I think I have a headache

Did you like this article? You should read these:

Comments