What to know about encodings and character sets to work with text (2011)

31 points by ColinWright 3 days ago

ryandrake 3 hours ago

Joel covered this[1] topic over 20 years ago (!!) and we still regularly see "senior" programmers who just casually think of text as a string and strings as text, and that's all there is to it. I still regularly see websites full of ????? and U+FFFD and apostrophes becoming â€™ everywhere.

1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

dang an hour ago

What Every Programmer Absolutely, Positively Needs To Know About Encodings (2011) - https://news.ycombinator.com/item?id=30384223 - Feb 2022 (58 comments)

What programmers need to know about encodings and charsets (2011) - https://news.ycombinator.com/item?id=24162499 - Aug 2020 (22 comments)

What to know about encodings and character sets - https://news.ycombinator.com/item?id=9788253 - June 2015 (30 comments)

What Every Programmer Needs To Know About Encodings And Character Sets - https://news.ycombinator.com/item?id=4771987 - Nov 2012 (5 comments)

jibal 37 minutes ago

> Everybody is aware of this at some level, but somehow this knowledge seems to suddenly disappear in a discussion about text, so let's get it out first: A computer cannot store "letters", "numbers", "pictures" or anything else. The only thing it can store and work with are bits.

This is wrong and it goes downhill from there. I don't want to take the time and effort to fisk it, but it's full of errors like mistaking characters for codepoints and saying things like "In other words, ASCII maps 1:1 unto UTF-8" -- a bizarre and wrong way to say what he said in the previous sentence: "All characters available in the ASCII encoding only take up a single byte in UTF-8 and they're the exact same bytes as are used in ASCII".

random3 2 hours ago

The best things are those that get out of the way.

ColinWright 3 days ago

Full title:

"What every programmer absolutely, positively needs to know about encodings and character sets to work with text"

____tom____ 3 hours ago

> Because Unicode is not an encoding.

> Overall, Unicode is yet another encoding scheme.

btilly 2 hours ago

That's just somewhat sloppy.
Unicode is not an encoding of text to bits. It is an encoding of text to numbers. There are a variety of encodings of text to bits based on how those numbers are to be encoded into bits.
Though technically Unicode isn't even quite that. For example "é" can be encoded as U+00E9 or as U+0065,U+0301. Going the other way, "水", U+6C34, is drawn differently in simplified Chinese, Japanese, and traditional Chinese. Unicode calls this, "language-sensitive glyph variation".
Which means that the correspondence between text and Unicode is many to many both ways. And then the Unicode can show up in bits and bytes again in multiple ways.
Terr_ 3 hours ago

Yeah, author seems to have made a mistake there.
> Unicode is a large table mapping characters to numbers and the different UTF encodings specify how these numbers are encoded as bits. Overall, Unicode is yet another encoding scheme.
I would guess this represents a confusion between the narrow abstract definition of Unicode versus the way it is casually used as an umbrella term which includes stuff like Transformation Formats.
- jibal 40 minutes ago
  
  The author doesn't understand what a character is, despite the Unicode standard making it very clear that character != codepoint

TacticalCoder 3 hours ago

> Text is either encoded in UTF-8 or it's not. If it's not, it's encoded in ASCII, ISO-8859-1, UTF-16 or some other encoding.

Nitpicking but if it's encoded in ASCII, it's by definition a validly encoded UTF-8 file.

nick49488171 2 hours ago

Bitmaps. Anything outside of ASCII should be a bitmap.

Uehreka 2 hours ago

This is the encodings equivalent of the “there should just be one timezone” take.
bloomca 2 hours ago

How would that work? How many bytes per character? How different fonts would work?
- nick49488171 2 hours ago
  
  Sorry, misplaced humor.