Joel covered this[1] topic over 20 years ago (!!) and we still regularly see "senior" programmers who just casually think of text as a string and strings as text, and that's all there is to it. I still regularly see websites full of ????? and U+FFFD and apostrophes becoming ’ everywhere.
> Everybody is aware of this at some level, but somehow this knowledge seems to suddenly disappear in a discussion about text, so let's get it out first: A computer cannot store "letters", "numbers", "pictures" or anything else. The only thing it can store and work with are bits.
This is wrong and it goes downhill from there. I don't want to take the time and effort to fisk it, but it's full of errors like mistaking characters for codepoints and saying things like "In other words, ASCII maps 1:1 unto UTF-8" -- a bizarre and wrong way to say what he said in the previous sentence: "All characters available in the ASCII encoding only take up a single byte in UTF-8 and they're the exact same bytes as are used in ASCII".
Unicode is not an encoding of text to bits. It is an encoding of text to numbers. There are a variety of encodings of text to bits based on how those numbers are to be encoded into bits.
Though technically Unicode isn't even quite that. For example "é" can be encoded as U+00E9 or as U+0065,U+0301. Going the other way, "水", U+6C34, is drawn differently in simplified Chinese, Japanese, and traditional Chinese. Unicode calls this, "language-sensitive glyph variation".
Which means that the correspondence between text and Unicode is many to many both ways. And then the Unicode can show up in bits and bytes again in multiple ways.
> Unicode is a large table mapping characters to numbers and the different UTF encodings specify how these numbers are encoded as bits. Overall, Unicode is yet another encoding scheme.
I would guess this represents a confusion between the narrow abstract definition of Unicode versus the way it is casually used as an umbrella term which includes stuff like Transformation Formats.
Joel covered this[1] topic over 20 years ago (!!) and we still regularly see "senior" programmers who just casually think of text as a string and strings as text, and that's all there is to it. I still regularly see websites full of ????? and U+FFFD and apostrophes becoming ’ everywhere.
1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
Related:
What Every Programmer Absolutely, Positively Needs To Know About Encodings (2011) - https://news.ycombinator.com/item?id=30384223 - Feb 2022 (58 comments)
What programmers need to know about encodings and charsets (2011) - https://news.ycombinator.com/item?id=24162499 - Aug 2020 (22 comments)
What to know about encodings and character sets - https://news.ycombinator.com/item?id=9788253 - June 2015 (30 comments)
What Every Programmer Needs To Know About Encodings And Character Sets - https://news.ycombinator.com/item?id=4771987 - Nov 2012 (5 comments)
> Everybody is aware of this at some level, but somehow this knowledge seems to suddenly disappear in a discussion about text, so let's get it out first: A computer cannot store "letters", "numbers", "pictures" or anything else. The only thing it can store and work with are bits.
This is wrong and it goes downhill from there. I don't want to take the time and effort to fisk it, but it's full of errors like mistaking characters for codepoints and saying things like "In other words, ASCII maps 1:1 unto UTF-8" -- a bizarre and wrong way to say what he said in the previous sentence: "All characters available in the ASCII encoding only take up a single byte in UTF-8 and they're the exact same bytes as are used in ASCII".
The best things are those that get out of the way.
Full title:
"What every programmer absolutely, positively needs to know about encodings and character sets to work with text"
> Because Unicode is not an encoding.
> Overall, Unicode is yet another encoding scheme.
?
That's just somewhat sloppy.
Unicode is not an encoding of text to bits. It is an encoding of text to numbers. There are a variety of encodings of text to bits based on how those numbers are to be encoded into bits.
Though technically Unicode isn't even quite that. For example "é" can be encoded as U+00E9 or as U+0065,U+0301. Going the other way, "水", U+6C34, is drawn differently in simplified Chinese, Japanese, and traditional Chinese. Unicode calls this, "language-sensitive glyph variation".
Which means that the correspondence between text and Unicode is many to many both ways. And then the Unicode can show up in bits and bytes again in multiple ways.
Yeah, author seems to have made a mistake there.
> Unicode is a large table mapping characters to numbers and the different UTF encodings specify how these numbers are encoded as bits. Overall, Unicode is yet another encoding scheme.
I would guess this represents a confusion between the narrow abstract definition of Unicode versus the way it is casually used as an umbrella term which includes stuff like Transformation Formats.
The author doesn't understand what a character is, despite the Unicode standard making it very clear that character != codepoint
> Text is either encoded in UTF-8 or it's not. If it's not, it's encoded in ASCII, ISO-8859-1, UTF-16 or some other encoding.
Nitpicking but if it's encoded in ASCII, it's by definition a validly encoded UTF-8 file.
Bitmaps. Anything outside of ASCII should be a bitmap.
This is the encodings equivalent of the “there should just be one timezone” take.
How would that work? How many bytes per character? How different fonts would work?
Sorry, misplaced humor.