Sunday, December 10, 2006

'bush hid the facts'!

if you:

1. Load Notepad in Windows (in my case XP Pro)

2. Type "bush hid the facts" (all in lowercase, no quotes)

3. Save this file under a name of your choice

4. Re-open the file

you will not see the text that you typed, but instead you will see a bunch of squares (or, as I later found out, some Chinese characters - that is, if you have the Chinese fonts installed, which is not my case).

Most people think it's a Windows Notepad easter egg (I thought so myself, to be honest), but in fact, it isn't. It's just a lousy Notepad bug. Let me explain...

I was myself curious about the cause of this phenomenon, and I found out that this text is not the only one to cause problems. There are other strings that cause Notepad to screw up, including "this app can break", which was another version of the bug that generated a lot of buzz. I've personally tested a series of strings that have the same effect, including "this api can break", "this cat can split", "jane can not dance", "text wit hou tcaps" and even "abcd efg hij klmno" and "xxxx xxx xxx xxxxx". What do these phrases have in common? They are made up by four words made up by four, three, three and five letters, all lowercase. So, by induction, all "4-3-3-5" strings should work.

Now, let's get to why this thing happens. First of all, it seems that Notepad writes the files just fine, it just can't read them again correctly. As a proof, try opening your saved file, the one that Notepad screws up, with another text editor. I used EditPlus and it turned out to be OK. So why the Notepad thing then? Well, it's a Windows thing. Notepad uses a Windows function that allows it to figure out whether a text file is Unicode or not. And that function, my friends, is the one that screws it up. Because the way it checks can easily be described as "guessing". And it guesses that the file is actually Unicode, and not Ascii, as it is supposed to be.

Now, two different but similar explanations can be given.

The first is that, after the ASCII-to-hex conversion of the string, Notepad rearranges the hex codes not according to ASCII standards, but to Unicode, and that messes it up. Here's the example:

Take "bush hid the facts". The hex codes (they can be seen with any hex editor you want to download) for the string are:

62 75 73 68 20 68 69 64 20 74 68 65 20 66 61 63 74 73

Arrange the codes to make up Unicode characters and you get:

7562 6873 6820 6964 7420 6568 6620 6163 7473

You'll notice that every code is hyperlinked. If you click on each one of them, you'll see that each one represents a Chinese (I think) "letter".

So this whole thing's cause is the coincidence that the 18 ASCII characters happen to represent 9 Unicode characters. And, of course, Windows' inability to determine the right encoding of the file.

The second explanation is slightly different, but the basics are the same: the difference between ASCII and Unicode. It's just a matter of Notepad defaults. You see, when you save the file, in the "Encoding" field, the default drop-down is set to ANSI. So, by default, Notepad saves as ANSI. But if you do a File -> Open, the default Encoding is set to Unicode. That's exactly what happens when you double click a saved file. Notepad knows the path, but not the Encoding. So it uses the default Unicode encoding, which spits the Chinese characters as explained above.

And that's about it. No easter eggs, no conspiracies, no Bush interventions. Just plain old Microsoft.

No comments: