ASCII...?
ASCII ("ass-key") stands for "American Standard Code for Information Interchange", and is a system where letters, digits and punctuation characters are assigned numbers, so that computers don't get confused. For many computer people, text and ASCII are the same thing, but it would be more fair to say "the one is recorded using the other". That is, text is usually stored as ASCII. In the same way that we humans interpret two diagonal lines joined at the top with a horizontal line between them as "A", a computer interprets the number 65 as "A" [when reading a text file]. The reason that a computer sees letters as numbers is because:
- Most computers have no eyes, and those that do have extremely poor vision.
- It's easier to send the number 65 down a wire than a picture of the letter A.
- Computers think of EVERYTHING as numbers
When i say the computer sees the number 65 as "A", the philosophical among you might be wondering "If it cant see the letter 'A' as A, then how can it see the number 65 as A?" The answer to that is that the decimal number 65 is exactly equivalent to the binary number 01000001, the basic level at which all modern computers think. So the computer is actually seeing 01000001 as A.
But once again, if the computer can't see the letter 'A' as A, how does it see the binary number 01000001 as A? Well, there are a couple of ways. The binary number 01000001 may seen as a set of transistors with the second and eighth transitors switched on and the rest switched off. So even as you look at this letter A, if you are reading this on a computer screen, somewhere deep in the memory of your computer is a set of 8 transistors, with 2 on and 6 off, corresponding to that exact character. If 'A' is the first letter in a text file, then the sequence of microscopic magnetic spots on your hard drive which define that file begins with the sequence "off-on-off-off-off-off-off-on".
In trying to understand how a computer "sees" things, it might be useful to imagine losing all of your 5 senses, retaining only the ability to feel someone tapping the back of your hand. The only way you could receive information from or about the world around you would be in a form such as morse code. This is basically how a computer interacts with the world. Once a set of protocols is established, such as what pattern of taps constitutes an A, you can learn any piece of information which can be translated into tapping. If you and the person tapping your hand don't agree on what pattern constitutes an A, then it will be impossible for you to communicate. Hence the need for an accepted standard, such as ASCII, which ensures that when text information is stored (as ones and zeros) on a computer, it can be reliably retrieved again.
Special Characters
Text files contain "special" characters as well as the characters that you can see. A space for instance is considered a character (represented in a file by the decimal number 32). There are also other special (invisible) characters representing:
-
Tab (decimal #9)
-
Carriage-Return (#13) - when present, this character is usually followed by a line-feed, and comes from the old days when a printer (or electric typewriter) had to be instructed when to return to the beginning of the line (the far left column).
-
Line-Feed (#10) - this is usually used to indicate the end of a line, with or without a preceeding carriage return. In reality several combinations are used, and this in itself is a highly annoying (and completely unnecessary) issue that text-handling software must take into account. It is called a line-feed because it began life as a printer instruction to feed the paper through by a single line.
-
End-of-File (#0) - another evolutionary hangover, which causes more trouble than it's worth these days. Also called Null, this character is not actually considered part of a text, but marks the end of it. Once upon a time it was required to signify to simple devices (and software) that an input was finished. It went on to server a similar purpose in the "C" programming language, signifying the end of a string. What this generally means is that very little software can cope with Null characters in text.
-
Form Feed (#12) - Another printer instruction, which was used to tell printers to feed a full page through. Basically this can be considered a "Page Break" instruction.
-
Backspace (#8) - A particularly amusing one these days... the idea of storing a backspace within a text file makes almost no sense. Almost, but still, if it was supported, it might be useful for creating compound characters. On a whim, i tried supporting in my own Text Editor, and found that it could be used to add underscores (normally impossible in a text file) by adding a backspace character followed by the underscore character "_". Generally it should not be used ;)
There are other codes even less used, such as vertical tab (not even sure how that one worked) and the glorious BEL character (#7) which had nothing at all to do with printing, but simply caused the computer to make a beep sound. I remember using it in BASIC to augment my fantastic program which asked you what your name was and how old you were, and then astounded you by somehow telling you what your name was and exactly how old you were! It went something like:
print "Hello " name$ ", you are " age% " years old!" char(7)
which would cause the computer to print "Hello robocop, you are 23343 years old!" and go *beep* at the same time.
Line Ending characters
Text files created in a Dos/Windows environment tend to use a carriage return followed by a line feed (abbreviated as CR-LF). The Unix/Linux standard is to use plain Line-feed (LF), a more sensible approach. There are Mac text files i believe that use plain carriage returns (CR). The term "Computer Standards" is almost an oxymoron sometimes. If you read much about computers, you will often find mention of "competing standards". Much of the programming experience can be described in these mildly ironic terms.
Addendum
To indicate just how annoying old standards can be, there are two "modes" available for opening a file using the standard C libraries.
-
Binary mode - the simplest one, interpreting each byte of the file as a character, including null bytes (all zeros). When talking about files, any file that is not primarily text is considered "binary". I open all files (including text files) in binary mode, because i see no reason not to.
-
Text mode - almost laughable, because it considers a null byte to be the end of the file, even when it isn't, and interprets CR-LF pairs as LF, thus actually screwing up the byte-to-character correspondence. There is no way to count the characters in a text-mode file without actually reading the whole thing (whereas the byte length is known in advance).
Saturday, October 19, 2002