latest technology: Linux Runs on Text: Understanding & Handling Text

Text plays a central role in the Linux operating system. Take better control of your system with a firm understanding of what text is and how best to handle, format and convert it.

This month, as another part of the series about using text on Linux systems, we’ll introduce “plain text” and how you can restructure it. We’ll see how to identify text from different systems (Unix, DOS, Mac) and to convert text between systems. The article ends with some examples, and there’ll be lots more next month.

If you’re used to clicking on files to view and edit them, you’ll probably find some new tools and concepts here. Gurus, please have a look at the main example and be sure it’s familiar.

The fundamental concept is the role of the newline (line feed) character. Reformatting text is basically a matter of juggling newlines. Let’s dig in.

What’s Text?

As I wrote in last month’s column, Linux runs on text. Text comes in a lot of flavors. What we’ll cover this month is plain text: a stream of characters that you can output directly to a terminal window using a utility like cat(1). The text doesn’t have special formatting codes like “start boldface” or “24 pixels high” that are only understood by certain operating systems or applications. Plain text doesn’t require a word processing program like OpenOffice.org Writer to interpret instructions buried before and between the actual text.

Let’s make a test text file. We’ll use it to demonstrate a lot of things about text. Although the example is a bit tedious, the techniques will be useful, later, when you need to know what’s in a text file. We’ll make this text file on a Windows system, and make a similar file later under Linux.

On Microsoft Windows systems, each line of a plain text file ends with a carriage return (CR) character followed by a newline (LF, line feed) character. Let’s use a “Command prompt” window to copy text typed from the keyboard (con:) into a Windows-format file named win.txt. On DOS-type systems, pressing CTRL-Z followed by the ENTER key ends input. The boldfaced text is user input; the rest is system output:

The first command, cat win.txt, shows a file that looks like the text we entered in the DOS window. However, the bash shell prompt, /d/tmp$, comes just after the text line 2 from the file — instead of on a new line by itself.
Why? It’s because (as we’ll see below) the contents of win.txt don’t end with a newline character. The shell always prints a prompt immediately after the output of a command (in this case, the cat utility) finishes. There’s no newline at the end of the file, so the shell prompt appears on the same line.
The second command shows a lot more:
1. The option -t tells cat to show TAB characters as ^I, so you can see that the indentation before line 1 is caused by a TAB.
2. The option -v tells cat to show “nonprinting” characters visibly, which lets us see that there’s a carriage return character, shown as ^M, after a space character, following the text line 1.
  Each line of a DOS text file ends with two characters: a carriage return and a newline (line feed). After showing the carriage return visibly, cat output the newline preceded by a $ character:
3. The option -e tells cat to mark the end of a line with $. This lets you see just where a newline falls.
The third command, od -c, shows the character representation of bytes one-by-one. The -w6 option lists six bytes per line. Each line starts with the octal offset from the start of the file. You can see:
1. The first six bytes (at offsets 0000000 through 0000005) are a TAB character (which od shows as \t), the word line and a space character.
2. The second six bytes (from offset 0000006) are the digit 1, a space character, a carriage return character (which od shows as \r), a newline character (which od shows as \n), and the first two characters of the next line of the file, li.
  od shows the structure of a text file. The newline character — the end of the first “line” in the file — is just a character. The bytes of the next “line” start immediately after the newline. (As we’ll see later, you can insert newline characters anywhere you want to start new lines.)
3. The last four bytes (octal offsets 0000014 through 0000017) are the letters n, e, a space, and the digit 2. There’s no carriage return, no newline. That’s because, while making the file, we typed the DOS end-of-input character CTRL-Z before pressing RETURN (ENTER) to end the line.
The wc utility reports 1 line, 4 words, and 16 characters.
1. Because there’s only one newline character, there’s only one “line”. (The second line isn’t complete.)
2. There are four words: line, 1, line, and 2.
3. The 16 characters include the carriage return and the newline. (You can see them in the od output, and see the the final offset — 0000020 octal, which shows the number of bytes read — is 16 decimal.) Although the TAB makes a lot of whitespace (it moves the cursor to the next “tab stop” on the terminal, as we’ll see below), it’s only a single character.

latest technology

Tuesday, April 28, 2009

Linux Runs on Text: Understanding & Handling Text

No comments:

Post a Comment

BEST OF THE BEST OFFERS

Blog Archive

Followers

About Me