Concept #1 - files:
It has been said that everything in UNIX is a file. You are familiar with files that reside on disk -- they are where data is stored. Although the definition of file as "where data is stored" may be true, the definition of a file in UNIX is more abstract. You should be aware of the fact that you can read from a file, and that you can write to a file. If you can read from a file (/etc/passwd) then this file is a "source" of data. If you can write to a file (GuiErrMsg.MM.DD.YY.txt) then this file is a "sink" (receptacle) for data. So a better definition for a file under UNIX is a place where you can get information, or a place where you can put information (or both).
When you are logged onto a UNIX system, there are three files that you are using all of the time that you may not even know of. When you type something on the keyboard, the UNIX system is reading that input from a file named "standard input." When the system displays something on your screen, it is writing to a file named "standard output." If the system needs to display an error message, it writes it to a file named "standard error." But you see both regular messages and error messages on your screen. Both standard output and standard error are pointed at your screen, so you can see both types of messages. If you wanted, you could "redirect" standard error to a file, and then never be bothered by those annoying error messages (good to do in a script, bad to do during an interactive session).
From the DOS world you may already be aware of "output redirection" where you capture the output of some command in a file:
dir > filelist.txt
A lesser known way to print from DOS is to copy a file directly to the printer:
copy filelist.txt prn
The copy command deals with files, so there must be something in DOS that behaves like UNIX, treating the printer as a file. You can also copy a file to a directory, so the copy command must know that a directory is a special type of file, one that contains other files.
DOS also allows you to "pipe" the output of one command to another command:
dir | print
Now, back to UNIX. If in UNIX everything is a file, then so is the user. The user is just a source of input (usually commands), and a receiver of output. If you think of everything as a file, then much of UNIX and the whole concept of "pipes" makes more sense. If a command usually takes input from the keyboard, and the keyboard is just a file, we can "redirect input" to the command from a file. If a command outputs to standard out, and standard out is just a file, then we can "redirect standard out" to another file, or "pipe" that output to another command. A pipe is just a fancy file, where one command puts it’s output into one end, and another command takes it’s input from the other end, and the data never gets written to an intermediary file.
If we wanted to print an ordered list of all of the help desk personnel, without pipes we would do something like this:
grep LHD /etc/passwd > data1
sort data1 > data2
lp data2
We now have our printout, but we also have two files that we need to delete (If you create these files in /tmp they will automatically be deleted the next time that the machine reboots.)
With pipes, we get the same output, with less typing, and no files to cleanup:
grep LHD /etc/passwd | sort | lp
The grep command is writing it’s output to a pipe. The sort command is getting it’s input from that pipe and sending it’s output to another pipe, with lp waiting for input at the other end.
Pipes are a little confusing because they seem to chang the way a command works. In the last command, grep looked like we are used to seeing it grep [string] [filename]. When grep is in the middle of a pipiline it looks like this:
ps -ef | grep sybase | sort
It looks like something is missing – we only have grep [string] – the filename is missing. Most UNIX commands are smart enough to know when they have data being piped to them, and therefore do not complain about "missing" arguments.
The sort command is a good command to demonstrate the many different ways a command can be used. All of these are valid, and produce the same output, but the differences can be confusing if you have not seen these variations used.
most common
usually used by "beginners" (inefficient)
less common, but still valid (input redirection)
The middle command is inefficient because it starts two processes (cat and sort) when one could have been used. This is not a problem most of the time, but can lead to problems when the amount of data being processed increases. A process that works with a small amount of data , but fails (or has dramatically increased execution times) is typically known as a scalability problem (something does not work well once the "scale" of the data changes from "small" to "large"). Some people think of scalability as an "advanced" topic, but it is important that everyone understands the concept.
There are many times when somebody writes a "quick and dirty" script (a script is just a sequence of commands in a file) because they think it will never be used again. Then, the next thing they know, it is being used on a daily basis. "Quick and dirty" usually means that not much thought was put into how things are being done, the final result is all that matters – efficiency does not matter. This can be a problem when the script then gets used again and again. There can also be a problem when the original requirements was to process 100 lines of a file, and then the same process is used for 10,000 lines. Maybe the best way to handle this is to decide that efficiency always matters.
NEXT UNIX part 2