- Slicing And Dicing
- Comparing Text
- Editing On The Fly
All Unix-like operating systems rely heavily on text files for data storage. So it makes sense that there are many tools for manipulating text. In this chapter, we will look at programs that are used to “slice and dice” text. This chapter will introduce the following commands:
cut- Remove sections from each line of files
paste- Merge lines of files
join- Join lines of two files on a common field
comm- Compare two sorted files line by line
diff- Compare files line by line
patch- Apply a diff file to an original
tr- Translate or delete characters
sed- Stream editor for filtering and transforming text
aspell- Interactive spellchecker
Slicing And Dicing
The next three programs we will discuss are used to peel columns of text out of files and recombine them in useful ways.
cut program is used to extract a section of text from a line and output the extracted section to standard output. It can accept multiple file arguments or input from standard input.
Here is a list of
-c- Extract the portion of the line defined by list. The list may consist of one or more comma-separated numerical ranges.
-f- Extract one or more fields from the line as defined by list. The list may contain one or more fields or field ranges separated by commas.
-fis specified, use delim as the field delimiting character. By default, fields must be separated by a single tab character.
--complement- Extract the entire line of text, except for those portions specified by
Let’s take a look at this distros.txt file.
[user@linux ~]$ cat -A distros.txt SUSE^I10.2^I12/07/2006$ Fedora^I10^I11/25/2008$ SUSE^I11.0^I06/19/2008$ Ubuntu^I8.04^I04/24/2008$ Fedora^I8^I11/08/2007$ SUSE^I10.3^I10/04/2007$ Ubuntu^I6.10^I10/26/2006$ Fedora^I7^I05/31/2007$ Ubuntu^I7.10^I10/18/2007$ Ubuntu^I7.04^I04/19/2007$ SUSE^I10.1^I05/11/2006$ Fedora^I6^I10/24/2006$ Fedora^I9^I05/13/2008$ Ubuntu^I6.06^I06/01/2006$ Ubuntu^I8.10^I10/30/2008$ Fedora^I5^I03/20/2006$
There are no embedded spaces, just single tab characters between the fields. Because the file uses tabs rather than spaces, we’ll use the
-f option to extract a field.
[user@linux ~]$ cut -f 3 distros.txt 12/07/2006 11/25/2008 06/19/2008 04/24/2008 11/08/2007 10/04/2007 10/26/2006 05/31/2007 10/18/2007 04/19/2007 05/11/2006 10/24/2006 05/13/2008 06/01/2006 10/30/2008 03/20/2006
Now let’s extract the year from each line.
[user@linux ~]$ cut -f 3 distros.txt | cut -c 7-10 2006 2008 2008 2008 2007 2007 2006 2007 2007 2007 2006 2006 2008 2006 2008 2006
When working with fields, it is possible to specify a different field delimiter rather than the tab character. Here we will extract the first field from the /etc/passwd file. Using the
-d option, we are able to specify the colon character as the field delimiter.
[user@linux ~]$ cut -d ':' -f 1 /etc/passwd | head root daemon bin sys sync games man lp mail news
paste command does the opposite of
cut. Rather than extracting a column of text from a file, it adds one or more columns of text to a file.
First let’s produce a list of distros sorted by date and store the result in a file called distros-by-date.txt.
Next, we will use
cut to extract the first two fields from the file (the distro name and version) and store that result in a file named distro-versions.txt.
The final piece of preparation is to extract the release dates and store them in a file named distro-dates.txt.
[user@linux ~]$ sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt > distros-by-date.txt [user@linux ~]$ cut -f 1,2 distros-by-date.txt > distros-versions.txt [user@linux ~]$ head distros-versions.txt Fedora 10 Ubuntu 8.10 SUSE 11.0 Fedora 9 Ubuntu 8.04 Fedora 8 Ubuntu 7.10 SUSE 10.3 Fedora 7 Ubuntu 7.04 [user@linux ~]$ cut -f 3 distros-by-date.txt > distros-dates.txt [user@linux ~]$ head distros-dates.txt 11/25/2008 10/30/2008 06/19/2008 05/13/2008 04/24/2008 11/08/2007 10/18/2007 10/04/2007 05/31/2007 04/19/2007
We now have the parts we need. To complete the process, use
paste to put the column of dates ahead of the distro names and versions, thus creating a chronological list.
[user@linux ~]$ paste distros-dates.txt distros-versions.txt 11/25/2008 Fedora 10 10/30/2008 Ubuntu 8.10 06/19/2008 SUSE 11.0 05/13/2008 Fedora 9 04/24/2008 Ubuntu 8.04 11/08/2007 Fedora 8 10/18/2007 Ubuntu 7.10 10/04/2007 SUSE 10.3 05/31/2007 Fedora 7 04/19/2007 Ubuntu 7.04 12/07/2006 SUSE 10.2 10/26/2006 Ubuntu 6.10 10/24/2006 Fedora 6 06/01/2006 Ubuntu 6.06 05/11/2006 SUSE 10.1 03/20/2006 Fedora 5
In some ways,
join is like
paste in that it adds columns to a file, but it uses a unique way to do it. A join is an operation usually associated with relational databases where data from multiple tables with a shared key field is combined to form a desired result. The
join program performs the same operation. It joins data from multiple files based on a shared key field.
To demonstrate the
join program, we’ll need to make a couple of files with a shared key. To do this, we will use our distros-by-date.txt file. From this file, we will construct two additional files. One contains the release dates (which will be our shared key for this demonstration) and the release names.
[user@linux ~]$ cut -f 1,1 distros-by-date.txt > distros-names.txt [user@linux ~]$ paste distros-dates.txt distros-names.txt > distros-key-names.txt [user@linux ~]$ head distros-key-names.txt 11/25/2008 Fedora 10/30/2008 Ubuntu 06/19/2008 SUSE 05/13/2008 Fedora 04/24/2008 Ubuntu 11/08/2007 Fedora 10/18/2007 Ubuntu 10/04/2007 SUSE 05/31/2007 Fedora 04/19/2007 Ubuntu
The second file contains the release dates and the version numbers, as shown here.
[user@linux ~]$ cut -f 2,2 distros-by-date.txt > distros-vernums.txt [user@linux ~]$ paste distros-dates.txt distros-vernums.txt > distros-key-vernums.txt [user@linux ~]$ head distros-key-vernums.txt 11/25/2008 10 10/30/2008 8.10 06/19/2008 11.0 05/13/2008 9 04/24/2008 8.04 11/08/2007 8 10/18/2007 7.10 10/04/2007 10.3 05/31/2007 7 04/19/2007 7.04
We now have two files with a shared key (the “release date” field). It is important to point out that the files must be sorted on the key field for
join to work properly.
[user@linux ~]$ join distros-key-names.txt distros-key-vernums.txt | head 11/25/2008 Fedora 10 10/30/2008 Ubuntu 8.10 06/19/2008 SUSE 11.0 05/13/2008 Fedora 9 04/24/2008 Ubuntu 8.04 11/08/2007 Fedora 8 10/18/2007 Ubuntu 7.10 10/04/2007 SUSE 10.3 05/31/2007 Fedora 7 04/19/2007 Ubuntu 7.04
It is often useful to compare versions of text files. For system administrators and software developers, this is particularly important. A system administrator may, for example, need to compare an existing configuration file to a previous version to diagnose a system problem. Likewise, a programmer frequently needs to see what changes have been made to programs over time.
comm program compares two text files and displays the lines that are unique to each one and the lines they have in common. To demonstrate, we will create two nearly identical text files using
comm produces three columns of output. The first column contains lines unique to the first file argument, the second column contains the lines unique to the second file argument, and the third column contains the lines shared by both files.
comm supports options in the form
n is either 1, 2, or 3. When used, these options specify which columns to suppress. For example, if we wanted to output only the lines shared by both files, we would suppress the output of the first and second columns.
diff is used to detect the differences between files. However,
diff is a much more complex tool, supporting many output formats and the ability to process large collections of text files at once.
diff is often used by software developers to examine changes between different versions of program source code and thus has the ability to recursively examine directories of source code, often referred to as source trees. One common use for
diff is the creation of diff files or patches that are used by programs such as
patch (which we’ll discuss shortly) to convert one version of a file (or files) to another version.
patch program is used to apply changes to text files. It accepts output from
diff and is generally used to convert older-version files into newer versions. Let’s consider a famous example. The Linux kernel is developed by a large, loosely organized team of contributors who submit a constant stream of small changes to the source code. The Linux kernel consists of several million lines of code, while the changes that are made by one contributor at one time are quite small. It makes no sense for a contributor to send each developer an entire kernel source tree each time a small change is made. Instead, a
diff file is submitted. The
diff file contains the change from the previous version of the kernel to the new version with the contributor’s changes. The receiver then uses the
patch program to apply the change to their own source tree. Using
patch offers two significant advantages.
difffile is small, compared to the full size of the source tree.
difffile concisely shows the change being made, allowing reviewers of the patch to quickly evaluate it.
Editing On The Fly
Our experience with text editors has been largely interactive, meaning that we manually move a cursor around and then type our changes. However, there are non-interactive ways to edit text as well. It’s possible, for example, to apply a set of changes to multiple files with a single command.
tr program is used to transliterate characters. We can think of this as a sort of character-based search-and-replace operation. Transliteration is the process of changing characters from one alphabet to another. For example, converting characters from lowercase to uppercase is transliteration.
sed is short for stream editor. It performs text editing on a stream of text, either a set of specified files or standard input.
sed is a powerful and somewhat complex program (there are entire books about it), so we will not cover it completely here.
aspell is an interactive spelling checker. The
aspell program is the successor to an earlier program named
ispell and can be used, for the most part, as a drop-in replacement. While the
aspell program is mostly used by other programs that require spellchecking capability, it can also be used effectively as a stand-alone tool from the command line. It has the ability to intelligently check various types of text files, including HTML documents, C or C++ programs, email messages, and other kinds of specialized texts.
In this chapter, we looked at a few of the many command line tools that operate on text.