Learn Linux 20: Text Processing

Published

Contents


Introduction

All Unix-like operating systems rely heavily on text files for data storage. So it makes sense that there are many tools for manipulating text. In this chapter, we will look at programs that are used to “slice and dice” text. This chapter will introduce the following commands:

  • cut - Remove sections from each line of files
  • paste - Merge lines of files
  • join - Join lines of two files on a common field
  • comm - Compare two sorted files line by line
  • diff - Compare files line by line
  • patch - Apply a diff file to an original
  • tr - Translate or delete characters
  • sed - Stream editor for filtering and transforming text
  • aspell - Interactive spellchecker

Slicing And Dicing

The next three programs we will discuss are used to peel columns of text out of files and recombine them in useful ways.

cut

The cut program is used to extract a section of text from a line and output the extracted section to standard output. It can accept multiple file arguments or input from standard input.

Here is a list of cut options:

  • -c - Extract the portion of the line defined by list. The list may consist of one or more comma-separated numerical ranges.
  • -f - Extract one or more fields from the line as defined by list. The list may contain one or more fields or field ranges separated by commas.
  • -d - When -f is specified, use delim as the field delimiting character. By default, fields must be separated by a single tab character.
  • --complement - Extract the entire line of text, except for those portions specified by -c and/or -f.

Let’s take a look at this distros.txt file.

[user@linux ~]$ cat -A distros.txt
SUSE^I10.2^I12/07/2006$
Fedora^I10^I11/25/2008$
SUSE^I11.0^I06/19/2008$
Ubuntu^I8.04^I04/24/2008$
Fedora^I8^I11/08/2007$
SUSE^I10.3^I10/04/2007$
Ubuntu^I6.10^I10/26/2006$
Fedora^I7^I05/31/2007$
Ubuntu^I7.10^I10/18/2007$
Ubuntu^I7.04^I04/19/2007$
SUSE^I10.1^I05/11/2006$
Fedora^I6^I10/24/2006$
Fedora^I9^I05/13/2008$
Ubuntu^I6.06^I06/01/2006$
Ubuntu^I8.10^I10/30/2008$
Fedora^I5^I03/20/2006$

There are no embedded spaces, just single tab characters between the fields. Because the file uses tabs rather than spaces, we’ll use the -f option to extract a field.

[user@linux ~]$ cut -f 3 distros.txt
12/07/2006
11/25/2008
06/19/2008
04/24/2008
11/08/2007
10/04/2007
10/26/2006
05/31/2007
10/18/2007
04/19/2007
05/11/2006
10/24/2006
05/13/2008
06/01/2006
10/30/2008
03/20/2006

Now let’s extract the year from each line.

[user@linux ~]$ cut -f 3 distros.txt | cut -c 7-10
2006
2008
2008
2008
2007
2007
2006
2007
2007
2007
2006
2006
2008
2006
2008
2006

When working with fields, it is possible to specify a different field delimiter rather than the tab character. Here we will extract the first field from the /etc/passwd file. Using the -d option, we are able to specify the colon character as the field delimiter.

[user@linux ~]$ cut -d ':' -f 1 /etc/passwd | head
root
daemon
bin
sys
sync
games
man
lp
mail
news

paste

The paste command does the opposite of cut. Rather than extracting a column of text from a file, it adds one or more columns of text to a file.

First let’s produce a list of distros sorted by date and store the result in a file called distros-by-date.txt.

Next, we will use cut to extract the first two fields from the file (the distro name and version) and store that result in a file named distro-versions.txt.

The final piece of preparation is to extract the release dates and store them in a file named distro-dates.txt.

[user@linux ~]$ sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt > distros-by-date.txt
[user@linux ~]$ cut -f 1,2 distros-by-date.txt > distros-versions.txt
[user@linux ~]$ head distros-versions.txt
Fedora    10
Ubuntu    8.10
SUSE      11.0
Fedora    9
Ubuntu    8.04
Fedora    8
Ubuntu    7.10
SUSE      10.3
Fedora    7
Ubuntu    7.04
[user@linux ~]$ cut -f 3 distros-by-date.txt > distros-dates.txt
[user@linux ~]$ head distros-dates.txt
11/25/2008
10/30/2008
06/19/2008
05/13/2008
04/24/2008
11/08/2007
10/18/2007
10/04/2007
05/31/2007
04/19/2007

We now have the parts we need. To complete the process, use paste to put the column of dates ahead of the distro names and versions, thus creating a chronological list.

[user@linux ~]$ paste distros-dates.txt distros-versions.txt
11/25/2008   Fedora      10
10/30/2008   Ubuntu      8.10
06/19/2008   SUSE        11.0
05/13/2008   Fedora      9
04/24/2008   Ubuntu      8.04
11/08/2007   Fedora      8
10/18/2007   Ubuntu      7.10
10/04/2007   SUSE        10.3
05/31/2007   Fedora      7
04/19/2007   Ubuntu      7.04
12/07/2006   SUSE        10.2
10/26/2006   Ubuntu      6.10
10/24/2006   Fedora      6
06/01/2006   Ubuntu      6.06
05/11/2006   SUSE        10.1
03/20/2006   Fedora      5

join

In some ways, join is like paste in that it adds columns to a file, but it uses a unique way to do it. A join is an operation usually associated with relational databases where data from multiple tables with a shared key field is combined to form a desired result. The join program performs the same operation. It joins data from multiple files based on a shared key field.

To demonstrate the join program, we’ll need to make a couple of files with a shared key. To do this, we will use our distros-by-date.txt file. From this file, we will construct two additional files. One contains the release dates (which will be our shared key for this demonstration) and the release names.

[user@linux ~]$ cut -f 1,1 distros-by-date.txt > distros-names.txt
[user@linux ~]$ paste distros-dates.txt distros-names.txt > distros-key-names.txt
[user@linux ~]$ head distros-key-names.txt
11/25/2008     Fedora
10/30/2008     Ubuntu
06/19/2008     SUSE
05/13/2008     Fedora
04/24/2008     Ubuntu
11/08/2007     Fedora
10/18/2007     Ubuntu
10/04/2007     SUSE
05/31/2007     Fedora
04/19/2007     Ubuntu

The second file contains the release dates and the version numbers, as shown here.

[user@linux ~]$ cut -f 2,2 distros-by-date.txt > distros-vernums.txt
[user@linux ~]$ paste distros-dates.txt distros-vernums.txt > distros-key-vernums.txt
[user@linux ~]$ head distros-key-vernums.txt
11/25/2008     10
10/30/2008     8.10
06/19/2008     11.0
05/13/2008     9
04/24/2008     8.04
11/08/2007     8
10/18/2007     7.10
10/04/2007     10.3
05/31/2007     7
04/19/2007     7.04

We now have two files with a shared key (the “release date” field). It is important to point out that the files must be sorted on the key field for join to work properly.

[user@linux ~]$ join distros-key-names.txt distros-key-vernums.txt | head
11/25/2008 Fedora 10
10/30/2008 Ubuntu 8.10
06/19/2008 SUSE 11.0
05/13/2008 Fedora 9
04/24/2008 Ubuntu 8.04
11/08/2007 Fedora 8
10/18/2007 Ubuntu 7.10
10/04/2007 SUSE 10.3
05/31/2007 Fedora 7
04/19/2007 Ubuntu 7.04

Comparing Text

It is often useful to compare versions of text files. For system administrators and software developers, this is particularly important. A system administrator may, for example, need to compare an existing configuration file to a previous version to diagnose a system problem. Likewise, a programmer frequently needs to see what changes have been made to programs over time.

comm

The comm program compares two text files and displays the lines that are unique to each one and the lines they have in common. To demonstrate, we will create two nearly identical text files using cat.

comm produces three columns of output. The first column contains lines unique to the first file argument, the second column contains the lines unique to the second file argument, and the third column contains the lines shared by both files. comm supports options in the form -n, where n is either 1, 2, or 3. When used, these options specify which columns to suppress. For example, if we wanted to output only the lines shared by both files, we would suppress the output of the first and second columns.

diff

Like the comm program, diff is used to detect the differences between files. However, diff is a much more complex tool, supporting many output formats and the ability to process large collections of text files at once. diff is often used by software developers to examine changes between different versions of program source code and thus has the ability to recursively examine directories of source code, often referred to as source trees. One common use for diff is the creation of diff files or patches that are used by programs such as patch (which we’ll discuss shortly) to convert one version of a file (or files) to another version.

patch

The patch program is used to apply changes to text files. It accepts output from diff and is generally used to convert older-version files into newer versions. Let’s consider a famous example. The Linux kernel is developed by a large, loosely organized team of contributors who submit a constant stream of small changes to the source code. The Linux kernel consists of several million lines of code, while the changes that are made by one contributor at one time are quite small. It makes no sense for a contributor to send each developer an entire kernel source tree each time a small change is made. Instead, a diff file is submitted. The diff file contains the change from the previous version of the kernel to the new version with the contributor’s changes. The receiver then uses the patch program to apply the change to their own source tree. Using diff/patch offers two significant advantages.

  • The diff file is small, compared to the full size of the source tree.
  • The diff file concisely shows the change being made, allowing reviewers of the patch to quickly evaluate it.

Editing On The Fly

Our experience with text editors has been largely interactive, meaning that we manually move a cursor around and then type our changes. However, there are non-interactive ways to edit text as well. It’s possible, for example, to apply a set of changes to multiple files with a single command.

tr

The tr program is used to transliterate characters. We can think of this as a sort of character-based search-and-replace operation. Transliteration is the process of changing characters from one alphabet to another. For example, converting characters from lowercase to uppercase is transliteration.

sed

The name sed is short for stream editor. It performs text editing on a stream of text, either a set of specified files or standard input. sed is a powerful and somewhat complex program (there are entire books about it), so we will not cover it completely here.

aspell

aspell is an interactive spelling checker. The aspell program is the successor to an earlier program named ispell and can be used, for the most part, as a drop-in replacement. While the aspell program is mostly used by other programs that require spellchecking capability, it can also be used effectively as a stand-alone tool from the command line. It has the ability to intelligently check various types of text files, including HTML documents, C or C++ programs, email messages, and other kinds of specialized texts.

Summary

In this chapter, we looked at a few of the many command line tools that operate on text.