sed & awk

sed & awkSearch this book
Previous: 5.5 Append, Insert, and ChangeChapter 5
Basic sed Commands
Next: 5.7 Transform
 

5.6 List

The list command (l) displays the contents of the pattern space, showing non-printing characters as two-digit ASCII codes. It is similar in function to the list (:l) command in vi. You can use this command to detect "invisible" characters in the input.[6]

[6] GNU sed displays certain characters, such as carriage return, using the ANSI C escape sequences, instead of straight octal. Presumably, this is easier to comprehend for those who are familiar with C (or awk, as we'll see later in the book).

$ cat test/spchar
Here is a string of special characters: ^A  ^B
^M ^G
$ sed -n -e "l" test/spchar
Here is a string of special characters: \01 \02 
\15 \07
$ # test with GNU sed too
$ gsed -n -e "l" test/spchar
Here is a string of special characters: \01  \02
\r \a

Because the list command causes immediate output, we suppress the default output or we would get duplicate copies of the lines.

You cannot match a character by ASCII value (nor can you match octal values) in sed.[7] Instead, you have to find a key combination in vi to produce it. Use CTRL-V to quote the character. For instance, you can match an ESC character (^[). Look at the following script:

[7] You can do this in awk, however.

# list line and replace ^[ with "Escape"
l
s/^[/Escape/

Here's a one-line test file:

The Great ^[ is a movie starring Steve McQueen.

Running the script produces the following output:

The Great \33 is a movie starring Steve McQueen.
The Great Escape is a movie starring Steve McQueen.

GNU sed produces this:

The Great \1b is a movie starring Steve McQueen.
The Great Escape is a movie starring Steve McQueen.

The ^[ character was made in vi by entering CTRL-V, then pressing the ESC key.

5.6.1 Stripping Out Non-Printable Characters from nroff Files

The UNIX formatter nroff produces output for line printers and CRT displays. To achieve such special effects as bolding, it outputs the character followed by a backspace and then outputs the same character again. A sample of it viewed with a text editor might look like:

N^HN^HN^HNA^HA^HA^HAM^HM^HM^HME^HE^HE^HE

which bolds the word "NAME." There are three overstrikes for each character output. Similarly, underlining is achieved by outputting an underscore, a backspace and then the character to be underlined. The following example is the word "file" surrounded by a sequence for underscoring it.

_^Hf_^Hi_^Hl_^He

It might be necessary at times to strip these printing "special-effects"; perhaps if you are given this type of output as a source file. The following line removes the sequences for emboldening and underscoring:

s/.^H//g

It removes any character preceding the backspace along with the backspace itself. In the case of underlining, "." matches the underscore; for emboldening, it matches the overstrike character. Because it is applied repeatedly, multiple occurrences of the overstrike character are removed, leaving a single character for each sequence. Note that ^H is entered in vi by pressing CTRL-V followed by CTRL-H.

A sample application is "de-formatting" an nroff-produced man page found on an older System V UNIX system.[8] If you should want to access the formatted pages with a text editor, you'd want to get a clean version. (In many ways, this is a similar problem to one we solved in converting a word processing file in the previous chapter.) A formatted man page captured in a file looks like this:

[8] For a while, many System V UNIX vendors only provided preformatted manpages. This allowed the man command to show information quickly, instead of having to format it, but the lack of troff source to the manpages made it difficult to fix documentation errors. Fortunately, most vendors of modern UNIX systems supply source for their manuals.

^[9     who(1)                                             who(1)
^[9 N^HN^HN^HNA^HA^HA^HAM^HM^HM^HME^HE^HE^HE
      who - who is on the system?
  S^HS^HS^HSY^HY^HY^HYN^HN^HN^HNO^HO^HO^HOP^HP^HP^HPS^HS^HS^HSI^HI
      who [-a] [-b] [-d] [-H] [-l] [-p] [-q] [-r] [-s] [-t] [-T]
      [-u] [_^Hf_^Hi_^Hl_^He]
          who am i
          who am I
  D^HD^HD^HDE^HE^HE^HES^HS^HS^HSC^HC^HC^HCR^HR^HR^HRI^HI^HI^HIP^HP
      who can list the user's name, terminal line, login time,
      elapsed time since activity occurred on the line, and the
...

In addition to stripping out the bolding and underlining sequences, there are strange escape sequences that produce form feeds or various other printer functions. You can see the sequence "^[9" at the top of the formatted manpage. This escape sequence can simply be removed:

s/^[9//g

Once again, the ESC character is entered in vi by typing CTRL-V followed by pressing the ESC key. The number 9 is literal. There are also what look to be leading spaces that supply the left margin and indentation. On further examination, it turns out that leading spaces precede the heading such as "NAME" but a single tab precedes each line of text. Also, there are tabs that unexpectedly appear in the text, which have to do with how nroff optimizes for display on a CRT screen.

To eliminate the left margin and the unwanted tabs, we add two commands to our previous two:

# sedman -- deformat nroff-formatted manpage
s/.^H//g
s/^[9//g
s/^[ ·]*//g
s/·/ /g

The third command looks for any number of tabs or spaces at the beginning of a line. (A tab is represented by "·" and a space by " ".) The last command looks for a tab and replaces it with a single space. Running this script on our sample man page output produces a file that looks like this:

who(1)                                                     who(1)
NAME
who - who is on the system?
SYNOPSIS
who [-a] [-b] [-d] [-H] [-l] [-p] [-q] [-r] [-s] [-t] [-T]
[-u] [file]
who am i
who am I
DESCRIPTION
who can list the user's name, terminal line, login time,
elapsed time since activity occurred on the line, and the
...

This script does not eliminate the unnecessary blank lines caused by paging. We will look at how to do that in the next chapter, as it requires a multiline operation.


Previous: 5.5 Append, Insert, and Changesed & awkNext: 5.7 Transform
5.5 Append, Insert, and ChangeBook Index5.7 Transform

The UNIX CD Bookshelf NavigationThe UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System