Unix Power ToolsUnix Power ToolsSearch this book

15.6. Compressing Files to Save Space

gzip is a fast and efficient compression program distributed by the GNU project. The basic function of gzip is to take a file filename, compress it, save the compressed version as filename.gz, and remove the original, uncompressed file. The original file is removed only if gzip is successful; it is very difficult to delete a file accidentally in this manner. Of course, being GNU software, gzip has more options than you want to think about, and many aspects of its behavior can be modified using command-line options.

First, let's say that we have a large file named garbage.txt:

rutabaga% ls -l garbage.txt*
-rw-r--r--   1 mdw      hack       312996 Nov 17 21:44 garbage.txt

If we compress this file using gzip, it replaces garbage.txt with the compressed file garbage.txt.gz. We end up with the following:

rutabaga% gzip garbage.txt
rutabaga% ls -l garbage.txt*
-rw-r--r--   1 mdw      hack       103441 Nov 17 21:48 garbage.txt.gz

Note that garbage.txt is removed when gzip completes.

You can give gzip a list of filenames; it compresses each file in the list, storing each with a .gz extension. (Unlike the zip program for Unix and MS-DOS systems, gzip will not, by default, compress several files into a single .gz archive. That's what tar is for; see Section 15.7.)

Figure Go to http://examples.oreilly.com/upt3 for more information on: gzip

How efficiently a file is compressed depends upon its format and contents. For example, many audio and graphics file formats (such as MP3 and JPEG) are already well compressed, and gzip will have little or no effect upon such files. Files that compress well usually include plain-text files and binary files such as executables and libraries. You can get information on a gzip ped file using gzip -l. For example:

rutabaga% gzip -l garbage.txt.gz
compressed  uncompr. ratio uncompressed_name
   103115    312996  67.0% garbage.txt

To get our original file back from the compressed version, we use gunzip, as in:

rutabaga% gunzip garbage.txt.gz
rutabaga% ls -l garbage.txt
-rw-r--r--   1 mdw      hack       312996 Nov 17 21:44 garbage.txt

which is identical to the original file. Note that when you gunzip a file, the compressed version is removed once the uncompression is complete.

gzip stores the name of the original, uncompressed file in the compressed version. This allows the name of the compressed file to be irrelevant; when the file is uncompressed it can be restored to its original splendor. To uncompress a file to its original filename, use the -N option with gunzip. To see the value of this option, consider the following sequence of commands:

rutabaga% gzip garbage.txt
rutabaga% mv garbage.txt.gz rubbish.txt.gz

If we were to gunzip rubbish.txt.gz at this point, the uncompressed file would be named rubbish.txt, after the new (compressed) filename. However, with the -N option, we get the following:

rutabaga% gunzip -N rubbish.txt.gz
rutabaga% ls -l garbage.txt
-rw-r--r--   1 mdw      hack       312996 Nov 17 21:44 garbage.txt

gzip and gunzip can also compress or uncompress data from standard input and output. If gzip is given no filenames to compress, it attempts to compress data read from standard input. Likewise, if you use the -c option with gunzip, it writes uncompressed data to standard output. For example, you could pipe the output of a command to gzip to compress the output stream and save it to a file in one step, as in:

rutabaga% ls -laR $HOME | gzip > filelist.gz

This will produce a recursive directory listing of your home directory and save it in the compressed file filelist.gz. You can display the contents of this file with the command:

rutabaga% gunzip -c filelist.gz | less

This will uncompress filelist.gz and pipe the output to the less (Section 12.3) command. When you use gunzip -c, the file on disk remains compressed.

The gzcat command is identical to gunzip -c. You can think of this as a version of cat for compressed files. Some systems, including Linux, even have a version of the pager less for compressed files: zless.

When compressing files, you can use one of the options -1, -2, through -9 to specify the speed and quality of the compression used. -1 (also - -fast) specifies the fastest method, which compresses the files less compactly, while -9 (also - -best) uses the slowest, but best compression method. If you don't specify one of these options, the default is -6. None of these options has any bearing on how you use gunzip; gunzip can uncompress the file no matter what speed option you use.

Figure Go to http://examples.oreilly.com/upt3 for more information on: bzip, bzip2

Another compression/decompression program has emerged to take the lead from gzip. bzip2 is the new kid on the block and sports even better compression (on the average about 10 to 20% better than gzip), at the expense of longer compression times. You cannot use bunzip2 to uncompress files compressed with gzip and vice versa. Since you cannot expect everybody to have bunzip2 installed on their machine, you might want to confine yourself to gzip for the time being if you want to send the compressed file to somebody else (or, as many archives do, provide both gzip- and bzip2-compressed versions of the file). However, it pays to have bzip2 installed, because more and more FTP servers now provide bzip2-compressed packages to conserve disk space and, more importantly these days, bandwidth. You can recognize bzip2-compressed files from their typical .bz2 file name extension.

While the command-line options of bzip2 are not exactly the same as those of gzip, those that have been described in this section are, except for - -best and - -fast, which bzip2 doesn't have. For more information, see the bzip2 manual page.

The bottom line is that you should use gzip/gunzip or bzip2/bunzip2 for your compression needs. If you encounter a file with the extension .Z, it was probably produced by compress, and gunzip can uncompress it for you.

[These days, the only real use for compress -- if you have gzip and bzip2 -- is for creating compressed images needed by some embedded hardware, such as older Cisco IOS images. -- DJPH]

-- MW, MKD, and LK



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.