Unix Power ToolsUnix Power ToolsSearch this book

32.18. Limiting the Extent of a Match

A regular expression tries to match the longest string possible, which can cause unexpected problems. For instance, look at the following regular expression, which matches any number of characters inside quotation marks:

".*"

Let's imagine an HTML table with lots of entries, each of which has two quoted strings, as shown below:

<td><a href="#arts"><img src="d_arrow.gif" border=0></a>

All the text in each line of the table is the same, except the text inside the quotes. To match the line through the first quoted string, a novice might describe the pattern with the following regular expression:

<td><a href=".*">

However, the pattern ends up matching almost all of the entry because the second quotation mark in the pattern matches the last quotation mark on the line! If you know how many quoted strings there are, you can specify each of them:

<td><a href=".*"><img src=".*" border=0></a>

Although this works as you'd expect, some line in the file might not have the same number of quoted strings, causing misses that should be hits -- you simply want the first argument. Here's a different regular expression that matches the shortest possible extent between two quotation marks:

"[^"]*"

It matches "a quote, followed by any number of characters that do not match a quote, followed by a quote." Note, however, that it will be fooled by escaped quotes, in strings such as the following:

$strExample = "This sentence contains an escaped \" character.";

The use of what we might call "negated character classes" like this is one of the things that distinguishes the journeyman regular expression user from the novice.

--DD and JP



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.