Unix Power ToolsUnix Power ToolsSearch this book

42.7. htmllib and HTMLParser

Python provides the htmllib module for parsing HTML content, which is often useful when dealing with web resources. Python also has an HTMLParser module, which handles both XHTML and HTML and provides a slightly lower-level view of the content. HTMLParser is also slightly simpler to use, since htmllib uses sgmllib and thus understands many of the complexities of SGML.

HTMLParser provides a class that the user subclasses from, defining methods that are called as tags are found in the input. The example below is a very basic HTML parser that uses the HTMLParser.HTMLParser class to print out tags as they are encountered:

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered the beginning of a %s tag" % tag
    def handle_endtag(self, tag):
        print "Encountered the end of a %s tag" % tag

-- DJPH



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.