Skip to Content

Python 2.6 Text Processing: Beginners Guide Table of Contents


Table of Contents

Preface
Chapter 1: Getting Started
Chapter 2: Working with the IO System
Chapter 3: Python String Services
Chapter 4: Text Processing Using the Standard Library
Chapter 5: Regular Expressions
Chapter 6: Structured Markup
Chapter 7: Creating Templates
Chapter 8: Understanding Encodings and i18n
Chapter 9: Advanced Output Formats
Chapter 10: Advanced Parsing and Grammars
Chapter 11: Searching and Indexing
Appendix A: Looking for Additional Resources
Appendix B: Pop Quiz Answers
Index

  • Chapter 1: Getting Started
    • Categorizing types of text data
      • Providing information through markup
      • Meaning through structured formats
      • Understanding freeform content
    • Ensuring you have Python installed
      • Providing support for Python 3
    • Implementing a simple cipher
    • Time for action – implementing a ROT13 encoder
      • Processing structured markup with a filter
    • Time for action – processing as a filter
    • Time for action – skipping over markup tags
      • State machines
    • Supporting third-party modules
      • Packaging in a nutshell
    • Time for action – installing SetupTools
    • Running a virtual environment
      • Configuring virtualenv
    • Time for action – configuring a virtual environment
    • Where to get help?
    • Summary
    • Chapter 2: Working with the IO System
      • Parsing web server logs
      • Time for action – generating transfer statistics
      • Using objects interchangeably
      • Time for action – introducing a new log format
      • Accessing files directly
      • Time for action – accessing files directly
        • Context managers
        • Handling other file types
      • Time for action – handling compressed files
        • Implementing file-like objects
          • File object methods
          • Enabling universal newlines
      • Accessing multiple files
      • Time for action – spell-checking HTML content
        • Simplifying multiple file access
          • Inplace filtering
      • Accessing remote files
      • Time for action – spell-checking live HTML pages
        • Error handling
      • Time for action – handling urllib 2 errors
      • Handling string IO instances
      • Understanding IO in Python 3
      • Summary
      • Chapter 3: Python String Services
        • Understanding the basics of string object
          • Defining strings
        • Time for action – employee management
          • Building non-literal strings
        • String formatting
        • Time for action – customizing log processor output
          • Percent (modulo) formatting
            • Mapping key
            • Conversion flags
            • Minimum width
            • Precision
            • Width
            • Conversion type
          • Using the format method approach
        • Time for action – adding status code data
          • Making use of conversion specifiers
        • Creating templates
        • Time for action – displaying warnings on malformed lines
          • Template syntax
          • Rendering a template
        • Calling string object methods
        • Time for action – simple manipulation with string methods
          • Aligning text
          • Detecting character classes
          • Casing
          • Searching strings
          • Dealing with lists of strings
            • Treating strings as sequences
        • Summary
        • Chapter 4: Text Processing Using the Standard Library
          • Reading CSV data
          • Time for action – processing Excel formats
          • Time for action – CSV and formulas
            • Reading non-Excel data
          • Time for action – processing custom CSV formats
          • Writing CSV data
          • Time for action – creating a spreadsheet of UNIX users
          • Modifying application configuration files
          • Time for action – adding basic configuration read support
            • Using value interpolation
          • Time for action – relying on configuration value interpolation
            • Handling default options
          • Time for action – configuration defaults
          • Writing configuration data
          • Time for action – generating a configuration file
          • Reconfiguring our source
            • A note on Python 3
          • Time for action – creating an egg-based package
            • Understanding the setup.py file
          • Working with JSON
          • Time for action – writing JSON data
            • Encoding data
            • Decoding data
          • Summary
          • Chapter 5: Regular Expressions
            • Simple string matching
            • Time for action – testing an HTTP URL
              • Understanding the match function
              • Learning basic syntax
                • Detecting repetition
                • Specifying character sets and classes
                • Applying anchors to restrict matches
              • Wrapping it up
            • Advanced pattern matching
              • Grouping
            • Time for action – regular expression grouping
              • Using greedy versus non-greedy operators
              • Assertions
                • Performing an 'or' operation
            • Implementing Python-specific elements
              • Other search functions
                • search
                • findall and finditer
                • split
                • sub
              • Compiled expression objects
                • Dealing with performance issues
              • Parser flags
              • Unicode regular expressions
              • The match object
                • Processing bind zone files
            • Time for action – reading DNS records
            • Summary
            • Chapter 6: Structured Markup
              • XML data
              • SAX processing
              • Time for action – event-driven processing
                • Incremental processing
              • Time for action – driving incremental processing
                • Building an application
              • Time for action – creating a dungeon adventure game
              • The Document Object Model
                • xml.dom.minidom
              • Time for action – updating our game to use DOM processing
                • Creating and modifying documents programmatically
              • XPath
                • Accessing XML data using ElementTree
              • Time for action – using XPath in our adventure
              • Reading HTML
              • Time for action – displaying links in an HTML page
                • BeautifulSoup
              • Summary
              • Chapter 7: Creating Templates
                • Time for action – installing Mako
                • Basic Mako usage
                • Time for action – loading a simple Mako template
                  • Generating a template context
                  • Managing execution with control structures
                  • Including Python code
                • Time for action – reformatting the date with Python code
                  • Adding functionality with tags
                    • Rendering files with %include
                    • Generating multiline comments with %doc
                    • Documenting Mako with %text
                    • Defining functions with %def
                • Time for action – defining Mako def tags
                  • Importing %def sections using %namespace
                • Time for action – converting mail message to use namespaces
                  • Filtering output
                    • Expression filters
                    • Filtering the output of %def blocks
                    • Setting default filters
                • Inheriting from base templates
                • Time for action – updating base template
                  • Growing the inheritance chain
                • Time for action – adding another inheritance layer
                  • Inheriting attributes
                • Customizing
                  • Custom tags
                • Time for action – creating custom Mako tags
                  • Customizing filters
                • Overviewing alternative approaches
                • Summary
                • Chapter 8: Understanding Encodings and i18n
                  • Understanding basic character encodings
                    • ASCII
                      • Limitations of ASCII
                    • KOI8-R
                  • Unicode
                    • Using Unicode with Python 3
                    • Understanding Unicode
                      • Design goals
                    • Organizational structure
                    • Backwards compatibility
                    • Encoding
                      • UTF-32
                      • UTF-8
                  • Encodings in Python
                  • Time for action – manually decoding
                    • Reading Unicode
                    • Writing Unicode strings
                  • Time for action – copying Unicode data
                  • Time for action – fixing our copy application
                  • The codecs module
                  • Time for action – changing encodings
                  • Adopting good practices
                  • Internationalization and Localization
                    • Preparing an application for translation
                  • Time for action – preparing for multiple languages
                  • Time for action – providing translations
                    • Looking for more information on internationalization
                  • Summary
                  • Chapter 9: Advanced Output Formats
                    • Dealing with PDF files using PLATYPUS
                    • Time for action – installing ReportLab
                      • Generating PDF documents
                    • Time for action – writing PDF with basic layout and style
                    • Writing native Excel data
                    • Time for action – installing xlwt
                      • Building XLS documents
                    • Time for action – generating XLS data
                    • Working with OpenDocument files
                    • Time for action – installing ODFPy
                      • Building an ODT generator
                    • Time for action – generating ODT data
                    • Summary
                    • Chapter 10: Advanced Parsing and Grammars
                      • Defining a language syntax
                        • Specifying grammar with Backus-Naur Form
                        • Grammar-driven parsing
                      • PyParsing
                      • Time for action – installing PyParsing
                      • Time for action – implementing a calculator
                        • Parse actions
                      • Time for action – handling type translations
                        • Suppressing parts of a match
                      • Time for action – suppressing portions of a match
                      • Processing data using the Natural Language Toolkit
                      • Time for action – installing NLTK
                        • NLTK processing examples
                          • Removing stems
                          • Discovering collocations
                      • Summary
                      • Chapter 11: Searching and Indexing
                        • Understanding search complexity
                        • Time for action – implementing a linear search
                        • Text indexing
                        • Time for action – installing Nucular
                          • An introduction to Nucular
                        • Time for action – full text indexing
                        • Time for action – measuring index benefit
                          • Scripts provided by Nucular
                          • Using XML files
                          • Advanced Nucular features
                        • Time for action – field-qualified indexes
                          • Performing an enhanced search
                        • Time for action – performing advanced Nucular queries
                        • Indexing and searching other data
                        • Time for action – indexing Open Office documents
                        • Other index systems
                          • Apache Lucene
                          • ZODB and zc.catalog
                          • SQL text indexing
                        • Summary
                        • Appendix A: Looking for Additional Resources
                          • Python resources
                            • Unofficial documentation
                            • Python enhancement proposals
                            • Self-documenting
                              • Using other documentation tools
                            • Community resources
                              • Following groups and mailing lists
                              • Finding a users' group
                              • Attending a local Python conference
                          • Honorable mention
                            • Lucene and Solr
                            • Generating C-based parsers with GNU Bison
                            • Apache Tika
                          • Getting started with Python 3
                            • Major language changes
                              • Print is now a function
                              • Catching exceptions
                              • Using metaclasses
                              • New reserved words
                              • Major library changes
                              • Changes to list comprehensions
                            • Migrating to Python 3
                          • Time for action – using 2to3 to move to Python 3
                          • Summary
                          • Appendix B: Pop Quiz Answers
                            • Chapter 1: Getting Started
                              • ROT 13 Processing Answers
                            • Chapter 2: Working with the IO System
                              • File-like objects
                            • Chapter 3: Python String Services
                              • String literals
                              • String formatting
                            • Chapter 4: Text Processing Using the Standard Library
                              • CSV handling
                              • JSON formatting
                            • Chapter 5: Regular Expressions
                              • Regular expressions
                              • Understanding the Pythonisms
                            • Chapter 6: Structured Markup
                              • SAX processing
                            • Chapter 7: Creating Templates
                              • Template inheritance
                            • Chapter 8: Understanding Encoding and i18n
                              • Character encodings
                              • Python encodings
                              • Internationalization
                            • Chapter 9: Advanced Output Formats
                              • Creating XLS documents
                            • Chapter 11: Searching and Indexing
                              • Introduction to Nucular

                            Awards Voting Nominations Previous Winners
                            Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                            Resources
                            Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                            Open Source Content Management Customer Relationship Management e-Commerce e-Learning Java Linux Servers Networking & Telephony PHP Web Graphics & Video Web Development
                            Enterprise BPEL Microsoft Oracle SOA Web Services
                            Other Packt Books .Net Web Graphics & Video Beginner Guides Cookbooks