Python Text Processing with NLTK 2.0 Cookbook Table of Contents


Table of Contents

Preface
Chapter 1: Tokenizing Text and WordNet Basics
Chapter 2: Replacing and Correcting Words
Chapter 3: Creating Custom Corpora
Chapter 4: Part-of-Speech Tagging
Chapter 5: Extracting Chunks
Chapter 6: Transforming Chunks and Trees
Chapter 7: Text Classification
Chapter 8: Distributed Processing and Handling Large Datasets
Chapter 9: Parsing Specific Data
Appendix: Penn Treebank Part-of-Speech Tags
Index

  • Chapter 1: Tokenizing Text and WordNet Basics
    • Introduction
    • Tokenizing text into sentences
    • Tokenizing sentences into words
    • Tokenizing sentences using regular expressions
    • Filtering stopwords in a tokenized sentence
    • Looking up synsets for a word in WordNet
    • Looking up lemmas and synonyms in WordNet
    • Calculating WordNet synset similarity
    • Discovering word collocations
  • Chapter 2: Replacing and Correcting Words
    • Introduction
    • Stemming words
    • Lemmatizing words with WordNet
    • Translating text with Babelfish
    • Replacing words matching regular expressions
    • Removing repeating characters
    • Spelling correction with Enchant
    • Replacing synonyms
    • Replacing negations with antonyms
  • Chapter 3: Creating Custom Corpora
    • Introduction
    • Setting up a custom corpus
    • Creating a word list corpus
    • Creating a part-of-speech tagged word corpus
    • Creating a chunked phrase corpus
    • Creating a categorized text corpus
    • Creating a categorized chunk corpus reader
    • Lazy corpus loading
    • Creating a custom corpus view
    • Creating a MongoDB backed corpus reader
    • Corpus editing with file locking
  • Chapter 4: Part-of-Speech Tagging
    • Introduction
    • Default tagging
    • Training a unigram part-of-speech tagger
    • Combining taggers with backoff tagging
    • Training and combining Ngram taggers
    • Creating a model of likely word tags
    • Tagging with regular expressions
    • Affix tagging
    • Training a Brill tagger
    • Training the TnT tagger
    • Using WordNet for tagging
    • Tagging proper names
    • Classifier based tagging
  • Chapter 5: Extracting Chunks
    • Introduction
    • Chunking and chinking with regular expressions
    • Merging and splitting chunks with regular expressions
    • Expanding and removing chunks with regular expressions
    • Partial parsing with regular expressions
    • Training a tagger-based chunker
    • Classification-based chunking
    • Extracting named entities
    • Extracting proper noun chunks
    • Extracting location chunks
    • Training a named entity chunker
  • Chapter 6: Transforming Chunks and Trees
    • Introduction
    • Filtering insignificant words
    • Correcting verb forms
    • Swapping verb phrases
    • Swapping noun cardinals
    • Swapping infinitive phrases
    • Singularizing plural nouns
    • Chaining chunk transformations
    • Converting a chunk tree to text
    • Flattening a deep tree
    • Creating a shallow tree
    • Converting tree nodes
  • Chapter 7: Text Classification
    • Introduction
    • Bag of Words feature extraction
    • Training a naive Bayes classifier
    • Training a decision tree classifier
    • Training a maximum entropy classifier
    • Measuring precision and recall of a classifier
    • Calculating high information words
    • Combining classifiers with voting
    • Classifying with multiple binary classifiers
  • Chapter 8: Distributed Processing and Handling Large Datasets
    • Introduction
    • Distributed tagging with execnet
    • Distributed chunking with execnet
    • Parallel list processing with execnet
    • Storing a frequency distribution in Redis
    • Storing a conditional frequency distribution in Redis
    • Storing an ordered dictionary in Redis
    • Distributed word scoring with Redis and execnet
  • Chapter 9: Parsing Specific Data
    • Introduction
    • Parsing dates and times with Dateutil
    • Time zone lookup and conversion
    • Tagging temporal expressions with Timex
    • Extracting URLs from HTML with lxml
    • Cleaning and stripping HTML
    • Converting HTML entities with BeautifulSoup
    • Detecting and converting character encodings

Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Sort A-Z