Table of Contents
Preface
Chapter 1: Tokenizing Text and WordNet Basics
Chapter 2: Replacing and Correcting Words
Chapter 3: Creating Custom Corpora
Chapter 4: Part-of-Speech Tagging
Chapter 5: Extracting Chunks
Chapter 6: Transforming Chunks and Trees
Chapter 7: Text Classification
Chapter 8: Distributed Processing and Handling Large Datasets
Chapter 9: Parsing Specific Data
Appendix: Penn Treebank Part-of-Speech Tags
Index
- Chapter 1: Tokenizing Text and WordNet Basics
- Introduction
- Tokenizing text into sentences
- Tokenizing sentences into words
- Tokenizing sentences using regular expressions
- Filtering stopwords in a tokenized sentence
- Looking up synsets for a word in WordNet
- Looking up lemmas and synonyms in WordNet
- Calculating WordNet synset similarity
- Discovering word collocations
- Chapter 2: Replacing and Correcting Words
- Introduction
- Stemming words
- Lemmatizing words with WordNet
- Translating text with Babelfish
- Replacing words matching regular expressions
- Removing repeating characters
- Spelling correction with Enchant
- Replacing synonyms
- Replacing negations with antonyms
- Chapter 3: Creating Custom Corpora
- Introduction
- Setting up a custom corpus
- Creating a word list corpus
- Creating a part-of-speech tagged word corpus
- Creating a chunked phrase corpus
- Creating a categorized text corpus
- Creating a categorized chunk corpus reader
- Lazy corpus loading
- Creating a custom corpus view
- Creating a MongoDB backed corpus reader
- Corpus editing with file locking
- Chapter 4: Part-of-Speech Tagging
- Introduction
- Default tagging
- Training a unigram part-of-speech tagger
- Combining taggers with backoff tagging
- Training and combining Ngram taggers
- Creating a model of likely word tags
- Tagging with regular expressions
- Affix tagging
- Training a Brill tagger
- Training the TnT tagger
- Using WordNet for tagging
- Tagging proper names
- Classifier based tagging
- Chapter 5: Extracting Chunks
- Introduction
- Chunking and chinking with regular expressions
- Merging and splitting chunks with regular expressions
- Expanding and removing chunks with regular expressions
- Partial parsing with regular expressions
- Training a tagger-based chunker
- Classification-based chunking
- Extracting named entities
- Extracting proper noun chunks
- Extracting location chunks
- Training a named entity chunker
- Chapter 6: Transforming Chunks and Trees
- Introduction
- Filtering insignificant words
- Correcting verb forms
- Swapping verb phrases
- Swapping noun cardinals
- Swapping infinitive phrases
- Singularizing plural nouns
- Chaining chunk transformations
- Converting a chunk tree to text
- Flattening a deep tree
- Creating a shallow tree
- Converting tree nodes
- Chapter 7: Text Classification
- Introduction
- Bag of Words feature extraction
- Training a naive Bayes classifier
- Training a decision tree classifier
- Training a maximum entropy classifier
- Measuring precision and recall of a classifier
- Calculating high information words
- Combining classifiers with voting
- Classifying with multiple binary classifiers
- Chapter 8: Distributed Processing and Handling Large Datasets
- Introduction
- Distributed tagging with execnet
- Distributed chunking with execnet
- Parallel list processing with execnet
- Storing a frequency distribution in Redis
- Storing a conditional frequency distribution in Redis
- Storing an ordered dictionary in Redis
- Distributed word scoring with Redis and execnet
- Chapter 9: Parsing Specific Data
- Introduction
- Parsing dates and times with Dateutil
- Time zone lookup and conversion
- Tagging temporal expressions with Timex
- Extracting URLs from HTML with lxml
- Cleaning and stripping HTML
- Converting HTML entities with BeautifulSoup
- Detecting and converting character encodings