Python Text Processing with NLTK 2.0 Cookbook

Jacob Perkins

Jump to: Overview | Reviews | The Author | Sample Chapters

Python Text Processing with NLTK 2.0 Cookbook

Click here to view more images

eBook: $23.99
Formats: PDF, PacktLib, ePub and Mobi formats

$20.39
save 15%!

Print + free eBook + free PacktLib access to the book: $63.98

$35.99
save 44%!

Free Shipping! UK, US, Europe and selected countries in Asia.

Also available on:

Overview

Table of Contents

The Author

Reviews

Downloads

Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond
Learn how machines and crawlers interpret and process natural languages
Easily work with huge amounts of data and learn how to handle distributed processing
Part of Packt's Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible

Book Details

Language : English
Paperback : 272 pages [ 235mm x 191mm ]
Release Date : November 2010
ISBN : 1849513600
ISBN 13 : 9781849513609
Author(s) : Jacob Perkins
Topics and Technologies : All Books, Cookbooks, Open Source

Preface
Chapter 1: Tokenizing Text and WordNet Basics
Chapter 2: Replacing and Correcting Words
Chapter 3: Creating Custom Corpora
Chapter 4: Part-of-Speech Tagging
Chapter 5: Extracting Chunks
Chapter 6: Transforming Chunks and Trees
Chapter 7: Text Classification
Chapter 8: Distributed Processing and Handling Large Datasets
Chapter 9: Parsing Specific Data
Appendix: Penn Treebank Part-of-Speech Tags
Index

Preface

Chapter 1: Tokenizing Text and WordNet Basics
- Introduction
- Tokenizing text into sentences
- Tokenizing sentences into words
- Tokenizing sentences using regular expressions
- Filtering stopwords in a tokenized sentence
- Looking up synsets for a word in WordNet
- Looking up lemmas and synonyms in WordNet
- Calculating WordNet synset similarity
- Discovering word collocations

Chapter 2: Replacing and Correcting Words
- Introduction
- Stemming words
- Lemmatizing words with WordNet
- Translating text with Babelfish
- Replacing words matching regular expressions
- Removing repeating characters
- Spelling correction with Enchant
- Replacing synonyms
- Replacing negations with antonyms

Chapter 3: Creating Custom Corpora
- Introduction
- Setting up a custom corpus
- Creating a word list corpus
- Creating a part-of-speech tagged word corpus
- Creating a chunked phrase corpus
- Creating a categorized text corpus
- Creating a categorized chunk corpus reader
- Lazy corpus loading
- Creating a custom corpus view
- Creating a MongoDB backed corpus reader
- Corpus editing with file locking

Chapter 4: Part-of-Speech Tagging
- Introduction
- Default tagging
- Training a unigram part-of-speech tagger
- Combining taggers with backoff tagging
- Training and combining Ngram taggers
- Creating a model of likely word tags
- Tagging with regular expressions
- Affix tagging
- Training a Brill tagger
- Training the TnT tagger
- Using WordNet for tagging
- Tagging proper names
- Classifier based tagging

Chapter 5: Extracting Chunks
- Introduction
- Chunking and chinking with regular expressions
- Merging and splitting chunks with regular expressions
- Expanding and removing chunks with regular expressions
- Partial parsing with regular expressions
- Training a tagger-based chunker
- Classification-based chunking
- Extracting named entities
- Extracting proper noun chunks
- Extracting location chunks
- Training a named entity chunker

Chapter 6: Transforming Chunks and Trees
- Introduction
- Filtering insignificant words
- Correcting verb forms
- Swapping verb phrases
- Swapping noun cardinals
- Swapping infinitive phrases
- Singularizing plural nouns
- Chaining chunk transformations
- Converting a chunk tree to text
- Flattening a deep tree
- Creating a shallow tree
- Converting tree nodes

Chapter 7: Text Classification
- Introduction
- Bag of Words feature extraction
- Training a naive Bayes classifier
- Training a decision tree classifier
- Training a maximum entropy classifier
- Measuring precision and recall of a classifier
- Calculating high information words
- Combining classifiers with voting
- Classifying with multiple binary classifiers

Chapter 8: Distributed Processing and Handling Large Datasets
- Introduction
- Distributed tagging with execnet
- Distributed chunking with execnet
- Parallel list processing with execnet
- Storing a frequency distribution in Redis
- Storing a conditional frequency distribution in Redis
- Storing an ordered dictionary in Redis
- Distributed word scoring with Redis and execnet

Chapter 9: Parsing Specific Data
- Introduction
- Parsing dates and times with Dateutil
- Time zone lookup and conversion
- Tagging temporal expressions with Timex
- Extracting URLs from HTML with lxml
- Cleaning and stripping HTML
- Converting HTML entities with BeautifulSoup
- Detecting and converting character encodings

Appendix: Penn Treebank Part-of-Speech Tags

Index

Jacob Perkins

Jacob Perkins has been an avid user of open source software since high school, when he first built his own computer and didn't want to pay for Windows. At one point he had 5 operating systems installed, including RedHat Linux, OpenBSD, and BeOS.

While at Washington University in St. Louis, Jacob took classes in Spanish, poetry writing, and worked on an independent study project that eventually became his Master's Project: WUGLE – a GUI for manipulating logical expressions. In his free time, he wrote the Gnome2 version of Seahorse (a GUI for encryption and key management), which has since been translated into over a dozen languages and is included in the default Gnome distribution.

After getting his MS in Computer Science, Jacob tried to start a web development studio with some friends, but since no-one knew anything about web development, it didn't work out as planned. Once he'd actually learned web development, he went off and co-founded another company called Weotta, which sparked his interest in Machine Learning and Natural Language Processing.

Jacob is currently the CTO / Chief Hacker for Weotta and blogs about what he's learned along the way at http://streamhacker.com/. He is also applying this knowledge to produce text processing APIs and demos at http://text-processing.com/. This book is a synthesis of his knowledge on processing text using Python, NLTK, and more.

Sorry, we don't have any reviews for this title yet.

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Code Downloads

Download the code and support files for this book.

Errata

- 4 submitted: last submission 26 Jul 2012

Errata type: Others | Page number: 29

The WordNetLemmatizer is a thin wrapper around the WordNet corpus, and uses the morphy() function of the WordNetCorpusReader to fnd a lemma. If no lemma is found, or the word itself is a lemma, the word is returned as it is. Unlike with stemming, knowing the part of speech of the word is important. As demonstrated previously, "cooking" does not return a different lemma unless you specify that the part of speech (pos) is a verb. This is because the default part of speech is a noun, and as a noun, "cooking" is its own lemma. "Cookbooks", on the other hand, is a noun, and its lemma is the singular form, "cookbook".

Errata type: Others | Page number: 35

The replacement string is then used to keep all the matched groups, while discarding the backreference to the second group. So the word "looooove" gets split into

(looo)(o)o(ve)

and then recombined as "loooove", discarding the last "o". This continues until only one "o" remains, when repeat_regexp no longer matches the string, and no more characters are removed.

Page: 40 | Errata Type: Code

First line of Code, "wordReplacer" should be "WordReplacer"

Page: 36 | Errata type: Code

Replace the last line of the code snippet:

self.max_dist = 2
with
self.max_dist = max_dist

Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.

What you will learn from this book

Learn Text categorization and Topic identification
Learn Stemming and Lemmatization and how to go beyond the usual spell checker
Replace negations with antonyms in your text
Learn to tokenize words into lists of sentences and words, and gain an insight into WordNet
Transform and manipulate chunks and trees
Learn advanced features of corpus readers and create your own custom corpora
Tag different parts of speech by creating, training, and using a part-of-speech tagger
Improve accuracy by combining multiple part-of-speech taggers
Learn how to do partial parsing to extract small chunks of text from a part-of-speech tagged sentence
Produce an alternative canonical form without changing the meaning by normalizing parsed chunks
Learn how search engines use Natural Language Processing to process text
Make your site more discoverable by learning how to automatically replace words with more searched equivalents
Parse dates, times, and HTML
Train and manipulate different types of classifiers

Special Offers

PacktLib gives you access to this and 600+ other titles with an annual or monthly subscription.

Annual subscription:

$220.00 per annum

Monthly subscription:

$21.99 per month

Buy any two eBooks and get 50% off

Add any two eBooks to your cart, and 50% will be taken off their total price

Building Dynamic Web 2.0 Websites with Ruby on Rails

Simply click below to add these two eBooks to your cart

View our bestselling Open Source bundles

In Detail

Natural Language Processing is used everywhere – in search engines, spell checkers, mobile phones, computer games – even your washing machine. Python's Natural Language Toolkit (NLTK) suite of libraries has rapidly emerged as one of the most efficient tools for Natural Language Processing. You want to employ nothing less than the best techniques in Natural Language Processing – and this book is your answer.

Python Text Processing with NLTK 2.0 Cookbook is your handy and illustrative guide, which will walk you through all the Natural Language Processing techniques in a step–by-step manner. It will demystify the advanced features of text analysis and text mining using the comprehensive NLTK suite.

This book cuts short the preamble and you dive right into the science of text processing with a practical hands-on approach.

Get started off with learning tokenization of text. Get an overview of WordNet and how to use it. Learn the basics as well as advanced features of Stemming and Lemmatization. Discover various ways to replace words with simpler and more common (read: more searched) variants. Create your own corpora and learn to create custom corpus readers for JSON files as well as for data stored in MongoDB. Use and manipulate POS taggers. Transform and normalize parsed chunks to produce a canonical form without changing their meaning. Dig into feature extraction and text classification. Learn how to easily handle huge amounts of data without any loss in efficiency or speed.

This book will teach you all that and beyond, in a hands-on learn-by-doing manner. Make yourself an expert in using the NLTK for Natural Language Processing with this handy companion.

Approach

The learn-by-doing approach of this book will enable you to dive right into the heart of text processing from the very first page. Each recipe is carefully designed to fulfill your appetite for Natural Language Processing. Packed with numerous illustrative examples and code samples, it will make the task of using the NLTK for Natural Language Processing easy and straightforward.

Who this book is for

This book is for Python programmers who want to quickly get to grips with using the NLTK for Natural Language Processing. Familiarity with basic text processing concepts is required. Programmers experienced in the NLTK will also find it useful. Students of linguistics will find it invaluable.