Getting Hadoop Up and Running in a Cluster
Setting up Hadoop on your machine
Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop
Adding the combiner step to the WordCount MapReduce program
HDFS basic command-line file operations
Setting Hadoop in a distributed cluster environment
Running the WordCount program in a distributed cluster environment
Using multiple disks/volumes and limiting HDFS disk usage
Setting the file replication factor
Advanced Hadoop MapReduce Administration
Tuning Hadoop configurations for cluster deployments
Running benchmarks to verify the Hadoop installation
Reusing Java VMs to improve the performance
Fault tolerance and speculative execution
Debug scripts – analyzing task failures
Setting failure percentages and skipping bad records
Shared-user Hadoop clusters – using fair and other schedulers
Hadoop security – integrating with Kerberos
Using the Hadoop Tool interface
Developing Complex Hadoop MapReduce Applications
Choosing appropriate Hadoop data types
Implementing a custom Hadoop Writable data type
Implementing a custom Hadoop key type
Emitting data of different value types from a mapper
Choosing a suitable Hadoop InputFormat for your input data format
Adding support for new input data formats – implementing a custom InputFormat
Formatting the results of MapReduce computations – using Hadoop OutputFormats
Hadoop intermediate (map to reduce) data partitioning
Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
Using Hadoop with legacy applications – Hadoop Streaming
Adding dependencies between MapReduce jobs
Hadoop counters for reporting custom metrics
Data random access using Java client APIs
Running MapReduce jobs on HBase (table input/output)
Running your first Pig command
Set operations (join, union) and sorting with Pig
Running a SQL-style query with Hive
Simple analytics using MapReduce
Performing Group-By using MapReduce
Calculating frequency distributions and sorting using MapReduce
Plotting the Hadoop results using GNU Plot
Calculating histograms using MapReduce
Calculating scatter plots using MapReduce
Parsing a complex dataset with Hadoop
Joining two datasets using MapReduce
Generating an inverted index using Hadoop MapReduce
Intra-domain web crawling using Apache Nutch
Indexing and searching web documents using Apache Solr
Configuring Apache HBase as the backend data store for Apache Nutch
Deploying Apache HBase on a Hadoop cluster
Whole web crawling with Apache Nutch using a Hadoop/HBase cluster
ElasticSearch for indexing and searching
Generating the in-links graph for crawled web pages
Classifications, Recommendations, and Finding Relationships
Clustering an Amazon sales dataset
Collaborative filtering-based recommendations
Classification using Naive Bayes Classifier
Assigning advertisements to keywords using the Adwords balance algorithm
Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python
Data de-duplication using Hadoop Streaming
Loading large datasets to an Apache HBase data store using importtsv and bulkload tools
Creating TF and TF-IDF vectors for the text data
Topic discovery using Latent Dirichlet Allocation (LDA)
Document classification using Mahout Naive Bayes classifier
Cloud Deployments: Using Hadoop on Clouds
Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
Saving money by using Amazon EC2 Spot Instances to execute EMR job flows
Executing a Pig script using EMR
Executing a Hive script using EMR
Creating an Amazon EMR job flow using the Command Line Interface
Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR
Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs
Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment