Hadoop Real-World Solutions Cookbook
Hadoop Distributed File System – Importing and Exporting Data
Importing and exporting data into HDFS using Hadoop shell commands
Moving data efficiently between clusters using Distributed Copy
Importing data from MySQL into HDFS using Sqoop
Exporting data from HDFS into MySQL using Sqoop
Configuring Sqoop for Microsoft SQL Server
Exporting data from HDFS into MongoDB
Importing data from MongoDB into HDFS
Exporting data from HDFS into MongoDB using Pig
Using HDFS in a Greenplum external table
Using Flume to load data into HDFS
Reading and writing data to HDFS
Reading and writing data to SequenceFiles
Using Apache Avro to serialize data
Using Apache Thrift to serialize data
Using Protocol Buffers to serialize data
Setting the replication factor for HDFS
Setting the block size for HDFS
Extracting and Transforming Data
Transforming Apache logs into TSV format using MapReduce
Using Apache Pig to filter bot traffic from web server logs
Using Apache Pig to sort web server log data by timestamp
Using Apache Pig to sessionize web server log data
Using Python to extend Apache Pig functionality
Using MapReduce and secondary sort to calculate page views
Using Hive and Python to clean and transform geographical event data
Using Python and Hadoop Streaming to perform a time series analytic
Using MultipleOutputs in MapReduce to name output files
Creating custom Hadoop Writable and InputFormat to read geographical event data
Performing Common Tasks Using Hive, Pig, and MapReduce
Using Hive to map an external table over weblog data in HDFS
Using Hive to dynamically create tables from the results of a weblog query
Using the Hive string UDFs to concatenate fields in weblog data
Using Hive to intersect weblog IPs and determine the country
Generating -grams over news archives using MapReduce
Using Pig to load a table and perform a SELECT operation with GROUP BY
Joining data in the Mapper using MapReduce
Joining data using Apache Pig replicated join
Joining sorted data using Apache Pig merge join
Joining skewed data using Apache Pig skewed join
Using a map-side join in Apache Hive to analyze geographical events
Using optimized full outer joins in Apache Hive to analyze geographical events
Joining data using an external key-value store (Redis)
Counting distinct IPs in weblog data using MapReduce and Combiners
Using Hive date UDFs to transform and sort event dates from geographic event data
Using Hive to build a per-month report of fatalities over geographic event data
Implementing a custom UDF in Hive to help validate source reliability over geographic event data
Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python
Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
Trim Outliers from the Audioscrobbler dataset using Pig and datafu
Single-source shortest-path with Apache Giraph
Using Apache Giraph to perform a distributed breadth-first search
Collaborative filtering with Apache Mahout
Sentiment classification with Apache Mahout
Using Counters in a MapReduce job to track bad records
Developing and testing MapReduce jobs with MRUnit
Developing and testing MapReduce jobs running in local mode
Enabling MapReduce jobs to skip bad records
Using Counters in a streaming job
Updating task status messages to display debugging information
Using illustrate to debug Pig jobs
Starting Hadoop in pseudo-distributed mode
Starting Hadoop in distributed mode
Adding new nodes to an existing cluster
Recovering from a NameNode failure
Monitoring cluster health using Ganglia
Tuning MapReduce job parameters
Persistence Using Apache Accumulo
Designing a row key to store geographic events in Accumulo
Using MapReduce to bulk import geographic event data into Accumulo
Setting a custom field constraint forinputting geographic event data in Accumulo
Limiting query results using the regex filtering iterator
Counting fatalities for different versions of the same key using SumCombiner
Enforcing cell-level security on scans using Accumulo