Table of Contents
Preface
Chapter 1: Getting Started
Chapter 2: The Command-line Interface
Chapter 3: Application Programmer Interface
Chapter 4: Performance Tuning
Chapter 5: Consistency, Availability, and Partition Tolerance with Cassandra
Chapter 6: Schema Design
Chapter 7: Administration
Chapter 8: Multiple Datacenter Deployments
Chapter 9: Coding and Internals
Chapter 10: Libraries and Applications
Chapter 11: Hadoop and Cassandra
Chapter 12: Collecting and Analyzing Performance Statistics
Chapter 13: Monitoring Cassandra Servers
Index
- Chapter 1: Getting Started
- Introduction
- A simple single node Cassandra installation
- Reading and writing test data using the command-line interface
- Running multiple instances on a single machine
- Scripting a multiple instance installation
- Setting up a build and test environment for tasks in this book
- Running in the foreground with full debugging
- Calculating ideal Initial Tokens for use with Random Partitioner
- Choosing Initial Tokens for use with Partitioners that preserve ordering
- Insight into Cassandra with JConsole
- Connecting with JConsole over a SOCKS proxy
- Connecting to Cassandra with Java and Thrift
- Chapter 2: The Command-line Interface
- Connecting to Cassandra with the CLI
- Creating a keyspace from the CLI
- Creating a column family with the CLI
- Describing a keyspace
- Writing data with the CLI
- Reading data with the CLI
- Deleting rows and columns from the CLI
- Listing and paginating all rows in a column family
- Dropping a keyspace or a column family
- CLI operations with super columns
- Using the assume keyword to decode column names or column values
- Supplying time to live information when inserting columns
- Using built-in CLI functions
- Using column metadata and comparators for type enforcement
- Changing the consistency level of the CLI
- Getting help from the CLI
- Loading CLI statements from a file
- Chapter 3: Application Programmer Interface
- Introduction
- Connecting to a Cassandra server
- Creating a keyspace and column family from the client
- Using MultiGet to limit round trips and overhead
- Writing unit tests with an embedded Cassandra server
- Cleaning up data directories before unit tests
- Generating Thrift bindings for other languages (C++, PHP, and others)
- Using the Cassandra Storage Proxy "Fat Client"
- Using range scans to find and remove old data
- Iterating all the columns of a large key
- Slicing columns in reverse
- Batch mutations to improve insert performance and code robustness
- Using TTL to create columns with self-deletion times
- Working with secondary indexes
- Chapter 4: Performance Tuning
- Introduction
- Choosing an operating system and distribution
- Choosing a Java Virtual Machine
- Using a dedicated Commit Log disk
- Choosing a high performing RAID level
- File system optimization for hard disk performance
- Boosting read performance with the Key Cache
- Boosting read performance with the Row Cache
- Disabling Swap Memory for predictable performance
- Stopping Cassandra from using swap without disabling it system-wide
- Enabling Memory Mapped Disk modes
- Tuning Memtables for write-heavy workloads
- Saving memory on 64 bit architectures with compressed pointers
- Tuning concurrent readers and writers for throughput
- Setting compaction thresholds
- Garbage collection tuning to avoid JVM pauses
- Raising the open file limit to deal with many clients
- Increasing performance by scaling up
- Chapter 5: Consistency, Availability, and Partition Tolerance with Cassandra
- Introduction
- Working with the formula for strong consistency
- Supplying the timestamp value with write requests
- Disabling the hinted handoff mechanism
- Adjusting read repair chance for less intensive data reads
- Confirming schema agreement across the cluster
- Adjusting replication factor to work with quorum
- Using write consistency ONE, read consistency ONE for low latency operations
- Using write consistency QUORUM, read consistency QUORUM for strong consistency
- Mixing levels write consistency QUORUM, read consistency ONE
- Choosing consistency over availability consistency ALL
- Choosing availability over consistency with write consistency ANY
- Demonstrating how consistency is not a lock or a transaction
- Chapter 6: Schema Design
- Introduction
- Saving disk space by using small column names
- Serializing data into large columns for smaller index sizes
- Storing time series data effectively
- Using Super Columns for nested maps
- Using a lower Replication Factor for disk space saving and performance enhancements
- Hybrid Random Partitioner using Order Preserving Partitioner
- Storing large objects
- Using Cassandra for distributed caching
- Storing large or infrequently accessed data in a separate column family
- Storing and searching edge graph data in Cassandra
- Developing secondary data orderings or indexes
- Chapter 7: Administration
- Defining seed nodes for Gossip Communication
- Nodetool Move: Moving a node to a specific ring location
- Nodetool Remove: Removing a downed node
- Nodetool Decommission: Removing a live node
- Joining nodes quickly with auto_bootstrap set to false
- Generating SSH keys for password-less interaction
- Copying the data directory to new hardware
- A node join using external data copy methods
- Nodetool Repair: When to use anti-entropy repair
- Nodetool Drain: Stable files on upgrade
- Lowering gc_grace for faster tombstone cleanup
- Scheduling Major Compaction
- Using nodetool snapshot for backups
- Clearing snapshots with nodetool clearsnapshot
- Restoring from a snapshot
- Exporting data to JSON with sstable2json
- Nodetool cleanup: Removing excess data
- Nodetool Compact: Defragment data and remove deleted data from disk
- Chapter 8: Multiple Datacenter Deployments
- Changing debugging to determine where read operations are being routed
- Using IPTables to simulate complex network scenarios in a local environment
- Choosing IP addresses to work with RackInferringSnitch
- Scripting a multiple datacenter installation
- Determining natural endpoints, datacenter, and rack for a given key
- Manually specifying Rack and Datacenter configuration with a property file snitch
- Troubleshooting dynamic snitch using JConsole
- Quorum operations in multi-datacenter environments
- Using traceroute to troubleshoot latency between network devices
- Ensuring bandwidth between switches in multiple rack environments
- Increasing rpc_timeout for dealing with latency across datacenters
- Changing consistency level from the CLI to test various consistency levels with multiple datacenter deployments
- Using the consistency levels TWO and THREE
- Calculating Ideal Initial Tokens for use with Network Topology Strategy and Random Partitioner
- Chapter 9: Coding and Internals
- Introduction
- Installing common development tools
- Building Cassandra from source
- Creating your own type by sub classing abstract type
- Using the validation to check data on insertion
- Communicating with the Cassandra developers and users through IRC and e-mail
- Generating a diff using subversion's diff feature
- Applying a diff using the patch command
- Using strings and od to quickly search through data files
- Customizing the sstable2json export utility
- Configure index interval ratio for lower memory usage
- Increasing phi_convict_threshold for less reliable networks
- Using the Cassandra maven plugin
- Chapter 10: Libraries and Applications
- Introduction
- Building the contrib stress tool for benchmarking
- Inserting and reading data with the stress tool
- Running the Yahoo! Cloud Serving Benchmark
- Hector, a high-level client for Cassandra
- Doing batch mutations with Hector
- Cassandra with Java Persistence Architecture (JPA)
- Setting up Solandra for full text indexing with a Cassandra backend
- Setting up Zookeeper to support Cages for transactional locking
- Using Cages to implement an atomic read and set
- Using Groovandra as a CLI alternative
- Searchable log storage with Logsandra
- Chapter 11: Hadoop and Cassandra
- Introduction
- A pseudo-distributed Hadoop setup
- A Map-only program that reads from Cassandra using the ColumnFamilyInputFormat
- A Map-only program that writes to Cassandra using the CassandraOutputFormat
- Using MapReduce to do grouping and counting with Cassandra input and output
- Setting up Hive with Cassandra Storage Handler support
- Defining a Hive table over a Cassandra Column Family
- Joining two Column Families with Hive
- Grouping and counting column values with Hive
- Co-locating Hadoop Task Trackers on Cassandra nodes
- Setting up a "Shadow" data center for running only MapReduce jobs
- Setting up DataStax Brisk the combined stack of Cassandra, Hadoop, and Hive
- Chapter 12: Collecting and Analyzing Performance Statistics
- Finding bottlenecks with nodetool tpstats
- Using nodetool cfstats to retrieve column family statistics
- Monitoring CPU utilization
- Adding read/write graphs to find active column families
- Using Memtable graphs to profile when and why they flush
- Graphing SSTable count
- Monitoring disk utilization and having a performance baseline
- Monitoring compaction by graphing its activity
- Using nodetool compaction stats to check the progress of compaction
- Graphing column family statistics to track average/max row sizes
- Using latency graphs to profile time to seek keys
- Tracking the physical disk size of each column family over time
- Using nodetool cfhistograms to see the distribution of query latencies
- Tracking open networking connections
- Chapter 13: Monitoring Cassandra Servers
- Introduction
- Forwarding Log4j logs to a central sever
- Using top to understand overall performance
- Using iostat to monitor current disk performance
- Using sar to review performance over time
- Using JMXTerm to access Cassandra JMX
- Monitoring the garbage collection events
- Using tpstats to find bottlenecks
- Creating a Nagios Check Script for Cassandra
- Keep an eye out for large rows with compaction limits
- Reviewing network traffic with IPTraf
- Keep on the lookout for dropped messages
- Inspecting column families for dangerous conditions