Understanding Apache Cassandra
Table of contents
- Introduction to Apache Cassandra
- Distribution and Resilience
- Core Components of Cassandra
- Scalability and Performance
- Partitions and Tokens
- Replication and Fault Tolerance
- Consistency and Availability
- Deployment Flexibility
- How to Install Cassandra via Docker
- Brief History of Cassandra
- Differences Between SQL Databases and Cassandra
- Features of Cassandra Over Other NoSQL Databases
- Cassandra CQL Commands
- Best Practices for Defining a Database in Cassandra
- Conclusion
Introduction to Apache Cassandra
Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, making it ideal for applications with large datasets that require high availability and scalability. This distinguishes it from many other NoSQL databases.
Type: NoSQL, non-relational
Characteristics: Lightweight, open-source, distributed
Strengths: Horizontal scalability, flexible schema
Use Case: Ideal for the rapid organization and analysis of high-volume, disparate data types, particularly important in the era of Big Data and cloud scalability.
Distribution and Resilience
Distributed Database: Cassandra operates on multiple machines but appears as a single database to users, ensuring high availability and fault tolerance.
Scalability: Easily scales to handle increased loads without downtime, protecting against data loss from hardware failures.
Isolation of Queries: Developers can adjust the throughput of read and write queries independently, optimizing performance as needed.
Core Components of Cassandra
Nodes: Individual instances of Cassandra.
Gossip Protocol: Peer-to-peer protocol that enables nodes to communicate and share state information about the cluster.
Masterless Architecture: Every node has the same capabilities and responsibilities, avoiding single points of failure and enhancing robustness.
Clusters/Rings: Multiple nodes organized into a cluster (or ring) for coordinated data storage and processing.
Datacenters: Clusters can span multiple datacenters, enhancing fault tolerance and global availability.
Scalability and Performance
Dynamic Scaling: Add more nodes to handle increased load without downtime.
Horizontal Scaling: Scale horizontally by adding more nodes using cost-effective hardware.
Elasticity: Supports both expansion and contraction based on application needs.
Partitions and Tokens
Partitions: Data is divided into partitions for distribution across nodes.
Tokens: Each node is assigned a range of tokens that determines the partition of data it will store.
Partition Key: Determines the specific node that will store a piece of data, ensuring even distribution and efficient data retrieval.
Replication and Fault Tolerance
Replication Factor (RF): Defines how many copies of each piece of data are stored across different nodes.
Coordinator Node: Determines which nodes will store the data based on the partition key and token range.
Replica Nodes: Store copies of the data, ensuring redundancy and fault tolerance.
Self-Healing: Automatically recovers and synchronizes nodes if any go down, ensuring no data loss and minimal manual intervention.
Multiple Replicas: Ensures high availability and load balancing, improving read and write performance.
Global Replication: Data can be replicated across different geographic regions, reducing latency for global users and improving access speed.
Consistency and Availability
CAP Theorem: Designed as an AP (Available Partition-tolerant) database, meaning it remains available and partition-tolerant even during network splits.
Consistency Level (CL): Configurable per-query; determines how many nodes must acknowledge an operation before it is considered successful.
Quorum: For example, with RF=3 and CL=QUORUM, at least two nodes must acknowledge a read or write operation to ensure it succeeds, balancing consistency and availability.
Deployment Flexibility
Deployment Agnostic: Can be deployed on-premises, in the cloud, or across multiple cloud providers, offering maximum flexibility.
Hybrid Deployments: Supports a combination of on-premises and cloud environments, providing a versatile solution for various deployment needs.
How to Install Cassandra via Docker
Prerequisite: Docker Desktop installed
- Pull the Cassandra Docker image:
docker pull cassandra:latest
- Create a network and run a Cassandra container:
docker network create cassandra
docker run --rm -d --name cassandra --hostname cassandra -p 9042:9042 --network cassandra cassandra
Explanation of the Docker Run:
docker run
: Command to create and start a new container from a specified image.--rm
: Automatically removes the container when it stops.-d
: Runs the container in detached mode.--name cassandra
: Assigns the name "cassandra" to the container.--hostname cassandra
: Sets the hostname of the container to "cassandra". The hostname is used within the container to identify itself on the network.-p 9042:9042
: Maps port 9042 of the host machine to port 9042 of the container.--network cassandra
: Connects the container to a Docker network named "cassandra".cassandra
: Specifies the Docker image to use for the container.
- Access the Cassandra CQL shell:
docker exec -it cassandra cqlsh
or
docker exec -it cassandra bash
cqlsh
Brief History of Cassandra
Originally developed at Facebook for their Inbox Search feature, Cassandra was released as an open-source project in 2008 and became an Apache Incubator project in 2009. Cassandra combines the best of Amazon's Dynamo and Google's Bigtable, making it highly scalable, decentralized, and fault-tolerant. It is ideal for applications with large datasets that require high availability. Since its release, Cassandra has been adopted by many large organizations, including Netflix, eBay, and Instagram, for its ability to handle real-time big data workloads.
Differences Between SQL Databases and Cassandra
Feature | SQL Databases | Cassandra |
Data Model | Relational | Column-Family (NoSQL) |
Schema | Fixed schema | Schema-less |
Transactions | ACID | BASE |
Scalability | Vertical | Horizontal |
Query Language | SQL | CQL |
Joins | Supported | Not supported |
Consistency | Strong | Tunable |
Availability | Limited | High |
Explanation of Terms:
Column-Family (NoSQL): Data stored in structures called column families, similar to tables in SQL databases. Unlike relational databases, Cassandra doesn't require a fixed schema, allowing for more flexibility.
Schema-less: No rigid structure enforced, allowing flexible data storage. This means that each row can have different columns, and new columns can be added without affecting existing rows.
BASE: Basically Available, Soft state, Eventual consistency; a flexible approach compared to ACID. It ensures that the system is always available (Basically Available), data changes can happen in the background without affecting the current state (soft state), and eventually, the data will become consistent (eventual consistency).
Features of Cassandra Over Other NoSQL Databases
Feature | Cassandra | Other NoSQL Databases |
Data Distribution | Decentralized (peer-to-peer) | Often centralized or master-slave |
Write Speed | High | Variable |
Read Speed | Tunable consistency, high with proper data modeling | Variable |
Query Language | CQL (SQL-like) | Various (e.g., MongoDB uses a JSON-like query language) |
Support for Multi-Datacenter | Strong, built-in | Variable |
Scalability | Linear scalability, easy to add nodes | Varies by database |
Replication | Multiple strategies, including synchronous and asynchronous | Varies by database |
Consistency | Tunable (from eventual to strong) | Typically eventual consistency |
Cassandra CQL Commands
Keyspace Operations
Purpose: Keyspaces are the outermost container for data in Cassandra. They are used to group tables with similar properties.
- Create a Keyspace:
CREATE KEYSPACE keyspace_name
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
- Use a Keyspace:
USE keyspace_name;
- Describe Keyspaces: Lists all keyspaces in the cluster.
DESCRIBE KEYSPACES;
Table Operations
Purpose: Tables store the actual data and define the structure of the stored data.
Primary Key: In Cassandra, the primary key uniquely identifies a row in a table. It is composed of a partition key and optional clustering columns.
Clustering Key: The clustering key is used to sort the data within the partition.
Create a Table:
CREATE TABLE table_name (
column1_name data_type PRIMARY KEY,
column2_name data_type,
...
);
- Describe Tables: Lists all the tables in the current keyspace.
DESCRIBE TABLES;
- Describe a Specific Table: Shows the schema of a specified table.
DESCRIBE TABLE table_name;
- Alter Table: Modifies the structure of an existing table.
ALTER TABLE table_name
ADD column_name data_type;
- Drop Table: Deletes an entire table from the keyspace.
DROP TABLE table_name;
- Truncate Table: Removes all data from a table without deleting the table itself.
TRUNCATE table_name;
Data Operations
- Insert Data: Adds new data to a table.
INSERT INTO table_name (column1_name, column2_name, ...)
VALUES (value1, value2, ...);
- Select Data: Retrieves data from a table.
SELECT * FROM table_name;
- Select Specific Columns:
SELECT column1_name, column2_name
FROM table_name;
- Using
WHERE
Clause: Filters the result set based on specified conditions.
SELECT * FROM table_name
WHERE column_name = value;
- Update Data: Modifies existing data in a table.
UPDATE table_name
SET column1_name = value1, column2_name = value2, ...
WHERE primary_key_column = primary_key_value;
- Delete Data: Removes data from a table.
DELETE FROM table_name
WHERE primary_key_column = primary_key_value;
Aggregation Commands: Performs aggregate functions on data in a table.
- Count Rows:
SELECT COUNT(*)
FROM table_name;
- Sum of a Column:
SELECT SUM(column_name)
FROM table_name;
- Average of a Column:
SELECT AVG(column_name)
FROM table_name;
- Minimum Value in a Column:
SELECT MIN(column_name)
FROM table_name;
- Maximum Value in a Column:
SELECT MAX(column_name)
FROM table_name;
Note: Aggregation functions like SUM and AVG are generally used on numeric data types. MIN and MAX can also be used on varchar data types. If used a non-comparable data types (e.g. boolean), an error with code 2200 is expected
- Group By Command: Groups rows that have the same values in specified columns into aggregate data.
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
- Order By Command: The ORDER BY command is used to sort the result set by one or more columns.
SELECT * FROM table_name
WHERE partition_key = value
ORDER BY clustering_column DESC;
Note: The ORDER BY
clause can only be used when the partition key is restricted by an =
or IN
clause. It orders the data based on the clustering columns.
- Ascending Order:
SELECT column1_name, column2_name
FROM table_name
ORDER BY column_name ASC;
- Descending Order:
SELECT column1_name, column2_name
FROM table_name
ORDER BY column_name DESC;
- Using ALLOW FILTERING: Permits filtering based on columns that are not part of the primary key. However, this is generally not recommended in production environments as it can lead to performance issues. It forces Cassandra to scan multiple nodes and partitions, which can be inefficient and slow.
SELECT * FROM table_name
WHERE non_primary_key_column = value
ALLOW FILTERING;
Pagination Commands
- Using LIMIT for Pagination:
SELECT * FROM table_name
WHERE partition_key = value
LIMIT 10;
Handling Joins in Cassandra
Cassandra does not support traditional SQL joins. To achieve the effect of joins, you can use the following strategies:
Denormalization: Store redundant data in multiple tables to avoid the need for joins.
Example:
Suppose you have two tables,
users
andorders
, and you need to fetch user information along with their orders. Instead of joining these tables, you can denormalize the data and store user information within theorders
table.CREATE TABLE users ( user_id UUID PRIMARY KEY, name TEXT, email TEXT ); CREATE TABLE orders ( order_id UUID PRIMARY KEY, user_id UUID, user_name TEXT, user_email TEXT, product_id UUID, product_name TEXT, quantity INT, price DECIMAL ); INSERT INTO users (user_id, name, email) VALUES (uuid(), 'Aadarsh', 'aadarsh@example.com'); INSERT INTO orders (order_id, user_id, user_name, user_email, product_id, product_name, quantity, price) VALUES (uuid(), user_id, 'Aadarsh', 'aadarsh@example.com', uuid(), 'Product A', 2, 29.99);
By storing
user_name
anduser_email
directly in theorders
table, you can avoid the need to join theusers
andorders
tables when querying orders along with user information.Materialized Views: Precomputed views that update automatically as data changes.
Example:
Suppose you have a
users
table and you want to create a materialized view to quickly access user information based on their email.CREATE TABLE users ( user_id UUID PRIMARY KEY, name TEXT, email TEXT ); CREATE MATERIALIZED VIEW users_by_email AS SELECT * FROM users WHERE email IS NOT NULL PRIMARY KEY (email, user_id);
This materialized view
users_by_email
allows you to query user information based on their email without the need to perform a join.SELECT * FROM users_by_email WHERE email = 'aadarsh@example.com';
The view will automatically update as the
users
table changes, ensuring that the data remains consistent and up-to-date.
Best Practices for Defining a Database in Cassandra
Choose the Right Primary Key: Ensure the primary key provides an even data distribution and supports your query requirements.
Use Composite Keys: Use composite keys for more granular control over data distribution and sorting.
Design for Queries: Model your data based on the queries you need to support. Avoid complex joins and model data to minimize the need for multiple queries.
Denormalize Data: Embrace data denormalization to reduce read latency and improve query performance.
Limit In-Memory Data: Keep partitions small to ensure they fit in memory and avoid performance issues.
Use Appropriate Data Types: Select appropriate data types for columns to optimize storage and performance.
Monitor and Tune: Regularly monitor performance and adjust replication factors, compaction strategies, and other configurations as needed.
Conclusion
Apache Cassandra is a powerful, distributed database solution designed for high availability and scalability. Its unique architecture, combined with flexible data modeling capabilities, makes it an excellent choice for applications that require large-scale data storage and high availability. With features like tunable consistency, dynamic scaling, and robust fault tolerance, Cassandra provides a resilient and efficient platform for managing big data workloads.