Understanding Apache Cassandra

Introduction to Apache Cassandra

Apache Cassandra is a highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, making it ideal for applications with large datasets that require high availability and scalability. This distinguishes it from many other NoSQL databases.

  • Type: NoSQL, non-relational

  • Characteristics: Lightweight, open-source, distributed

  • Strengths: Horizontal scalability, flexible schema

  • Use Case: Ideal for the rapid organization and analysis of high-volume, disparate data types, particularly important in the era of Big Data and cloud scalability.

Distribution and Resilience

  • Distributed Database: Cassandra operates on multiple machines but appears as a single database to users, ensuring high availability and fault tolerance.

  • Scalability: Easily scales to handle increased loads without downtime, protecting against data loss from hardware failures.

  • Isolation of Queries: Developers can adjust the throughput of read and write queries independently, optimizing performance as needed.

Core Components of Cassandra

  • Nodes: Individual instances of Cassandra.

  • Gossip Protocol: Peer-to-peer protocol that enables nodes to communicate and share state information about the cluster.

  • Masterless Architecture: Every node has the same capabilities and responsibilities, avoiding single points of failure and enhancing robustness.

  • Clusters/Rings: Multiple nodes organized into a cluster (or ring) for coordinated data storage and processing.

  • Datacenters: Clusters can span multiple datacenters, enhancing fault tolerance and global availability.

Scalability and Performance

  • Dynamic Scaling: Add more nodes to handle increased load without downtime.

  • Horizontal Scaling: Scale horizontally by adding more nodes using cost-effective hardware.

  • Elasticity: Supports both expansion and contraction based on application needs.

Partitions and Tokens

  • Partitions: Data is divided into partitions for distribution across nodes.

  • Tokens: Each node is assigned a range of tokens that determines the partition of data it will store.

  • Partition Key: Determines the specific node that will store a piece of data, ensuring even distribution and efficient data retrieval.

Replication and Fault Tolerance

  • Replication Factor (RF): Defines how many copies of each piece of data are stored across different nodes.

  • Coordinator Node: Determines which nodes will store the data based on the partition key and token range.

  • Replica Nodes: Store copies of the data, ensuring redundancy and fault tolerance.

  • Self-Healing: Automatically recovers and synchronizes nodes if any go down, ensuring no data loss and minimal manual intervention.

  • Multiple Replicas: Ensures high availability and load balancing, improving read and write performance.

  • Global Replication: Data can be replicated across different geographic regions, reducing latency for global users and improving access speed.

Consistency and Availability

  • CAP Theorem: Designed as an AP (Available Partition-tolerant) database, meaning it remains available and partition-tolerant even during network splits.

  • Consistency Level (CL): Configurable per-query; determines how many nodes must acknowledge an operation before it is considered successful.

  • Quorum: For example, with RF=3 and CL=QUORUM, at least two nodes must acknowledge a read or write operation to ensure it succeeds, balancing consistency and availability.

Deployment Flexibility

  • Deployment Agnostic: Can be deployed on-premises, in the cloud, or across multiple cloud providers, offering maximum flexibility.

  • Hybrid Deployments: Supports a combination of on-premises and cloud environments, providing a versatile solution for various deployment needs.

How to Install Cassandra via Docker

Prerequisite: Docker Desktop installed

  1. Pull the Cassandra Docker image:
docker pull cassandra:latest
  1. Create a network and run a Cassandra container:
docker network create cassandra
docker run --rm -d --name cassandra --hostname cassandra -p 9042:9042 --network cassandra cassandra

Explanation of the Docker Run:

  • docker run: Command to create and start a new container from a specified image.

  • --rm: Automatically removes the container when it stops.

  • -d: Runs the container in detached mode.

  • --name cassandra: Assigns the name "cassandra" to the container.

  • --hostname cassandra: Sets the hostname of the container to "cassandra". The hostname is used within the container to identify itself on the network.

  • -p 9042:9042: Maps port 9042 of the host machine to port 9042 of the container.

  • --network cassandra: Connects the container to a Docker network named "cassandra".

  • cassandra: Specifies the Docker image to use for the container.

  1. Access the Cassandra CQL shell:
docker exec -it cassandra cqlsh

or

docker exec -it cassandra bash
cqlsh

Brief History of Cassandra

Originally developed at Facebook for their Inbox Search feature, Cassandra was released as an open-source project in 2008 and became an Apache Incubator project in 2009. Cassandra combines the best of Amazon's Dynamo and Google's Bigtable, making it highly scalable, decentralized, and fault-tolerant. It is ideal for applications with large datasets that require high availability. Since its release, Cassandra has been adopted by many large organizations, including Netflix, eBay, and Instagram, for its ability to handle real-time big data workloads.

Differences Between SQL Databases and Cassandra

FeatureSQL DatabasesCassandra
Data ModelRelationalColumn-Family (NoSQL)
SchemaFixed schemaSchema-less
TransactionsACIDBASE
ScalabilityVerticalHorizontal
Query LanguageSQLCQL
JoinsSupportedNot supported
ConsistencyStrongTunable
AvailabilityLimitedHigh

Explanation of Terms:

  • Column-Family (NoSQL): Data stored in structures called column families, similar to tables in SQL databases. Unlike relational databases, Cassandra doesn't require a fixed schema, allowing for more flexibility.

  • Schema-less: No rigid structure enforced, allowing flexible data storage. This means that each row can have different columns, and new columns can be added without affecting existing rows.

  • BASE: Basically Available, Soft state, Eventual consistency; a flexible approach compared to ACID. It ensures that the system is always available (Basically Available), data changes can happen in the background without affecting the current state (soft state), and eventually, the data will become consistent (eventual consistency).

Features of Cassandra Over Other NoSQL Databases

FeatureCassandraOther NoSQL Databases
Data DistributionDecentralized (peer-to-peer)Often centralized or master-slave
Write SpeedHighVariable
Read SpeedTunable consistency, high with proper data modelingVariable
Query LanguageCQL (SQL-like)Various (e.g., MongoDB uses a JSON-like query language)
Support for Multi-DatacenterStrong, built-inVariable
ScalabilityLinear scalability, easy to add nodesVaries by database
ReplicationMultiple strategies, including synchronous and asynchronousVaries by database
ConsistencyTunable (from eventual to strong)Typically eventual consistency

Cassandra CQL Commands

Keyspace Operations

Purpose: Keyspaces are the outermost container for data in Cassandra. They are used to group tables with similar properties.

  • Create a Keyspace:
CREATE KEYSPACE keyspace_name
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
  • Use a Keyspace:
USE keyspace_name;
  • Describe Keyspaces: Lists all keyspaces in the cluster.
DESCRIBE KEYSPACES;

Table Operations

Purpose: Tables store the actual data and define the structure of the stored data.

  • Primary Key: In Cassandra, the primary key uniquely identifies a row in a table. It is composed of a partition key and optional clustering columns.

  • Clustering Key: The clustering key is used to sort the data within the partition.

Create a Table:

CREATE TABLE table_name (
    column1_name data_type PRIMARY KEY,
    column2_name data_type,
    ...
);
  • Describe Tables: Lists all the tables in the current keyspace.
DESCRIBE TABLES;
  • Describe a Specific Table: Shows the schema of a specified table.
DESCRIBE TABLE table_name;
  • Alter Table: Modifies the structure of an existing table.
ALTER TABLE table_name
ADD column_name data_type;
  • Drop Table: Deletes an entire table from the keyspace.
DROP TABLE table_name;
  • Truncate Table: Removes all data from a table without deleting the table itself.
TRUNCATE table_name;

Data Operations

  • Insert Data: Adds new data to a table.
INSERT INTO table_name (column1_name, column2_name, ...)
VALUES (value1, value2, ...);
  • Select Data: Retrieves data from a table.
SELECT * FROM table_name;
  • Select Specific Columns:
SELECT column1_name, column2_name
FROM table_name;
  • Using WHERE Clause: Filters the result set based on specified conditions.
SELECT * FROM table_name
WHERE column_name = value;
  • Update Data: Modifies existing data in a table.
UPDATE table_name
SET column1_name = value1, column2_name = value2, ...
WHERE primary_key_column = primary_key_value;
  • Delete Data: Removes data from a table.
DELETE FROM table_name
WHERE primary_key_column = primary_key_value;
  • Aggregation Commands: Performs aggregate functions on data in a table.

    • Count Rows:
    SELECT COUNT(*)
    FROM table_name;
  • Sum of a Column:
    SELECT SUM(column_name)
    FROM table_name;
  • Average of a Column:
    SELECT AVG(column_name)
    FROM table_name;
  • Minimum Value in a Column:
    SELECT MIN(column_name)
    FROM table_name;
  • Maximum Value in a Column:
    SELECT MAX(column_name)
    FROM table_name;

Note: Aggregation functions like SUM and AVG are generally used on numeric data types. MIN and MAX can also be used on varchar data types. If used a non-comparable data types (e.g. boolean), an error with code 2200 is expected

  • Group By Command: Groups rows that have the same values in specified columns into aggregate data.
SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name;
  • Order By Command: The ORDER BY command is used to sort the result set by one or more columns.
SELECT * FROM table_name
WHERE partition_key = value
ORDER BY clustering_column DESC;

Note: The ORDER BY clause can only be used when the partition key is restricted by an = or IN clause. It orders the data based on the clustering columns.

  • Ascending Order:
SELECT column1_name, column2_name
FROM table_name
ORDER BY column_name ASC;
  • Descending Order:
SELECT column1_name, column2_name
FROM table_name
ORDER BY column_name DESC;
  • Using ALLOW FILTERING: Permits filtering based on columns that are not part of the primary key. However, this is generally not recommended in production environments as it can lead to performance issues. It forces Cassandra to scan multiple nodes and partitions, which can be inefficient and slow.
SELECT * FROM table_name
WHERE non_primary_key_column = value
ALLOW FILTERING;

Pagination Commands

  • Using LIMIT for Pagination:
SELECT * FROM table_name
WHERE partition_key = value
LIMIT 10;

Handling Joins in Cassandra

Cassandra does not support traditional SQL joins. To achieve the effect of joins, you can use the following strategies:

  • Denormalization: Store redundant data in multiple tables to avoid the need for joins.

    Example:

    Suppose you have two tables, users and orders, and you need to fetch user information along with their orders. Instead of joining these tables, you can denormalize the data and store user information within the orders table.

      CREATE TABLE users (
          user_id UUID PRIMARY KEY,
          name TEXT,
          email TEXT
      );
    
      CREATE TABLE orders (
          order_id UUID PRIMARY KEY,
          user_id UUID,
          user_name TEXT,
          user_email TEXT,
          product_id UUID,
          product_name TEXT,
          quantity INT,
          price DECIMAL
      );
    
      INSERT INTO users (user_id, name, email) VALUES (uuid(), 'Aadarsh', 'aadarsh@example.com');
      INSERT INTO orders (order_id, user_id, user_name, user_email, product_id, product_name, quantity, price)
      VALUES (uuid(), user_id, 'Aadarsh', 'aadarsh@example.com', uuid(), 'Product A', 2, 29.99);
    

    By storing user_name and user_email directly in the orders table, you can avoid the need to join the users and orders tables when querying orders along with user information.

  • Materialized Views: Precomputed views that update automatically as data changes.

    Example:

    Suppose you have a users table and you want to create a materialized view to quickly access user information based on their email.

      CREATE TABLE users (
          user_id UUID PRIMARY KEY,
          name TEXT,
          email TEXT
      );
    
      CREATE MATERIALIZED VIEW users_by_email AS
          SELECT *
          FROM users
          WHERE email IS NOT NULL
          PRIMARY KEY (email, user_id);
    

    This materialized view users_by_email allows you to query user information based on their email without the need to perform a join.

      SELECT * FROM users_by_email WHERE email = 'aadarsh@example.com';
    

    The view will automatically update as the users table changes, ensuring that the data remains consistent and up-to-date.

Best Practices for Defining a Database in Cassandra

  • Choose the Right Primary Key: Ensure the primary key provides an even data distribution and supports your query requirements.

  • Use Composite Keys: Use composite keys for more granular control over data distribution and sorting.

  • Design for Queries: Model your data based on the queries you need to support. Avoid complex joins and model data to minimize the need for multiple queries.

  • Denormalize Data: Embrace data denormalization to reduce read latency and improve query performance.

  • Limit In-Memory Data: Keep partitions small to ensure they fit in memory and avoid performance issues.

  • Use Appropriate Data Types: Select appropriate data types for columns to optimize storage and performance.

  • Monitor and Tune: Regularly monitor performance and adjust replication factors, compaction strategies, and other configurations as needed.

Conclusion

Apache Cassandra is a powerful, distributed database solution designed for high availability and scalability. Its unique architecture, combined with flexible data modeling capabilities, makes it an excellent choice for applications that require large-scale data storage and high availability. With features like tunable consistency, dynamic scaling, and robust fault tolerance, Cassandra provides a resilient and efficient platform for managing big data workloads.