This tutorial focuses on how to install and Configure Apache Cassandra on CentOS Stream 9 / RHEL 9. Apache Cassandra is an open-source NoSQL database written in Java that manages massive amounts of data at the same time. It is a lightweight, largely distributed, and non-relational database with among its strengths being it automatically scales horizontally, has distributed architectures, and has a flexible approach to a schema definition. The NoSQL database feature enables rapid, ad-hoc organization and analysis of huge sets and disparate data types. It can be installed anywhere as it is vendor-independent and therefore works with any of the major cloud providers.
Each instance of Cassandra is called a node that can handle 2-4TB of data and many thousands of operations per second which depends on the resources allocated to a single node. It is a peer-to-peer system with masterless architecture where each node can provide the same functionality as any other node. The nodes communicate wit each other via a protocol called gossip. It is purposed to maintain a high-availability cluster with 100% uptime even in case of Node failure as all nodes can handle any operation done by a different node. It offers geographic distribution where you can set up data centers in different parts of the world and Apache Cassandra will handle the communication between the nodes. It being distributed means that Cassandra can run on multiple machines while appearing as a unified whole to users.
Cassandra uses partitions to distribute data automatically with positive performance consequences. It has a partition key that is responsible for distributing data among nodes and for determining data locality. Each node owns a particular set of tokens, which Cassandra distributes data based on the ranges of these tokens across the cluster. When data is interested in a cluster, a hash function is applied to the partition key that determines what node gets the data. The node that owns the data for that range is called a replica node that can be replicated to multiple (replica) nodes, to ensure reliability and fault tolerance. Cassandra supports the replication factor (RF) notion which describes how many copies of your data should be in the database.
Apache Cassandra is an attractive option to enterprises due to its unique features that provide linear scalability and have proven fault-tolerance on commodity hardware making it the perfect platform for mission-critical data. The features include;
- Fault Tolerant – Cassandra replicates across multiple data centers, providing lower latency for your users and the assurance that data will not be lost in case of any failure. Failed nodes are replaced with no downtime.
- Distributed – Cassandra is designed as a distributed system. To benefit from its maximum performance it is recommended for Cassandra to be run on multiple machines. The architecture makes it suitable for applications that are not supposed to lose data even when a node is down.
- Scalable – Cassandra scales linearly when new machines are added across as many geographical sites as needed. It also streams data between nodes during scaling operations such as adding a new node or data center during peak traffic times providing an elastic architecture, particularly in Cloud.
- Performant and focuses on Quality – Cassandra has been tested with over 1000 nodes and outperforms popular NoSQL databases in benchmarks and real-world applications.
- Security and observability – with the new Audit Logging feature, Cassandra tracks the DML, DDL, and DCL activity with minimal impact on normal workload performance
Apache Cassandra powers mission-critical deployments with improved performance and high scalability. It is run by different incorporations, from startups to the largest enterprises. They include Apple, Alby, BestBuy, Bloomberg, Bigmate, Airship, Instagram, Hule, eBay, Macy’s, The New York Times, Target, Spotify, Walmart, Uber, Yelp, and many more.
The latest release of Apache Cassandra is version 4 with the following features.
- Support for Java 11 that can be used to build and run Apache Cassandra 4.0
- Apache Cassandra 4.0 implements virtual tables backed by an API
- Introduced Audit Logging to heap memory and disk space to prevent out-of-memory errors
- Support new feature with Full Query Logging (FQL) for debugging query traffic and migration
- Improved Internode Messaging with optimized Internode Messaging Protocol
- Improved Streaming which is the way nodes of cluster exchange data in form of SStables
Install Apache Cassandra on CentOS Stream 9 / RHEL 9
The following steps will lead you into the successful installation of Apache Cassandra on CentOS Stream 9 / RHEL 9.
Install missing dependencies on your system
sudo yum install java python3 python-pip
Install cqlsh with pi using the following command
sudo pip3 install cqlsh tox
Confirm Java is installed
$ java -version
openjdk version "11.0.16" 2022-07-19 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.16.0.8-1.el9_0) (build 11.0.16+8-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.16.0.8-1.el9_0) (build 11.0.16+8-LTS, mixed mode, sharing)
$ cqlsh --version
cqlsh 6.0.0
Create the repo file to add Cassandra’s repository.
sudo vi /etc/yum.repos.d/cassandra.repo
Then add the Apache repository of Cassandra to the file.
[cassandra]
name=Apache Cassandra
baseurl=https://redhat.cassandra.apache.org/41x/
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://downloads.apache.org/cassandra/KEYS
Once you have pasted the above commands, tap esc, then tap 😡 to save and close the file. Update the package index.
sudo yum update -y
PS. If you get an error while importing GPG keys, update the crypto policies to LEGACY to ensure compatibility then reboot your system.
sudo update-crypto-policies --set LEGACY
sudo reboot
Install Cassandra with the following command
sudo yum install cassandra -y
Verify that Apache Cassandra has been successfully installed using the rpm command below.
$ rpm -qi cassandra
Name : cassandra
Version : 4.1~alpha1
Release : 1
Architecture: noarch
Install Date: Tue 13 Sep 2022 12:37:11 AM EAT
Group : Development/Libraries
Size : 61567406
License : Apache Software License 2.0
Signature : RSA/SHA256, Fri 20 May 2022 11:40:59 PM EAT, Key ID e91335d77e3e87cb
Source RPM : cassandra-4.1~alpha1-1.src.rpm
Build Date : Fri 20 May 2022 11:40:38 PM EAT
Build Host : 10a3cf41bc4a
URL : http://cassandra.apache.org/
Summary : Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store.
Description :
Cassandra is a distributed (peer-to-peer) system for the management and storage of structured data.
Create the Cassandra service
$ sudo vim /etc/systemd/system/cassandra.service
[Unit]
Description=Apache Cassandra
After=network.target
[Service]
PIDFile=/var/run/cassandra/cassandra.pid
User=cassandra
Group=cassandra
ExecStart=/usr/sbin/cassandra -f -p /var/run/cassandra/cassandra.pid
Restart=always
[Install]
WantedBy=multi-user.target
Then reload the daemon.
sudo systemctl daemon-reload
Start the Cassandra service and enable it to start on boot.
sudo systemctl start cassandra
sudo systemctl enable cassandra
Then check the status of the service.
$ sudo systemctl status cassandra
● cassandra.service - Apache Cassandra
Loaded: loaded (/etc/systemd/system/cassandra.service; disabled; vendor pr>
Active: active (running) since Tue 2022-09-13 00:39:36 EAT; 30s ago
Main PID: 2826 (java)
Tasks: 44 (limit: 48809)
Memory: 2.2G
CPU: 10.231s
CGroup: /system.slice/cassandra.service
└─2826 /usr/bin/java -ea -da:net.openhft... -XX:+UseThreadPrioriti
You can also check the status of Cassandra using nodetool.
$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 104.38 KiB 16 100.0% 304ddf04-80c4-460a-9076-cb364443c6fb rack1
To connect to the database, use the following command.
$ cqlsh
Connected to Test Cluster at 127.0.0.1:9042
[cqlsh 6.0.0 | Cassandra 4.1-alpha1 | CQL spec 3.4.5 | Native protocol v5]
Use HELP for help.
cqlsh>
You can change the cluster name with the following command.
> UPDATE system.local SET cluster_name = 'Technixleo Cluster' WHERE KEY = 'local';
Logout.
> quit
Edit the YAML configuration file to also cahnge the cluster name.
sudo vi /etc/cassandra/default.conf/cassandra.yaml
Edit the Cluster name
cluster_name: 'Technixleo Cluster'
Using separate secondary disk for Cassandra Data
Cassandra is data hungry, this makes it use a lot of space in your main disk. Most users prefer having a dedicated disk to store Cassandra data.
My secondary disk is /dev/sdb. I will create a partition on the disk using the fdisk utility.
$ sudo fdisk /dev/sdb
Welcome to fdisk (util-linux 2.37.4).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0x8ed35d2c.
Command (m for help): n
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-104857599, default 2048):
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-104857599, default 104857599):
Created a new partition 1 of type 'Linux' and of size 50 GiB.
Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
The new partition is /dev/sdb1 as shown below.
$ sudo fdisk -l /dev/sdb
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x8ed35d2c
Device Boot Start End Sectors Size Id Type
/dev/sdb1 2048 104857599 104855552 50G 83 Linux
Create the directory to store Cassandra’s data.
sudo mkdir /var/lib/cassandra
Assign correct permission to the new directory.
sudo chmod 777 /var/lib/cassandra
You might get an error while mounting the directory with respect to the file system. You can format the partition with a filesystem say ext4 as shown below.
$ sudo mkfs.ext4 /dev/sdb1
mke2fs 1.46.5 (30-Dec-2021)
Discarding device blocks: done
Creating filesystem with 13106944 4k blocks and 3276800 inodes
Filesystem UUID: 81f42655-7ccd-4698-bc2a-7fdc45dfd106
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424
Allocating group tables: done
Writing inode tables: done
Creating journal (65536 blocks): done
Writing superblocks and filesystem accounting information: done
Mount the directory on the new partition with the following command.
sudo mount /dev/sdb1 /var/lib/cassandra
Confirm it is mounted
$ df -h /var/lib/cassandra
Filesystem Size Used Avail Use% Mounted on
/dev/sdb1 49G 24K 47G 1% /var/lib/cassandra
View the UUID of the new device with the following command
sudo file -sL /dev/sdb1
/dev/sdb1: Linux rev 1.0 ext4 filesystem data, UUID=81f42655-7ccd-4698-bc2a-7fdc45dfd106 (needs journal recovery) (extents) (64bit) (large files) (huge files)
Edit your /etc/fstab to include new partition and choosing a relevant file-system
UUID=81f42655-7ccd-4698-bc2a-7fdc45dfd106 /var/lib/cassandra ext4 defaults 0 2
Clear the system cache.
nodetool flush system
Then restart to apply changes.
sudo systemctl restart Cassandra
Then check if Cassandra is running
nodetool status
nodetool Utility
nodetool is a command line utility that is used to manage a cluster by exposing operations and attributes available with Cassandra.
To identify the nodetool version, use the following command.
$ nodetool version
ReleaseVersion: 4.1-alpha1
To return information about a specific node, use the following command.
$ nodetool info
ID : 304ddf04-80c4-460a-9076-cb364443c6fb
Gossip active : true
Native Transport active: true
Load : 125.35 KiB
Generation No : 1663019321
Uptime (seconds) : 194
Heap Memory (MB) : 114.18 / 1902.00
Off Heap Memory (MB) : 0.00
Data Center : datacenter1
Rack : rack1
Exceptions : 0
Key Cache : entries 11, size 984 bytes, capacity 95 MiB, 115 hits, 130 requests, 0.885 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 47 MiB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Percent Repaired : 100.0%
Token : (invoke with -T/--tokens to see all 16 tokens)
nodetool describecluster command will give you the name of the Cassandra cluster.
$ nodetool describecluster
Cluster Information:
Name: Technixleo Cluster
Snitch: org.apache.cassandra.locator.SimpleSnitch
DynamicEndPointSnitch: enabled
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
54e17321-3f2e-37ca-9b08-d91ba7bdd369: [127.0.0.1]
Stats for all nodes:
Live: 1
Joining: 0
Moving: 0
Leaving: 0
Unreachable: 0
Data Centers:
datacenter1 #Nodes: 1 #Down: 0
Database versions:
4.1.0-alpha1: [127.0.0.1:7000]
Keyspaces:
system_schema -> Replication class: LocalStrategy {}
system -> Replication class: LocalStrategy {}
system_auth -> Replication class: SimpleStrategy {replication_factor=1}
system_distributed -> Replication class: SimpleStrategy {replication_factor=3}
system_traces -> Replication class: SimpleStrategy {replication_factor=2}
To identify which node is responsible for handling which range of tokens:
$ nodetool ring
Datacenter: datacenter1
==========
Address Rack Status State Load Owns Token
7908336325362939036
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% -8948403195832713941
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% -7329379900135670643
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% -6351234447619556568
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% -4572600146795969439
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% -3476558246561031150
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% -2385261135868818890
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% -1540884488377657134
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% -100616258475073895
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% 970884357864958917
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% 1988735762607385932
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% 2806407695219736727
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% 3882163576445316266
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% 4618079269156209907
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% 5697452076603811565
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% 6940972082284403537
127.0.0.1 rack1 Up Normal 125.35 KiB 100.00% 7908336325362939036
Warning: "nodetool ring" is used to output all the tokens of a node.
To view status related info of a node use "nodetool status" instead.
To query information about a remote node:
nodetool -h <ip-address> -p <jmx-port> info
To remove data from a node that is not responsible for, use the following command
nodetool cleanup
Wrap up
Apache Cassandra database is designed to scale when an application is under high stress removing the possibility of losing data or stalling operations. It is capped as an ‘always on’ database that is also deployment agnostic meaning you can run it – on-prem, a cloud provider, or multiple cloud providers. Cassandra allows you to tune your consistency by representing the minimum number of Cassandra nodes that must acknowledge a read or write operation to the coordinator before the operation is considered successful. Check below for some of our other articles;
- Install and Use MySQL Workbench on Kubuntu/KDE Neon
- Install MySQL / MariaDB Database on Solus Linux
- Install PostgreSQL 13 on CentOS |AlmaLinux |RHEL
Hello Ann Kamau, thank you for giving such a detailed explanation for installing Cassandra in Centos Stream, I followed your step by step but currently stuck at the “cqlsh” command to connect to Cassandra.
Whenever i tried to run that command I got the following error :
Traceback (most recent call last):
File “/usr/bin/cqlsh.py”, line 148, in
from cqlshlib import cql3handling, pylexotron, sslhandling, cqlshhandling, authproviderhandling
ImportError: cannot import name ‘authproviderhandling’ from ‘cqlshlib’ (/usr/local/lib/python3.9/site-packages/cqlshlib/__init__.py)
this is after I installed the cqlsh according to step by step that provided above
Sorry forgot to say, Thanks in advance much appreciated your help Ann Kamau
Glad I could help