Apache Cassandra: distributed management of large databases

Contents

If you need to manage large amounts of data on the order of several terabytes or even petabytes, traditional database systems will not be up to the task. In this case, you need special big data applications that are easily scalable, since it’s often difficult to predict the actual volume of data in advance. One of the most popular modern examples of such systems is Cassandra, an open-source solution originally developed for Facebook.

What is Apache Cassandra?

Apache Cassandra is an open-source database management system (DBMS) for very large yet structured databases. Thanks to easy scalability, these databases can be distributed across different clusters, which is why Cassandra is not bound to a single server.

Cassandra is a column-oriented NoSQL database. In this case, NoSQL means “Not only SQL” and not “no SQL”. When it comes to processing large amounts of data, NoSQL structures offer significant advantages over typical SQL databases because they are not bound by the restrictions of the query language SQL (Structured Query Language). Apache Cassandra has its own query language called Cassandra Query Language (CQL), which is similar to SQL, but is much preferred by developers because it is tailored to the special features of Cassandra.

As a NoSQL database, Cassandra relies on redundancy to ensure high resilience. By contrast, relational databases frequently encounter problems when replicating data.

Fact

Cassandra was originally developed by Avinash Lakshman and Prashant Malik at Facebook and was first released in 2008. In 2009, the Apache Software Foundation, one of the most important open source developer communities, included the project as a sub-project in the Apache Incubator. In February 2011, Apache Cassandra graduated to a top-level project in the Apache Software Foundation, alongside other popular projects such as Apache HTTP Server, Solr search server, the Kafka messaging platform or OpenOffice, which is the most well-known Apache project.

Along with the original developers, other big companies such as IBM, Twitter, and Rackspace, one of the largest IT service providers in the United States, contribute to Cassandra. One major contributor to the project is DataStax, a company specializing in subscription-based support, installation assistance, and training courses in the Cassandra database. DataStax contributes 80% of Cassandra’s open-source releases and also offers DataStax Enterprise, a commercial database solution built on the freely available Cassandra system.

According to the DB-Engines Ranking, Apache Cassandra is currently the most popular column-oriented database and has outperformed big competitors like Microsoft Azure Cosmos DB or Google Cloud Bigtable.

Cassandra: core functions

As a truly distributed system, Cassandra does not use a master. All clusters have equal permissions and can process every database request, which significantly increases performance. Data is distributed across nodes. The system can also be easily scaled by simply adding more nodes. After installing Cassandra, all you have to do is distribute the configuration files to the new nodes. Cassandra provides tools for this.

Apache Cassandra features a configurable replication system to ensure resilience and recovery of data in the event of a failure. Fault tolerance is minimized because the data is automatically replicated between the nodes. Failed nodes can be easily replaced. The system remains available for requests at all times.

Cassandra also offers high availability and partition tolerance. According to the CAP theorem in computer science, it is impossible to guarantee consistency, availability, and partition tolerance at the same time. Consistency, meaning that all nodes see the same data at all times, has the lowest priority in many big data systems. After a failure, consistency can be quickly restored through data recovery, whereas the other two properties must be ensured at all times.

Cassandra databases support the MapReduce programming model developed by Google for calculations involving large amounts of data in distributed systems. The proprietary query language CQL (Cassandra Query Language) is designed especially for the data structures of Cassandra.

What are the benefits of Apache Cassandra?

One of the main advantages of Cassandra is that it provides easy scalability with very high resiliency – two fundamental requirements for big data applications. Cassandra is horizontally scalable, which means you can increase the capacity and performance of the system by adding more nodes. This is the opposite of vertical scaling, where you add more powerful CPUs and larger hard drives to a single database server when you need to increase performance or capacity. Horizontal scaling is the cheaper solution in most cases since you can use commercially available server hardware.

Cassandra’s data model is based on multidimensional hash tables where each row can have any number of columns. Unlike columns in a traditional database table, these columns do not have to be the same in every row. Apache Cassandra also has a clear speed advantage when compared to other NoSQL databases in benchmark analyses and real-life application scenarios.

Where is Apache Cassandra used?

One of the main goals in developing Cassandra was to help Facebook users to search their inboxes more easily. The corporate giant used a cluster of over 150 individual nodes to power this feature. It’s no coincidence that Cassandra, which resembles Amazon Dynamo and Google Bigtable in its basic structures, is now very popular with providers of large social networks in which vast amounts of data are shared between users. Along with Twitter, Instagram, and Spotify, other big-name customers include the social bookmarking website Digg and social news aggregator Reddit.

Note

Facebook has now switched from Cassandra to a proprietary solution that combines the HBase and HDFS database systems, both components of the Apache Hadoop framework.

Many other networks that handle large amounts of data use Cassandra both as a main database and as a secondary component for specific tasks. Examples include eBay, GitHub, Netflix, The Weather Channel, and the Large Hadron Collider at CERN, the European Organization for Nuclear Research (around 30,000 terabytes of data per year). Apple has one of the largest Cassandra installations, with 75,000 nodes.

Getting started with Apache Cassandra

Apache Cassandra runs on UNIX-like systems, preferably Linux servers. The Java Runtime Environment is also required because Cassandra is programmed in Java. Installation packages are stored on Apache servers as Debian or RPM packages. To install Cassandra, you add the corresponding repository. After installation, you create the usual data, cache and protocol directories and configure the cassandra.yaml file.

Cassandra has its own command line tools for administrator tasks. The most important utility is the Cassandra Query Language shell (cqlsh).

You can use the following command to view a list of all available commands:

cqlsh --help

The following YouTube video provides a clear introduction to Apache Cassandra:

Tip

DataStax offers OpsCenter, a web-based tool for visual management and monitoring of Cassandra systems.

Reviewers

Sven Ignor
Sven Ignor is a TYPO3 web developer with over 15 years of experience and specialises in bespoke solutions based on TYPO3. He is happy to share his knowledge and is committed to TYPO3 and the community
Julia Hertler
With over 18 years of experience in content marketing, Julia Hertler has deep expertise in digital communications. For the past 10 years, she has specialized in the areas of domains and hosting at the IONOS Digital Guide, making complex technical topics easy to understand.

10 Years Digital Guide: A Success Story

Apache Kafka Tutorial

The Scala-based streaming and messaging software Apache Kafka is one of the most popular solutions for efficiently storing and processing large data streams. In this Kafka tutorial, you will learn the requirements for using this open source software and how best to install and…

Apache
Tutorials

Rawpixel.comShutterstock

Apache Lucene Tutorial

Who wouldn’t want to build their own search engine that’s adapted precisely to their requirements? Apache Lucene makes this possible. The open source project can be precisely adjusted and also works extremely quickly, which is why even large companies such as Twitter rely on…

Apache

IONOSby Gemini

How do you install and set up the Apache web server on Windows?

The Apache HTTP Server is widely regarded as the standard web server for delivering HTTP documents on the internet. However, you can also set up the Apache web server locally on a Windows PC to test websites or check scripts in a development environment. All you need is the free…

Apache
Tutorials

Rawpixel.comShutterstock

How to install the Google PageSpeed module on Apache

Google PageSpeed modules are a free collection of Apache web server modules which are designed to optimize a website's performance. Learn how to install and enable Google PageSpeed modules on a Cloud Server with Linux.

Apache
Google
Tutorials

sakkmesterkeShutterstock

Add an Apache Virtual Host on CentOS 7 and Ubuntu 16.04

Learn how to add a virtual host to an Apache web server. Virtual hosts allow you to host multiple separate websites on the same server, with a separate set of directories for each website.

Apache
Tutorials

kubaisshutterstock

What is Firebird DB?

Firebird DB offers developers a lean and resource-efficient SQL database with fully ACID-compliant transactions. Whether in embedded systems, web-based applications, or reporting solutions, Firebird delivers solid, reliable performance. In this guide, we will explore its core…

Encyclopedia

Apache Cassandra: dis­trib­uted man­age­ment of large databases