Apache Cassandra is an open source, NoSQL database.

Cassandra Database: Cassandra is a distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data. It provides high availability with no single point of failure. Apache Cassandra: Apache Cassandra is a free and open-source, distributed, wide column store,NoSQL Database management system designed to handle large amounts of data ,providing high availability with no single point of failure.

Cassandra Database:

Cassandra is a distributed database management system designed for handling a high volume of structured data across commodity servers.Cassandra handles the huge amount of data with its distributed architecture. Data is placed on different machines with more than one replication factor that provides high availability and no single point of failure.The design goal of Cassandra is to handle big data workloads across multiple nodes without any single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is distributed among all the nodes in a cluster.

  1. All the nodes in a cluster play the same role. Each node is independent and at the same time interconnected to other nodes.
  2. Each node in a cluster can accept read and write requests, regardless of where the data is actually located in the cluster.
  3. When a node goes down, read/write requests can be served from other nodes in the network

The key components of Cassandra:

  1. Node − It is the place where data is stored.
  2. Data center − It is a collection of related nodes.
  3. Cluster − A cluster is a component that contains one or more data centers.
  4. Commit log − The commit log is a crash-recovery mechanism in Cassandra. Every write operation is written to the commit log.
  5. Mem-table − A mem-table is a memory-resident data structure. After commit log, the data will be written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-tables.
  6. SSTable − It is a disk file to which the data is flushed from the mem-table when its contents reach a threshold value.
  7. Bloom filter − These are nothing but quick, nondeterministic, algorithms for testing whether an element is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.

Although Cassandra query language resembles with SQL language, their data modelling methods are totally different.In Cassandra, a bad data model can degrade performance, especially when users try to implement the RDBMS concepts on Cassandra.

Cassandra Query Language:

Users can access Cassandra through its nodes using Cassandra Query Language (CQL). CQL treats the database (Keyspace) as a container of tables. Programmers use cqlsh: a prompt to work with CQL or separate application language drivers. Clients approach any of the nodes for their read-write operations. That node (coordinator) plays a proxy between the client and the nodes holding the data.

Cluster in Cassandra Database:

Cassandra database is distributed over several machines that operate together. The outermost container is known as the Cluster. For failure handling, every node contains a replica, and in case of a failure, the replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to them.Model Your Data in Cassandra

Following things should be kept in mind while modelling your queries.First of all, determine what queries you want.For example, do you need?

  1. Determine what queries you want to support
    • Joins
    • Group by
    • Filtering on which column etc.
  2. Create table according to your queries:Create table according to your queries. Create a table that will satisfy your queries. Try to create a table in such a way that a minimum number of partitions needs to be read.

Advantages of Cassandra Database:

  • One of the biggest advantages of using Cassandra is its elastic scalability. ... You don't have to restart the cluster or change queries related Cassandra application while scaling up or down. This is why Cassandra is popular of having a very high throughput for the highest number of nodes.
  • You will have availability (replication means your data are available on multiple nodes/ datacenters/ racks, zones and this is configurable). The details of the mechanics of replication is abstracted from the user and that makes it easy.
  • Cassandra cluster deployment allows for specification of a lot of its behavior through a YAML file and there are other config files that you can tweak to achieve functionality. (Not sure is this addresses your point about XML but definitely you can change config, for ex using sed, regex in a program.)

Disadvantages of Cassandra Database:

  • Replication means data gets replicated across multiple nodes as you configure. For ex, every record I write I can have it replicated to 2 or 3 or even 10 other nodes. But this also means any bad data also gets replicated. So you have to take care to not do so. (For ex in user ID table, if I accidentally put id = -1, name = ‘Moe’, address = ‘blah’ and this is incorrect according to business logic, Cassandra will still take it and replicate it. So replication doesn’t automatically mean your data is safe
  • As you advance more into Cassandra there are more things you may need to pay attention to. Not all of these are accurate. (Row counts, for ex, or read-write related statistics, etc.) This means you cannot just randomly tweak settings and fix your cluster if somethings start going wrong.
    • You cannot run unanticipated queries because the data storage on disk or in mem is such that you can’t query on any column you want. You will explicitly have to add indexes. This will bite you if you simply assume that you just have to create a table using CQL (Cassandra Query Language) which is modeled on SQL.

Apache Cassandra:

Apache Cassandra is a free and open-source, distripbuted, wide column store,NoSQL Database management system designed to handle large amounts of data across many Commdity Service,providing high availability with no single point of failure. Cassandra offers robust support for clsters spanning multiple datacenters,with asynchronous masterless replication allowing low latency operations for all clients. An introduction to ApacheCassandr, introduced us to various types of Apache Cassandra. In this article we are going to delve into Cassandra’s Architecture. Cassandra is a peer-to-peer distributed database that runs on a cluster of homogeneous nodes. Cassandra has been architected from the ground up to handle large volumes of data while providing high availability. Cassandra provides high write and read throughput.  A Cassandra cluster has no special nodes i.e. the cluster has no masters, no slaves or elected leaders. This enables Cassandra to be highly available while having no single point of failure.

Apache Cassandra is used by smaller organizations while Datastax enterprise is used by the larger organization for storing huge amount of data.Apache Cassandra is managed by Apache.In order to understand Cassandra's architecture it is important to understand some key concepts, data structures and algorithms frequently used by Cassandra.

  • Data Partitioning -  Apache Cassandra is a distributed database system using a shared nothing architecture. A single logical database is spread across a cluster of nodes and thus the need to spread data evenly amongst all participating nodes. At a 10000 foot level Cassandra stores data by dividing data evenly around its cluster of nodes. Each node is responsible for part of the data. The act of distributing data across nodes is referred to as data partitioning.
  • Consistent Hashing - Two main problems crop up when trying to distribute data efficiently. One, determining a node on which a specific piece of data should reside on. Two, minimising data movement when adding or removing nodes. Consistent hashing enables us to achieve these goals. A consistent hashing algorithm enables us to map Cassandra row keys to physical nodes. The range of values from a consistent hashing algorithm is a fixed circular space which can be visualised as a ring. Consistent hashing also minimises the key movements when nodes join or leave the cluster. On average only k/n keys need to be remapped where k is the number of keys and n is the number of slots (nodes). This is in stark contrast to most hashing algorithms where a change in the number of slots results in the need to remap a large number of keys.
  • Data Replication - Partitioning of data on a shared nothing system results in a single point of failure i.e. if one of the nodes goes down part of your data is unavailable. This limitation is overcome by creating copies of the data, know as replicas, thus avoiding a single point of failure. Storing copies of data on multiple nodes is referred to as replication.  Replication of data ensures fault tolerance and reliability.
  • Eventual Consistency - Since data is replicated across nodes we need to ensure that data is synchronized across replicas. This is referred to as data consistency.  Eventual consistency is a consistency model used in distributed computing. It theoretically guarantees that, provided there are no new updates, all nodes/replicas will eventually return the last updated value. Domain Name System (DNS) are a good example of an eventually consistent system.
  • Tunable Consistency - Cassandra provides tunable consistency i.e. users can determine the consistency level by tuning it via read and write operations. Eventual consistency often conjures up fear and doubt in the minds of application developers. The key thing to keep in mind is that reaching a consistent state often takes microseconds.
  • Consistency Level - Cassandra enables users to configure the number of replicas in a cluster that must acknowledge a read or write operation before considering the operation successful. The consistency level is a required parameter in any read and write operation and determines the exact number of nodes that must successfully complete the operation before considering the operation successful.
  • Data Centre, Racks, Nodes - A Data Centre (DC) is a centralised place to house computer and networking systems to help meet an organisation's information technology needs. A rack is a unit that contains multiple servers all stacked one on top of another. A rack enables data centres to conserve floor space and consolidates networked resources. A node is a single server in a rack. Why do we care? Often Cassandra is deployed in a DC environment and one must replicate data intelligently to ensure no single point of failure. Data must be replicated to servers in different racks to ensure continued availability in the case of rack failure. Cassandra can be easily configured to work in a multi DC environment to facilitate fail over and disaster recovery.
  • Snitches and Replication Strategies - As mentioned above it is important to intelligently distribute data across DC’s and racks. In Cassandra the distribution of data across nodes is configurable. Cassandra uses snitches and replication strategies to determine how data is replicated across DC’s, racks and nodes. Snitches determine proximity of nodes within a ring. Replication strategies use proximity information provided by snitches to determine locality of a particular copy.
  • Gossip Protocol - Cassandra uses a gossip protocol to discover node state for all nodes in a cluster.  Nodes discover information about other nodes by exchanging state information about themselves and other nodes they know about. This is done with a maximum of 3 other nodes. Nodes do not exchange information with every other node in the cluster in order to reduce network load. They just exchange information with a few nodes and over a period of time state information about every node propagates throughout the cluster. The gossip protocol facilitates failure detection.
  • Bloom Filters -  A bloom filter is an extremely fast way to test the existence of a data structure in a set. A bloom filter can tell if an item might exist in a set or definitely does not exist in the set. False positives are possible but false negatives are not. Bloom filters are a good way of avoiding expensive I/O operation.
  • Merkle Tree - Merkle tree is a hash tree which provides an efficient way to find differences in data blocks. Leaves contain hashes of individual data blocks and parent nodes contain hashes of their respective children. This enables efficient way of finding differences between nodes.
  • SSTable - A Sorted String Table (SSTable) ordered immutable key value map. It is basically an efficient way of storing large sorted data segments in a file.
  • Write Back Cache - A write back cache is where the write operation is only directed to the cache and completion is immediately confirmed. This is different from Write-through cache where the write operation is directed at the cache but is only confirmed once the data is written to both the cache and the underlying storage structure.
  • Memtable - A memtable is a write back cache residing in memory which has not been flushed to disk yet.
  • Cassandra Keyspace - Keyspace is similar to a schema in the RDBMS world. A keyspace is a container for all your application data. When defining a keyspace, you need to specify a replication strategy and a replication factor i.e. the number of nodes that the data must be replicate too.
  • Column Family - A column family is analogous to the concept of a table in an RDBMS. But that is where the similarity ends. Instead of thinking of a column family as RDBMS table think of a column family as a map of sorted map. A row in the map provides access to a set of columns which is represented by a sorted map.  Map<RowKey, SortedMap<ColumnKey, ColumnValue>> Please note in CQL (Cassandra Query Language) lingo a Column Family is referred to as a table.
  • Row Key - A row key is also known as the partition key and has a number of columns associated with it i.e. a sorted map as shown above. The row key is responsible for determining data distribution across a cluster.

Main Features:

  • Distributed:Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request.
  • Supports replication and multi data center replication.
  • Replication strategies are configurable.Cassandra is designed as a distributed system, for deployment of large numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically tailored for multiple-data center deployment, for redundancy, for failover and disaster recovery.
  • Scalability:Designed to have read and write throughput both increase linearly as new machines are added, with the aim of no downtime or interruption to applications.
  • Fault-tolerant:Data is automatically replicated to multiple nodes for faukt-toulerrence.Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
  • Tunable consistency:Cassandra is typically classified as an APsystem, meaning that availability and partition tolerance are generally considered to be more important than consistency in Cassandra,Writes and reads offer a tunable level of consistency, all the way from "writes never fail" to "block for all replicas to be readable", with the quorum levelin the middle.
  • MapReduce support:Cassan
  • dra has Hadoop integration, withMapReduce support. There is support also for Apache Pig and Apache Hive.
  • Query language:Cassandra introduced the Cassandra Query Language (CQL). CQL is a simple interface for accessing Cassandra, as an alternative to the traditional Structured Query Language(SQL).

Advantages of Apache cassandra:

  • Open Source:Cassandra is Apache’s open-source project, this means it is available for FREE! Yes, you can download the application and use the way you want. Infact, its open-source nature has given birth to a huge Cassandra community where like-minded people share their views, queries, suggestions related to Big Data. Further, Cassandra can be integrated with other Apache open-source projects like Hadoop (with the help of MapReduce), Apache Pig and Apache Hive.
  • Peer-Peer Architecture:Cassandra follows a peer-to-peer architecture, instead of master-slave architecture. Hence, there is no single point of failure in Cassandra. Moreover, any number of servers/nodes can be added to any Cassandra cluster in any of the datacenters. As all the machines are at equal level, any server can entertain request from any client. Undoubtedly, with its robust architecture and exceptional characteristics, Cassandra has raised the bar far above than other databases.
  • High Availability and Fault Tolerance:Another striking feature of Cassandra is Data replication which makes Cassandra highly available and fault-tolerant. Replication means each data is stored at more than one location. This is because, even if one node fails, the user should be able to retrieve the data with ease from another location. In a Cassandra cluster, each row is replicated based on the row key. You can set the number of replicas you want to create. Just like scaling, data replication can also happen across multiple data centres. This further leads to high level back-up and recovery competencies in Cassandra
  • High Performance:The basic idea behind developing Cassandra was to harness the hidden capabilities of several multicore machines. Cassandra has made this dream come true! Cassandra has demonstrated brilliant performance under large sets of data. Thus, Cassandra is loved by those organizations that deal with huge amount of data every day and at the same time cannot afford to lose such data.