# How Cassandra Stores Data on Filesystem

In order to get optimal performance from cassandra, its important to understand how it stores the data on disk. Its common problem among new users coming from RDBMS to not consider the queries while designing their column families(a.k.a tables). Cassandra’s cql interface return data in tabular format and it might give the illusion that we can query it just like any RDBMS, but that’s not the case.

All cassandra data is persisted in SSTables(Sorted String tables) inside data directory. Default location of data directory is $CASSANDRA_HOME/data/data. You can change it using data_file_directories config in conf/cassandra.yaml. On a fresh setup, here’s what data directory looks like: Each directory in data represent a keyspace. These are internal keyspaces used by cassandra. Lets create a new keyspace Since we have not yet created any table, so you will not see ks1 named directory yet in data dir, but you can check cassandra' system_schema.keyspace table. Let create a table now: As soon as you create the table, here’s how the data directory will look like It created a directory by name of <table>-<table_id>. Cassandra creates a sub directory for each table. All the data for this table will be contained in this dir. You can move around this dir to different location in future by creating a symlink. You can check for this table in system_schema.tables table You will notice the table defaults like gc_grace_seconds and memtable_flush_period_in_ms which got applied. Lets insert some data to this table: Now lets check the table data directory. You will notice that there is no new file created. Thats because C* first writes data in memory and then flushed it to disk after a certain threshold is reached. For durability purpose it writes data to a append only commit log. Commit log is shared across keyspaces. By default its location is$CASSANDRA_HOME/data/commitlog and it can be changed in conf/cassandra.yaml

To force flushing memtables data to disk as SSTable, use following command

After flushing, here’s what the content of table directory looks like. Each flush create a new SSTable on disk. For each SSTable, following set of files is created

All files share common naming convention which is --.db

• version is an alphabetic string which represents SSTable storage format version. Like mc in above in example
• generation is an index number which is incremented every time a new SSTable is created for a table. Like 5 in above example
• component represents the type of information stored in the file. Like CompressionInfo, Data, Digest, Index, etc

The table below provides a brief description for each component.

Lets insert few more entries into the table and see what data directory looks like:

Again do bin/nodetool flush to flush data to SSTable and check filesystem:

You will notice that a new SSTable with generation number 7 is created. Cassandra periodically keep merging these SSTables through a process called compaction. You can force compaction by running following command.

You will notice that after compaction is complete there is just a single SSTable with a new generation. Compaction causes a temporary spike in disk space usage and disk I/O while old and new SSTables co-exist. As it completes, compaction frees up disk space occupied by old SSTables.

Now lets see how data is exactly stored in this SSTable.

C* comes bundled with a utility called sstabledump which can be used to see content of SSTable in json or row format. Here’s what you will see

Here you will notice 2 “rows”. Don’t confuse this row with RDBMS row. This “row” actually means a paritition. Number of rows(partitions) in sstable is determined by your primary key. Here’s the key which we used:

Primary key consists of 2 parts - partitioning and clustering fields

First part of our key, i.e. “user” will be used for partitioning, and remaining part - aka action_category and action_id will be used for clustering.

To understand this better, lets create a new table with exact same definition but changed partitioning key

In above definition, now paritioning key will be user_id + action_category and clustering key will be just action_id. Lets insert exact same 2 rows and notice how they gets stored in sstable

You will notice that in this example, for each user, each category of data went into a different partition.

First rule for data modelling is that you should choose your paritioning key in a way for a single query, only 1 partition is read.

Lets try to run some queries and see the impact of partitioning:

Lets run same query on user_tracking_new

Reason you got the error is that at minimum you need to specify all columns which are part of partition key. C* needs to know a way to get to a parition before it can query data inside it. In first case it worked because only user_id was part of partition key.

That means our user_tracking_new can only be used to query user data when category is also know. You might wonder then why do you even prefer that over first CF. Reason is that in first CF, for huge volume of user active, parition can grow very large and hence have performance issues. Our goal is to keep partition size to a reasonable size to not effect query performance.

Lets try another query:

You will see that it failed to fetch the result. Reason being that above filtering needs to span 2 partitions and hence C* is warning against it. If we want to force it, then we can add ALLOW FILTERING at the end of query

## Clustering order

This is another extremely powerful feature available in Cassandra, and it allows you to naturally store records in a given order based on the value of a particular column. So every time you write to the Stocks table, Cassandra will figure out where that record is supposed to go in the physical data partitions and store the record in the order you told it to. Storing data in sorted order gives drastic query performance improvements for range queries, which is very significant in timeseries data.

Now insert same data in this CF:

If you try to fetch 1 row, you will get it in sorted order

If you change CLUSTERING order to DESC like WITH CLUSTERING ORDER BY (action_id DESC) and do same thing, you can see it now return it sorted in DESC order:

You can see same ordered getting reflected in sstable too

You will notice that for each column stored in cassandra, it has to maintain some other metadata too. You need to understand these overheads too for better capacity planning. I will cover those in some future post.

I hope you got some idea about why cqlsh looks like SQL but doesn’t behave like one. Play around with different combinations of primary key and see what data looks like in SSTable to get more understanding of data storage which will help you to model it based on your queries.