Saumitra's blog

Search / Analytics / Distributed Systems / Machine Learning / DSLs

Writing Your Own Kafka Source Connector for Apache Solr

Kafka provides a common framework, called Kafka Connect, to standardize integration with other data systems. Kafka Connectors are ready-to-use components built using Connect framework. A connector can be a Source Connector if it reads from an external system and write to Kafka or a Sink Connector if it reads data from Kafka and write to external system.

In this post, we will see how to implement our own Kafka source connector. Our source connector will read data from an Apache Solr collection using CursorMark and write to a kafka topic. Full code for this post with deployment instructions is avaliable at

Deploying Kafka Dependent Scala Microservices With Docker


In this post we will see how to use docker-compose and sbt-docker to deploy scala microservices. We will create microservices to (1) get tweets using twitter streaming api and put them in kafka and (2) read from kafka and get number of hashtags in each tweet. We will then see how to use sbt-docker plugin to create separate docker images for our services. Finally we will use docker-compose to define the environment for our services and run them.

Search and Analytics on Streaming Data With Kafka, Solr, Cassandra, Spark

In this blog post we will see how to setup a simple search and anlytics pipeline on streaming data in scala.

  • For sample timeseries data, we will use twitter stream.
  • For data pipelining, we will use kafka
  • For search, we will use Solr. We will use Banana for a UI query interface for solr data.
  • For analytics, we will store data in cassandra. We will see example of using spark for running analytics query. We will use zeppelin for a UI query interface.

How Cassandra Stores Data on Filesystem

In order to get optimal performance from cassandra, its important to understand how it stores the data on disk. Its common problem among new users coming from RDBMS to not consider the queries while designing their column families(a.k.a tables). Cassandra’s cql interface return data in tabular format and it might give the illusion that we can query it just like any RDBMS, but that’s not the case.

Creating DSL With Antlr4 and Scala

Domain specific languages, when done right, helps a lot in improving developer productivity. First thing which you need while creating a DSL is a parser which can takes a piece of text and transforms it in structured format(like Abstract Syntax Tree) so that your program can understand and do something useful with it. DSL tends to stay for years so while choosing a tool for creating parser for you DSL you need to make sure that its easy to maintain and evolve the language. For parsing simple DSL, you can just use regular expression or scala’s in-built parser-combinators, but for even slightly complex DSL, both of these becomes performance and mantainenance nightmares.

In this post we will see how to use antlr to create a basic grammar and use it in Scala. Full code and grammar for this post is avaliable at

Antlr4 - Visitor vs Listener Pattern

In previous post, we saw how to create and parse DSL with antlr4. In this post we will compare the two tree walking mechanism provided by the library - Listener vs Visitor. Both approaches have their own advantages, and choice of preferred method depends on what you are using antlr for.

Setting Up Solr Healthcheck Alert Using Zookeeper Watches

In this post we will see how to get instant health alerts if any replica of any collection becomes unhealthy in a cluster.

For this puporse, we will use zookeeper watches. But before we go there, lets see how solr maintains state information for a collection.