In this post we will see how to use solr’s significantTerms streaming function to find results which are not easily visible with regular search features.
If you are not familiar with solr streaming expression, go through this post for a quick introduction.
Lets consider a real world example. We have events logs from an application in following format:
1 2 3
If I want check how my application is doing, I can go and filter all the logs which are ERROR logs. I can then do faceting on the
className field to show me which modules have maximum number of errors.
Here’s the problem with this approach. It shows me those result on top which which have “max occurence”. It doesn’t surfaces “signifiacnt errors” on top. Max occurance != Significance
For example, here’s the response from the facet query
1 2 3 4 5 6
Instead of querying from solr-admin-ui, you can also query like below:
This is indicating that maximum number of errors have occurred in some cassandra
cass.Project service. It doesn’t tell me how significant these 300 errors are when compared against all event during that time range. For example, it might be possible that during that time range I tried to write 100 million events and out of those 300 failed, which makes it totally ignorable. It doesn’t tell me if this is an “significant” anamoly or not.
Solr’s significantTerms can be very helpful in such case. Lets try to use to.
1 2 3 4 5 6
What I am saying in above query is that give me list of all sigification term in field
msg field contains keyword
Here’s what we get in response
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
You will see term
models.S3Storage has a
foreground score of 6 and
background score of 7. This means that there are total 7 event for
models.S3Storage module and out of that 6 events has keyword
error in its
We can see that
models.S3Storage is the culprit because out of 7 attempts, it failed for 6 of them, so its anamolous.
If you want to go through the code and see how exactly score is calculated for significant terms, you can check this: https://github.com/apache/lucene-solr/blob/0c1fde664fb1c9456b3fdc2abd08e80dc8f86eb8/solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/SignificantTermsStream.java#L361