Saumitra's blog

Search / Analytics / Distributed Systems / Machine Learning / DSLs

Using Solr's Significant Terms to Surface Anamolies

In this post we will see how to use solr’s significantTerms streaming function to find results which are not easily visible with regular search features.

If you are not familiar with solr streaming expression, go through this post for a quick introduction.

Lets consider a real world example. We have events logs from an application in following format:

1
2
3
msg,logLevel,host,className,id,thread,ts
Error uploading to S3 because source doesn't exist. Source= /tmp/compressed/30b3ea2d-1c2f-41d9-9457-eadd04f40b2c,ERROR,gbd-std-01,models.S3Storage$,83dcf5bd29d0cbc9d6a809b344d440e1,play-akka.actor.default-dispatcher-5,2016-12-01T06:23:05.254Z
Error uploading to S3 because source doesn't exist. Source= /tmp/compressed/30b3ea2d-1c2f-41d9-9457-eadd04f40b2c,ERROR,gbd-std-01,models.S3Storage$,27e741889c2ceff1bade623e20dd1be4,play-akka.actor.default-dispatcher-5,2016-12-01T06:23:05.253Z

If I want check how my application is doing, I can go and filter all the logs which are ERROR logs. I can then do faceting on the className field to show me which modules have maximum number of errors.

Here’s the problem with this approach. It shows me those result on top which which have “max occurence”. It doesn’t surfaces “signifiacnt errors” on top. Max occurance != Significance

For example, here’s the response from the facet query

1
2
3
4
5
6
facet(collection1,
         q="msg:error",
         buckets="className",
         bucketSorts="count(*) desc",
         bucketSizeLimit="10",
         count(*))

Instead of querying from solr-admin-ui, you can also query like below:

1
http://localhost:8983/solr/collection1/stream?expr=facet(collection1, q=msg:error, buckets=className, bucketSorts=count(*) desc,bucketSizeLimit=10, count(*))

This is indicating that maximum number of errors have occurred in some cassandra cass.Project service. It doesn’t tell me how significant these 300 errors are when compared against all event during that time range. For example, it might be possible that during that time range I tried to write 100 million events and out of those 300 failed, which makes it totally ignorable. It doesn’t tell me if this is an “significant” anamoly or not.

Solr’s significantTerms can be very helpful in such case. Lets try to use to.

1
2
3
4
5
6
significantTerms(collection1,
         field="className",
         q="msg:error",
         minDocFreq="10",
         maxDocFreq=".20",
         minTermLength="5")

What I am saying in above query is that give me list of all sigification term in field className whose msg field contains keyword error.

Here’s what we get in response

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
  "result-set": {
    "docs": [
      {
        "score": 14.545833,
        "term": "models.S3Storage$",
        "foreground": 6,
        "background": 7
      },
      {
        "EOF": true,
        "RESPONSE_TIME": 5
      }
    ]
  }
}

You will see term models.S3Storage has a foreground score of 6 and background score of 7. This means that there are total 7 event for models.S3Storage module and out of that 6 events has keyword error in its msg field.

We can see that models.S3Storage is the culprit because out of 7 attempts, it failed for 6 of them, so its anamolous.

If you want to go through the code and see how exactly score is calculated for significant terms, you can check this: https://github.com/apache/lucene-solr/blob/0c1fde664fb1c9456b3fdc2abd08e80dc8f86eb8/solr/solrj/src/java/org/apache/solr/client/solrj/io/stream/SignificantTermsStream.java#L361