Solr significant terms
In this post we will see how to use solr's significantTerms streaming function to find results which are not easily visible with regular search features.
If you are not familiar with solr streaming expression, go through this post for a quick introduction.
Lets consider a real world example. We have events logs from an application in following format:
msg,logLevel,host,className,id,thread,ts
Error uploading to S3 because source doesn't exist. Source= /tmp/compressed/30b3ea2d-1c2f-41d9-9457-eadd04f40b2c,ERROR,gbd-std-01,models.S3Storage$,83dcf5bd29d0cbc9d6a809b344d440e1,play-akka.actor.default-dispatcher-5,2016-12-01T06:23:05.254Z
Error uploading to S3 because source doesn't exist. Source= /tmp/compressed/30b3ea2d-1c2f-41d9-9457-eadd04f40b2c,ERROR,gbd-std-01,models.S3Storage$,27e741889c2ceff1bade623e20dd1be4,play-akka.actor.default-dispatcher-5,2016-12-01T06:23:05.253Z
If I want check how my application is doing, I can go and filter all the logs which are ERROR logs. I can then do faceting on the className
field to show me which modules have maximum number of errors.
Here's the problem with this approach. It shows me those result on top which which have "max occurence". It doesn't surfaces "signifiacnt errors" on top. Max occurance != Significance
For example, here's the response from the facet query
facet(collection1,
q="msg:error",
buckets="className",
bucketSorts="count(*) desc",
bucketSizeLimit="10",
count(*))
Instead of querying from solr-admin-ui, you can also query like below:
http://localhost:8983/solr/collection1/stream?expr=facet(collection1, q=msg:error, buckets=className, bucketSorts=count(*) desc,bucketSizeLimit=10, count(*))
This is indicating that maximum number of errors have occurred in some cassandra cass.Project
service. It doesn't tell me how significant these 300 errors are when compared against all event during that time range. For example, it might be possible that during that time range I tried to write 100 million events and out of those 300 failed, which makes it totally ignorable. It doesn't tell me if this is an "significant" anamoly or not.
Solr's significantTerms can be very helpful in such case. Lets try to use to.
significantTerms(collection1,
field="className",
q="msg:error",
minDocFreq="10",
maxDocFreq=".20",
minTermLength="5")
What I am saying in above query is that give me list of all sigification term in field className
whose msg
field contains keyword error
.
Here's what we get in response
{
"result-set": {
"docs": [
{
"score": 14.545833,
"term": "models.S3Storage$",
"foreground": 6,
"background": 7
},
{
"EOF": true,
"RESPONSE_TIME": 5
}
]
}
}
You will see term models.S3Storage
has a foreground
score of 6 and background
score of 7. This means that there are total 7 event for models.S3Storage
module and out of that 6 events has keyword error
in its msg
field.
We can see that models.S3Storage
is the culprit because out of 7 attempts, it failed for 6 of them, so its anamolous.
If you want to go through the code and see how exactly score is calculated for significant terms, you can check this link