Categorizing using Elasticsearch

I’m fortunate to work at a company that once a month have a ‘hack day’ when we are allowed to just try out new and crazy ideas. As an advocate of using search for so much more than just the ‘search page’ I decided to do a small demo of how to use Elasticsearch to do categorization and I wanted to share some of my ideas.

The approach

We often have a large amount of data that we know falls into different categories and that we can use as a sample space for predicting unseen data. If we index all this data that we have and look at the unseen data as ‘queries’ we should be able to construct queries where the score when querying the known data space represent a similarity between the ‘objects’. However, now that we have a result where each ‘object’ in the result has a similarity score with the ‘queried’ object how should we interpret it? What if we group all items in the results by each category that we know that they fall into and then calculate the avg similarity score within that group and return the category for which the avg score is the highest? Depending on your data and problem space constructing the similarity query might be more or less easy but the beauty of this approach is that is very easy to implement in Elasticsearch using parent-child mappings.

Demo: Implementing a language categorizer

We start by creating an index (I’ve named it ‘myindex’) and add a parent mapping between a category type and a data type:

http://localhost:9200/myindex

{
    "mappings": {
        "data": {
            "_parent": {
                "type": "category"
            }
        }
    }
}

Then we add the known categories:

http://localhost:9200/myindex/category/sv

{
    "name": "Swedish"
}
...

For the ‘known’ data we index it and relate it to the parent category:

http://localhost:9200/myindex/data/1?parent=sv

{
    "text": "det här är en text skriven på svenska"
}
...

Using the has_child-query we can then easily achieve the described approach of searching for categories and have them returned based on the avg score of a child-query issued on the data. (In this case we just do a simple query_string-query with the text we want do do language detection for, ‘en svensk text’):

http://localhost:9200/myindex/category/_search

{
  "query":{
    "has_child": {
      "type": "data",
      "score_type" : "avg",
      "query" : {
        "query_string": {
          "query": "en svensk text"
        }
      }
    }
  }
}

Resulting in:

{
    "took": 27,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.3125,
        "hits": [
            {
                "_index": "myindex",
                "_type": "category",
                "_id": "sv",
                "_score": 0.3125,
                "_source": {
                    "name": "Swedish"
                }
            }
        ]
    }
}

Voila! In just 4 steps you know have your very own language detector.

Discussion

The described language detector might be a bit simplistic but can easily be more advanced by adding more advanced analyzers for your indexed data using stemmers or alike for the appropriate language. However the approach can be tweaked to do much more advanced queries on more complex objects using all the fancy query types available in Elasticsearch (just ignore the ConstantScore-query ;-)). Only your imagination stops you from creating appropriate similarity queries that might give you really good results in just a few easy steps!

So stop looking at your ‘search engine’ as a search page provider and see it as a great tool for not just querying but also a great source for analyzing your data!

Note:

The Elasticsearch0.20.x releases have a bug in the has_child-query calculating the sum and not the avg score when using the score_type=avg. I’ve issued a pull request but if you want to try it out you can clone and build from the master branch where the upgraded Lucene distribution circumvents the problem.