A couple of weeks ago I had the privilege to speak at the Elasticsearch Meetup in Stockholm where I did show a slightly tweaked version of the ‘language categorizer’ I did write about a couple of months ago and would like to share the changes I did (the naive approach wasn’t accurate enough to be presented live in public ;-)).
First, instead of using the default mapping for the ‘text’ field we use the nGram-tokenizer where we specify that we want to index all 2 and 3 letter ngrams of the text. Every language basically have different frequencies for each letter sequence and we want to find the language not by the words in the query but by how common the letter sequences are (as an example ‘th’ is very common in english but rare in swedish). This way we will actually be able to detect the language for a word sequence even if we haven’t actually seen any of the words in the training set. So, we extend our mappings to use the nGram-tokenizer:
When querying we then want to calculate the average score for each language using the has_child query but we don’t want the query to filter out any hits but rather return a 0-score for a document that doesn’t match (i.e. we don’t want the score for one language to be based on maybe one tenth of all documents for that language and the score for another to be based on half of the documents for that language). To do this we use the ‘boosting’ query where we simply give a boost to all documents matching the query:
Using these simple ‘tweaks’ we have actually created a very accurate language detector. If you want to give it a try you can create training sets from the Leipzig Corpora Collection for the different languages you want to detect.