Implementing Typeahead with Elastic Search
In this article, we’ll explore implementing typeahead (or in other words, autocomplete) for a real dataset using Completion Suggester. We’ll then create phrases that map to our completion and ask for some suggestions based on the movie’s title. After that, we’ll use contexts to limit our suggestions to only some genres while applying specific boosts per genre.
The goal is an exploration of this feature, its API, and possible usage.
We will not look at the performance (although the chosen one is supposedly the fastest), go through all the options suggesters give us, or take a look at other alternatives, such as Search as you type, Term suggester, and Phrase suggester.
I’ll be using Opendistro for ElasticSearch 1.13, but any recent ElasticSearch version should suffice.
For the dataset, we’ll use a list of movies from Kaggle that can be found here. Let’s download and extract it:
Peeking at the structure:
head -n 1 movies_metadata.csv we see columns:
Scala code to transform CSV into JSOND
We’ll use a bit of Scala. The full project can be found here. We’ll use Kantan for CSV, and Circe for JSON.
Let’s set up our build.sbt:
Based on the headers from the previous section, we can create a corresponding case class. We’ll also annotate it with Circe @JsonCodec as we’ll be serializing it later into JSON file.
Now we only need to tell our library how the Json type is represented in the CSV file. Looking at the data, it’s a bit hairy - so we’ll do a bit of cleaning as we parse.
Let’s parse it and print it as JSON
Let’s peek at the row to see if it looks fine:
Data looks fine, we’ll import it.
Start local ElasticSearch
You can use docker-compose to start both ES and Kibana
Or just run ES if you aren’t interested in Kibana
Verify that everything is running properly and import data
It seems our dataset is too large to be imported in single bulk. Let’s split it and import each split as bulk.
We have split it into 87 batches that we import.
After waiting a bit, we can verify movie index contains data.
We see there are 35932 documents.
We can query it:
Our goal is to try to implement typeahead/completion suggester on the title field of this data.
For completion suggester, we need to use a new field of type completion. I recommend reading documentation.
Let’s add the completion field to our index mapping via a template:
We could populate it with the value of the movie title, such as:
This suggester would find matches on terms such as “The lord” and “The two”. However, if you had just asked for “Towers” or “Lord of the rings”, you’d get zero matches (completion suggester can be understood as prefix suggester, and thus, you have to match left-to-right).
We could maintain a set of sensible completions for this entry, however maintaining it for every entry may be cumbersome in this case, so we’ll just generate it based on our dataset using Scala. We could use a custom analyzer on completion, for example with something like a Shingle filter, but I wasn’t able to get analyzers in the completion field to be picked up on my current ES version.
No worries, we’ll get around it with a bit more code ;-). Let’s generate it based on postfix words, so completion would trigger for following cases: “towers”, “two towers”, “the two towers”, “rings two the towers”. We could use a different analyzer on the completion field (i.e. to remove stopwords), but that has its own caveats. Let’s write a bit of Scala and generate a new dataset that includes a new field:
We have a simple function implementation that goes through the string, rather naively filters it out, and generates postfixes. We could also assign different weights to each postfix, but let’s keep it simple.
We’ll modify our resulting JSON with the title_completion field.
Same as before, we’ll split the dataset and bulk insert it to the new index: movies_completion.
Let’s try to search for “rings”:
We get back some titles:
We see we got matches also based on words inside.
Let’s search for “Lord of”:
What if we want to autocomplete only on the Fantasy genre for example? Let’s say we know the user is a fan of such a category, or the user has already selected such a category? We can use contexts. Let’s set them up in mapping first.
We want to index it now, so let’s change our dataset a bit so when we import it, the movie_category is populated. We would like our
title_completion field to have a structure such as this:
We can modify the output JSON, adding all genre names from the original JSON into the movie category.
And we import to a new index
Let’s check one specific movie:
Input and contexts look good. Let’s try to run our suggester for Fantasy category and “Lord of” text query:
We get back following fantasy titles:
We could go further, and be interested in Horrors as well while boosting fantasy titles:
This would yield all the rings on top, as well as a horror movie (in 5th place since the default limit is set to 5 suggestions)
There are more possibilities we can do with completion suggesters (such as fuzzy or regex queries), but this should give a basic overview of how we can use it to implement a variety of typeahead. It is a powerful way to for context-aware fast type completion, as completion suggesters are optimized for speed that they trade for higher memory consumption.
We have chosen to generate data for completion field input, but I recommend hand-tuning it for the best results. Nonetheless, this requires careful maintaining of such completion data (unlike with Phrase or Term suggesters) - possibly even in the different index.