After deciding on the tech stack for Potterverse, it was time for me to start building something nice. The process started with creating the first design, downloading and processing the dataset, extracting the required information and dumping everything into elasticsearch.
The first design
The very first product design I had in mind looked like this
The left half of the page contains the search results for the given term
and right half contains a small preview of the page that is selected/tapped. If the user
taps on the title of the result, it opens the source page in a new tab. Each item in the result
contains some information about the page, like
The short preview of the page will help the user get a snippet of information about the
entity that is tapped. After reading the snippet, if the user want to read more from
the page, there is a
Read More link which would open the source page in new tab.
The result page also shows the time taken for the search to perform and approximate number of relevant documents for the given search query.
Downloading and processing the dataset
The very basic version of Potterverse is built on top of the Harry Potter Wikia Dataset from Wikia. The dataset is very similar to Wikipedia dataset and is also in Mediawiki format which makes my life much simpler as I have already written a SAX Parser for my very first search engine built on top of Wikipedia data.
With slight modifications in the code, I converted the entire dataset in
JSON format which is loved by both Python and Elasticsearch. Along with extracting
basic information like
Body the parser also extracts
ExternalLinks explicitly. The entire detail is JSONed and dumped in files, one
file per document.
The dataset contains tonnes of documents that are not significant, like
etc. which are filtered out before processing. After filtering we are left with around 13350 documents,
and this now becomes the very first corpus on top of which the first version of Potterverse
search engine is built.
Indexing the data in Elasticsearch
The list of information that si required to show on interface includes
Since Elasticsearch should behold everything required on the interface, hence we need to compute, process and dump information beforehand. Let’s see what we have and what we don;t
Titleis present in dump file as is.
SourceLinkis base url of Wikia with
Titleappended to it.
ShortExcerptis first 256 characters from
ShortPreviewis first 3000 characters from
So a sample document dumped in Elasticsearch index will look something like this
Title is the title of the document picked as is from the dump file and
Excerpt is the
first 3000 characters from
Body and the
Body itself. I persist
Excerpt so that, while
querying the elasticsearch I need not fetch the entire
Body of the page, instead I can only
Excerpt; inturn saving a lot of network bandwidth.
The mapping for the index is raw and default, with default analyzer, default tokenizers and default settings, in short no customizations.
Querying the data
For the first version of search engine, the query to be fired on Elasticsearch is a very basic
one with boost given to
Title match none given to
Body. Fuzziness is set
AUTO for both fields, which ensures that the search engine is also Typo Tolerant.
For the query
Harry Potter elasticsearch query that is fired looks like this
Above query when fired on Elasticsearch returns nice results; which are okay to be driving the first version of search engine. The default tf-idf scoring works well and results are quite relevant.
With very minimal efforts this is how I spun up a nice looking Harry Potter based search engine. It was not difficult and took very little time for entire setup. As and when some modifications will be made to Potterverse, a blog will be published detailing the changes, improvements and results that have been made and achieved.