I often come across a debate amongst techies about ‘speed vs relevance’ while they are building search/ retrieval information applications. It is an important consideration for this kind of an application. However, there is no single answer to it. I am going to share how I approach this challenge.
I start by asking 2 questions:
- How long do you wait on a page if you are searching for something and waiting for the result? Few secs, few milliseconds? No, we are talking about micro or nanoseconds.
- Do you get upset when your search query provides result that are not relevant to your query? Ofcourse. So, what we do. We try to adjust our search term/word to get the relevant results/answers.
Relevancy has become a really important consideration when we are searching or reading up on the internet or any app especially when the data size is huge.
Relevancy ranking is the process of sorting the document results so that those documents which are most likely to be relevant to your query are shown right at the top.
But this process needs to be defined in order to appear when one is looking for it. And, this is how it can be defined:
- Typo: Will rank higher when a word doesn’t contain a typo or a spelling mistake
- Words: Will rank higher when an object matches the words typed in by a user – if the query contains more than one word
- Proximity: Will rank higher when words are having no unwanted conjunctions. For eg: If you’re looking up George Clooney then people won’t type George ‘and’ Clooney or “George ‘some word’ Clooney. Word proximity matters in ranking.
- Attribute: Will rank higher when the match is a more important attribute (Title, Description).
Say: There is a big article and it has titles, description, subtitles, footnotes etc., then one can define the key/important words and base the ranking accordingly. Or an algorithm may work on its own and pick random words based on the Title.
- Words position: Will rank higher when the accurate words are at the beginning of an attribute
- Exact: Will rank higher when words match exactly without any suffix (Algolia natively handles prefix search)
There are a few algorithms designed to find relevant results. Most commonly used ones are:
- TF-IDF
- Vector Space Model
- Boolean Model
- K Nearest Neighbor (KNN)
- Custom Build
Again, the documents which are most relevant are not necessarily those which are most useful to display on the first page of search results. For example, two duplicate documents might be individually considered quite relevant, but it is only useful to display one of them. A measure called “maximal marginal relevance” (MMR) has been proposed to overcome this shortcoming. It considers the relevance of each document only in terms of how much new information it brings given the previous results.
Apache Lucene, on which the Elasticsearch engine is built, uses TF-IDF, Vector Space Model as well as Boolean Model. One can actually build or change their own method of searching based on Elasticsearch.
Let’s find out how TF-IDF works?
TF- Term Frequency. Number of occurrence of the word in given document/s
IDF- Inverse Document Frequency. Number of given document/s has that given word
TF: One can find the value of given term (word) in given document/s like tf (t,d).This means that number of times the term t occurs in document d.
Let’s take an example using 2 sentences as below:
I have document d1: This is called Cricket.
And I have document d2: This is another game.
Terms d1 | Frequency | Terms d2 | Frequency |
this | 1 | this | 1 |
is | 1 | is | 1 |
called | 2 | another | 2 |
cricket | 1 | game | 3 |
I need to find out weightage of each word in the document so that it is easy to provide a result. By tf formula, suppose I have to find weightage of the word “this” which appears only once in both documents. But the number of words in d2 is more.
Tf{“this”,d1}= 1/5= 0.2
Tf{“this”,d2}=1/7 ~0.14
IDF, in this case, is same for both documents as there are 2 number of documents and term appears twice so
Idf{“this”,D}=log{2/2}=0
So tf-idf in this case is
Tfidf{“this”,d1}=0.2×0=0
Tfidf{“this”,d2}=0.14×0=0
So this is not a very interesting or relevant term in given documents.
Let’s look at the term ‘game’ here.
Tf{“game”,d1}=0/5=0
Tf{“game”,d2}=3/7=~0.429
Idf{“game”D}=log(2/1)=0.301
Finally,
Tfidf{“game”,d1}=0x0.301=0
Tfidf{“game”,d2}=0.429×0.301=~0.13
So, the word ‘game’ is the most relevant word in document d2.
And, that’s how TF-IDF algorithm works.
In case of ElasticSearch engine which is based on Apache Lucene, uses “_score” filed as metadata which gives the weightage as “_score” on each document. This value “_score” calculated based on tfidf and some other algorithms helps provide relevant results. We can customize this search results by providing other mechanisms like ‘Boosting’.
I’ll be taking this up in my next blog – which will cover features/API to be used in ElasticSearch to achieve relevant search results and other popular algorithms. Watch out for the June issue!
Ajit Gadge I Senior Database Consultant
Ajit brings over 16 years of solid experience in solution architecting and implementation. He has successfully delivered solutions on various database technologies including PostgreSQL, MySQL, SQLServer. His derives his strength from his passion for database technologies and love of pre-sales engagement with customers. Ajit has successfully engaged with customers from across the globe including South East Asia, USA and India. His commitment and ability to find solutions has earned him great respect from customers. Prior to Ashnik he has worked with EnterpriseDB for 7 years and helped it grow in South East Asia and India