How to find related or duplicate items with IndexDen

Many applications and websites faced with the challenge of finding “related items” like:

  • Related articles in a blog
  • Related products in a shop
  • And so on

How to determine related item?
In the comparison of text object – related items determined based on the percentage of text similarity. For instance if text A similar more than on 80% to text B then both text object are related to each other.

How IndexDen can help?

Recently IndexDen added new feature to the API – quorum operator. With quorum operator you could match those documents that pass a given threshold of given words.
For example: search query like “the world is a wonderful place”/3 will match all documents that have at least 3 of the 6 specified words.

Ok, lets try real example with IndexDen.

If you don’t have IndexDen account create one here.

I created an index ‘titles’ and with 3 documents in it.

 [
{
"docid":"doc1",
"fields":{"title":"the world is a wonderful place to live"}
},
{
"docid":"doc2",
"fields":{"title":"the world is a beautiful place"}
},
{
"docid":"doc3",
"fields":{"title":"I was born and living in a wonderful place "}
}
]

Then at IndexDen dashboard I changed the default formula #0 for index ‘titles’ from ‘-age’ to ‘relevance’, because we will need to sort results by relevance, by default results sorted by the document age.

Time for testing.

Lets say we want to find all related items to the title “Our world is amazing place to live” with similarity percentage around 30%.

First we need to calculate quorum limit based on the number of words in search term “Our world is amazing place to live”. I set quorum limit to 2 words to cover around 30% of words in given search term.

For simplicity I will use Ruby command line to fire queries. See how to install ruby client here

#initialize index object
require 'indextank'
api = IndexTank::Client.new ""
index = api.indexes “titles”

query = '"Our world is amazing place to live"/2'
results = index.search(query, :fetch => 'title')

The results:

[
{"docid"=>"doc1", "query_relevance_score"=>"3500", "title"=>"the world is a wonderful place to live"},
{"docid"=>"doc2", "query_relevance_score"=>"2474", "title"=>"the world is a beautiful place"},
{"docid"=>"doc3", "query_relevance_score"=>"1474", "title"=>"I was born and living in a wonderful place "}]

As you could see all 3 documents were found due to low quorum limit. Now lets tune quorum limit to filter documents which similar on 50%. BTW, take a look at query relevance score, the higher the score the more similar document to search term.
Ok, now I set quorum limit to 3 it is around 50% similarity.

query = '"Our world is amazing place to live"/3'
results = index.search(query, :fetch => 'title')

The results:

[
{"docid"=>"doc1", "query_relevance_score"=>"3500", "title"=>"the world is a wonderful place to live"},
{"docid"=>"doc2", "query_relevance_score"=>"2474", "title"=>"the world is a beautiful place"}]

Yep, now only two documents were found.

And in finally lets set quorum limit to 5, it is around 80% of text similarity.

query = '"Our world is amazing place to live"/5'
results = index.search(query, :fetch => 'title')

The results:

[{"docid"=>"doc1", "query_relevance_score"=>"3500", "title"=>"the world is a wonderful place to live"}]

We found that only one documents from 3 are similar on 80%.

OK, here I described how to use quorum operator to find related documents in the IndexDen index.
Other uses cases for quorum operator are:

  • Find duplicate records in the index
  • No more empty search results. See the article http://blog.indexden.com/2012/new-fuzzy-search-feature-quorum-operator-in-indexden-api
  • Forgot about ‘OR’ operator between each word to find any of the word. You could replace ‘OR’s by quorum limit like “many search terms here”/1 – is equal to “many OR search OR terms OR here”

Don’t hesitate to try IndexDen – it is free up to 15k documents. And  we are support micro plan (100k documents limit) only for $11.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>