Inverted Index and its basic implementation
What is an Inverted Index?
It is a data structure used to retrieve documents based on their content efficiently. It is used in an information retrieval system. It is an index data structure storing a mapping from content, such as words or numbers, to its location in a document or a set of documents. In simple words, it is a dictionary-like data structure that directs the user from a word to a document or a web page.
e.g. In a regular dictionary, you look up a word to find its meaning. In an inverted index, you look up a word to see which documents contain it.
Let’s build the simple inverted index using Python:
from collections import defaultdict
# Step 1
index = defaultdict(set)
documents = {
1: "apple banana mango",
2: "banana orange",
3: "mango apple banana"
}
for doc_id, text in documents.items():
for word in text.lower().split():
index[word].add(doc_id)
# Step 2
def search(index, query):
query_words = query.lower().split()
if not query_words:
return set()
result = index.get(query_words[0], set()).copy()
for word in query_words[1:]:
result &= index.get(word, set())
return result
# Step 3
query = "apple banana"
found_docs = search(index, query)
print(f"Search results for '{query}': {sorted(found_docs)}")
In step 1, the code will create an index that contains the unique words from the documents and the keys of the documents in which the word is present. The created index will look like below.
{
'apple': {1, 3},
'banana': {1, 2, 3},
'mango': {1, 3},
'orange': {2}
}
In step 2, the search function is added. The parameters are the index and the user query.
In step 3, the output of the search function will be [1, 3] as both apple and banana are present in the 1st and the 3rd document. But in case the query is modified to (query = " apple orange"), it will return an empty list, and this combination is not present in any of the documents.
Is text preprocessing needed in real-life search engines?
Yes. Text preprocessing methods like tokenization, stopword removal, lowercasing, and stemming/lemmatization are done to create an index with unique words.
Tokenization:
Lowercasing:
Stopword removal:
Lemmatization or Stemming:
These are the basics of an inverted index. I will explain this in a lot more detail at some other time.