Inverted Index and its basic implementation

Abhirup Chandra

Published Jul 25, 2025

What is an Inverted Index?

It is a data structure used to retrieve documents based on their content efficiently. It is used in an information retrieval system. It is an index data structure storing a mapping from content, such as words or numbers, to its location in a document or a set of documents. In simple words, it is a dictionary-like data structure that directs the user from a word to a document or a web page.

e.g. In a regular dictionary, you look up a word to find its meaning. In an inverted index, you look up a word to see which documents contain it.

Let’s build the simple inverted index using Python:

from collections import defaultdict

# Step 1
index = defaultdict(set)
documents = {
    1: "apple banana mango",
    2: "banana orange",
    3: "mango apple banana"
}

for doc_id, text in documents.items():
    for word in text.lower().split():
        index[word].add(doc_id)

# Step 2
def search(index, query):
    query_words = query.lower().split()
    if not query_words:
        return set()

    result = index.get(query_words[0], set()).copy()
    for word in query_words[1:]:
        result &= index.get(word, set())
    return result

# Step 3
query = "apple banana"
found_docs = search(index, query)
print(f"Search results for '{query}': {sorted(found_docs)}")

In step 1, the code will create an index that contains the unique words from the documents and the keys of the documents in which the word is present. The created index will look like below.

{
'apple': {1, 3},
'banana': {1, 2, 3},
'mango': {1, 3},
'orange': {2}
}

In step 2, the search function is added. The parameters are the index and the user query.

In step 3, the output of the search function will be [1, 3] as both apple and banana are present in the 1st and the 3rd document. But in case the query is modified to (query = " apple orange"), it will return an empty list, and this combination is not present in any of the documents.

Is text preprocessing needed in real-life search engines?

Yes. Text preprocessing methods like tokenization, stopword removal, lowercasing, and stemming/lemmatization are done to create an index with unique words.

Tokenization:

Big sentences are broken down by words or tokens using spaces between them. It also removes unnecessary punctuation.

Lowercasing:

All the tokens are converted to lowercase to make the search function case-insensitive.

Stopword removal:

Removal of common words that don’t add much meaning. e.g the, is, and, of, etc.

Lemmatization or Stemming:

To reduce the tokens to their base/root form.

running to run
cats to cat
better to good

These are the basics of an inverted index. I will explain this in a lot more detail at some other time.

To view or add a comment, sign in

Inverted Index and its basic implementation

Abhirup Chandra

What is an Inverted Index?

Is text preprocessing needed in real-life search engines?

More articles by Abhirup Chandra

Explore content categories

What is an Inverted Index?

Is text preprocessing needed in real-life search engines?

More articles by Abhirup Chandra

Your Database Promised Your Data Is Safe. Here's How It Actually Keeps That Promise.

I in ACID (Database In Depth)

Your Database Transaction Scope Is Holding More Than You Think

Database Connection Pooling

3 Things to remember when you are dealing with Database Indexes

Token Bucket Algorithm

Mutex and Thread safety in Multithread Programs

JAVA with humor: Curious case of Access Modifiers

Explore content categories