Understanding TF-IDF - a First Principle Computation with Apache Spark

Emml Asimadi, (CDMP-Master)

Published Mar 14, 2017

Introduction

TF-IDF is an abbreviation for Term Frequency-Inverse Document Frequency which represents one of the most common algorithms in text processing. In this article we explain TF-IDF, explore its use-cases and perform a practical implementation from 'first principles' using the popular big data framework Apache Spark.

Just as in calculus you do not perform differentiation from first principles on a normal day, this 'first principle' approach is not for normal day use. It is just to cement our understanding of TF-IDF and showcase some interesting capabilities of Apache Spark. In fact Spark provides a clean/precise API for computing TF-IDF out-of-the-box so no need for this first principle stuff!!!.

This article is written for technology enthusiast, developers (beginner -intermediate) who want to gain an intuitive and practical understanding of TF-IDF as well as interesting capabilities of Apache Spark in processing large datasets e.g .....distributed counts, user-defined functions(UDF), broadcast joins among others.

The collection of documents to be analysed make up a corpus and provides the context for TF-IDF scores. In technical speak, it provides the vector space in which we represent all the documents in the corpus.

What is discussed in this article ?

What is TF-IDF?
The use-cases of TF-IDF.
Understanding TF-IDF using spark.
Use-case 1: Document Search.

1. What is TF-IDF?

Intuitively, tf-idf provides a means of characterizing/describing documents using the words that occur in them. It provides a means for converting documents into Vectors [N1, N2, N3....Nn] for machine learning or further processing. Given a collection of documents that make up a corpus, tf-idf assigns scores to words in each document which represents how important the word is in describing the document. 'Generally', the more important the word, the higher it's tf-idf score. The score is a product of two distinct terms TF (term frequency - how often the word occurs in the document) and IDF (Inverse Document Frequency - is a weighting factor that values rare words. It is constant for each word in the corpus and is a measure of how rare the word is in the corpus.)

So what we are trying to do with tf-idf is to find a vector that best describes the document using the words that occur in the document, giving more value to words that are rare in the corpus.

The maths

Now let's attempt the short maths, stay with me :).

Let's now compute this on our toy corpus of two documents below. In the table below we compute the individual values that make up the formula above. Notice that the most important words for describing Document 1 and 2 are Spark and Hadoop respectively. The rest don't matter as they occur in 'all' documents and therefore don't contribute to uniquely describing the document. Thus Spark & Hadoop have the highest TF-IDF scores in their respective documents. Simple but hope you get it?

A number of approaches exist for computing TF-IDF including normalizing the term frequencies to account for the variation in length of the documents in the corpus. The table below from Wikipedia summaries them (our approach is ticked in red below).

2. The Use-Cases

Document Search: TF-IDF has been traditionally applied in information retrieval systems. It allows us to retrieve documents that are closely related to a query phrase but not an exact string-match. This is demonstrated in section below.
Recommendation: TF-IDF can be used to compute similarity between documents and hence can be used in recommendation. In recommending e.g. articles similar to what you have already read :) - you probably don't want to be reading the same kind of article.
Pre-processing for other Machine Learning algorithms: TF-IDF is also a common pre-processing step for other machine learning algorithms namely LDA, Classifiers among other. It is essentially used to vectorize text (convert text into vector) for further processing.
Text Analysis: TF-IDF provides a simple yet powerful means of analyzing text.
Customer Care Applications: a more specific text analysis use-case is in Customer care. At the least we could analyse customer reviews over time to find the key terms for further action. In more sophisticated application, text from customer agents notes at customer care center could highlight the onset of a problem in the customer base e.g. in a telecom network. If applied with other machine learning techniques and data from the network we could be looking at an application that could suggest to a network Engineer which part of the network is faulty and the severity level.
Document Classification: TF-IDF forms a fundamental algorithm for classifying documents e.g spam, not spam.
Article Auto tagging: TF-IDF could also be employed in tagging articles. For example, in our miniature computation above, Document1 could be tagged 'Spark' and Document2 as 'Hadoop'. To tag on a large scale, we compute tf-idf of all articles, set a threshold, terms with scores above this threshold would be used in auto-tagging the article. Given tags could be multiword we could employ ngram before tf-idf.

3. Understanding TF-IDF using spark.

To further cement our understanding of this algorithm we apply it to BBCs news dataset (technology news 2004- 2005) using first principles. Spark allows us to compute tf-idf in just a few lines of code, however in this example we avoid existing library for computing tf-idf to provide deeper understanding of tf-idf and showcase spark's capabilities alongside its simple API. In this example we program against Spark's Python API.

Directory of documents (BBC Tech news to be analysed)

Step 1: Load the data

# Create and RDD of textFiles
# Using wholeTextFiles the created RDD has filename as Key and content as value
# The method takes numpartitions = 8 to set the parallelism of the application  
tech_text = sc.wholeTextFiles("/mnt/dataset/public/bbcnews/tech/",8)

Below is a 'simplified view' of the resulting RDD first five elements of the RDD.

With the data loaded we proceed to compute all variables required to calculate TF-IDF score.

Step 2: Compute TF-IDF - one variable at a time

2.1 N - Number of Documents: The easiest to compute :). the total number of documents in the corpus.

# Yes that simple. the number of documents in the corpus.
number_of_docs = tech_text.count()

# Output : 401

2.2 TF- Term Frequency: How many times does the word occur in the document. Here we split the text into words using regex and count the word occurrence in the documents. To keep our code clean we define a method called tokenize to do exactly what its name suggests- to tokenize the text into words.

import re
def tokenize(s):
  return re.split("\\W+", s.lower())

#We Tokenize the text
tokenized_text = tech_text.map(lambda (text,title): (title, tokenize(text)) )

#Count Words in each document
term_frequency = tokenized_text.flatMapValues(lambda x: x).countByValue()
term_frequency.items()[:20] # Display 20 lines

Output: ( (Document, word), TF). we could decide to remove stopwords (and, is, i) which don't convey any meaning in describing the document. This is not done in this example.

2.3 Compute Document Frequency: how many documents in the corpus contain the term

document_frequency = tokenized_text.flatMapValues(lambda x: x).distinct()\
                        .map(lambda (title,word): (word,title)).countByKey()


document_frequency.items()[:10]

Output: (Term , DF)

2.4 Compute TF-IDF - finally compute TF-IDF scores of words. We define a function that takes N, Term Frequencies & Document Frequencies and returns TF-IDF scores.

import numpy as np


def tf_idf(number_of_docs, term_frequency, document_frequency):
    result = []
    for key, value in tf.items():
        doc = key[0]
        term = key[1]
        df = document_frequency[term]
        if (df>0):
          tf_idf = float(value)*np.log(number_of_docs/df)
        
        result.append({"doc":doc, "score":tf_idf, "term":term})
    return result



tf_idf_output = tf_idf(number_of_docs, term_frequency, document_frequency)

tf_idf_output[:10]

#'this could be done in a better way that takes advantage of spark.

Output:

There YOU are, the tf-idf scores of words in our corpus.

4. Use-Case 1: Document Search Example

We can now search for documents in our corpus. When a query term is provided,we vectorize it and compute it's similarity other documents in the corpus. To efficiently find documents that are related to our query we use Spark's Broadcast join.

The maths:

# The search Funtion takes the query and the number of Top related documents to return

tfidf_RDD = sc.parallelize(tf_idf_output).map(lambda x: (x['term'],(x['doc'],x['score']) )) # the corpus with tfidf scores

def search(query, topN):
  tokens = sc.parallelize(tokenize(query)).map(lambda x: (x,1) ).collectAsMap()
  bcTokens = sc.broadcast(tokens)

  #connect to documents with terms in the Query. to Limit the computation space
  #so that we don't attempt to compute similarity for docs that have no words in common with our query.  
  joined_tfidf = tfidf_RDD.map(lambda (k,v): (k,bcTokens.value.get(k,'-'),v) ).filter(lambda (a,b,c): b != '-' )
  
  #compute the score using aggregateByKey
  scount = joined_tfidf.map(lambda a: a[2]).aggregateByKey((0,0),
  (lambda acc, value: (acc[0] +value,acc[1]+1)),
  (lambda acc1,acc2: (acc1[0]+acc2[0],acc1[1]+acc2[1])) )
  
  scores = scount.map(lambda (k,v): ( v[0]*v[1]/len(tokens), k) ).top(topN)
  
  return scores

Now with the search function defined lets use part of document 001.txt to search the corpus.

search("Ink helps drive democracy in Asia The Kyrgyz Republic",5 )

Output: Notice document 001.txt is the most highly ranked match.

Conclusion

To conclude, in this article, we have tried to gain an intuitive understanding of a common text processing algorithm TF-IDF (describing document using the frequency of words occurring in them weighted by the importance of the word in the corpus.). We then proceeded to list a few use-cases after which we computed TF-IDF scores for bbc news corpus from first principles to illustrated interesting capabilities of Spark. Finally we performed a simple document search to illustrate one of the typical use-cases of TF-IDF.

Feel free to leave your feedback and comments..... cheers.

See here for implementation in databrick's notebook.

Nishant Tanksale 2y

Concise and informative article 👏

Alejandro Estrella Gabilondo 5y

Very concise and well explained.

1 Reaction

Yasamin Abbaszadegan 6y

Thank you for this useful afrtical I was following up on https://www.garudax.id/pulse/calculating-tf-idf-apache-spark-arseniy-tashoyan/.

1 Reaction

Understanding TF-IDF - a First Principle Computation with Apache Spark

Emml Asimadi, (CDMP-Master)

Introduction

What is discussed in this article ?

1. What is TF-IDF?

2. The Use-Cases

3. Understanding TF-IDF using spark.

4. Use-Case 1: Document Search Example

Conclusion

More articles by this author

Others also viewed

Connected Component using Map-reduce on Apache Spark

Building Your First Spark : Logistic Regression Model

Stream Processing/Analytics Tools Like Apache Flink is NOT Transactional Machine Learning (TML)

Using Apache Spark for Machine Learning - DataFrames and RDDs

Pyspark - Code Tweaks For Better Performance:

Two Fundamental Changes in Apache Spark

Scaling Data Science: Transitioning from Pandas to Apache Spark on Databricks

Why PySpark is Your Go-To for Big Data in Python💥

White Paper - Creating a Virtual CPU on Pyspark: A Paradigm Shift in Data Pre-Processing

Understanding Why PySpark Uses Cache and Persist

Explore content categories

Introduction

What is discussed in this article ?

1. What is TF-IDF?

2. The Use-Cases

3. Understanding TF-IDF using spark.

4. Use-Case 1: Document Search Example

Conclusion

AI Agent Tooling Is Getting Easier to Connect to Enterprise Systems — Should Boards Be Worried?

Apr 23, 2026

Is AI-Driven a New Agenda — or the Data-Driven Agenda You Never Finished?

Apr 7, 2026

The Business Case for Data Architecture in the Age of AI

Mar 24, 2026

The Shift from Prompt Engineering to Context Engineering

Mar 11, 2026

There Are Four Ways to Source AI Innovation. Most Leaders Are Over-Indexing on Two.

Feb 25, 2026

AI initiatives are stalling - not because models fail, but because data can't withstand scrutiny

Feb 18, 2026

Agentic AI Maturity: Why Most Organisations Are Further Along Than They Think

Feb 10, 2026

About this Newsletter (transition note)

Feb 3, 2026

2025 in Review: The Strategic Shifts in Data & AI

Dec 30, 2025

This Week - Agentic AI Foundations, Executive Orders and GPT-5.2

Dec 16, 2025

Others also viewed

Connected Component using Map-reduce on Apache Spark

Building Your First Spark : Logistic Regression Model

Stream Processing/Analytics Tools Like Apache Flink is NOT Transactional Machine Learning (TML)

Using Apache Spark for Machine Learning - DataFrames and RDDs

Pyspark - Code Tweaks For Better Performance:

Two Fundamental Changes in Apache Spark

Scaling Data Science: Transitioning from Pandas to Apache Spark on Databricks

Why PySpark is Your Go-To for Big Data in Python💥

White Paper - Creating a Virtual CPU on Pyspark: A Paradigm Shift in Data Pre-Processing

Understanding Why PySpark Uses Cache and Persist

Explore content categories