Understanding TF-IDF - a First Principle Computation with Apache Spark
Introduction
TF-IDF is an abbreviation for Term Frequency-Inverse Document Frequency which represents one of the most common algorithms in text processing. In this article we explain TF-IDF, explore its use-cases and perform a practical implementation from 'first principles' using the popular big data framework Apache Spark.
Just as in calculus you do not perform differentiation from first principles on a normal day, this 'first principle' approach is not for normal day use. It is just to cement our understanding of TF-IDF and showcase some interesting capabilities of Apache Spark. In fact Spark provides a clean/precise API for computing TF-IDF out-of-the-box so no need for this first principle stuff!!!.
This article is written for technology enthusiast, developers (beginner -intermediate) who want to gain an intuitive and practical understanding of TF-IDF as well as interesting capabilities of Apache Spark in processing large datasets e.g .....distributed counts, user-defined functions(UDF), broadcast joins among others.
The collection of documents to be analysed make up a corpus and provides the context for TF-IDF scores. In technical speak, it provides the vector space in which we represent all the documents in the corpus.
What is discussed in this article ?
- What is TF-IDF?
- The use-cases of TF-IDF.
- Understanding TF-IDF using spark.
- Use-case 1: Document Search.
1. What is TF-IDF?
Intuitively, tf-idf provides a means of characterizing/describing documents using the words that occur in them. It provides a means for converting documents into Vectors [N1, N2, N3....Nn] for machine learning or further processing. Given a collection of documents that make up a corpus, tf-idf assigns scores to words in each document which represents how important the word is in describing the document. 'Generally', the more important the word, the higher it's tf-idf score. The score is a product of two distinct terms TF (term frequency - how often the word occurs in the document) and IDF (Inverse Document Frequency - is a weighting factor that values rare words. It is constant for each word in the corpus and is a measure of how rare the word is in the corpus.)
So what we are trying to do with tf-idf is to find a vector that best describes the document using the words that occur in the document, giving more value to words that are rare in the corpus.
The maths
Now let's attempt the short maths, stay with me :).
Let's now compute this on our toy corpus of two documents below. In the table below we compute the individual values that make up the formula above. Notice that the most important words for describing Document 1 and 2 are Spark and Hadoop respectively. The rest don't matter as they occur in 'all' documents and therefore don't contribute to uniquely describing the document. Thus Spark & Hadoop have the highest TF-IDF scores in their respective documents. Simple but hope you get it?
A number of approaches exist for computing TF-IDF including normalizing the term frequencies to account for the variation in length of the documents in the corpus. The table below from Wikipedia summaries them (our approach is ticked in red below).
2. The Use-Cases
- Document Search: TF-IDF has been traditionally applied in information retrieval systems. It allows us to retrieve documents that are closely related to a query phrase but not an exact string-match. This is demonstrated in section below.
- Recommendation: TF-IDF can be used to compute similarity between documents and hence can be used in recommendation. In recommending e.g. articles similar to what you have already read :) - you probably don't want to be reading the same kind of article.
- Pre-processing for other Machine Learning algorithms: TF-IDF is also a common pre-processing step for other machine learning algorithms namely LDA, Classifiers among other. It is essentially used to vectorize text (convert text into vector) for further processing.
- Text Analysis: TF-IDF provides a simple yet powerful means of analyzing text.
- Customer Care Applications: a more specific text analysis use-case is in Customer care. At the least we could analyse customer reviews over time to find the key terms for further action. In more sophisticated application, text from customer agents notes at customer care center could highlight the onset of a problem in the customer base e.g. in a telecom network. If applied with other machine learning techniques and data from the network we could be looking at an application that could suggest to a network Engineer which part of the network is faulty and the severity level.
- Document Classification: TF-IDF forms a fundamental algorithm for classifying documents e.g spam, not spam.
- Article Auto tagging: TF-IDF could also be employed in tagging articles. For example, in our miniature computation above, Document1 could be tagged 'Spark' and Document2 as 'Hadoop'. To tag on a large scale, we compute tf-idf of all articles, set a threshold, terms with scores above this threshold would be used in auto-tagging the article. Given tags could be multiword we could employ ngram before tf-idf.
3. Understanding TF-IDF using spark.
To further cement our understanding of this algorithm we apply it to BBCs news dataset (technology news 2004- 2005) using first principles. Spark allows us to compute tf-idf in just a few lines of code, however in this example we avoid existing library for computing tf-idf to provide deeper understanding of tf-idf and showcase spark's capabilities alongside its simple API. In this example we program against Spark's Python API.
Directory of documents (BBC Tech news to be analysed)
Step 1: Load the data
# Create and RDD of textFiles
# Using wholeTextFiles the created RDD has filename as Key and content as value
# The method takes numpartitions = 8 to set the parallelism of the application
tech_text = sc.wholeTextFiles("/mnt/dataset/public/bbcnews/tech/",8)
Below is a 'simplified view' of the resulting RDD first five elements of the RDD.
With the data loaded we proceed to compute all variables required to calculate TF-IDF score.
Step 2: Compute TF-IDF - one variable at a time
2.1 N - Number of Documents: The easiest to compute :). the total number of documents in the corpus.
# Yes that simple. the number of documents in the corpus.
number_of_docs = tech_text.count()
# Output : 401
2.2 TF- Term Frequency: How many times does the word occur in the document. Here we split the text into words using regex and count the word occurrence in the documents. To keep our code clean we define a method called tokenize to do exactly what its name suggests- to tokenize the text into words.
import re
def tokenize(s):
return re.split("\\W+", s.lower())
#We Tokenize the text
tokenized_text = tech_text.map(lambda (text,title): (title, tokenize(text)) )
#Count Words in each document
term_frequency = tokenized_text.flatMapValues(lambda x: x).countByValue()
term_frequency.items()[:20] # Display 20 lines
Output: ( (Document, word), TF). we could decide to remove stopwords (and, is, i) which don't convey any meaning in describing the document. This is not done in this example.
2.3 Compute Document Frequency: how many documents in the corpus contain the term
document_frequency = tokenized_text.flatMapValues(lambda x: x).distinct()\
.map(lambda (title,word): (word,title)).countByKey()
document_frequency.items()[:10]
Output: (Term , DF)
2.4 Compute TF-IDF - finally compute TF-IDF scores of words. We define a function that takes N, Term Frequencies & Document Frequencies and returns TF-IDF scores.
import numpy as np
def tf_idf(number_of_docs, term_frequency, document_frequency):
result = []
for key, value in tf.items():
doc = key[0]
term = key[1]
df = document_frequency[term]
if (df>0):
tf_idf = float(value)*np.log(number_of_docs/df)
result.append({"doc":doc, "score":tf_idf, "term":term})
return result
tf_idf_output = tf_idf(number_of_docs, term_frequency, document_frequency)
tf_idf_output[:10]
#'this could be done in a better way that takes advantage of spark.
Output:
There YOU are, the tf-idf scores of words in our corpus.
4. Use-Case 1: Document Search Example
We can now search for documents in our corpus. When a query term is provided,we vectorize it and compute it's similarity other documents in the corpus. To efficiently find documents that are related to our query we use Spark's Broadcast join.
The maths:
# The search Funtion takes the query and the number of Top related documents to return
tfidf_RDD = sc.parallelize(tf_idf_output).map(lambda x: (x['term'],(x['doc'],x['score']) )) # the corpus with tfidf scores
def search(query, topN):
tokens = sc.parallelize(tokenize(query)).map(lambda x: (x,1) ).collectAsMap()
bcTokens = sc.broadcast(tokens)
#connect to documents with terms in the Query. to Limit the computation space
#so that we don't attempt to compute similarity for docs that have no words in common with our query.
joined_tfidf = tfidf_RDD.map(lambda (k,v): (k,bcTokens.value.get(k,'-'),v) ).filter(lambda (a,b,c): b != '-' )
#compute the score using aggregateByKey
scount = joined_tfidf.map(lambda a: a[2]).aggregateByKey((0,0),
(lambda acc, value: (acc[0] +value,acc[1]+1)),
(lambda acc1,acc2: (acc1[0]+acc2[0],acc1[1]+acc2[1])) )
scores = scount.map(lambda (k,v): ( v[0]*v[1]/len(tokens), k) ).top(topN)
return scores
Now with the search function defined lets use part of document 001.txt to search the corpus.
search("Ink helps drive democracy in Asia The Kyrgyz Republic",5 )
Output: Notice document 001.txt is the most highly ranked match.
Conclusion
To conclude, in this article, we have tried to gain an intuitive understanding of a common text processing algorithm TF-IDF (describing document using the frequency of words occurring in them weighted by the importance of the word in the corpus.). We then proceeded to list a few use-cases after which we computed TF-IDF scores for bbc news corpus from first principles to illustrated interesting capabilities of Spark. Finally we performed a simple document search to illustrate one of the typical use-cases of TF-IDF.
Feel free to leave your feedback and comments..... cheers.
See here for implementation in databrick's notebook.
Concise and informative article 👏
Very concise and well explained.
Thank you for this useful afrtical I was following up on https://www.garudax.id/pulse/calculating-tf-idf-apache-spark-arseniy-tashoyan/.
comment réaliser ça avec java spark