Natural Language Processing for Beginners: Concepts, Tools, and a Real Example
Briefing
Despite all the news about AI and its impact (both positive and negative) on our world, there are many useful techniques beyond LLMs like GPT and Google Gemini. In this article, I would like to present a small and beginner-friendly introduction to Natural Language Processing (NLP), a branch of AI.
Context and Definitions
NLP is an acronym for Natural Language Processing. It is a branch of AI that teaches computers how to understand, interpret and generate human language.
Human language is complex. A single word have different meanings depends on the context. When we include emotions, intonation and mood, the complexity increases even more. NLP is the field that enables computers to work with booth written and spoken human language despite these challenges.
In this article, I will address a very common and practical problem to demonstrate how NLP can be used to solve real-world scenarios. Imagine you work for a very popular hotel marketplace, where costumers frequently leave their comments about room accommodations and hotel services. Your task is to analyze these comments and summarize them to for the hotel manager.
Manually reading every comment, sentence by sentence, and taking notes would be inefficient and time-consuming. Fortunately, NLP provides techniques that help us process large amounts of text data, extract meaningful insights, and do this in a much smarter way.
To demonstrate this, we will use the Python programming language along with some well-known NLP libraries such as nltk and pandas.
NLTK Library
NLTK, short for Natural Language Toolkit, is a Python package designed for work with human language data. It provides many classes, functions, and tools that help us analyze and process text. Let's explore some of its most common features.
Stopwords
The first one is stopwords. Stopwords are common words that mainly serve a grammatical purpose, such as prepositions, pronouns and adverbs. These words usually carry little meaning by themselves.
In the NLTK stopwords collection, you can find words like:
This collection allow us to focus on the most meaning words in a text by removing less informative ones during analysis.
Stemming
Another useful tool from NLTK is Stemming. A stemmer removes morphological affixes from words, leaving only the word stem.
Stemming algorithms aim to remove elements related to grammatical role, tense, derivational morphology. For example, using stemming the list of words:
is reduced to a single stem:
However, stemmer has limitations. It doesn't recognize written words with the same meaning but with different roots. For example:
As we can see, stemming may produce incorrect or incomplete results.
Lemmatization
Lemmatization is other powerful tool from NLTK. It addresses some of the issues introduced by Stemming.
While stemming may generate non-words or over-simplified results, lemmatization uses a vocabulary and morphological analysis to return the base or dictionary form of a word, know as the lemma.
Lemmatization is more computationally expensive than stemming because it relies on lexical resources such as WordNet to understand the context. For example, the word better correctly maps to its lemma good, something that stemming cannot achieve.
Tokens and Tokenization
Now, let's discuss tokens, a very common term in NLP. A token is the smallest unit of text that we analyze - think of a building block.
Before processing text, we must first break it into these smaller, manageable pieces. This process is called tokenization.
NLTK provides two very useful tokenization functions:
The function sent_tokenize splits a block of text into individual sentences. It understands that abbreviations such as Mr., Ms. and U.S.A.do not indicate the end of a sentence.
Example input:
Hello there! How are you doing today? Mr. Smith is waiting for you.
Output:
Hello there!
How are you today?
Mr. Smith is waiting for you.
The function word_tokenize splits a sentence into words and punctuation. For example, applying it to:
Mr. Smith is waiting for you.
Produces:
['Mr', '.' , 'Smith', 'is', 'waiting', 'for', 'you', '.']
N-Grams
Moving forward, let's talk about n-grams. An N-Gram is a contiguous sequence of n items in a given sample of text.
Think of it as a sliding window that moves across a sentence, capturing groups of words that appear next to each other.
We usually give the specific names based on the value of n:
While word tokenization is useful, it can lose context. For example, looking at the single word good does not capture negation. however, the bigram not good conveys a completely different meaning.
Pandas
Pandas is one of the most important libraries in Python for data analysis, it helps you store, clean, explore, and analyze data quickly and efficiently and you’ll often see it used in data science, machine learning, and general data processing tasks.
Let’s start by understanding what Pandas actually does.
Imagine you have a spreadsheet full of sales data — with columns like Product Name, Price, and Quantity Sold.
Pandas allows you to work with this data directly inside Python, just like you would in Excel — but with much more power and flexibility.
You can use Pandas to:
In short — Pandas makes data manipulation easy and efficient.
The "pd.series" Object
A Series is a one-dimensional labeled array — like a single column in a spreadsheet.
Here’s how to create one:
import pandas as pd
numbers = pd.Series([10, 20, 30, 40])
print(numbers)
Output:
0 10
1 20
2 30
3 40
dtype: int64
Each value (10, 20, 30, 40) has a label, called an index (0, 1, 2, 3 by default). The “dtype” at the bottom shows the data type of the elements (in this case, integers).
You can think of a Series as a column of data with labels.
The "pd.Data Frame" Object
A DataFrame is a two-dimensional table — like a full spreadsheet with rows and columns. Here’s an example:
Recommended by LinkedIn
import pandas as pd
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["London", "Paris", "Berlin"]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 London
1 Bob 30 Paris
2 Charlie 35 Berlin
Each key in the dictionary ("Name", "Age", "City") becomes a column. Each list inside the dictionary becomes the data for that column. Pandas automatically adds an index column on the left (0, 1, 2).
You can think of a DataFrame as a collection of Series that share the same index.
Let's Move On
Now that we understand the basic concepts of text processing, let's build our project. The goal of this project is to process hotel reviews from a CSV file and identify the most relevant terms within them.
Importing
First, we need to import the required tools for the project.
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import re
import pandas as pd
import matplotlib.pyplot as plt
stopwords is used to clean the reviews
stem and tokenize to extract relevant terms for the text.
re is the regular expression handler.
pandas loads and process data from CSV .
pyplot from matplotlib is used to plot graphs from our result.
Extracting Data
We start by loading data from the CSV file:
data = pd.read_csv("tripadvisor_hotel_reviews.csv")
data.info()
data.info() method shows a summary of the extracted dataset, including column details, data types and memory usage:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Review 109 non-null object
1 Rating 109 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.8+ KB
Lowercasing Text
In Python - and in many programming languages - uppercase and lowercase are treated differently.
To ensure consistence during analyses, we convert all the text in the Review column to lowercase and store it in a new column called review_lowercase column.
data['review_lowercase'] = data['Review'].str.lower()
data.head()
Removing Stopwords
Next, we remove less informative words from the review_lowercase column and store the result in the review_no_stopwords column.
en_stopwords = stopwords.words('english')
en_stopwords.remove("not")
data['review_no_stopwords'] = data['review_lowercase'].apply(lambda x: ' '.join([word for word in x.split() if word not in (en_stopwords)]))
data.head()
Removing Punctuation
Now we remove punctuation from the text and store the result in the review_no_stopwords column. First, we replace all occurrences of * with the word star.
data['review_no_stopwords_no_punct'] = data.apply(lambda x: re.sub(r"[*]", "star", x["review_no_stopwords"]), axis=1)
Then, we remove remaining punctuation.
data['review_no_stopwords_no_punct'] = data.apply(lambda x: re.sub(r"([^\w\s])", "", x['review_no_stopwords_no_punct']), axis=1)
data.head()
Tokenizing
All previous steps aimed to produce cleaner and more concise text. Now, we tokenize the reviews from the review_no_stopwords_no_punct column:
data['tokenized'] = data.apply(lambda x: word_tokenize(x['review_no_stopwords_no_punct']), axis=1)
data.head()
Stemming
Next, we apply stemming to the tokenized and store the result in the stemmed column:
ps = PorterStemmer()
data['stemmed'] = data['tokenized'].apply(lambda tokens: [ps.stem(token) for token in tokens])
data.head()
Lemmatization
We also apply lemmatization to the tokenized words:
lemmatizer = WordNetLemmatizer()
data['lemmatized'] = data['tokenized'].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])
data.head()
N-Grams
Our first step here is to combine all tokens from lemmatized column to a single list.
tokens_clean = sum(data['lemmatized'], [])
In this project, we adopt bigrams approach to capture meaningful word combinations instead of isolated terms. For example, the phrase great location carries more meaning than the words great and location individually.
bigrams = (pd.Series(nltk.ngrams(tokens_clean, 2)).value_counts())
print(bigrams)
Below, we have the most common bigrams found in the reviews:
(great, location) 24
(space, needle) 21
(hotel, monaco) 16
(great, hotel) 12
(pike, place) 12
..
(breakast, included) 1
(included, deal) 1
(deal, class) 1
(class, sitting) 1
(sitting, pool) 1
Name: count, Length: 8263, dtype: int64
Visualizing Results
Finally, we visualize the top ten most frequent bigrams using bar chart:
bigrams[0:10].sort_values().plot.barh(color='skyblue', width=.6, figsize=(12, 8))
plt.title('10 Most Frequently Occuring Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# of Occurances')
Conclusion
Throughout this presentation, we explored how useful NLP tools can be in practical scenarios.
We started with a clear problem: to analyze hotel reviews. From there, we introduced key NLP concepts and useful Python tools, and finally applied then in a hands-on dataexploration to extract meaningful insights from rel reviews.
The goal of this article ws not cover everything about NLP,but to offer a simple introduction and inspire you to explore this field further.
Today, NLP is used to solve many real-world problems, such as web search, content moderation, document analysis, and text summarization. There are still many challenges waiting to be addressed, and many opportunities for improvement through Machine Learning, Data Analysis and AI Techniques. As developers and learners, we have the chance to push these boundaries and build smarter solutions.