Natural Language Processing for Beginners: Concepts, Tools, and a Real Example

GILMAR SANTANA

Published Jan 30, 2026

Briefing

Despite all the news about AI and its impact (both positive and negative) on our world, there are many useful techniques beyond LLMs like GPT and Google Gemini. In this article, I would like to present a small and beginner-friendly introduction to Natural Language Processing (NLP), a branch of AI.

Context and Definitions

NLP is an acronym for Natural Language Processing. It is a branch of AI that teaches computers how to understand, interpret and generate human language.

Human language is complex. A single word have different meanings depends on the context. When we include emotions, intonation and mood, the complexity increases even more. NLP is the field that enables computers to work with booth written and spoken human language despite these challenges.

In this article, I will address a very common and practical problem to demonstrate how NLP can be used to solve real-world scenarios. Imagine you work for a very popular hotel marketplace, where costumers frequently leave their comments about room accommodations and hotel services. Your task is to analyze these comments and summarize them to for the hotel manager.

Manually reading every comment, sentence by sentence, and taking notes would be inefficient and time-consuming. Fortunately, NLP provides techniques that help us process large amounts of text data, extract meaningful insights, and do this in a much smarter way.

To demonstrate this, we will use the Python programming language along with some well-known NLP libraries such as nltk and pandas.

NLTK Library

NLTK, short for Natural Language Toolkit, is a Python package designed for work with human language data. It provides many classes, functions, and tools that help us analyze and process text. Let's explore some of its most common features.

Stopwords

The first one is stopwords. Stopwords are common words that mainly serve a grammatical purpose, such as prepositions, pronouns and adverbs. These words usually carry little meaning by themselves.

In the NLTK stopwords collection, you can find words like:

a
about
above
after
below
between
few
for
from
to
too
under
until
while
who
whom
you
yours
yourself

This collection allow us to focus on the most meaning words in a text by removing less informative ones during analysis.

Stemming

Another useful tool from NLTK is Stemming. A stemmer removes morphological affixes from words, leaving only the word stem.

Stemming algorithms aim to remove elements related to grammatical role, tense, derivational morphology. For example, using stemming the list of words:

connecting, connected, connectivity, connect, connects

is reduced to a single stem:

connect.

However, stemmer has limitations. It doesn't recognize written words with the same meaning but with different roots. For example:

likes -> like
better -> better
worse -> wors

As we can see, stemming may produce incorrect or incomplete results.

Lemmatization

Lemmatization is other powerful tool from NLTK. It addresses some of the issues introduced by Stemming.

While stemming may generate non-words or over-simplified results, lemmatization uses a vocabulary and morphological analysis to return the base or dictionary form of a word, know as the lemma.

Lemmatization is more computationally expensive than stemming because it relies on lexical resources such as WordNet to understand the context. For example, the word better correctly maps to its lemma good, something that stemming cannot achieve.

Tokens and Tokenization

Now, let's discuss tokens, a very common term in NLP. A token is the smallest unit of text that we analyze - think of a building block.

Before processing text, we must first break it into these smaller, manageable pieces. This process is called tokenization.

NLTK provides two very useful tokenization functions:

sent_tokenize
word_tokenize

The function sent_tokenize splits a block of text into individual sentences. It understands that abbreviations such as Mr., Ms. and U.S.A.do not indicate the end of a sentence.

Example input:

Hello there! How are you doing today? Mr. Smith is waiting for you.

Output:

Hello there!
How are you today?
Mr. Smith is waiting for you.

The function word_tokenize splits a sentence into words and punctuation. For example, applying it to:

Mr. Smith is waiting for you.

Produces:

['Mr', '.' , 'Smith', 'is', 'waiting', 'for', 'you', '.']

N-Grams

Moving forward, let's talk about n-grams. An N-Gram is a contiguous sequence of n items in a given sample of text.

Think of it as a sliding window that moves across a sentence, capturing groups of words that appear next to each other.

We usually give the specific names based on the value of n:

Unigram (n=1): Single words
Bigram (n=2): Pairs of consecutive words
Trigram (n=3): Triple of words

While word tokenization is useful, it can lose context. For example, looking at the single word good does not capture negation. however, the bigram not good conveys a completely different meaning.

Pandas

Pandas is one of the most important libraries in Python for data analysis, it helps you store, clean, explore, and analyze data quickly and efficiently and you’ll often see it used in data science, machine learning, and general data processing tasks.

Let’s start by understanding what Pandas actually does.

Imagine you have a spreadsheet full of sales data — with columns like Product Name, Price, and Quantity Sold.

Pandas allows you to work with this data directly inside Python, just like you would in Excel — but with much more power and flexibility.

You can use Pandas to:

Load data from files like CSV, Excel, or SQL databases
Filter rows and select specific columns
Perform calculations or summaries (like averages or totals)
Handle missing values
Combine multiple datasets together

In short — Pandas makes data manipulation easy and efficient.

The "pd.series" Object

A Series is a one-dimensional labeled array — like a single column in a spreadsheet.

Here’s how to create one:

import pandas as pd
numbers = pd.Series([10, 20, 30, 40])
print(numbers)

Output:

    0    10
    1    20
    2    30
    3    40
    dtype: int64

Each value (10, 20, 30, 40) has a label, called an index (0, 1, 2, 3 by default). The “dtype” at the bottom shows the data type of the elements (in this case, integers).

You can think of a Series as a column of data with labels.

The "pd.Data Frame" Object

A DataFrame is a two-dimensional table — like a full spreadsheet with rows and columns. Here’s an example:

Let's Move On

Now that we understand the basic concepts of text processing, let's build our project. The goal of this project is to process hotel reviews from a CSV file and identify the most relevant terms within them.

Importing

First, we need to import the required tools for the project.

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import re
import pandas as pd
import matplotlib.pyplot as plt

stopwords is used to clean the reviews

stem and tokenize to extract relevant terms for the text.

re is the regular expression handler.

pandas loads and process data from CSV .

pyplot from matplotlib is used to plot graphs from our result.

Extracting Data

We start by loading data from the CSV file:

data = pd.read_csv("tripadvisor_hotel_reviews.csv")
data.info()

data.info() method shows a summary of the extracted dataset, including column details, data types and memory usage:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Review  109 non-null    object
 1   Rating  109 non-null    int64
dtypes: int64(1), object(1)
memory usage: 1.8+ KB

Lowercasing Text

In Python - and in many programming languages - uppercase and lowercase are treated differently.

To ensure consistence during analyses, we convert all the text in the Review column to lowercase and store it in a new column called review_lowercase column.

data['review_lowercase'] = data['Review'].str.lower()
data.head()

Removing Stopwords

Next, we remove less informative words from the review_lowercase column and store the result in the review_no_stopwords column.

en_stopwords = stopwords.words('english')
en_stopwords.remove("not")
data['review_no_stopwords'] = data['review_lowercase'].apply(lambda x: ' '.join([word for word in x.split() if word not in (en_stopwords)]))
data.head()

Removing Punctuation

Now we remove punctuation from the text and store the result in the review_no_stopwords column. First, we replace all occurrences of * with the word star.

data['review_no_stopwords_no_punct'] = data.apply(lambda x: re.sub(r"[*]", "star", x["review_no_stopwords"]), axis=1)

Then, we remove remaining punctuation.

data['review_no_stopwords_no_punct'] = data.apply(lambda x: re.sub(r"([^\w\s])", "", x['review_no_stopwords_no_punct']), axis=1)
data.head()

Tokenizing

All previous steps aimed to produce cleaner and more concise text. Now, we tokenize the reviews from the review_no_stopwords_no_punct column:

data['tokenized'] = data.apply(lambda x: word_tokenize(x['review_no_stopwords_no_punct']), axis=1)
data.head()

Stemming

Next, we apply stemming to the tokenized and store the result in the stemmed column:

ps = PorterStemmer()
data['stemmed'] = data['tokenized'].apply(lambda tokens: [ps.stem(token) for token in tokens])
data.head()

Lemmatization

We also apply lemmatization to the tokenized words:

lemmatizer = WordNetLemmatizer()
data['lemmatized'] = data['tokenized'].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])
data.head()

N-Grams

Our first step here is to combine all tokens from lemmatized column to a single list.

tokens_clean = sum(data['lemmatized'], [])

In this project, we adopt bigrams approach to capture meaningful word combinations instead of isolated terms. For example, the phrase great location carries more meaning than the words great and location individually.

bigrams = (pd.Series(nltk.ngrams(tokens_clean, 2)).value_counts())
print(bigrams)

Below, we have the most common bigrams found in the reviews:

(great, location)       24
(space, needle)         21
(hotel, monaco)         16
(great, hotel)          12
(pike, place)           12
                        ..
(breakast, included)     1
(included, deal)         1
(deal, class)            1
(class, sitting)         1
(sitting, pool)          1
Name: count, Length: 8263, dtype: int64

Visualizing Results

Finally, we visualize the top ten most frequent bigrams using bar chart:

bigrams[0:10].sort_values().plot.barh(color='skyblue', width=.6, figsize=(12, 8))
plt.title('10 Most Frequently Occuring Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# of Occurances')

Conclusion

Throughout this presentation, we explored how useful NLP tools can be in practical scenarios.

We started with a clear problem: to analyze hotel reviews. From there, we introduced key NLP concepts and useful Python tools, and finally applied then in a hands-on dataexploration to extract meaningful insights from rel reviews.

The goal of this article ws not cover everything about NLP,but to offer a simple introduction and inspire you to explore this field further.

Today, NLP is used to solve many real-world problems, such as web search, content moderation, document analysis, and text summarization. There are still many challenges waiting to be addressed, and many opportunities for improvement through Machine Learning, Data Analysis and AI Techniques. As developers and learners, we have the chance to push these boundaries and build smarter solutions.

To view or add a comment, sign in

Briefing

Context and Definitions

NLTK Library

Stopwords

Stemming

Lemmatization

Tokens and Tokenization

N-Grams

Pandas

The "pd.series" Object

The "pd.Data Frame" Object

Recommended by LinkedIn

Let's Move On

Importing

Extracting Data

Lowercasing Text

Removing Stopwords

Removing Punctuation

Stemming

Lemmatization

N-Grams

Visualizing Results

Conclusion

More articles by GILMAR SANTANA

🚀 Built your own Production-Ready RAG System with Local LLMs

Solução blockchain para controle de ponto.

REDSTORE - Faça de você seu próprio estilo

1º Projeto no ar e balanço do ano.

1 ANO DE GNU/LINUX - Não apenas uma mudança de ecossistema, mas uma mudança de sentimentos.(parte3)

1 ANO DE GNU/LINUX - Não apenas uma mudança de ecossistema, mas uma mudança de sentimentos.(parte2)

1 ANO DE GNU/LINUX - Não apenas uma mudança de ecossistema, mas uma mudança de sentimentos.(parte1)

Others also viewed

Understanding Large Language Models: The History of LLMs - Part 2

A Comprehensive Guide To Feature Engineering with N-Grams For Natural Language Processing

Creating a Large Language Model (LLM) Model: A Step-by-Step Guide

Meeting Recording on June 29, 2025, on "Text Substitution for NLP"

Natural Language Processing with TextBlob

Natural Language Processing with R Workshop on Nov 21st

Machine Learning and NLP, a practical example

The Natural Language

Text Analysis and Natural Language Processing with NLTK

Harnessing the Power of NLP with NLTK Library

Similar topics

Natural Language Processing Algorithms

Deep Learning in NLP

Utilizing Natural Language Processing in AI Recommendations

Evaluating NLP Tools For Chatbot Performance

Understanding The NLP Lifecycle In Chatbots

How To Keep Chatbot Conversations Relevant With NLP

Explore content categories