Natural Language Processing for Beginners: Concepts, Tools, and a Real Example

Natural Language Processing for Beginners: Concepts, Tools, and a Real Example

Briefing

Despite all the news about AI and its impact (both positive and negative) on our world, there are many useful techniques beyond LLMs like GPT and Google Gemini. In this article, I would like to present a small and beginner-friendly introduction to Natural Language Processing (NLP), a branch of AI.

Context and Definitions

NLP is an acronym for Natural Language Processing. It is a branch of AI that teaches computers how to understand, interpret and generate human language.

Human language is complex. A single word have different meanings depends on the context. When we include emotions, intonation and mood, the complexity increases even more. NLP is the field that enables computers to work with booth written and spoken human language despite these challenges.

In this article, I will address a very common and practical problem to demonstrate how NLP can be used to solve real-world scenarios. Imagine you work for a very popular hotel marketplace, where costumers frequently leave their comments about room accommodations and hotel services. Your task is to analyze these comments and summarize them to for the hotel manager.

Manually reading every comment, sentence by sentence, and taking notes would be inefficient and time-consuming. Fortunately, NLP provides techniques that help us process large amounts of text data, extract meaningful insights, and do this in a much smarter way.

To demonstrate this, we will use the Python programming language along with some well-known NLP libraries such as nltk and pandas.

NLTK Library

NLTK, short for Natural Language Toolkit, is a Python package designed for work with human language data. It provides many classes, functions, and tools that help us analyze and process text. Let's explore some of its most common features.

Stopwords

The first one is stopwords. Stopwords are common words that mainly serve a grammatical purpose, such as prepositions, pronouns and adverbs. These words usually carry little meaning by themselves.

In the NLTK stopwords collection, you can find words like:

  • a
  • about
  • above
  • after
  • below
  • between
  • few
  • for
  • from
  • to
  • too
  • under
  • until
  • while
  • who
  • whom
  • you
  • yours
  • yourself

This collection allow us to focus on the most meaning words in a text by removing less informative ones during analysis.

Stemming

Another useful tool from NLTK is Stemming. A stemmer removes morphological affixes from words, leaving only the word stem.

Stemming algorithms aim to remove elements related to grammatical role, tense, derivational morphology. For example, using stemming the list of words:

  • connecting, connected, connectivity, connect, connects

is reduced to a single stem:

  • connect.

However, stemmer has limitations. It doesn't recognize written words with the same meaning but with different roots. For example:

  • likes -> like
  • better -> better
  • worse -> wors

As we can see, stemming may produce incorrect or incomplete results.

Lemmatization

Lemmatization is other powerful tool from NLTK. It addresses some of the issues introduced by Stemming.

While stemming may generate non-words or over-simplified results, lemmatization uses a vocabulary and morphological analysis to return the base or dictionary form of a word, know as the lemma.

Lemmatization is more computationally expensive than stemming because it relies on lexical resources such as WordNet to understand the context. For example, the word better correctly maps to its lemma good, something that stemming cannot achieve.

Tokens and Tokenization

Now, let's discuss tokens, a very common term in NLP. A token is the smallest unit of text that we analyze - think of a building block.

Before processing text, we must first break it into these smaller, manageable pieces. This process is called tokenization.

NLTK provides two very useful tokenization functions:

  • sent_tokenize
  • word_tokenize

The function sent_tokenize splits a block of text into individual sentences. It understands that abbreviations such as Mr., Ms. and U.S.A.do not indicate the end of a sentence.

Example input:

Hello there! How are you doing today? Mr. Smith is waiting for you.        

Output:

Hello there!
How are you today?
Mr. Smith is waiting for you.        

The function word_tokenize splits a sentence into words and punctuation. For example, applying it to:

Mr. Smith is waiting for you.        

Produces:

['Mr', '.' , 'Smith', 'is', 'waiting', 'for', 'you', '.']        


N-Grams

Moving forward, let's talk about n-grams. An N-Gram is a contiguous sequence of n items in a given sample of text.

Think of it as a sliding window that moves across a sentence, capturing groups of words that appear next to each other.

We usually give the specific names based on the value of n:

  • Unigram (n=1): Single words
  • Bigram (n=2): Pairs of consecutive words
  • Trigram (n=3): Triple of words

While word tokenization is useful, it can lose context. For example, looking at the single word good does not capture negation. however, the bigram not good conveys a completely different meaning.

Pandas

Pandas is one of the most important libraries in Python for data analysis, it helps you store, clean, explore, and analyze data quickly and efficiently and you’ll often see it used in data science, machine learning, and general data processing tasks.

Let’s start by understanding what Pandas actually does.

Imagine you have a spreadsheet full of sales data — with columns like Product Name, Price, and Quantity Sold.

Pandas allows you to work with this data directly inside Python, just like you would in Excel — but with much more power and flexibility.

You can use Pandas to:

  • Load data from files like CSV, Excel, or SQL databases
  • Filter rows and select specific columns
  • Perform calculations or summaries (like averages or totals)
  • Handle missing values
  • Combine multiple datasets together

In short — Pandas makes data manipulation easy and efficient.

The "pd.series" Object

A Series is a one-dimensional labeled array — like a single column in a spreadsheet.

Here’s how to create one:

import pandas as pd
numbers = pd.Series([10, 20, 30, 40])
print(numbers)        

Output:

    0    10
    1    20
    2    30
    3    40
    dtype: int64        

Each value (10, 20, 30, 40) has a label, called an index (0, 1, 2, 3 by default). The “dtype” at the bottom shows the data type of the elements (in this case, integers).

You can think of a Series as a column of data with labels.

The "pd.Data Frame" Object

A DataFrame is a two-dimensional table — like a full spreadsheet with rows and columns. Here’s an example:

import pandas as pd
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["London", "Paris", "Berlin"]
}
df = pd.DataFrame(data)
print(df)        

Output:

      Name  Age    City
0    Alice   25  London
1      Bob   30   Paris
2  Charlie   35  Berlin        

Each key in the dictionary ("Name", "Age", "City") becomes a column. Each list inside the dictionary becomes the data for that column. Pandas automatically adds an index column on the left (0, 1, 2).

You can think of a DataFrame as a collection of Series that share the same index.

Let's Move On

Now that we understand the basic concepts of text processing, let's build our project. The goal of this project is to process hotel reviews from a CSV file and identify the most relevant terms within them.

Importing

First, we need to import the required tools for the project.

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
import re
import pandas as pd
import matplotlib.pyplot as plt        

stopwords is used to clean the reviews

stem and tokenize to extract relevant terms for the text.

re is the regular expression handler.

pandas loads and process data from CSV .

pyplot from matplotlib is used to plot graphs from our result.


Extracting Data

We start by loading data from the CSV file:

data = pd.read_csv("tripadvisor_hotel_reviews.csv")
data.info()        

data.info() method shows a summary of the extracted dataset, including column details, data types and memory usage:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 109 entries, 0 to 108
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Review  109 non-null    object
 1   Rating  109 non-null    int64
dtypes: int64(1), object(1)
memory usage: 1.8+ KB        


Lowercasing Text

In Python - and in many programming languages - uppercase and lowercase are treated differently.

To ensure consistence during analyses, we convert all the text in the Review column to lowercase and store it in a new column called review_lowercase column.

data['review_lowercase'] = data['Review'].str.lower()
data.head()        
Article content
lowecase column

Removing Stopwords

Next, we remove less informative words from the review_lowercase column and store the result in the review_no_stopwords column.

en_stopwords = stopwords.words('english')
en_stopwords.remove("not")
data['review_no_stopwords'] = data['review_lowercase'].apply(lambda x: ' '.join([word for word in x.split() if word not in (en_stopwords)]))
data.head()        
Article content
stopwords column

Removing Punctuation

Now we remove punctuation from the text and store the result in the review_no_stopwords column. First, we replace all occurrences of * with the word star.

data['review_no_stopwords_no_punct'] = data.apply(lambda x: re.sub(r"[*]", "star", x["review_no_stopwords"]), axis=1)        

Then, we remove remaining punctuation.

data['review_no_stopwords_no_punct'] = data.apply(lambda x: re.sub(r"([^\w\s])", "", x['review_no_stopwords_no_punct']), axis=1)
data.head()        
Article content
removing punctuation column

Tokenizing

All previous steps aimed to produce cleaner and more concise text. Now, we tokenize the reviews from the review_no_stopwords_no_punct column:

data['tokenized'] = data.apply(lambda x: word_tokenize(x['review_no_stopwords_no_punct']), axis=1)
data.head()        
Article content
Tokenized column

Stemming

Next, we apply stemming to the tokenized and store the result in the stemmed column:

ps = PorterStemmer()
data['stemmed'] = data['tokenized'].apply(lambda tokens: [ps.stem(token) for token in tokens])
data.head()        
Article content
Stemmed Column

Lemmatization

We also apply lemmatization to the tokenized words:

lemmatizer = WordNetLemmatizer()
data['lemmatized'] = data['tokenized'].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])
data.head()        
Article content
Lemmatized Column

N-Grams

Our first step here is to combine all tokens from lemmatized column to a single list.

tokens_clean = sum(data['lemmatized'], [])        

In this project, we adopt bigrams approach to capture meaningful word combinations instead of isolated terms. For example, the phrase great location carries more meaning than the words great and location individually.

bigrams = (pd.Series(nltk.ngrams(tokens_clean, 2)).value_counts())
print(bigrams)        

Below, we have the most common bigrams found in the reviews:

(great, location)       24
(space, needle)         21
(hotel, monaco)         16
(great, hotel)          12
(pike, place)           12
                        ..
(breakast, included)     1
(included, deal)         1
(deal, class)            1
(class, sitting)         1
(sitting, pool)          1
Name: count, Length: 8263, dtype: int64        


Visualizing Results

Finally, we visualize the top ten most frequent bigrams using bar chart:

bigrams[0:10].sort_values().plot.barh(color='skyblue', width=.6, figsize=(12, 8))
plt.title('10 Most Frequently Occuring Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# of Occurances')        
Article content
10 Most Frequently Occurring Bigrams


Conclusion

Throughout this presentation, we explored how useful NLP tools can be in practical scenarios.

We started with a clear problem: to analyze hotel reviews. From there, we introduced key NLP concepts and useful Python tools, and finally applied then in a hands-on dataexploration to extract meaningful insights from rel reviews.

The goal of this article ws not cover everything about NLP,but to offer a simple introduction and inspire you to explore this field further.

Today, NLP is used to solve many real-world problems, such as web search, content moderation, document analysis, and text summarization. There are still many challenges waiting to be addressed, and many opportunities for improvement through Machine Learning, Data Analysis and AI Techniques. As developers and learners, we have the chance to push these boundaries and build smarter solutions.

To view or add a comment, sign in

More articles by GILMAR SANTANA

Others also viewed

Explore content categories