Python AI for Natural Language Processing (NLP)

Harisha Lakshan Warnakulasuriya(BSc.(ousl))

Published Sep 19, 2025

+ Follow

L. P. Harisha Lakshan Warnakulasuriya(BSc in CS(OUSL)).

Bachelor of Bio Science in Computer Science.

Reading Master of Science in Computer Science at the University of Sri Jayawardenapura

Content

Introduction → Core NLP concepts
Pre-processing, tokenization, POS, NER
Sentiment analysis, text classification, topic modeling
Word embeddings, transformers, real-time projects
Deployment, best practices, advanced techniques

What Is NLP?

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) devoted to enabling computers to understand, interpret, and generate human language. It sits at the intersection of:

Linguistics (structure and meaning of language)
Computer Science (algorithms, data structures)
Machine Learning / Deep Learning (statistical models that learn from data)

Goal: make machines as fluent as possible in human languages — text, speech, or even mixed media.

2. Why Python Is the #1 Language for NLP

Python dominates NLP development because:

Reason Explanation Readable syntax Easier to express algorithms & prototypes Massive library ecosystem nltk, spacy, textblob, gensim, scikit-learn, transformers Scientific stack numpy, pandas, matplotlib, scipy Integration with DL frameworks TensorFlow, PyTorch, JAX Vibrant community Tutorials, Stack Overflow, pre-trained models

In practice, most research papers and production pipelines in NLP have a Python reference implementation.

3. Core Building Blocks of NLP

Building Block Purpose Example

Tokenization Break text into pieces (tokens) “Chatbots are cool!” → [Chatbots, are, cool, !]

Normalization Lower-case, remove punctuation, fix spelling “COOOL!!” → “cool”

Stopword Removal Drop high-frequency function words “and, the, is”

Stemming / Lemmatization Reduce words to root “running” → “run”

POS Tagging Label each token’s role (noun, verb, adj) Python/NN is/VB great/JJ

Named Entity Recognition (NER) Find names, orgs, locations “SpaceX” → ORG

Sentiment Analysis Measure opinion or emotion “Fantastic!” → Positive

4. Text Pre-Processing Pipeline

Raw text
   ↓
Cleaning (remove HTML, digits, etc.)
   ↓
Tokenization
   ↓
Lowercase & normalize
   ↓
Stopword removal
   ↓
Stemming / Lemmatization
   ↓
Vectorization (Bag of Words / TF-IDF / Embeddings)

A robust pipeline is crucial: garbage-in = garbage-out.

5. Installing NLP Libraries

# Create a virtual environment (optional but recommended)
python -m venv nlp_env
source nlp_env/bin/activate  # or nlp_env\Scripts\activate on Windows

# Install key packages
pip install nltk spacy textblob gensim scikit-learn matplotlib
python -m spacy download en_core_web_sm   # small English model

6. Loading and Cleaning Text

# Load text from a file
with open("article.txt", "r", encoding="utf8") as f:
    text = f.read()

# Simple cleaning
import re
clean_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
clean_text = clean_text.lower()
print(clean_text[:300])

Tip: For HTML, use BeautifulSoup to strip tags before regex cleaning.

7. Tokenization

Using nltk

import nltk
nltk.download("punkt")
from nltk.tokenize import word_tokenize

sample = "Natural Language Processing (NLP) is amazing!"
tokens = word_tokenize(sample)
print(tokens)

Using spaCy

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(sample)
tokens = [token.text for token in doc]
print(tokens)

8. Stopword Removal

from nltk.corpus import stopwords
nltk.download("stopwords")

stop_words = set(stopwords.words("english"))
filtered = [t for t in tokens if t.lower() not in stop_words]
print(filtered)

9. Lemmatization

from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

lemm = WordNetLemmatizer()
lemmas = [lemm.lemmatize(w) for w in filtered]
print(lemmas)

Diagram of the flow:

Raw text
 → Tokenize
   → Remove stopwords
     → Lemmatize

10. Combining Steps into a Function

def preprocess(text):
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    import nltk
    nltk.download("punkt", quiet=True)
    nltk.download("stopwords", quiet=True)
    nltk.download("wordnet", quiet=True)

    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words("english"))
    filtered = [t for t in tokens if t.isalpha() and t not in stop_words]
    lemm = WordNetLemmatizer()
    return [lemm.lemmatize(w) for w in filtered]

print(preprocess("Dogs are running quickly in the park!"))

11. Representing Text as Numbers

After cleaning text, models need vectors.

Technique Description Library Bag of Words Counts word frequency sklearn TF-IDF Weighted counts by rarity sklearn Word Embeddings Dense vectors capturing meaning gensim, transformers

Example – TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["I love NLP", "NLP loves Python", "Python is great"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names_out())
print(X.toarray())

12. Challenges in NLP

Ambiguity: words with multiple meanings
Context: “bank” (river vs finance)
Domain adaptation: slang, emojis, legal jargon
Data sparsity: rare words, languages with limited corpora
Bias & fairness: models may inherit stereotypes from data

13. Best Practices for Beginners

Start with small datasets, inspect outputs.
Always evaluate pre-processing impact.
Keep your pipeline modular (each step testable).
Use pre-trained models whenever possible.
Understand tokenization details — crucial for later deep-learning models.

Part 2 – POS Tagging, Named Entity Recognition, Sentiment & Text Classification

14. Part-of-Speech (POS) Tagging

Definition: POS tagging labels each token with its grammatical role: noun, verb, adjective, etc. This helps algorithms understand sentence structure and relationships.

Example with nltk

import nltk
nltk.download("averaged_perceptron_tagger")

sentence = "The quick brown fox jumps over the lazy dog"
tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)
print(tags)

Output:

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'),
 ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'),
 ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

Example with spaCy

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
for token in doc:
    print(token.text, token.pos_, token.tag_)

spaCy provides both universal POS (pos_) and detailed tag (tag_).

15. Named Entity Recognition (NER)

NER detects and classifies “named entities”: people, organizations, places, dates, monetary values.

spaCy NER

text = "Elon Musk founded SpaceX in 2002 and acquired Twitter in 2022."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)

Common labels:

Label Meaning PERSON Person ORG Organization GPE Country/City DATE Calendar date MONEY Monetary value

Customizing NER

spaCy allows training your own NER model if your domain has entities like CHEMICAL, LAW_CASE, PRODUCT_ID.

Use spacy.blank("en") to start, then feed labeled data with Example objects.

16. Sentiment Analysis

Sentiment = emotional polarity of text (positive, negative, neutral).

Using TextBlob

from textblob import TextBlob

txt = "I absolutely love this new phone, but the battery life is too short."
blob = TextBlob(txt)
print(blob.sentiment)

Output:

Sentiment(polarity=0.35, subjectivity=0.75)

Polarity ranges –1 → +1.
Subjectivity measures opinion vs fact.

Using VADER (great for social media)

from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download("vader_lexicon")

analyzer = SentimentIntensityAnalyzer()
print(analyzer.polarity_scores("This movie was surprisingly good!"))

Sentiment Pipeline with Transformers

from transformers import pipeline
sentiment_model = pipeline("sentiment-analysis")
print(sentiment_model("The new update is awesome!"))

Transformers handle complex syntax, sarcasm, emojis better than lexicon-based tools.

17. Text Classification

Goal: assign a label/category to a document. Examples:

Spam vs Ham
Product review stars
News topic: sports, politics, tech

Naive Bayes Classifier with scikit-learn

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

docs = [
    "I love NLP",
    "Python is amazing for AI",
    "Spam emails are annoying",
    "Deep learning is exciting",
    "Buy cheap products now!!!"
]
labels = ["pos","pos","neg","pos","neg"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

clf = MultinomialNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))

Logistic Regression with TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
model = LogisticRegression()
model.fit(X, labels)
print(model.predict(tfidf.transform(["I hate spam messages"])))

Evaluating Classifiers

Key metrics:

Accuracy
Precision, Recall, F1-score
Confusion Matrix

Use classification_report from sklearn.metrics.

18. Topic Modelling

Topic modelling groups documents by hidden themes.

Latent Dirichlet Allocation (LDA)

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love deep learning and neural networks",
    "Dogs and cats are lovely pets",
    "Machine learning is fascinating",
    "My cat is playful",
    "Neural networks drive AI progress"
]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

lda = LatentDirichletAllocation(n_components=2, random_state=0)
lda.fit(X)

for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}:")
    words = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]]
    print(words)

19. Visualising Text Data

Word clouds for keyword prominence
Bar charts of token frequency
t-SNE / UMAP for embedding spaces

Example – Word Frequency:

from collections import Counter
import matplotlib.pyplot as plt

tokens = ["nlp","nlp","python","ai","python","learning","nlp"]
freq = Counter(tokens)
plt.bar(freq.keys(), freq.values())
plt.title("Word frequency")
plt.show()

20. Putting It Together: Mini Project

Problem: Sentiment Analysis on Product Reviews

Gather a CSV of reviews (review, rating).
Preprocess text (clean, tokenize, lemmatize).
Vectorize with TF-IDF.
Train Logistic Regression.
Evaluate & deploy as API.

Sketch of the pipeline:

Raw reviews → Clean text → Preprocess
   → Vectorize (TF-IDF)
     → Train model
       → Serve predictions (Flask/FastAPI)

This simple flow is the foundation for production systems.

Part 3 – Embeddings, Transformers, Real-Time Projects & Deployment

21. Word Embeddings

Traditional Bag-of-Words/TF-IDF vectors don’t capture semantic meaning. Word embeddings map words to dense vectors in a continuous space where semantically similar words are closer.

Common Techniques

Method Description Word2Vec Predicts word given context or vice versa (CBOW / Skip-gram) GloVe Global co-occurrence vectors FastText Includes subword information (handles rare words)

Word2Vec Example (gensim)

from gensim.models import Word2Vec

sentences = [
    ["nlp", "is", "fun"],
    ["python", "makes", "nlp", "easy"],
    ["deep", "learning", "and", "nlp"]
]

model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, workers=4)
print(model.wv['nlp'])  # Vector for 'nlp'

# Find similar words
print(model.wv.most_similar('nlp', topn=3))

Visualizing Embeddings (t-SNE)

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

words = list(model.wv.index_to_key)
vectors = [model.wv[w] for w in words]

tsne = TSNE(n_components=2, random_state=0)
vec_2d = tsne.fit_transform(vectors)

plt.figure(figsize=(6,6))
plt.scatter(vec_2d[:,0], vec_2d[:,1])
for i, w in enumerate(words):
    plt.annotate(w, xy=(vec_2d[i,0], vec_2d[i,1]))
plt.show()

22. Transformer Models

Transformers revolutionized NLP (2017). Key ideas:

Self-attention mechanism
Parallel processing
Pre-training + fine-tuning

Popular Transformer Models

Model Type Use Cases BERT Encoder Classification, NER, QA GPT Decoder Text generation, chatbots T5 Encoder-Decoder Text-to-text tasks

Using Hugging Face Transformers

from transformers import pipeline

qa_pipeline = pipeline("question-answering")
context = "Python is a popular programming language for AI and NLP."
question = "Why is Python popular?"

result = qa_pipeline(question=question, context=context)
print(result)

23. Real-Time NLP Projects

23.1 Chatbot

Data: intents.json (patterns, responses, context)
Pipeline:

import random
def chatbot_response(intent_class):
    responses = {
        "greeting": ["Hello!", "Hi there!", "Greetings!"],
        "goodbye": ["Bye!", "See you later!", "Take care!"]
    }
    return random.choice(responses[intent_class])

23.2 Sentiment Dashboard

Fetch tweets with Tweepy
Preprocess & analyze sentiment (TextBlob/VADER/Transformers)
Plot histograms or live dashboards (Plotly / Dash)

23.3 Resume Parser

Extract text from PDFs (PyPDF2)
NER for names, education, skills (spaCy)
Output structured JSON for HR tools

23.4 Document Summarizer

from transformers import pipeline
summarizer = pipeline("summarization")
text = "Long article about AI..."
summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(summary[0]['summary_text'])

24. Deployment

API Layer: Flask, FastAPI
Containerization: Docker
Scaling: Kubernetes, Celery for async tasks
Model Caching: Pre-load models in memory
Monitoring: Track latency, errors, and usage

Deployment Example (FastAPI)

from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
classifier = pipeline("sentiment-analysis")

@app.get("/sentiment")
def analyze(text: str):
    result = classifier(text)
    return result

Run via:

uvicorn app:app --reload

25. Best Practices

Preprocess consistently in training & inference
Use embeddings for semantic understanding
Start with pre-trained transformers, fine-tune if necessary
Evaluate with metrics suitable to task (accuracy, F1, BLEU, ROUGE)
Log input/output for debugging
Avoid overfitting with small datasets

26. Advanced Topics

Multilingual NLP: mBERT, XLM-R
Question Answering: extractive vs generative
Text Generation: GPT-3/4, controlled generation
Bias Mitigation: detect & reduce harmful model biases
Edge Deployment: quantization, ONNX, TensorRT

27. References & Further Reading

Jurafsky, D., & Martin, J. H. Speech and Language Processing (3rd ed.)
Hugging Face Transformers: https://huggingface.co/transformers
spaCy Documentation: https://spacy.io
Gensim Word2Vec Tutorials: https://radimrehurek.com/gensim
NLTK Book: https://www.nltk.org/book/

28. Conclusion

Python AI for NLP allows developers to turn raw text into intelligent applications: chatbots, sentiment analyzers, summarizers, and more. By following best practices:

Clean and preprocess data
Choose correct embeddings & models
Leverage transformers for deep understanding
Deploy robust pipelines for production

The future: multilingual AI, low-resource languages, and edge NLP will make AI ubiquitous in everyday language tasks.

🔗 GitHub Repository (Demo)

You can access the complete source code here: 👉

https://github.com/harishalakshan/Python-AI-for-Natural-Language-Processing-.git

Stay tuned for more real-world AI-powered hardware integration projects!

This Lesson Series are compiled and crafted and teaches by Experienced Software Engineer L.P. Harisha Lakshan Warnakulasuriya.

My Personal Website -: https://www.harishalakshanwarnakulasuriya.com

My Portfolio Website -: https://main.harishacrypto.xyz

My Newsletter Series -: https://newsletter.harishacrypto.xyz

My email address: uniconrprofessionalbay@gmail.com

My GitHub Portfolio : Sponsor @harishalakshan on GitHub Sponsors