Python AI for Natural Language Processing (NLP)
L. P. Harisha Lakshan Warnakulasuriya(BSc in CS(OUSL)).
Bachelor of Bio Science in Computer Science.
Reading Master of Science in Computer Science at the University of Sri Jayawardenapura
Content
What Is NLP?
Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) devoted to enabling computers to understand, interpret, and generate human language. It sits at the intersection of:
Goal: make machines as fluent as possible in human languages — text, speech, or even mixed media.
2. Why Python Is the #1 Language for NLP
Python dominates NLP development because:
Reason Explanation Readable syntax Easier to express algorithms & prototypes Massive library ecosystem nltk, spacy, textblob, gensim, scikit-learn, transformers Scientific stack numpy, pandas, matplotlib, scipy Integration with DL frameworks TensorFlow, PyTorch, JAX Vibrant community Tutorials, Stack Overflow, pre-trained models
In practice, most research papers and production pipelines in NLP have a Python reference implementation.
3. Core Building Blocks of NLP
Building Block Purpose Example
Tokenization Break text into pieces (tokens) “Chatbots are cool!” → [Chatbots, are, cool, !]
Normalization Lower-case, remove punctuation, fix spelling “COOOL!!” → “cool”
Stopword Removal Drop high-frequency function words “and, the, is”
Stemming / Lemmatization Reduce words to root “running” → “run”
POS Tagging Label each token’s role (noun, verb, adj) Python/NN is/VB great/JJ
Named Entity Recognition (NER) Find names, orgs, locations “SpaceX” → ORG
Sentiment Analysis Measure opinion or emotion “Fantastic!” → Positive
4. Text Pre-Processing Pipeline
Raw text
↓
Cleaning (remove HTML, digits, etc.)
↓
Tokenization
↓
Lowercase & normalize
↓
Stopword removal
↓
Stemming / Lemmatization
↓
Vectorization (Bag of Words / TF-IDF / Embeddings)
A robust pipeline is crucial: garbage-in = garbage-out.
5. Installing NLP Libraries
# Create a virtual environment (optional but recommended)
python -m venv nlp_env
source nlp_env/bin/activate # or nlp_env\Scripts\activate on Windows
# Install key packages
pip install nltk spacy textblob gensim scikit-learn matplotlib
python -m spacy download en_core_web_sm # small English model
6. Loading and Cleaning Text
# Load text from a file
with open("article.txt", "r", encoding="utf8") as f:
text = f.read()
# Simple cleaning
import re
clean_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
clean_text = clean_text.lower()
print(clean_text[:300])
Tip: For HTML, use BeautifulSoup to strip tags before regex cleaning.
7. Tokenization
Using nltk
import nltk
nltk.download("punkt")
from nltk.tokenize import word_tokenize
sample = "Natural Language Processing (NLP) is amazing!"
tokens = word_tokenize(sample)
print(tokens)
Using spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(sample)
tokens = [token.text for token in doc]
print(tokens)
8. Stopword Removal
from nltk.corpus import stopwords
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
filtered = [t for t in tokens if t.lower() not in stop_words]
print(filtered)
9. Lemmatization
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")
lemm = WordNetLemmatizer()
lemmas = [lemm.lemmatize(w) for w in filtered]
print(lemmas)
Diagram of the flow:
Raw text
→ Tokenize
→ Remove stopwords
→ Lemmatize
10. Combining Steps into a Function
def preprocess(text):
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download("punkt", quiet=True)
nltk.download("stopwords", quiet=True)
nltk.download("wordnet", quiet=True)
tokens = word_tokenize(text.lower())
stop_words = set(stopwords.words("english"))
filtered = [t for t in tokens if t.isalpha() and t not in stop_words]
lemm = WordNetLemmatizer()
return [lemm.lemmatize(w) for w in filtered]
print(preprocess("Dogs are running quickly in the park!"))
11. Representing Text as Numbers
After cleaning text, models need vectors.
Technique Description Library Bag of Words Counts word frequency sklearn TF-IDF Weighted counts by rarity sklearn Word Embeddings Dense vectors capturing meaning gensim, transformers
Example – TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["I love NLP", "NLP loves Python", "Python is great"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
print(vectorizer.get_feature_names_out())
print(X.toarray())
12. Challenges in NLP
13. Best Practices for Beginners
Part 2 – POS Tagging, Named Entity Recognition, Sentiment & Text Classification
14. Part-of-Speech (POS) Tagging
Definition: POS tagging labels each token with its grammatical role: noun, verb, adjective, etc. This helps algorithms understand sentence structure and relationships.
Example with nltk
import nltk
nltk.download("averaged_perceptron_tagger")
sentence = "The quick brown fox jumps over the lazy dog"
tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)
print(tags)
Output:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'),
('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'),
('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
Example with spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
for token in doc:
print(token.text, token.pos_, token.tag_)
spaCy provides both universal POS (pos_) and detailed tag (tag_).
15. Named Entity Recognition (NER)
NER detects and classifies “named entities”: people, organizations, places, dates, monetary values.
spaCy NER
text = "Elon Musk founded SpaceX in 2002 and acquired Twitter in 2022."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
Common labels:
Label Meaning PERSON Person ORG Organization GPE Country/City DATE Calendar date MONEY Monetary value
Customizing NER
spaCy allows training your own NER model if your domain has entities like CHEMICAL, LAW_CASE, PRODUCT_ID.
Use spacy.blank("en") to start, then feed labeled data with Example objects.
16. Sentiment Analysis
Sentiment = emotional polarity of text (positive, negative, neutral).
Using TextBlob
from textblob import TextBlob
txt = "I absolutely love this new phone, but the battery life is too short."
blob = TextBlob(txt)
print(blob.sentiment)
Output:
Sentiment(polarity=0.35, subjectivity=0.75)
Using VADER (great for social media)
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download("vader_lexicon")
analyzer = SentimentIntensityAnalyzer()
print(analyzer.polarity_scores("This movie was surprisingly good!"))
Sentiment Pipeline with Transformers
from transformers import pipeline
sentiment_model = pipeline("sentiment-analysis")
print(sentiment_model("The new update is awesome!"))
Transformers handle complex syntax, sarcasm, emojis better than lexicon-based tools.
17. Text Classification
Goal: assign a label/category to a document. Examples:
Naive Bayes Classifier with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
docs = [
"I love NLP",
"Python is amazing for AI",
"Spam emails are annoying",
"Deep learning is exciting",
"Buy cheap products now!!!"
]
labels = ["pos","pos","neg","pos","neg"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)
clf = MultinomialNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))
Logistic Regression with TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
model = LogisticRegression()
model.fit(X, labels)
print(model.predict(tfidf.transform(["I hate spam messages"])))
Evaluating Classifiers
Key metrics:
Use classification_report from sklearn.metrics.
18. Topic Modelling
Topic modelling groups documents by hidden themes.
Latent Dirichlet Allocation (LDA)
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"I love deep learning and neural networks",
"Dogs and cats are lovely pets",
"Machine learning is fascinating",
"My cat is playful",
"Neural networks drive AI progress"
]
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
lda = LatentDirichletAllocation(n_components=2, random_state=0)
lda.fit(X)
for idx, topic in enumerate(lda.components_):
print(f"Topic {idx}:")
words = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]]
print(words)
19. Visualising Text Data
Example – Word Frequency:
from collections import Counter
import matplotlib.pyplot as plt
tokens = ["nlp","nlp","python","ai","python","learning","nlp"]
freq = Counter(tokens)
plt.bar(freq.keys(), freq.values())
plt.title("Word frequency")
plt.show()
20. Putting It Together: Mini Project
Problem: Sentiment Analysis on Product Reviews
Sketch of the pipeline:
Raw reviews → Clean text → Preprocess
→ Vectorize (TF-IDF)
→ Train model
→ Serve predictions (Flask/FastAPI)
This simple flow is the foundation for production systems.
Part 3 – Embeddings, Transformers, Real-Time Projects & Deployment
21. Word Embeddings
Traditional Bag-of-Words/TF-IDF vectors don’t capture semantic meaning. Word embeddings map words to dense vectors in a continuous space where semantically similar words are closer.
Common Techniques
Method Description Word2Vec Predicts word given context or vice versa (CBOW / Skip-gram) GloVe Global co-occurrence vectors FastText Includes subword information (handles rare words)
Word2Vec Example (gensim)
from gensim.models import Word2Vec
sentences = [
["nlp", "is", "fun"],
["python", "makes", "nlp", "easy"],
["deep", "learning", "and", "nlp"]
]
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, workers=4)
print(model.wv['nlp']) # Vector for 'nlp'
# Find similar words
print(model.wv.most_similar('nlp', topn=3))
Visualizing Embeddings (t-SNE)
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
words = list(model.wv.index_to_key)
vectors = [model.wv[w] for w in words]
tsne = TSNE(n_components=2, random_state=0)
vec_2d = tsne.fit_transform(vectors)
plt.figure(figsize=(6,6))
plt.scatter(vec_2d[:,0], vec_2d[:,1])
for i, w in enumerate(words):
plt.annotate(w, xy=(vec_2d[i,0], vec_2d[i,1]))
plt.show()
22. Transformer Models
Transformers revolutionized NLP (2017). Key ideas:
Popular Transformer Models
Model Type Use Cases BERT Encoder Classification, NER, QA GPT Decoder Text generation, chatbots T5 Encoder-Decoder Text-to-text tasks
Using Hugging Face Transformers
from transformers import pipeline
qa_pipeline = pipeline("question-answering")
context = "Python is a popular programming language for AI and NLP."
question = "Why is Python popular?"
result = qa_pipeline(question=question, context=context)
print(result)
23. Real-Time NLP Projects
23.1 Chatbot
import random
def chatbot_response(intent_class):
responses = {
"greeting": ["Hello!", "Hi there!", "Greetings!"],
"goodbye": ["Bye!", "See you later!", "Take care!"]
}
return random.choice(responses[intent_class])
23.2 Sentiment Dashboard
23.3 Resume Parser
23.4 Document Summarizer
from transformers import pipeline
summarizer = pipeline("summarization")
text = "Long article about AI..."
summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(summary[0]['summary_text'])
24. Deployment
Deployment Example (FastAPI)
from fastapi import FastAPI
from transformers import pipeline
app = FastAPI()
classifier = pipeline("sentiment-analysis")
@app.get("/sentiment")
def analyze(text: str):
result = classifier(text)
return result
Run via:
uvicorn app:app --reload
25. Best Practices
26. Advanced Topics
27. References & Further Reading
28. Conclusion
Python AI for NLP allows developers to turn raw text into intelligent applications: chatbots, sentiment analyzers, summarizers, and more. By following best practices:
The future: multilingual AI, low-resource languages, and edge NLP will make AI ubiquitous in everyday language tasks.
🔗 GitHub Repository (Demo)
You can access the complete source code here: 👉
Stay tuned for more real-world AI-powered hardware integration projects!
This Lesson Series are compiled and crafted and teaches by Experienced Software Engineer L.P. Harisha Lakshan Warnakulasuriya.
My Personal Website -: https://www.harishalakshanwarnakulasuriya.com
My Portfolio Website -: https://main.harishacrypto.xyz
My Newsletter Series -: https://newsletter.harishacrypto.xyz
My email address: uniconrprofessionalbay@gmail.com
My GitHub Portfolio : Sponsor @harishalakshan on GitHub Sponsors