Python AI for Natural Language Processing (NLP)

Python AI for Natural Language Processing (NLP)

L. P. Harisha Lakshan Warnakulasuriya(BSc in CS(OUSL)).

Bachelor of Bio Science in Computer Science.

Reading Master of Science in Computer Science at the University of Sri Jayawardenapura

Content

  • Introduction → Core NLP concepts
  • Pre-processing, tokenization, POS, NER
  • Sentiment analysis, text classification, topic modeling
  • Word embeddings, transformers, real-time projects
  • Deployment, best practices, advanced techniques

What Is NLP?

Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) devoted to enabling computers to understand, interpret, and generate human language. It sits at the intersection of:

  • Linguistics (structure and meaning of language)
  • Computer Science (algorithms, data structures)
  • Machine Learning / Deep Learning (statistical models that learn from data)

Goal: make machines as fluent as possible in human languages — text, speech, or even mixed media.

2. Why Python Is the #1 Language for NLP

Python dominates NLP development because:

Reason Explanation Readable syntax Easier to express algorithms & prototypes Massive library ecosystem nltk, spacy, textblob, gensim, scikit-learn, transformers Scientific stack numpy, pandas, matplotlib, scipy Integration with DL frameworks TensorFlow, PyTorch, JAX Vibrant community Tutorials, Stack Overflow, pre-trained models

In practice, most research papers and production pipelines in NLP have a Python reference implementation.

3. Core Building Blocks of NLP

Building Block Purpose Example

Tokenization Break text into pieces (tokens) “Chatbots are cool!” → [Chatbots, are, cool, !]

Normalization Lower-case, remove punctuation, fix spelling “COOOL!!” → “cool”

Stopword Removal Drop high-frequency function words “and, the, is”

Stemming / Lemmatization Reduce words to root “running” → “run”

POS Tagging Label each token’s role (noun, verb, adj) Python/NN is/VB great/JJ

Named Entity Recognition (NER) Find names, orgs, locations “SpaceX” → ORG

Sentiment Analysis Measure opinion or emotion “Fantastic!” → Positive


4. Text Pre-Processing Pipeline

Raw text
   ↓
Cleaning (remove HTML, digits, etc.)
   ↓
Tokenization
   ↓
Lowercase & normalize
   ↓
Stopword removal
   ↓
Stemming / Lemmatization
   ↓
Vectorization (Bag of Words / TF-IDF / Embeddings)
        
A robust pipeline is crucial: garbage-in = garbage-out.

5. Installing NLP Libraries

# Create a virtual environment (optional but recommended)
python -m venv nlp_env
source nlp_env/bin/activate  # or nlp_env\Scripts\activate on Windows

# Install key packages
pip install nltk spacy textblob gensim scikit-learn matplotlib
python -m spacy download en_core_web_sm   # small English model
        

6. Loading and Cleaning Text

# Load text from a file
with open("article.txt", "r", encoding="utf8") as f:
    text = f.read()

# Simple cleaning
import re
clean_text = re.sub(r"[^a-zA-Z0-9\s]", "", text)
clean_text = clean_text.lower()
print(clean_text[:300])
        
Tip: For HTML, use BeautifulSoup to strip tags before regex cleaning.

7. Tokenization

Using nltk

import nltk
nltk.download("punkt")
from nltk.tokenize import word_tokenize

sample = "Natural Language Processing (NLP) is amazing!"
tokens = word_tokenize(sample)
print(tokens)
        

Using spaCy

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(sample)
tokens = [token.text for token in doc]
print(tokens)
        

8. Stopword Removal

from nltk.corpus import stopwords
nltk.download("stopwords")

stop_words = set(stopwords.words("english"))
filtered = [t for t in tokens if t.lower() not in stop_words]
print(filtered)
        

9. Lemmatization

from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

lemm = WordNetLemmatizer()
lemmas = [lemm.lemmatize(w) for w in filtered]
print(lemmas)
        

Diagram of the flow:

Raw text
 → Tokenize
   → Remove stopwords
     → Lemmatize
        

10. Combining Steps into a Function

def preprocess(text):
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    import nltk
    nltk.download("punkt", quiet=True)
    nltk.download("stopwords", quiet=True)
    nltk.download("wordnet", quiet=True)

    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words("english"))
    filtered = [t for t in tokens if t.isalpha() and t not in stop_words]
    lemm = WordNetLemmatizer()
    return [lemm.lemmatize(w) for w in filtered]

print(preprocess("Dogs are running quickly in the park!"))
        

11. Representing Text as Numbers

After cleaning text, models need vectors.

Technique Description Library Bag of Words Counts word frequency sklearn TF-IDF Weighted counts by rarity sklearn Word Embeddings Dense vectors capturing meaning gensim, transformers

Example – TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ["I love NLP", "NLP loves Python", "Python is great"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

print(vectorizer.get_feature_names_out())
print(X.toarray())
        

12. Challenges in NLP

  • Ambiguity: words with multiple meanings
  • Context: “bank” (river vs finance)
  • Domain adaptation: slang, emojis, legal jargon
  • Data sparsity: rare words, languages with limited corpora
  • Bias & fairness: models may inherit stereotypes from data


13. Best Practices for Beginners

  1. Start with small datasets, inspect outputs.
  2. Always evaluate pre-processing impact.
  3. Keep your pipeline modular (each step testable).
  4. Use pre-trained models whenever possible.
  5. Understand tokenization details — crucial for later deep-learning models.


Part 2 – POS Tagging, Named Entity Recognition, Sentiment & Text Classification


14. Part-of-Speech (POS) Tagging

Definition: POS tagging labels each token with its grammatical role: noun, verb, adjective, etc. This helps algorithms understand sentence structure and relationships.

Example with nltk

import nltk
nltk.download("averaged_perceptron_tagger")

sentence = "The quick brown fox jumps over the lazy dog"
tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)
print(tags)
        

Output:

[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'),
 ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'),
 ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
        

Example with spaCy

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
for token in doc:
    print(token.text, token.pos_, token.tag_)
        
spaCy provides both universal POS (pos_) and detailed tag (tag_).

15. Named Entity Recognition (NER)

NER detects and classifies “named entities”: people, organizations, places, dates, monetary values.

spaCy NER

text = "Elon Musk founded SpaceX in 2002 and acquired Twitter in 2022."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)
        

Common labels:

Label Meaning PERSON Person ORG Organization GPE Country/City DATE Calendar date MONEY Monetary value


Customizing NER

spaCy allows training your own NER model if your domain has entities like CHEMICAL, LAW_CASE, PRODUCT_ID.

Use spacy.blank("en") to start, then feed labeled data with Example objects.

16. Sentiment Analysis

Sentiment = emotional polarity of text (positive, negative, neutral).

Using TextBlob

from textblob import TextBlob

txt = "I absolutely love this new phone, but the battery life is too short."
blob = TextBlob(txt)
print(blob.sentiment)
        

Output:

Sentiment(polarity=0.35, subjectivity=0.75)
        

  • Polarity ranges –1 → +1.
  • Subjectivity measures opinion vs fact.


Using VADER (great for social media)

from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download("vader_lexicon")

analyzer = SentimentIntensityAnalyzer()
print(analyzer.polarity_scores("This movie was surprisingly good!"))
        

Sentiment Pipeline with Transformers

from transformers import pipeline
sentiment_model = pipeline("sentiment-analysis")
print(sentiment_model("The new update is awesome!"))
        
Transformers handle complex syntax, sarcasm, emojis better than lexicon-based tools.

17. Text Classification

Goal: assign a label/category to a document. Examples:

  • Spam vs Ham
  • Product review stars
  • News topic: sports, politics, tech

Naive Bayes Classifier with scikit-learn

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

docs = [
    "I love NLP",
    "Python is amazing for AI",
    "Spam emails are annoying",
    "Deep learning is exciting",
    "Buy cheap products now!!!"
]
labels = ["pos","pos","neg","pos","neg"]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

clf = MultinomialNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, pred))
        

Logistic Regression with TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
model = LogisticRegression()
model.fit(X, labels)
print(model.predict(tfidf.transform(["I hate spam messages"])))
        

Evaluating Classifiers

Key metrics:

  • Accuracy
  • Precision, Recall, F1-score
  • Confusion Matrix

Use classification_report from sklearn.metrics.

18. Topic Modelling

Topic modelling groups documents by hidden themes.

Latent Dirichlet Allocation (LDA)

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "I love deep learning and neural networks",
    "Dogs and cats are lovely pets",
    "Machine learning is fascinating",
    "My cat is playful",
    "Neural networks drive AI progress"
]

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

lda = LatentDirichletAllocation(n_components=2, random_state=0)
lda.fit(X)

for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}:")
    words = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-5:]]
    print(words)
        

19. Visualising Text Data

  • Word clouds for keyword prominence
  • Bar charts of token frequency
  • t-SNE / UMAP for embedding spaces

Example – Word Frequency:

from collections import Counter
import matplotlib.pyplot as plt

tokens = ["nlp","nlp","python","ai","python","learning","nlp"]
freq = Counter(tokens)
plt.bar(freq.keys(), freq.values())
plt.title("Word frequency")
plt.show()
        

20. Putting It Together: Mini Project

Problem: Sentiment Analysis on Product Reviews

  1. Gather a CSV of reviews (review, rating).
  2. Preprocess text (clean, tokenize, lemmatize).
  3. Vectorize with TF-IDF.
  4. Train Logistic Regression.
  5. Evaluate & deploy as API.

Sketch of the pipeline:

Raw reviews → Clean text → Preprocess
   → Vectorize (TF-IDF)
     → Train model
       → Serve predictions (Flask/FastAPI)
        
This simple flow is the foundation for production systems.

Part 3 – Embeddings, Transformers, Real-Time Projects & Deployment


21. Word Embeddings

Traditional Bag-of-Words/TF-IDF vectors don’t capture semantic meaning. Word embeddings map words to dense vectors in a continuous space where semantically similar words are closer.

Common Techniques

Method Description Word2Vec Predicts word given context or vice versa (CBOW / Skip-gram) GloVe Global co-occurrence vectors FastText Includes subword information (handles rare words)


Word2Vec Example (gensim)

from gensim.models import Word2Vec

sentences = [
    ["nlp", "is", "fun"],
    ["python", "makes", "nlp", "easy"],
    ["deep", "learning", "and", "nlp"]
]

model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, workers=4)
print(model.wv['nlp'])  # Vector for 'nlp'

# Find similar words
print(model.wv.most_similar('nlp', topn=3))
        

Visualizing Embeddings (t-SNE)

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

words = list(model.wv.index_to_key)
vectors = [model.wv[w] for w in words]

tsne = TSNE(n_components=2, random_state=0)
vec_2d = tsne.fit_transform(vectors)

plt.figure(figsize=(6,6))
plt.scatter(vec_2d[:,0], vec_2d[:,1])
for i, w in enumerate(words):
    plt.annotate(w, xy=(vec_2d[i,0], vec_2d[i,1]))
plt.show()
        

22. Transformer Models

Transformers revolutionized NLP (2017). Key ideas:

  • Self-attention mechanism
  • Parallel processing
  • Pre-training + fine-tuning


Popular Transformer Models

Model Type Use Cases BERT Encoder Classification, NER, QA GPT Decoder Text generation, chatbots T5 Encoder-Decoder Text-to-text tasks


Using Hugging Face Transformers

from transformers import pipeline

qa_pipeline = pipeline("question-answering")
context = "Python is a popular programming language for AI and NLP."
question = "Why is Python popular?"

result = qa_pipeline(question=question, context=context)
print(result)
        

23. Real-Time NLP Projects

23.1 Chatbot

  • Data: intents.json (patterns, responses, context)
  • Pipeline:

import random
def chatbot_response(intent_class):
    responses = {
        "greeting": ["Hello!", "Hi there!", "Greetings!"],
        "goodbye": ["Bye!", "See you later!", "Take care!"]
    }
    return random.choice(responses[intent_class])
        

23.2 Sentiment Dashboard

  • Fetch tweets with Tweepy
  • Preprocess & analyze sentiment (TextBlob/VADER/Transformers)
  • Plot histograms or live dashboards (Plotly / Dash)


23.3 Resume Parser

  • Extract text from PDFs (PyPDF2)
  • NER for names, education, skills (spaCy)
  • Output structured JSON for HR tools


23.4 Document Summarizer

from transformers import pipeline
summarizer = pipeline("summarization")
text = "Long article about AI..."
summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(summary[0]['summary_text'])
        

24. Deployment

  1. API Layer: Flask, FastAPI
  2. Containerization: Docker
  3. Scaling: Kubernetes, Celery for async tasks
  4. Model Caching: Pre-load models in memory
  5. Monitoring: Track latency, errors, and usage


Deployment Example (FastAPI)

from fastapi import FastAPI
from transformers import pipeline

app = FastAPI()
classifier = pipeline("sentiment-analysis")

@app.get("/sentiment")
def analyze(text: str):
    result = classifier(text)
    return result
        

Run via:

uvicorn app:app --reload
        

25. Best Practices

  • Preprocess consistently in training & inference
  • Use embeddings for semantic understanding
  • Start with pre-trained transformers, fine-tune if necessary
  • Evaluate with metrics suitable to task (accuracy, F1, BLEU, ROUGE)
  • Log input/output for debugging
  • Avoid overfitting with small datasets


26. Advanced Topics

  • Multilingual NLP: mBERT, XLM-R
  • Question Answering: extractive vs generative
  • Text Generation: GPT-3/4, controlled generation
  • Bias Mitigation: detect & reduce harmful model biases
  • Edge Deployment: quantization, ONNX, TensorRT


27. References & Further Reading

  1. Jurafsky, D., & Martin, J. H. Speech and Language Processing (3rd ed.)
  2. Hugging Face Transformers: https://huggingface.co/transformers
  3. spaCy Documentation: https://spacy.io
  4. Gensim Word2Vec Tutorials: https://radimrehurek.com/gensim
  5. NLTK Book: https://www.nltk.org/book/


28. Conclusion

Python AI for NLP allows developers to turn raw text into intelligent applications: chatbots, sentiment analyzers, summarizers, and more. By following best practices:

  • Clean and preprocess data
  • Choose correct embeddings & models
  • Leverage transformers for deep understanding
  • Deploy robust pipelines for production

The future: multilingual AI, low-resource languages, and edge NLP will make AI ubiquitous in everyday language tasks.

🔗 GitHub Repository (Demo)

You can access the complete source code here: 👉

https://github.com/harishalakshan/Python-AI-for-Natural-Language-Processing-.git

Stay tuned for more real-world AI-powered hardware integration projects!

This Lesson Series are compiled and crafted and teaches by Experienced Software Engineer L.P. Harisha Lakshan Warnakulasuriya.

My Personal Website -: https://www.harishalakshanwarnakulasuriya.com

My Portfolio Website -: https://main.harishacrypto.xyz

My Newsletter Series -: https://newsletter.harishacrypto.xyz

My email address: uniconrprofessionalbay@gmail.com

My GitHub Portfolio : Sponsor @harishalakshan on GitHub Sponsors







To view or add a comment, sign in

More articles by Harisha Lakshan Warnakulasuriya(BSc.(ousl))

Explore content categories