Text Preprocessing

Sujith Kumar

Published Apr 5, 2018

+ Follow

These are some preprocessing steps are to be performed while working on unstructured data.

Converting into lowercase:

1.Noise Removal :

Removal of words based on the domain.

Removal of regular expressions.

Removal of Stopword.

2. Lexicon Normalization :

Lemmatization.

Stemming.

singularization.

3. POS tagging:

4. Spell check

5. Synonyms

6. Named Entity Recognition:

Feature Engineering on text data

7. Word Embeddings

Packages:

nltk

textblob

################################################################################

Converting into lowercase :

Converting all the text into lowercase.

df['text']=df['text'].str.lower()

Noise Removal :

Removal of words based on the domain .

stop_words = set(stopwords.words('english'))

stop_words.update(('may','eels','i e','rv','ades','abaca','comma','e',' e ',' g ','g','alfalfa','we','us'))

texts = pd.Series(" ".join([w for w in sentence.split(" ") if not w in stop_words])for sentence in texts)

Removal regular expressions & Removal of numbers:

def noiseRemoval(st):

st = st.replace('\n', ' ').replace('\r', '')

st = st.replace("-", " ")

st = st.replace("â€™", " ")

st = st.replace("â€™s", " ")

st = re.sub(r"[.,/?\'\"\\;:\[\]{}!@#$^()]+", " ", st, re.DOTALL|re.U|re.I)

st = re.sub(r" [a-z0-9]{1} ", " ", st)

st = st.replace("&","")

return st.lower()

df['text'] = df['text'].apply(noiseRemoval)

Removal of numbers.

df['text'] = ([i for i in df['text'] if not i.isdigit()])

Remove unnecessary tags.

Removal Stopword.

# Read the txt file that conatins stopwords

stopwords = set(w.rstrip() for w in open('stopwords.txt'))

# note: an alternative source of stopwords

# from nltk.corpus import stopwords

# stopwords.words('english')

# add more stopwords specific to this problem

stopwords = stopwords.union({

'introduction', 'edition', 'series', 'application',

'approach', 'card', 'access', 'package', 'plus', 'etext',

'brief', 'vol', 'fundamental', 'guide', 'essential', 'printed',

'third', 'second', 'fourth', })

def my_tokenizer(s):

s = s.lower() # downcase

tokens = nltk.tokenize.word_tokenize(s) # split string into words (tokens)

tokens = [t for t in tokens if len(t) > 2] # remove short words, they're probably not useful

tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # put words into base form

tokens = [t for t in tokens if t not in stopwords] # remove stopwords

tokens = [t for t in tokens if not any(c.isdigit() for c in t)] # remove any digits, i.e. "3rd edition"

return tokens

Lexicon Normalization :

Lemmatization.

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

texts1 = [" ".join([wnl.lemmatize(word, pos='v') for word in sentence.split(" ")]) for sentence in texts]

Stemming.

from nltk.stem import PorterStemmer

from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

for w in example_words:

print(ps.stem(w))

new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

words = word_tokenize(new_text)

for w in words:

print(ps.stem(w))

# difference between stemming and lemmatization

from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()

porter_stemmer.stem('wolves')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize('wolves')

Singularization.

from textblob import TextBlob, Word

def singularization(st):

### nouns

sentence = TextBlob(str(st))

singlarWords = list(sentence.words.singularize())

newSpellCheck = ' '.join(singlarWords)

return(newSpellCheck)

POS tagging:

a = 'Machine learning is a field of computer science that gives computer systems the ability to learn'

nltk.pos_tag(a.split())

from textblob import TextBlob

wiki = TextBlob("Python is a high-level, general-purpose programming language.")

wiki.tags

Spell check:

Check the spelling of the word and find the appropriate word that misspells.

from textblob import TextBlob, Word

w = Word('falibility')

w.spellcheck()

w = Word('pupet')

w.spellcheck()

from pattern3.en import suggest

suggest('falibility')

suggest('citru')

from autocorrect import spell

spell('HTe')

Synonyms:

from nltk.corpus import wordnet as wn

#To get synonyms

wn.synsets('dog')

#To get synonyms which contains only verbs

wn.synsets('dog', pos=wn.VERB)

#To get definition

print(wn.synset('dog.n.01').definition())

member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds

Named Entity Recognition:

What are the entities

'Albert Einstein' -> person

'Apple' -> organization

s = 'Albert Einstein was born on march 14, 1879'

#need to install maxnet_ne_chunker and (corpra/)words

nltk.download()

tags = nltk.pos_tag(s.split())

Text Preprocessing

Sujith Kumar

More articles by Sujith Kumar

Others also viewed

Birth of the Singularity

Row-Level Security (RLS) vs Object-Level Security (OLS) in Power BI

Mapping

Understanding Cross-Validation: Different Approaches

Demystifying Sorting Algorithms: A Comprehensive Guide

What is Correlation in Data Analysis?

Seeing Is Believing - The Importance of Visualization

What is a Data Scientist?

Price Outlier Detection using Cluster Analysis

Data is the New Weapon - The Power To Do GOOD or BAD

Explore content categories