Text Preprocessing
These are some preprocessing steps are to be performed while working on unstructured data.
Converting into lowercase:
1.Noise Removal :
Removal of words based on the domain.
Removal of regular expressions.
Removal of Stopword.
2. Lexicon Normalization :
Lemmatization.
Stemming.
singularization.
3. POS tagging:
4. Spell check
5. Synonyms
6. Named Entity Recognition:
Feature Engineering on text data
7. Word Embeddings
Packages:
nltk
textblob
################################################################################
Converting into lowercase :
Converting all the text into lowercase.
df['text']=df['text'].str.lower()
Noise Removal :
Removal of words based on the domain .
stop_words = set(stopwords.words('english'))
stop_words.update(('may','eels','i e','rv','ades','abaca','comma','e',' e ',' g ','g','alfalfa','we','us'))
texts = pd.Series(" ".join([w for w in sentence.split(" ") if not w in stop_words])for sentence in texts)
Removal regular expressions & Removal of numbers:
def noiseRemoval(st):
st = st.replace('\n', ' ').replace('\r', '')
st = st.replace("-", " ")
st = st.replace("’", " ")
st = st.replace("’s", " ")
st = re.sub(r"[.,/?\'\"\\;:\[\]{}!@#$^()]+", " ", st, re.DOTALL|re.U|re.I)
st = re.sub(r" [a-z0-9]{1} ", " ", st)
st = st.replace("&","")
return st.lower()
df['text'] = df['text'].apply(noiseRemoval)
Removal of numbers.
df['text'] = ([i for i in df['text'] if not i.isdigit()])
Remove unnecessary tags.
Removal Stopword.
# Read the txt file that conatins stopwords
stopwords = set(w.rstrip() for w in open('stopwords.txt'))
# note: an alternative source of stopwords
# from nltk.corpus import stopwords
# stopwords.words('english')
# add more stopwords specific to this problem
stopwords = stopwords.union({
'introduction', 'edition', 'series', 'application',
'approach', 'card', 'access', 'package', 'plus', 'etext',
'brief', 'vol', 'fundamental', 'guide', 'essential', 'printed',
'third', 'second', 'fourth', })
def my_tokenizer(s):
s = s.lower() # downcase
tokens = nltk.tokenize.word_tokenize(s) # split string into words (tokens)
tokens = [t for t in tokens if len(t) > 2] # remove short words, they're probably not useful
tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # put words into base form
tokens = [t for t in tokens if t not in stopwords] # remove stopwords
tokens = [t for t in tokens if not any(c.isdigit() for c in t)] # remove any digits, i.e. "3rd edition"
return tokens
Lexicon Normalization :
Lemmatization.
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
texts1 = [" ".join([wnl.lemmatize(word, pos='v') for word in sentence.split(" ")]) for sentence in texts]
Stemming.
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
for w in example_words:
print(ps.stem(w))
new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."
words = word_tokenize(new_text)
for w in words:
print(ps.stem(w))
# difference between stemming and lemmatization
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
porter_stemmer.stem('wolves')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('wolves')
Singularization.
from textblob import TextBlob, Word
def singularization(st):
### nouns
sentence = TextBlob(str(st))
singlarWords = list(sentence.words.singularize())
newSpellCheck = ' '.join(singlarWords)
return(newSpellCheck)
POS tagging:
a = 'Machine learning is a field of computer science that gives computer systems the ability to learn'
nltk.pos_tag(a.split())
from textblob import TextBlob
wiki = TextBlob("Python is a high-level, general-purpose programming language.")
wiki.tags
Spell check:
Check the spelling of the word and find the appropriate word that misspells.
from textblob import TextBlob, Word
w = Word('falibility')
w.spellcheck()
w = Word('pupet')
w.spellcheck()
from pattern3.en import suggest
suggest('falibility')
suggest('citru')
from autocorrect import spell
spell('HTe')
Synonyms:
from nltk.corpus import wordnet as wn
#To get synonyms
wn.synsets('dog')
#To get synonyms which contains only verbs
wn.synsets('dog', pos=wn.VERB)
#To get definition
print(wn.synset('dog.n.01').definition())
member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
Named Entity Recognition:
What are the entities
'Albert Einstein' -> person
'Apple' -> organization
s = 'Albert Einstein was born on march 14, 1879'
#need to install maxnet_ne_chunker and (corpra/)words
nltk.download()
tags = nltk.pos_tag(s.split())
tags
nltk.ne_chunk(tags)
nltk.ne_chunk(tags).draw()
Word Embeddings:
1. Pubmed Word2vec Embeddings:
Its is Biomedical natural language processing when dealing with the medical data, it has 200 dimensions
download it from here
import gensim
from gensim.models import KeyedVectors
path = "input\\wikipedia-pubmed-and-PMC-w2v.bin"
modelW2v = KeyedVectors.load_word2vec_format(path, binary = True)
2. GloVe Embeddings:
GloVe stands for Global Vectors for Word Representation
download it from here
https://nlp.stanford.edu/projects/glove/
import gensim
from gensim.models import KeyedVectors
path = "input\\glove.6B.300d.txt.word2vec"
glovModel = KeyedVectors.load_word2vec_format(path, binary = False)
3. Google News Word2vec Embeddings
This repository hosts the word2vec pre-trained Google News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors).
download it from here
https://github.com/mmihaltz/word2vec-GoogleNews-vectors
import gensim
from gensim.models import KeyedVectors
path = "input\\GoogleNews-vectors-negative300.bin.gz"
model = KeyedVectors.load_word2vec_format(path, binary = True)
Conclusion:
There are many techinques for text preprocessing which helps deep learning models to learning from the text.