Text Preprocessing

Text Preprocessing

These are some preprocessing steps are to be performed while working on unstructured data.

Converting into lowercase:

1.Noise Removal :

    Removal of words based on the domain.

   Removal of regular expressions.

   Removal of Stopword.

2. Lexicon Normalization :

    Lemmatization.

    Stemming.

    singularization.

3. POS tagging:

4. Spell check

5. Synonyms

6. Named Entity Recognition:

Feature Engineering on text data

7. Word Embeddings


Packages: 

nltk

textblob

################################################################################

Converting into lowercase :

    Converting all the text into lowercase.

df['text']=df['text'].str.lower()

Noise Removal :

 Removal of words based on the domain .

stop_words = set(stopwords.words('english'))

stop_words.update(('may','eels','i e','rv','ades','abaca','comma','e',' e ',' g ','g','alfalfa','we','us'))

  texts = pd.Series(" ".join([w for w in sentence.split(" ") if not w in stop_words])for sentence in texts)

    Removal regular expressions &  Removal of numbers:

def noiseRemoval(st):

  st = st.replace('\n', ' ').replace('\r', '')

  st = st.replace("-", " ")

  st = st.replace("’", " ")

  st = st.replace("’s", " ")

  st = re.sub(r"[.,/?\'\"\\;:\[\]{}!@#$^()]+", " ", st, re.DOTALL|re.U|re.I)

  st = re.sub(r" [a-z0-9]{1} ", " ", st)

  st = st.replace("&","")

  return st.lower()

df['text'] = df['text'].apply(noiseRemoval)

     Removal of numbers.

df['text'] = ([i for i in df['text'] if not i.isdigit()])

       Remove unnecessary tags.

      Removal Stopword.

# Read the txt file that conatins stopwords

stopwords = set(w.rstrip() for w in open('stopwords.txt'))

# note: an alternative source of stopwords

# from nltk.corpus import stopwords

# stopwords.words('english')

# add more stopwords specific to this problem

stopwords = stopwords.union({

  'introduction', 'edition', 'series', 'application',

  'approach', 'card', 'access', 'package', 'plus', 'etext',

  'brief', 'vol', 'fundamental', 'guide', 'essential', 'printed',

  'third', 'second', 'fourth', })

def my_tokenizer(s):

  s = s.lower() # downcase

  tokens = nltk.tokenize.word_tokenize(s) # split string into words (tokens)

  tokens = [t for t in tokens if len(t) > 2] # remove short words, they're probably not useful

  tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens] # put words into base form

  tokens = [t for t in tokens if t not in stopwords] # remove stopwords

  tokens = [t for t in tokens if not any(c.isdigit() for c in t)] # remove any digits, i.e. "3rd edition"

  return tokens

Lexicon Normalization :

 Lemmatization.

from nltk.stem import WordNetLemmatizer

 wnl = WordNetLemmatizer()

  texts1 = [" ".join([wnl.lemmatize(word, pos='v') for word in sentence.split(" ")]) for sentence in texts]

  Stemming.

from nltk.stem import PorterStemmer

from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

example_words = ["python","pythoner","pythoning","pythoned","pythonly"]

for w in example_words:

  print(ps.stem(w))

new_text = "It is important to by very pythonly while you are pythoning with python. All pythoners have pythoned poorly at least once."

words = word_tokenize(new_text)

for w in words:

  print(ps.stem(w))

# difference between stemming and lemmatization

from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()

porter_stemmer.stem('wolves')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize('wolves')

  Singularization.

from textblob import TextBlob, Word

def singularization(st):

  ### nouns

  sentence = TextBlob(str(st))

  singlarWords = list(sentence.words.singularize())

  newSpellCheck = ' '.join(singlarWords)

  return(newSpellCheck)

POS tagging:

a = 'Machine learning is a field of computer science that gives computer systems the ability to learn'

nltk.pos_tag(a.split())

from textblob import TextBlob

wiki = TextBlob("Python is a high-level, general-purpose programming language.")

wiki.tags

Spell check:

Check the spelling of the word and find the appropriate word that misspells.

from textblob import TextBlob, Word

w = Word('falibility')

w.spellcheck()

w = Word('pupet')

w.spellcheck()

from pattern3.en import suggest

suggest('falibility')

suggest('citru')

from autocorrect import spell

spell('HTe')

Synonyms:

from nltk.corpus import wordnet as wn

 #To get synonyms

 wn.synsets('dog')

 #To get synonyms which contains only verbs

 wn.synsets('dog', pos=wn.VERB)

 #To get definition

 print(wn.synset('dog.n.01').definition())

 member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds

Named Entity Recognition:

What are the entities

'Albert Einstein' -> person

'Apple' -> organization

s = 'Albert Einstein was born on march 14, 1879'

#need to install maxnet_ne_chunker and (corpra/)words

nltk.download()

tags = nltk.pos_tag(s.split())

tags

nltk.ne_chunk(tags)

nltk.ne_chunk(tags).draw()

Word Embeddings:

1. Pubmed Word2vec Embeddings:

Its is Biomedical natural language processing when dealing with the medical data, it has 200 dimensions

download it from here

http://bio.nlplab.org/

import gensim

from gensim.models import KeyedVectors

path = "input\\wikipedia-pubmed-and-PMC-w2v.bin"

modelW2v = KeyedVectors.load_word2vec_format(path, binary = True)

2. GloVe Embeddings:

GloVe stands for Global Vectors for Word Representation

download it from here

https://nlp.stanford.edu/projects/glove/

import gensim

from gensim.models import KeyedVectors

path = "input\\glove.6B.300d.txt.word2vec"

glovModel = KeyedVectors.load_word2vec_format(path, binary = False)

3. Google News Word2vec Embeddings

This repository hosts the word2vec pre-trained Google News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors).

download it from here

https://github.com/mmihaltz/word2vec-GoogleNews-vectors

import gensim

from gensim.models import KeyedVectors

path = "input\\GoogleNews-vectors-negative300.bin.gz"

model = KeyedVectors.load_word2vec_format(path, binary = True)

Conclusion:

There are many techinques for text preprocessing which helps deep learning models to learning from the text.

To view or add a comment, sign in

More articles by Sujith Kumar

Others also viewed

Explore content categories