Text Preprocessing

July 4, 2019 1 minute read

It’s important to preprocess text to make it machine understandable.

Like image after obtaining text we need to normalize it.

Text Normalization

Text Normalization includes:

Converting all letters to lower or upper case
Converting number into words or removing numbers
Removing punctuation mark, accent mark
Removing whitespaces
Expanding abbreviations
Removing stop words, sparse terms, and particular words

converting to lowercase

>>>text = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
>>>text.lower()
'the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.'

Removing Numbers

>>>import re
>>> text = 'Box A contains 3 red and 5 white balls.'
>>> result = re.sub(r'\d+', '',text)
>>> result
 'Box A contains  red and  white balls.'

Removing Punctuations

import re
text = 'hello!, this is what?'
regex = re.compile('[%s]' % re.escape(string.punctuation))
result = regex.sub('', text)
result
# output: 'hello this is what'

Removing Whitespace

input_str = " \t a string example\t "
input_str = input_str.strip()
input_str

Removing stop words

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

input_str = "NLTK is a leading platform for building Python programs to work with human language data."
stop_words = set(stopwords.words(‘english’))
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)
# output: ['NLTK', 'leading','platform', 'building', 'Python', 'programs', 'work','human', 'language', 'data', '.']

Stemming

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer= PorterStemmer()
input_str="There are several types of stemming algorithms."
input_str=word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))
    
# output: there are sever type of stem algorithm.

Lemmatization

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()
input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

# output: be have do language city mouse

References

Share on

Twitter Facebook LinkedIn

Sagor Sarker

Text Preprocessing

Text Normalization

converting to lowercase

Removing Numbers

Removing Punctuations

Removing Whitespace

Removing stop words

Stemming

Lemmatization

References

Share on

Comments

You May Also Enjoy

Large text data token counting fast

Processing tips of huggingface datasets

Process large CSV file using pandas

Large CSV file multiprocessing