Text Normalization#
Overview#
The objective of text normalization is to clean up the text by removing unnecessary and irrelevant components. What to include or exclude for the later analysis is highly dependent on the research questions. Before any data preprocessing, you always need to ask yourself whether it would have great impact on the data distribution if you remove the target structures.
Contents
import spacy
import unicodedata
import re
import collections
import requests
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer, LancasterStemmer, RegexpStemmer, SnowballStemmer
from bs4 import BeautifulSoup
## NLTK downloads
nltk.download(['wordnet', 'stopwords'])
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
True
Stemming#
Stemming is the process where we standardize word forms into their base stem irrespective of their inflections (or other derivational variations).
The
nltk
provides several popular stemmers for English:nltk.stem.PorterStemmer
nltk.stem.LancasterStemmer
nltk.stem.RegexpStemmer
nltk.stem.SnowballStemmer
We can compare the results of different stemmers.
words = ['jumping', 'jumps', 'jumped', 'jumpy']
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer('english')
rs = RegexpStemmer('ing$|s$|ed$|y$', min=4) # set the minimum of the string to stem
## Porter Stemmer
[ps.stem(w) for w in words]
['jump', 'jump', 'jump', 'jumpi']
## Lancaster Stemmer
[ls.stem(w) for w in words]
['jump', 'jump', 'jump', 'jumpy']
## Snowball Stemmer
[ss.stem(w) for w in words]
['jump', 'jump', 'jump', 'jumpi']
## Regular Expression Stemmer
[rs.stem(w) for w in words]
['jump', 'jump', 'jump', 'jump']
Lemmatization#
Lemmatization is similar to stemmatization.
It is a process where we remove word affixes to get the root word but not the root stem.
These root words, i.e., lemmas, are lexicographically correct words and always present in the dictionary.
Question
In terms of Lemmatization and Stemmatization, which one requires more computational cost? That is, which processing might be slower?
Two frequently-used lemmatizers
nltk.stem.WordNetLemmatizer
spacy
WordNet Lemmatizer#
WordNetLemmatizer utilizes the dictionary of WordNet.
It requires the parts of speech of the word for lemmatization.
I think right now only nouns, verbs and adjectives are important in
WordNetLemmatizer
.
Important
For Wordnet-based Lemmatizer, it is important to specify the part of speech of the word form.
Valid options are "n"
for nouns, "v"
for verbs, "a"
for adjectives, "r"
for adverbs and "s"
for satellite adjectives.
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
# nouns
nouns = ['cars','men','feet']
nouns_lemma = [wnl.lemmatize(n, 'n') for n in nouns]
print(nouns_lemma)
['car', 'men', 'foot']
# verbs
verbs = ['running', 'ate', 'grew', 'grown']
verbs_lemma = [wnl.lemmatize(v, 'v') for v in verbs]
print(verbs_lemma)
['run', 'eat', 'grow', 'grow']
# adj
adjs = ['saddest', 'fancier', 'jumpy', 'better', 'least', 'less', 'worse', 'worst','best', 'better']
adjs_lemma = [wnl.lemmatize(a, 'a') for a in adjs]
print(adjs_lemma)
['sad', 'fancy', 'jumpy', 'good', 'least', 'less', 'bad', 'bad', 'best', 'good']
Spacy#
Warning
To use spacy
properly, you need to download/install the language models of the English language first before you load the parameter files. Please see spacy documentation for installation steps.
Also, please remember to install the language models in the right conda environment.
For example, in my Mac:
$ conda activate python-notes
$ pip install spacy
$ python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parse','entity'])
text = 'My system keeps crashing his crashed yesterday, ours crashes daily'
text_tagged = nlp(text)
spacy
processes the text by tokenizing it into tokens and enriching the tokens with many annotations.
for t in text_tagged:
print(t.text+'/'+t.lemma_ + '/'+ t.pos_)
My/my/PRON
system/system/NOUN
keeps/keep/VERB
crashing/crash/VERB
his/his/PRON
crashed/crashed/NOUN
yesterday/yesterday/NOUN
,/,/PUNCT
ours/ours/PRON
crashes/crash/VERB
daily/daily/ADV
def lemmatize_text(text):
text = nlp(text)
text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
return text
lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily")
'my system keep crash ! his crashed yesterday , ours crash daily'
Contractions#
For the English data, contractions are problematic sometimes.
These may get even more complicated when different tokenizers deal with contractions differently.
A good way is to expand all contractions into their original independent word forms.
## -------------------- ##
## Colab Only ##
## -------------------- ##
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Note
If you are using your local Python environment, please download the TAWP
directory from the ENC2045_demo_data
. This directory contains code snippets provided in Sarkar’s (2020) book.
We assume that you have saved the TAWP
directory under GoogleDrive/ENC2045_demo_data
.
## -------------------- ##
## Colab Only ##
## -------------------- ##
## working dir
!pwd
/content
## -------------------- ##
## Colab Only ##
## -------------------- ##
## check ENC2045_demo_data
!ls /content/drive/MyDrive/ENC2045_demo_data/
05-3_clauseorders.csv dcard-top100.csv PDF TAWP
addition-student-version.csv glove pdf-firth-text.png TAWP.zip
date-student-version.csv movie_reviews.csv stopwords
## -------------------- ##
## Colab Only ##
## -------------------- ##
!cp -r /content/drive/MyDrive/ENC2045_demo_data/TAWP /content/
import TAWP ## You have to put TAWA directory under your WORKING DIRECTORY
from TAWP.contractions import CONTRACTION_MAP
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
## create a regex pattern of all contracted forms
contractions_pattern = re.compile('({})'.format('|'.join(
contraction_mapping.keys())),flags=re.IGNORECASE | re.DOTALL)
def expand_match(contraction):
match = contraction.group(0) # the whole matched contraction
# if the matched contraction (=keys) exists in the dict,
# get its corresponding uncontracted form (=values)
expanded_contraction = contraction_mapping.get(match)\
if contraction_mapping.get(match)\
else contraction_mapping.get(match.lower())
return expanded_contraction
# find each contraction in the pattern,
# find it from text,
# and replace it using the output of
# expand_match
expanded_text = contractions_pattern.sub(expand_match, text)
expanded_text = re.sub("'", "", expanded_text)
return expanded_text
Note
In re.sub(repl, str)
, when repl
is a function like above, the function is called for every non-overlapping occurrence of pattern contractions_pattern
. The function expand_match
takes a single matched contraction, and returns the replacement string, i.e., its uncontracted form in the dictionary.
In other words, contractions_pattern.sub(expand_match, text)
above says: to substitute every non-overlapping occurrence of pattern contractions_pattern
in text
with the output of expand_match()
print(expand_contractions("Y'all can't expand contractions I'd think"))
print(expand_contractions("I'm very glad he's here! And it ain't here!"))
you all cannot expand contractions I would think
I am very glad he is here! And it is not here!
type(CONTRACTION_MAP)
dict
list(CONTRACTION_MAP.items())[:5] # check the first five items
[("ain't", 'is not'),
("aren't", 'are not'),
("can't", 'cannot'),
("can't've", 'cannot have'),
("'cause", 'because')]
Accented Characters (Non-ASCII)#
The
unicodedata
module handles unicode characters very efficiently. Please check unicodedata dcoumentation for more details.When dealing with the English data, we may often encounter foreign characters in texts that are not part of the ASCII character set.
## Function: remove accented chars
def remove_accented_chars(text):
# ```
# (NFKD) will apply the compatibility decomposition, i.e.
# replace all compatibility characters with their equivalents.
# ```
text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return text
remove_accented_chars('Sómě Áccěntěd těxt')
# print(unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt'))
# print(unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt').encode('ascii','ignore'))
# print(unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt').encode('ascii','ignore').decode('utf-8', 'ignore'))
'Some Accented text'
Note
str.encode()
returns an encoded version of the string as a bytes object using the specified encoding.byes.decode()
returns a string decoded from the given bytes using the specified encoding.
Another common scenario is the case where texts include both English and Chinese characters. What’s worse, the English characters are in full-width.
## Chinese characters with full-width English letters and punctuations
text = '中英文abc,,。..ABC123'
print(unicodedata.normalize('NFKD', text))
print(unicodedata.normalize('NFKC', text)) # recommended method
print(unicodedata.normalize('NFC', text))
print(unicodedata.normalize('NFD', text))
中英文abc,,。..ABC123
中英文abc,,。..ABC123
中英文abc,,。..ABC123
中英文abc,,。..ABC123
Sometimes, we may even want to keep characters of one language only.
text = "中文CHINESE。!==.= ^o^ 2020/5/20 alvin@gmal.cob@%&*"
# remove puncs/symbols
print(''.join(
[c for c in text if unicodedata.category(c)[0] not in ["P", "S"]]))
# select letters
print(''.join([c for c in text if unicodedata.category(c)[0] in ["L"]]))
# remove alphabets
print(''.join(
[c for c in text if unicodedata.category(c)[:2] not in ["Lu", 'Ll']]))
# select Chinese chars?
print(''.join([c for c in text if unicodedata.category(c)[:2] in ["Lo"]]))
中文CHINESE o 2020520 alvingmalcob
中文CHINESEoalvingmalcob
中文。!==.= ^^ 2020/5/20 @.@%&*
中文
Note
Please check this page for unicode category names.
It seems that the unicode catetory Lo
is good to identify Chinese characters?
We can also make use of the category names to identify punctuations.
Note
This page shows how unicode deals with combining or decomposing character classes.
Special Characters#
Depending on the research questions and the defined tasks, we often need to decide whether to remove irrelevant characters.
Common irrelevant (aka. non-informative) characters may include:
Punctuation marks and symbols
Digits
Any other non-alphanumeric characters
def remove_special_characters(text, remove_digits=False):
pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
text = re.sub(pattern, '', text)
return text
s = "This is a simple case! Removing 1 or 2 symbols is probably ok...:)"
print(remove_special_characters(s))
print(remove_special_characters(s, True))
This is a simple case Removing 1 or 2 symbols is probably ok
This is a simple case Removing or symbols is probably ok
Warning
In the following example, if we use the same remove_special_characters()
to pre-process the text, what additional problems will we encounter?
Any suggestions or better alternative methods?
s = "It's a complex sentences, and I'm not sure if it's ok to replace all symbols then :( What now!!??)"
print(remove_special_characters(s))
Its a complex sentences and Im not sure if its ok to replace all symbols then What now
Stopwords#
At the word-token level, there are words that have little semantic information and are usually removed from text in text preprocessing. These words are often referred to as stopwords.
However, there is no universal stopword list. Whether a word is informative or not depends on your research/project objective. It is a linguistic decision.
The
nltk.corpus.stopwords.words()
provides a standard English language stopwords list.
tokenizer = ToktokTokenizer()
# assuming per line per sentence
# for other Tokenizer, maybe sent.tokenize should go first
stopword_list = nltk.corpus.stopwords.words('english')
## Function: remove stopwords (English)
def remove_stopwords(text, is_lower_case=False, stopwords=stopword_list):
tokens = tokenizer.tokenize(text)
tokens = [token.strip() for token in tokens]
if is_lower_case:
filtered_tokens = [token for token in tokens if token not in stopwords]
else:
filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
filtered_text = ' '.join(filtered_tokens)
return filtered_text
remove_stopwords("The, and, if are stopwords, computer is not")
', , stopwords , computer'
We can check the languages of the stopwords lists provided by
nltk
.
nltk.corpus.stopwords.fileids()
['arabic',
'azerbaijani',
'basque',
'bengali',
'catalan',
'chinese',
'danish',
'dutch',
'english',
'finnish',
'french',
'german',
'greek',
'hebrew',
'hinglish',
'hungarian',
'indonesian',
'italian',
'kazakh',
'nepali',
'norwegian',
'portuguese',
'romanian',
'russian',
'slovene',
'spanish',
'swedish',
'tajik',
'turkish']
Redundant Whitespaces#
Very often we would see redundant duplicate whitespaces in texts.
Sometimes, when we remove special characters (punctuations, digits etc.), we may replace those characters with whitespaces (not empty string), which may lead to duplicate whitespaces in texts.
## function: remove redundant spaces
def remove_redundant_whitespaces(text):
text = re.sub(r'\s+'," ", text)
return text.strip()
s = "We are humans and we often have typos. "
remove_redundant_whitespaces(s)
'We are humans and we often have typos.'
Packing things together#
Ideally, we can wrap all the relevant steps of text preprocessing in one coherent procedure.
Please study Sarkar’s
text_normalizer.py
def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
accented_char_removal=True, text_lower_case=True,
text_stemming=False, text_lemmatization=True,
special_char_removal=True, remove_digits=True,
stopword_removal=True, stopwords=stopword_list):
## Your codes here
return corpus_normalized
References#
Sarkar (2020), Chapter 3.