Text Normalization#

Overview#

The objective of text normalization is to clean up the text by removing unnecessary and irrelevant components. What to include or exclude for the later analysis is highly dependent on the research questions. Before any data preprocessing, you always need to ask yourself whether it would have great impact on the data distribution if you remove the target structures.

import spacy
import unicodedata
import re
import collections
import requests
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import wordnet
from nltk.stem import PorterStemmer, LancasterStemmer, RegexpStemmer, SnowballStemmer
from bs4 import BeautifulSoup
## NLTK downloads
nltk.download(['wordnet', 'stopwords'])
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
True

HTML Tags#

  • One of the most frequent data source is the Internet. Assuming that we web-crawl text data from a wikipedia page, we need to clean up the HTML codes quite a bit.

  • Important steps:

    • Download and parse the HTML codes of the webpage

    • Identify the elements from the page that we are interested in

    • Extract the textual contents of the HTML elements

    • Remove unnecessary HTML tags

    • Remove extra spacing/spaces

## Get html data
data = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)')
content = data.content
print(content[:500])
b'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-p'
## Function: remove html tags

def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    ## Can include more HTML preprocessing here...
    stripped_html_elements = soup.findAll(name='div',attrs={'id':'mw-content-text'})
    stripped_text = ' '.join([h.get_text() for h in stripped_html_elements])
    stripped_text = re.sub(r'[\r|\n|\r]+', '\n', stripped_text)
    return stripped_text

clean_content = strip_html_tags(content)
print(clean_content[:1000])
General-purpose programming language
PythonParadigmMulti-paradigm: Object-oriented,[1] Procedural (Imperative), Functional, Structured, ReflectiveDesigned byGuido van RossumDeveloperPython Software FoundationFirst appeared20 February 1991; 33 years ago (1991-02-20)[2]Stable release3.12.2 
   / 7 February 2024; 21 days ago (7 February 2024)
Typing disciplineDuck, Dynamic, Strong typing;[3] Optional type annotations (since 3.5, but those hints are ignored, except with unofficial tools)[4]OSWindows, macOS, Linux/UNIX, Android, Unix-like systems, BSD variants[5][6] and a few other platforms[7]LicensePython Software Foundation LicenseFilename extensions.py, .pyw, .pyz, [8]
.pyi, .pyc, .pydWebsitepython.orgMajor implementationsCPython, PyPy, Stackless Python, MicroPython, CircuitPython, IronPython, JythonDialectsCython, RPython, Starlark[9]Influenced byABC,[10] Ada,[11] ALGOL 68,[12] APL,[13] C,[14] C++,[15] CLU,[16] Dylan,[17] Haskell,[18][13] Icon,[19] Lisp,[20] Modula-3,[12][15] Perl,[21]

Tip

The above preprocessing of the wiki page content is less ideal. One can further improve the text processing by targeting at the real text content of the entry.

Stemming#

  • Stemming is the process where we standardize word forms into their base stem irrespective of their inflections (or other derivational variations).

  • The nltk provides several popular stemmers for English:

    • nltk.stem.PorterStemmer

    • nltk.stem.LancasterStemmer

    • nltk.stem.RegexpStemmer

    • nltk.stem.SnowballStemmer

  • We can compare the results of different stemmers.

words = ['jumping', 'jumps', 'jumped', 'jumpy']
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer('english')

rs = RegexpStemmer('ing$|s$|ed$|y$', min=4) # set the minimum of the string to stem
## Porter Stemmer
[ps.stem(w) for w in words]
['jump', 'jump', 'jump', 'jumpi']
## Lancaster Stemmer
[ls.stem(w) for w in words]
['jump', 'jump', 'jump', 'jumpy']
## Snowball Stemmer
[ss.stem(w) for w in words]
['jump', 'jump', 'jump', 'jumpi']
## Regular Expression Stemmer
[rs.stem(w) for w in words]
['jump', 'jump', 'jump', 'jump']

Lemmatization#

  • Lemmatization is similar to stemmatization.

  • It is a process where we remove word affixes to get the root word but not the root stem.

  • These root words, i.e., lemmas, are lexicographically correct words and always present in the dictionary.

Question

In terms of Lemmatization and Stemmatization, which one requires more computational cost? That is, which processing might be slower?

  • Two frequently-used lemmatizers

    • nltk.stem.WordNetLemmatizer

    • spacy

WordNet Lemmatizer#

  • WordNetLemmatizer utilizes the dictionary of WordNet.

  • It requires the parts of speech of the word for lemmatization.

  • I think right now only nouns, verbs and adjectives are important in WordNetLemmatizer.

Important

For Wordnet-based Lemmatizer, it is important to specify the part of speech of the word form.

Valid options are "n" for nouns, "v" for verbs, "a" for adjectives, "r" for adverbs and "s" for satellite adjectives.

from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
# nouns
nouns = ['cars','men','feet']
nouns_lemma = [wnl.lemmatize(n, 'n') for n in nouns]
print(nouns_lemma)
['car', 'men', 'foot']
# verbs
verbs = ['running', 'ate', 'grew', 'grown']
verbs_lemma = [wnl.lemmatize(v, 'v') for v in verbs]
print(verbs_lemma)
['run', 'eat', 'grow', 'grow']
# adj
adjs = ['saddest', 'fancier', 'jumpy', 'better', 'least', 'less', 'worse', 'worst','best', 'better']
adjs_lemma = [wnl.lemmatize(a, 'a') for a in adjs]
print(adjs_lemma)
['sad', 'fancy', 'jumpy', 'good', 'least', 'less', 'bad', 'bad', 'best', 'good']

Spacy#

Warning

To use spacy properly, you need to download/install the language models of the English language first before you load the parameter files. Please see spacy documentation for installation steps.

Also, please remember to install the language models in the right conda environment.

  • For example, in my Mac:

$ conda activate python-notes
$ pip install spacy
$ python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parse','entity'])
text = 'My system keeps crashing his crashed yesterday, ours crashes daily'
text_tagged = nlp(text)
  • spacy processes the text by tokenizing it into tokens and enriching the tokens with many annotations.

for t in text_tagged:
    print(t.text+'/'+t.lemma_ + '/'+ t.pos_)
My/my/PRON
system/system/NOUN
keeps/keep/VERB
crashing/crash/VERB
his/his/PRON
crashed/crashed/NOUN
yesterday/yesterday/NOUN
,/,/PUNCT
ours/ours/PRON
crashes/crash/VERB
daily/daily/ADV
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily")
'my system keep crash ! his crashed yesterday , ours crash daily'

Contractions#

  • For the English data, contractions are problematic sometimes.

  • These may get even more complicated when different tokenizers deal with contractions differently.

  • A good way is to expand all contractions into their original independent word forms.

## -------------------- ##
## Colab Only           ##
## -------------------- ##
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Note

If you are using your local Python environment, please download the TAWP directory from the ENC2045_demo_data. This directory contains code snippets provided in Sarkar’s (2020) book.

We assume that you have saved the TAWP directory under GoogleDrive/ENC2045_demo_data.

## -------------------- ##
## Colab Only           ##
## -------------------- ##
## working dir
!pwd
/content
## -------------------- ##
## Colab Only           ##
## -------------------- ##
## check ENC2045_demo_data
!ls /content/drive/MyDrive/ENC2045_demo_data/
05-3_clauseorders.csv	      dcard-top100.csv	 PDF		     TAWP
addition-student-version.csv  glove		 pdf-firth-text.png  TAWP.zip
date-student-version.csv      movie_reviews.csv  stopwords
## -------------------- ##
## Colab Only           ##
## -------------------- ##
!cp -r /content/drive/MyDrive/ENC2045_demo_data/TAWP /content/
import TAWP  ## You have to put TAWA directory under your WORKING DIRECTORY
from TAWP.contractions import CONTRACTION_MAP

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    ## create a regex pattern of all contracted forms
    contractions_pattern = re.compile('({})'.format('|'.join(
        contraction_mapping.keys())),flags=re.IGNORECASE | re.DOTALL)

    def expand_match(contraction):
        match = contraction.group(0)  # the whole matched contraction

        # if the matched contraction (=keys) exists in the dict,
        # get its corresponding uncontracted form (=values)
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())

        return expanded_contraction

    # find each contraction in the pattern,
    # find it from text,
    # and replace it using the output of
    # expand_match
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

Note

In re.sub(repl, str), when repl is a function like above, the function is called for every non-overlapping occurrence of pattern contractions_pattern. The function expand_match takes a single matched contraction, and returns the replacement string, i.e., its uncontracted form in the dictionary.

In other words, contractions_pattern.sub(expand_match, text) above says: to substitute every non-overlapping occurrence of pattern contractions_pattern in text with the output of expand_match()

print(expand_contractions("Y'all can't expand contractions I'd think"))
print(expand_contractions("I'm very glad he's here! And it ain't here!"))
you all cannot expand contractions I would think
I am very glad he is here! And it is not here!
type(CONTRACTION_MAP)
dict
list(CONTRACTION_MAP.items())[:5] # check the first five items
[("ain't", 'is not'),
 ("aren't", 'are not'),
 ("can't", 'cannot'),
 ("can't've", 'cannot have'),
 ("'cause", 'because')]

Accented Characters (Non-ASCII)#

  • The unicodedata module handles unicode characters very efficiently. Please check unicodedata dcoumentation for more details.

  • When dealing with the English data, we may often encounter foreign characters in texts that are not part of the ASCII character set.

## Function: remove accented chars

def remove_accented_chars(text):
#     ```
#     (NFKD) will apply the compatibility decomposition, i.e.
#     replace all compatibility characters with their equivalents.
#     ```
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text


remove_accented_chars('Sómě Áccěntěd těxt')

# print(unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt'))
# print(unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt').encode('ascii','ignore'))
# print(unicodedata.normalize('NFKD', 'Sómě Áccěntěd těxt').encode('ascii','ignore').decode('utf-8', 'ignore'))
'Some Accented text'

Note

  • str.encode() returns an encoded version of the string as a bytes object using the specified encoding.

  • byes.decode() returns a string decoded from the given bytes using the specified encoding.

  • Another common scenario is the case where texts include both English and Chinese characters. What’s worse, the English characters are in full-width.

## Chinese characters with full-width English letters and punctuations
text = '中英文abc,,。..ABC123'
print(unicodedata.normalize('NFKD', text))
print(unicodedata.normalize('NFKC', text))  # recommended method
print(unicodedata.normalize('NFC', text))
print(unicodedata.normalize('NFD', text))
中英文abc,,。..ABC123
中英文abc,,。..ABC123
中英文abc,,。..ABC123
中英文abc,,。..ABC123
  • Sometimes, we may even want to keep characters of one language only.

text = "中文CHINESE。!==.= ^o^ 2020/5/20 alvin@gmal.cob@%&*"

# remove puncs/symbols
print(''.join(
    [c for c in text if unicodedata.category(c)[0] not in ["P", "S"]]))

# select letters
print(''.join([c for c in text if unicodedata.category(c)[0] in ["L"]]))

# remove alphabets
print(''.join(
    [c for c in text if unicodedata.category(c)[:2] not in ["Lu", 'Ll']]))

# select Chinese chars?
print(''.join([c for c in text if unicodedata.category(c)[:2] in ["Lo"]]))
中文CHINESE o 2020520 alvingmalcob
中文CHINESEoalvingmalcob
中文。!==.= ^^ 2020/5/20 @.@%&*
中文

Note

Please check this page for unicode category names.

It seems that the unicode catetory Lo is good to identify Chinese characters?

We can also make use of the category names to identify punctuations.

Note

This page shows how unicode deals with combining or decomposing character classes.

Special Characters#

  • Depending on the research questions and the defined tasks, we often need to decide whether to remove irrelevant characters.

  • Common irrelevant (aka. non-informative) characters may include:

    • Punctuation marks and symbols

    • Digits

    • Any other non-alphanumeric characters

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text

s = "This is a simple case! Removing 1 or 2 symbols is probably ok...:)"
print(remove_special_characters(s))
print(remove_special_characters(s, True))
This is a simple case Removing 1 or 2 symbols is probably ok
This is a simple case Removing  or  symbols is probably ok

Warning

In the following example, if we use the same remove_special_characters() to pre-process the text, what additional problems will we encounter?

Any suggestions or better alternative methods?

s = "It's a complex sentences, and I'm not sure if it's ok to replace all symbols then :( What now!!??)"
print(remove_special_characters(s))
Its a complex sentences and Im not sure if its ok to replace all symbols then  What now

Stopwords#

  • At the word-token level, there are words that have little semantic information and are usually removed from text in text preprocessing. These words are often referred to as stopwords.

  • However, there is no universal stopword list. Whether a word is informative or not depends on your research/project objective. It is a linguistic decision.

  • The nltk.corpus.stopwords.words() provides a standard English language stopwords list.

tokenizer = ToktokTokenizer()
    # assuming per line per sentence
    # for other Tokenizer, maybe sent.tokenize should go first
stopword_list = nltk.corpus.stopwords.words('english')

## Function: remove stopwords (English)

def remove_stopwords(text, is_lower_case=False, stopwords=stopword_list):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

remove_stopwords("The, and, if are stopwords, computer is not")
', , stopwords , computer'
  • We can check the languages of the stopwords lists provided by nltk.

nltk.corpus.stopwords.fileids()
['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

Redundant Whitespaces#

  • Very often we would see redundant duplicate whitespaces in texts.

  • Sometimes, when we remove special characters (punctuations, digits etc.), we may replace those characters with whitespaces (not empty string), which may lead to duplicate whitespaces in texts.

## function: remove redundant spaces
def remove_redundant_whitespaces(text):
    text = re.sub(r'\s+'," ", text)
    return text.strip()
s = "We are humans  and we   often have typos.  "
remove_redundant_whitespaces(s)
'We are humans and we often have typos.'

Packing things together#

  • Ideally, we can wrap all the relevant steps of text preprocessing in one coherent procedure.

  • Please study Sarkar’s text_normalizer.py

def normalize_corpus(corpus, html_stripping=True, contraction_expansion=True,
                     accented_char_removal=True, text_lower_case=True,
                     text_stemming=False, text_lemmatization=True,
                     special_char_removal=True, remove_digits=True,
                     stopword_removal=True, stopwords=stopword_list):
                     
                     ## Your codes here
    return corpus_normalized

References#

  • Sarkar (2020), Chapter 3.