Text Preprocessing#
Overview#
Before real data analysis and mining, we usually need to preprocess the textual data into more easy-to-interpret formats.
This step is tedious but crucial and often involves a wide variety of techniques that convert raw text into well-defined sequences of linguistic components.
Types of Preprocessing#
Text Normalization
HTML tags
Stemming
Lemmatization
Contractions
Accented Characters
Special Characters
Redundant Whitespaces
Stopwords
Text Tokenization
words
sentences
other self-defined linguistic units
Text Enrichment
POS Tagging
Chunking
Parsing
Warning
These three preprocessing steps do NOT necessarily proceed in a serial fashion. The seqential ordering of these steps can be very crucial.