Text Preprocessing

Contents

Text Preprocessing#

Overview#

Before real data analysis and mining, we usually need to preprocess the textual data into more easy-to-interpret formats.
This step is tedious but crucial and often involves a wide variety of techniques that convert raw text into well-defined sequences of linguistic components.

Types of Preprocessing#

Text Normalization
- HTML tags
- Stemming
- Lemmatization
- Contractions
- Accented Characters
- Special Characters
- Redundant Whitespaces
- Stopwords
Text Tokenization
- words
- sentences
- other self-defined linguistic units
Text Enrichment
- POS Tagging
- Chunking
- Parsing

Warning

These three preprocessing steps do NOT necessarily proceed in a serial fashion. The seqential ordering of these steps can be very crucial.