Text Preprocessing#

Overview#

  • Before real data analysis and mining, we usually need to preprocess the textual data into more easy-to-interpret formats.

  • This step is tedious but crucial and often involves a wide variety of techniques that convert raw text into well-defined sequences of linguistic components.

Types of Preprocessing#

  • Text Normalization

    • HTML tags

    • Stemming

    • Lemmatization

    • Contractions

    • Accented Characters

    • Special Characters

    • Redundant Whitespaces

    • Stopwords

  • Text Tokenization

    • words

    • sentences

    • other self-defined linguistic units

  • Text Enrichment

    • POS Tagging

    • Chunking

    • Parsing

Warning

These three preprocessing steps do NOT necessarily proceed in a serial fashion. The seqential ordering of these steps can be very crucial.