NLP Pipeline#

A General NLP Pipeline#


Varations of the NLP Pipelines#

  • The process may not always be linear.

  • There are loops in between.

  • These procedures may depend on specific task at hand.

Data Collection#

Data Acquisition: Heart of ML System#

  • Ideal Setting: We have everything needed.

  • Labels and Annotations

  • Very often we are dealing with less-than-ideal scenarios

Less-than-ideal Scenarios#

  • Initial datasets with limited annotations/labels

  • Initial datasets labeled/extracted based on regular expressions or heuristics

  • Public datasets (cf. Google Dataset Search or kaggle)

  • Scraped data

  • Product intervention (Product intervention refers to the deliberate manipulation or curation of data by product developers or platform owners, which may introduce biases or distortions in the dataset.)

Data Augmentation#

  • Data augmentation in natural language processing (NLP) is a technique used to increase the diversity and quantity of training data by leveraging language properties to create new text samples that are syntactically similar to the original source text data.

  • Types of strategies:

    • Synonym replacement

    • Related word replacement (based on association metrics)

    • Back translation

    • Replacing entities

    • Adding noise to data (e.g. spelling errors, random words)

Text Extraction and Cleanup#

Text Extraction#

  • Extracting raw texts from the input data

    • HTML

    • PDF

  • Relevant vs. irrelevant information

    • non-textual information

    • markup

    • metadata

  • Encoding format

Extracting texts from webpages#

Extracting textual contents from the website is a very common way to obtain data. It requires a close study of the structure of the HTML content of the web pages.

The following code attempts to extract hyperlinks from the index page of PTT, a renowned public bulletin board in Taiwan, and automatically navigates to the first hyperlink to access its textual content.

## -------------------- ##
## Colab Only           ##
## -------------------- ##

! pip install requests_html
! apt install tesseract-ocr
! apt install libtesseract-dev
! pip install pytesseract
Requirement already satisfied: requests_html in /usr/local/lib/python3.10/dist-packages (0.10.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.31.0)
Requirement already satisfied: pyquery in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.0.0)
Requirement already satisfied: fake-useragent in /usr/local/lib/python3.10/dist-packages (from requests_html) (1.4.0)
Requirement already satisfied: parse in /usr/local/lib/python3.10/dist-packages (from requests_html) (1.20.1)
Requirement already satisfied: bs4 in /usr/local/lib/python3.10/dist-packages (from requests_html) (0.0.2)
Requirement already satisfied: w3lib in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.1.2)
Requirement already satisfied: pyppeteer>=0.0.14 in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.0.0)
Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (1.4.4)
Requirement already satisfied: certifi>=2023 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (2024.2.2)
Requirement already satisfied: importlib-metadata>=1.4 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (7.0.1)
Requirement already satisfied: pyee<12.0.0,>=11.0.0 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (11.1.0)
Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (4.66.2)
Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (1.26.18)
Requirement already satisfied: websockets<11.0,>=10.0 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (10.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from bs4->requests_html) (4.12.3)
Requirement already satisfied: lxml>=2.1 in /usr/local/lib/python3.10/dist-packages (from pyquery->requests_html) (4.9.4)
Requirement already satisfied: cssselect>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from pyquery->requests_html) (1.2.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->requests_html) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->requests_html) (3.6)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata>=1.4->pyppeteer>=0.0.14->requests_html) (3.17.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from pyee<12.0.0,>=11.0.0->pyppeteer>=0.0.14->requests_html) (4.9.0)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->bs4->requests_html) (2.5)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libtesseract-dev is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from pytesseract) (23.2)
Requirement already satisfied: Pillow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from pytesseract) (9.4.0)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10
## Example 1: PTT Gossipping
from requests_html import HTMLSession
import requests
from bs4 import BeautifulSoup
import pandas as pd

## HTML session
session = HTMLSession()
session.cookies.set('over18', '1')  # age verification

## Processing index page
newsurl = ''
response = session.get(newsurl)
web_content = response.text

## Extract all hyperlinks
soup = BeautifulSoup(web_content,'html.parser')
title = [x['href'] for x in'div.title > a')]

## Extract first article content
first_art_link = ''+ title[0]

response = session.get(first_art_link)
web_content = response.text

## Extract article from HTML
soup = BeautifulSoup(web_content,'html.parser')

## Article Metadata
header = soup.find_all('span','article-meta-value')

# Author
author = header[0].text
# Board
board = header[1].text
# Title
title = header[2].text
# Date
date = header[3].text

## Artcle Content
main_container = soup.find(id='main-container')

## Paragraphs
all_text = main_container.text
pre_text = all_text.split('--')[0]
texts = pre_text.split('\n')
contents = texts[2:]
content = '\n'.join(contents)

# Printing
['/bbs/Gossiping/M.1708555641.A.B86.html', '/bbs/Gossiping/M.1708555709.A.95A.html', '/bbs/Gossiping/M.1708555913.A.A76.html', '/bbs/Gossiping/M.1708556060.A.DD7.html', '/bbs/Gossiping/M.1708556086.A.06B.html']
作者:A98454 ( 阿肥是肥宅的肥)
標題:Re: [問卦] 科舉制度是控制人民的手段嗎?
日期:Thu Feb 22 03:47:22 2024
內容: 米那桑 空尼幾蛙 關於J個問題  阿肥覺得是這樣的

 說到控制人民就要說說 法家思想了,荀子認為任性本惡,因此人需要學習去壓制內心慾望


就別想有的沒的好好跟著國家給的kpi ,乖乖種田就好了

 至於那些有其他想法的,思想hen 危險的,就發配充軍思想改造,避免這些人影響農民種




,大概武則天開始,出現科舉制度的雛形,(也許要拉攏一些底層勢力抗衡 朝中的貴族階


 所以 科舉制度 算是 帝國2.0的新制度發明吧


 樓下  李白 怎麽看?

Extracting texts from scanned PDF#


Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.

You have to manually install it before you can use it in Python.

## -------------------- ##
## Colab Only           ##
## -------------------- ##

from google.colab import drive
from PIL import Image
from pytesseract import image_to_string

#import os

YOUR_DEMO_DATA_PATH = "/content/drive/MyDrive/ENC2045_demo_data/"  # please change your file path on local machines
filename = YOUR_DEMO_DATA_PATH+'pdf-firth-text.png'

text = image_to_string(
Stellenbosch Papers in Linguistics, Vol. 15, 1986, 31-60 doi: 10.5774/15-0-96

SPIL 14 (1986) 31- 6¢ 31


Nigel Love

"The study of the living votce of a
man tn aectton ts a very btg job in-

deed." --- J.R. Firth

John Rupert Firth was born in 1890. After serving as Pro-
fessor of English at the University of the Punjab from 1919
to 1928, he took up a pest in the phonetics department of
University College, London. In 1938 he moved to the lin-
guistics department of the School of Oriental and African
Studies in London, where from 1944 until his retirement in
1956 he was Professor of Generali Linguistics. He died in
1960. He was an influential teacher, some of whose doctrines
(especially those concerning phonology) were widely propa-~
gated and developed by his students in what came to be known

as the "London school” of linguistics.

"The business of linguistics", according to Firth, "is to


describe languages". In saying as much he would have the
assent of most twentieth-century linguistic theorists.

Where he parts company with many is in holding that this
enterprise is not incompatible with, or even separable from,
studying “the living voice of a man in action"; and his
chief interest as a linguistic thinker lies in his attempt
to resist the idea that synchronic descriptive linguistics
should treat what he calis “speech-events" as no more than
a means of access to what really interests the linguist:

the Language-system underlying them.

Languages, according to many theorists, are to be envisaged
as systems of abstract entities. These entities are units

of linguistic “form. Units of linguistic form are of two

Unicode normalization#

text = 'I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ'

## encode the strings in bytes
text2 = text.encode('utf-8') 
I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ
b'I feel really \xf0\x9f\x98\xa1. GOGOGO!! \xf0\x9f\x92\xaa\xf0\x9f\x92\xaa\xf0\x9f\x92\xaa  \xf0\x9f\xa4\xa3\xf0\x9f\xa4\xa3 \xc8\x80\xc3\x86\xc4\x8e\xc7\xa6\xc6\x93'
import unicodedata

## normalize by decomposition
unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
'I feel really . GOGOGO!!    ADG'
## normalize by composition
unicodedata.normalize('NFKC', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
'I feel really . GOGOGO!!    '
## normalize full-widths punks
text_punks = ',。「」『』!《'
text_punks2 = unicodedata.normalize('NFKD', text_punks).encode('utf-8', 'ignore').decode('utf-8', 'ignore')
## Check comma and exclamation mark
print(ord(text_punks2[0])) ## normalized

print(ord(text_punks2[6])) ## normalized
## char vs. code number (decimal), code number (hexadecimal)
  • Please check unicodedata documentation for more detail on character normalization.

  • Other useful libraries

    • Spelling check: pyenchant, Microsoft REST API

    • PDF: PyPDF, PDFMiner

    • OCR: pytesseract

Cleanup (Preprocessing)#

Segmentation and Tokenization#


NLTK package provides many useful datasets for text analysis. Some of the codes may require you to download the corpus data first. Please see Installing NLTK Data for more information.

## -------------------- ##
## Colab Only           ##
## -------------------- ##

import nltk['punkt', 'stopwords', 'wordnet'])
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']
  • Frequent preprocessing

    • Stopword removal

    • Stemming and/or lemmatization

    • Digits/Punctuaions removal

    • Case normalization

Removing stopwords, punctuations, digits#

from nltk.corpus import stopwords
from string import punctuation

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)


# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']

Stemming and lemmatization#

## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])
['car', 'revolut', 'better']
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))
  • Task-specific preprocessing

    • Unicode normalization

    • Language detection

    • Code mixing

    • Transliteration (e.g., using piyin for Chinese words in English-Chinese code-switching texts)

  • Automatic annotations

    • POS tagging

    • Syntactic/Dependency Parsing

    • Named Entity Recognition

    • Coreference Resolution

Important Reminders for Preprocessing#

  • Not all steps are necessary

  • These steps are NOT sequential

  • These steps are task-dependent and language-dependent

  • Goals

    • Text Normalization

    • Text Tokenization

    • Text Enrichment/Annotation

Feature Engineering#

What is feature engineering?#

  • Feature engineering is the process of transforming raw data into informative features that are suitable for machine learning algorithms.

  • It usually involves creating, selecting, or modifying features from the original dataset to improve the performance of a machine learning model.

  • Feature engineering aims to extract relevant information from the data and represent it in a way that enhances the model’s ability to learn patterns and make accurate predictions.

  • It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. (Cf. construct, operational definitions, and measurement in experimental science)

  • In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.

Feature Engineering for Classical ML#

  • Word-based frequency lists

  • Bag-of-words representations

  • Domain-specific word frequency lists

  • Handcrafted features based on domain-specific knowledge

Feature Engineering for DL#

  • DL directly takes the texts as inputs to the model.

  • Deep learning models, particularly deep neural networks, are capable of automatically extracting meaningful features from raw data through multiple layers of abstraction.

  • This process is known as feature learning or representation learning, and it is a key advantage of deep learning over traditional machine learning approaches.

  • However, a drawback is that deep learning models are often less interpretable compared to traditional machine learning approaches.

Strengths and Weakness#


From Simple to Complex#

  • Start with heuristics or rules

  • Experiment with different ML models

    • From heuristics to features

    • From manual annotation to automatic extraction

    • Feature importance (weights)

  • Find the most optimal model

    • Ensemble and stacking

    • Redo feature engineering

    • Transfer learning

    • Reapply heuristics


  • Ensemble learning combines predictions from multiple individual models directly, while stacking involves training a meta-model to combine predictions from base models.

  • Ensemble learning is typically simpler to implement but may not capture complex relationships as effectively as stacking.

  • Stacking can lead to improved performance but requires additional data for training the meta-model and is more complex to implement and tune.


Why evaluation?#

  • We need to know how good the model we’ve built is – “Goodness”

  • Factors relating to the evaluation methods

    • Model building

    • Deployment

    • Production

  • ML metrics vs. Business metrics

Intrinsic vs. Extrinsic Evaluation#

Intrinsic evaluation assesses the performance of a machine learning model based solely on its performance on a specific task or dataset. Extrinsic evaluation assesses the performance of a machine learning model within the context of a larger system or real-world application.

  • Take spam-classification system as an example

    • Intrinsic:

      • the precision and recall of the spam classification/prediction

    • Extrinsic:

      • the amount of time users spent on a spam email

General Principles#

  • Do intrinsic evaluation before extrinsic.

  • Extrinsic evaluation is more expensive because it often invovles project stakeholders outside the AI team.

  • Only when we get consistently good results in intrinsic evaluation should we go for extrinsic evaluation.

  • Bad results in intrinsic often implies bad results in extrinsic as well.

Common Intrinsic Metrics#

  • Principles for Evaluation Metrics Selection

  • Data type of the labels (ground truths)

    • Binary (e.g., sentiment)

    • Ordinal (e.g., informational retrieval)

    • Categorical (e.g., POS tags)

    • Textual (e.g., named entity, machine translation, text generation)

  • Automatic vs. Human Evalation

Post-Modeling Phases#

Post-Modeling Phases#

  • Deployment of the model in a production environment (e.g., web service)

  • Monitoring system performance on a regular basis

  • Updating system with new-coming data