NLP Pipeline#
A General NLP Pipeline#
Varations of the NLP Pipelines#
The process may not always be linear.
There are loops in between.
These procedures may depend on specific task at hand.
Data Collection#
Data Acquisition: Heart of ML System#
Ideal Setting: We have everything needed.
Labels and Annotations
Very often we are dealing with less-than-ideal scenarios
Less-than-ideal Scenarios#
Initial datasets with limited annotations/labels
Initial datasets labeled/extracted based on regular expressions or heuristics
Public datasets (cf. Google Dataset Search or kaggle)
Scraped data
Product intervention (Product intervention refers to the deliberate manipulation or curation of data by product developers or platform owners, which may introduce biases or distortions in the dataset.)
Data Augmentation#
Data augmentation in natural language processing (NLP) is a technique used to increase the diversity and quantity of training data by leveraging language properties to create new text samples that are syntactically similar to the original source text data.
Types of strategies:
Synonym replacement
Related word replacement (based on association metrics)
Back translation
Replacing entities
Adding noise to data (e.g. spelling errors, random words)
Text Extraction and Cleanup#
Text Extraction#
Extracting raw texts from the input data
Relevant vs. irrelevant information
non-textual information
Encoding format
Extracting texts from webpages#
Extracting textual contents from the website is a very common way to obtain data. It requires a close study of the structure of the HTML content of the web pages.
The following code attempts to extract hyperlinks from the index page of PTT, a renowned public bulletin board in Taiwan, and automatically navigates to the first hyperlink to access its textual content.
## -------------------- ##
## Colab Only ##
## -------------------- ##
! pip install requests_html
! apt install tesseract-ocr
! apt install libtesseract-dev
! pip install pytesseract
Requirement already satisfied: requests_html in /usr/local/lib/python3.10/dist-packages (0.10.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.31.0)
Requirement already satisfied: pyquery in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.0.0)
Requirement already satisfied: fake-useragent in /usr/local/lib/python3.10/dist-packages (from requests_html) (1.4.0)
Requirement already satisfied: parse in /usr/local/lib/python3.10/dist-packages (from requests_html) (1.20.1)
Requirement already satisfied: bs4 in /usr/local/lib/python3.10/dist-packages (from requests_html) (0.0.2)
Requirement already satisfied: w3lib in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.1.2)
Requirement already satisfied: pyppeteer>=0.0.14 in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.0.0)
Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (1.4.4)
Requirement already satisfied: certifi>=2023 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (2024.2.2)
Requirement already satisfied: importlib-metadata>=1.4 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (7.0.1)
Requirement already satisfied: pyee<12.0.0,>=11.0.0 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (11.1.0)
Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (4.66.2)
Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (1.26.18)
Requirement already satisfied: websockets<11.0,>=10.0 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (10.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from bs4->requests_html) (4.12.3)
Requirement already satisfied: lxml>=2.1 in /usr/local/lib/python3.10/dist-packages (from pyquery->requests_html) (4.9.4)
Requirement already satisfied: cssselect>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from pyquery->requests_html) (1.2.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->requests_html) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->requests_html) (3.6)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata>=1.4->pyppeteer>=0.0.14->requests_html) (3.17.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from pyee<12.0.0,>=11.0.0->pyppeteer>=0.0.14->requests_html) (4.9.0)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->bs4->requests_html) (2.5)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libtesseract-dev is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting pytesseract
Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from pytesseract) (23.2)
Requirement already satisfied: Pillow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from pytesseract) (9.4.0)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10
## Example 1: PTT Gossipping
from requests_html import HTMLSession
import requests
from bs4 import BeautifulSoup
import pandas as pd
## HTML session
session = HTMLSession()
session.cookies.set('over18', '1') # age verification
## Processing index page
newsurl = ''
response = session.get(newsurl)
web_content = response.text
## Extract all hyperlinks
soup = BeautifulSoup(web_content,'html.parser')
title = [x['href'] for x in'div.title > a')]
## Extract first article content
first_art_link = ''+ title[0]
response = session.get(first_art_link)
web_content = response.text
## Extract article from HTML
soup = BeautifulSoup(web_content,'html.parser')
## Article Metadata
header = soup.find_all('span','article-meta-value')
# Author
author = header[0].text
# Board
board = header[1].text
# Title
title = header[2].text
# Date
date = header[3].text
## Artcle Content
main_container = soup.find(id='main-container')
## Paragraphs
all_text = main_container.text
pre_text = all_text.split('--')[0]
texts = pre_text.split('\n')
contents = texts[2:]
content = '\n'.join(contents)
# Printing
['/bbs/Gossiping/M.1708555641.A.B86.html', '/bbs/Gossiping/M.1708555709.A.95A.html', '/bbs/Gossiping/M.1708555913.A.A76.html', '/bbs/Gossiping/M.1708556060.A.DD7.html', '/bbs/Gossiping/M.1708556086.A.06B.html']
作者:A98454 ( 阿肥是肥宅的肥)
標題:Re: [問卦] 科舉制度是控制人民的手段嗎?
日期:Thu Feb 22 03:47:22 2024
內容: 米那桑 空尼幾蛙 關於J個問題 阿肥覺得是這樣的
說到控制人民就要說說 法家思想了,荀子認為任性本惡,因此人需要學習去壓制內心慾望
就別想有的沒的好好跟著國家給的kpi ,乖乖種田就好了
至於那些有其他想法的,思想hen 危險的,就發配充軍思想改造,避免這些人影響農民種
,大概武則天開始,出現科舉制度的雛形,(也許要拉攏一些底層勢力抗衡 朝中的貴族階
所以 科舉制度 算是 帝國2.0的新制度發明吧
樓下 李白 怎麽看?
Extracting texts from scanned PDF#
Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.
You have to manually install it before you can use it in Python.
## -------------------- ##
## Colab Only ##
## -------------------- ##
from google.colab import drive
from PIL import Image
from pytesseract import image_to_string
#import os
YOUR_DEMO_DATA_PATH = "/content/drive/MyDrive/ENC2045_demo_data/" # please change your file path on local machines
filename = YOUR_DEMO_DATA_PATH+'pdf-firth-text.png'
text = image_to_string(
Stellenbosch Papers in Linguistics, Vol. 15, 1986, 31-60 doi: 10.5774/15-0-96
SPIL 14 (1986) 31- 6¢ 31
Nigel Love
"The study of the living votce of a
man tn aectton ts a very btg job in-
deed." --- J.R. Firth
John Rupert Firth was born in 1890. After serving as Pro-
fessor of English at the University of the Punjab from 1919
to 1928, he took up a pest in the phonetics department of
University College, London. In 1938 he moved to the lin-
guistics department of the School of Oriental and African
Studies in London, where from 1944 until his retirement in
1956 he was Professor of Generali Linguistics. He died in
1960. He was an influential teacher, some of whose doctrines
(especially those concerning phonology) were widely propa-~
gated and developed by his students in what came to be known
as the "London school” of linguistics.
"The business of linguistics", according to Firth, "is to
describe languages". In saying as much he would have the
assent of most twentieth-century linguistic theorists.
Where he parts company with many is in holding that this
enterprise is not incompatible with, or even separable from,
studying “the living voice of a man in action"; and his
chief interest as a linguistic thinker lies in his attempt
to resist the idea that synchronic descriptive linguistics
should treat what he calis “speech-events" as no more than
a means of access to what really interests the linguist:
the Language-system underlying them.
Languages, according to many theorists, are to be envisaged
as systems of abstract entities. These entities are units
of linguistic “form. Units of linguistic form are of two
Unicode normalization#
text = 'I feel really 😡. GOGOGO!! 💪💪💪 🤣🤣 ȀÆĎǦƓ'
## encode the strings in bytes
text2 = text.encode('utf-8')
I feel really 😡. GOGOGO!! 💪💪💪 🤣🤣 ȀÆĎǦƓ
b'I feel really \xf0\x9f\x98\xa1. GOGOGO!! \xf0\x9f\x92\xaa\xf0\x9f\x92\xaa\xf0\x9f\x92\xaa \xf0\x9f\xa4\xa3\xf0\x9f\xa4\xa3 \xc8\x80\xc3\x86\xc4\x8e\xc7\xa6\xc6\x93'
import unicodedata
## normalize by decomposition
unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
'I feel really . GOGOGO!! ADG'
## normalize by composition
unicodedata.normalize('NFKC', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
'I feel really . GOGOGO!! '
## normalize full-widths punks
text_punks = ',。「」『』!《'
text_punks2 = unicodedata.normalize('NFKD', text_punks).encode('utf-8', 'ignore').decode('utf-8', 'ignore')
## Check comma and exclamation mark
print(ord(text_punks2[0])) ## normalized
print(ord(text_punks2[6])) ## normalized
NFKC (Normalization Form Compatibility Composition):
Compatibility: NFKC focuses on ensuring compatibility between different characters or character sequences that have similar meanings or appearances but may be encoded differently.
Composition: NFKC may compose multiple characters into a single, equivalent character if they represent the same semantic meaning. It aims to create a more compact representation by combining characters where possible.
NFKD (Normalization Form Compatibility Decomposition):
Compatibility: Like NFKC, NFKD also focuses on ensuring compatibility between characters, but it does this through decomposition rather than composition.
Decomposition: NFKD decomposes characters or character sequences into their simplest form. It separates complex characters into their basic components, including base characters and any diacritic marks or modifiers.
See also
A few notes on unicode:
Unicode: Unicode is a standard for encoding characters in digital systems. Each character is assigned a unique code point, which is a numerical value that represents that character. For example, the Unicode code point for the character “我” is
.ord(): In Python, the
function returns the Unicode code point (in decimal format) for a given character. For example, to get the Unicode code point for the character “我”, you would useord("我")
, which returns25105
.hex(): The
function in Python converts an integer to its hexadecimal representation. So, if we pass the Unicode code point of “我” tohex()
, it will return its hexadecimal representation. For example,hex(25105)
will return'0x6211'
, where'0x'
indicates that the following digits are in hexadecimal format.
## char vs. code number (decimal), code number (hexadecimal)
Please check unicodedata documentation for more detail on character normalization.
Other useful libraries
Spelling check: pyenchant, Microsoft REST API
OCR: pytesseract
Cleanup (Preprocessing)#
Segmentation and Tokenization#
NLTK package provides many useful datasets for text analysis. Some of the codes may require you to download the corpus data first. Please see Installing NLTK Data for more information.
## -------------------- ##
## Colab Only ##
## -------------------- ##
import nltk['punkt', 'stopwords', 'wordnet'])
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
from nltk.tokenize import sent_tokenize, word_tokenize
text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
## sent segmentation
sents = sent_tokenize(text)
## word tokenization
for sent in sents:
Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']
Frequent preprocessing
Stopword removal
Stemming and/or lemmatization
Digits/Punctuaions removal
Case normalization
Removing stopwords, punctuations, digits#
from nltk.corpus import stopwords
from string import punctuation
eng_stopwords = stopwords.words('english')
text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."
words = word_tokenize(text)
# remove stopwords, punctuations, digits
for w in words:
if w not in eng_stopwords and w not in punctuation and not w.isdigit():
['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']
Stemming and lemmatization#
## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])
['car', 'revolut', 'better']
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
## Wordnet requires POS of words
poss = ['n','n','a']
for w,p in zip(words,poss):
print(lemmatizer.lemmatize(w, pos=p))
Task-specific preprocessing
Unicode normalization
Language detection
Code mixing
Transliteration (e.g., using piyin for Chinese words in English-Chinese code-switching texts)
Automatic annotations
POS tagging
Syntactic/Dependency Parsing
Named Entity Recognition
Coreference Resolution
Important Reminders for Preprocessing#
Not all steps are necessary
These steps are NOT sequential
These steps are task-dependent and language-dependent
Text Normalization
Text Tokenization
Text Enrichment/Annotation
Feature Engineering#
What is feature engineering?#
Feature engineering is the process of transforming raw data into informative features that are suitable for machine learning algorithms.
It usually involves creating, selecting, or modifying features from the original dataset to improve the performance of a machine learning model.
Feature engineering aims to extract relevant information from the data and represent it in a way that enhances the model’s ability to learn patterns and make accurate predictions.
It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. (Cf. construct, operational definitions, and measurement in experimental science)
In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.
Feature Engineering for Classical ML#
Word-based frequency lists
Bag-of-words representations
Domain-specific word frequency lists
Handcrafted features based on domain-specific knowledge
Feature Engineering for DL#
DL directly takes the texts as inputs to the model.
Deep learning models, particularly deep neural networks, are capable of automatically extracting meaningful features from raw data through multiple layers of abstraction.
This process is known as feature learning or representation learning, and it is a key advantage of deep learning over traditional machine learning approaches.
However, a drawback is that deep learning models are often less interpretable compared to traditional machine learning approaches.
Strengths and Weakness#
From Simple to Complex#
Start with heuristics or rules
Experiment with different ML models
From heuristics to features
From manual annotation to automatic extraction
Feature importance (weights)
Find the most optimal model
Ensemble and stacking
Redo feature engineering
Transfer learning
Reapply heuristics
Ensemble learning combines predictions from multiple individual models directly, while stacking involves training a meta-model to combine predictions from base models.
Ensemble learning is typically simpler to implement but may not capture complex relationships as effectively as stacking.
Stacking can lead to improved performance but requires additional data for training the meta-model and is more complex to implement and tune.
Why evaluation?#
We need to know how good the model we’ve built is – “Goodness”
Factors relating to the evaluation methods
Model building
ML metrics vs. Business metrics
Intrinsic vs. Extrinsic Evaluation#
Intrinsic evaluation assesses the performance of a machine learning model based solely on its performance on a specific task or dataset. Extrinsic evaluation assesses the performance of a machine learning model within the context of a larger system or real-world application.
Take spam-classification system as an example
the precision and recall of the spam classification/prediction
the amount of time users spent on a spam email
General Principles#
Do intrinsic evaluation before extrinsic.
Extrinsic evaluation is more expensive because it often invovles project stakeholders outside the AI team.
Only when we get consistently good results in intrinsic evaluation should we go for extrinsic evaluation.
Bad results in intrinsic often implies bad results in extrinsic as well.
Common Intrinsic Metrics#
Principles for Evaluation Metrics Selection
Data type of the labels (ground truths)
Binary (e.g., sentiment)
Ordinal (e.g., informational retrieval)
Categorical (e.g., POS tags)
Textual (e.g., named entity, machine translation, text generation)
Automatic vs. Human Evalation
Post-Modeling Phases#
Post-Modeling Phases#
Deployment of the model in a production environment (e.g., web service)
Monitoring system performance on a regular basis
Updating system with new-coming data
Chapter 2 of Practical Natural Language Processing. [Vajjala et al., 2020]