NLP Pipeline

NLP Pipeline #

A General NLP Pipeline #

nlp-pipeline

Varations of the NLP Pipelines #

The process may not always be linear.
There are loops in between.
These procedures may depend on specific task at hand.

Contents

Data Collection #

Data Acquisition: Heart of ML System #

Ideal Setting: We have everything needed.
Labels and Annotations
Very often we are dealing with less-than-ideal scenarios

Less-than-ideal Scenarios #

Initial datasets with limited annotations/labels
Initial datasets labeled/extracted based on regular expressions or heuristics
Public datasets (cf. Google Dataset Search or kaggle)
Scraped data
Product intervention (Product intervention refers to the deliberate manipulation or curation of data by product developers or platform owners, which may introduce biases or distortions in the dataset.)

Data Augmentation #

Data augmentation in natural language processing (NLP) is a technique used to increase the diversity and quantity of training data by leveraging language properties to create new text samples that are syntactically similar to the original source text data.
Types of strategies:
- Synonym replacement
- Related word replacement (based on association metrics)
- Back translation
- Replacing entities
- Adding noise to data (e.g. spelling errors, random words)

Text Extraction and Cleanup #

Text Extraction #

Extracting raw texts from the input data
- HTML
- PDF
Relevant vs. irrelevant information
- non-textual information
- markup
- metadata
Encoding format

Extracting texts from webpages#

Extracting textual contents from the website is a very common way to obtain data. It requires a close study of the structure of the HTML content of the web pages.

The following code attempts to extract hyperlinks from the index page of PTT, a renowned public bulletin board in Taiwan, and automatically navigates to the first hyperlink to access its textual content.

## -------------------- ##
## Colab Only           ##
## -------------------- ##

! pip install requests_html
! apt install tesseract-ocr
! apt install libtesseract-dev
! pip install pytesseract

Requirement already satisfied: requests_html in /usr/local/lib/python3.10/dist-packages (0.10.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.31.0)
Requirement already satisfied: pyquery in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.0.0)
Requirement already satisfied: fake-useragent in /usr/local/lib/python3.10/dist-packages (from requests_html) (1.4.0)
Requirement already satisfied: parse in /usr/local/lib/python3.10/dist-packages (from requests_html) (1.20.1)
Requirement already satisfied: bs4 in /usr/local/lib/python3.10/dist-packages (from requests_html) (0.0.2)
Requirement already satisfied: w3lib in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.1.2)
Requirement already satisfied: pyppeteer>=0.0.14 in /usr/local/lib/python3.10/dist-packages (from requests_html) (2.0.0)
Requirement already satisfied: appdirs<2.0.0,>=1.4.3 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (1.4.4)
Requirement already satisfied: certifi>=2023 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (2024.2.2)
Requirement already satisfied: importlib-metadata>=1.4 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (7.0.1)
Requirement already satisfied: pyee<12.0.0,>=11.0.0 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (11.1.0)
Requirement already satisfied: tqdm<5.0.0,>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (4.66.2)
Requirement already satisfied: urllib3<2.0.0,>=1.25.8 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (1.26.18)
Requirement already satisfied: websockets<11.0,>=10.0 in /usr/local/lib/python3.10/dist-packages (from pyppeteer>=0.0.14->requests_html) (10.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from bs4->requests_html) (4.12.3)
Requirement already satisfied: lxml>=2.1 in /usr/local/lib/python3.10/dist-packages (from pyquery->requests_html) (4.9.4)
Requirement already satisfied: cssselect>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from pyquery->requests_html) (1.2.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->requests_html) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->requests_html) (3.6)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata>=1.4->pyppeteer>=0.0.14->requests_html) (3.17.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from pyee<12.0.0,>=11.0.0->pyppeteer>=0.0.14->requests_html) (4.9.0)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->bs4->requests_html) (2.5)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libtesseract-dev is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from pytesseract) (23.2)
Requirement already satisfied: Pillow>=8.0.0 in /usr/local/lib/python3.10/dist-packages (from pytesseract) (9.4.0)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10

## Example 1: PTT Gossipping
from requests_html import HTMLSession
import requests
from bs4 import BeautifulSoup
import pandas as pd

## HTML session
session = HTMLSession()
session.cookies.set('over18', '1')  # age verification

## Processing index page
newsurl = 'https://www.ptt.cc/bbs/Gossiping/index.html'
response = session.get(newsurl)
web_content = response.text

## Extract all hyperlinks
soup = BeautifulSoup(web_content,'html.parser')
title = [x['href'] for x in soup.select('div.title > a')]
print(title[:5])

## Extract first article content
first_art_link = 'https://www.ptt.cc/'+ title[0]
print(first_art_link)

response = session.get(first_art_link)
web_content = response.text

## Extract article from HTML
soup = BeautifulSoup(web_content,'html.parser')

## Article Metadata
header = soup.find_all('span','article-meta-value')

# Author
author = header[0].text
# Board
board = header[1].text
# Title
title = header[2].text
# Date
date = header[3].text


## Artcle Content
main_container = soup.find(id='main-container')

## Paragraphs
all_text = main_container.text
pre_text = all_text.split('--')[0]
texts = pre_text.split('\n')
contents = texts[2:]
content = '\n'.join(contents)

# Printing
print('作者：'+author)
print('看板：'+board)
print('標題：'+title)
print('日期：'+date)
print('內容：'+content)

['/bbs/Gossiping/M.1708555641.A.B86.html', '/bbs/Gossiping/M.1708555709.A.95A.html', '/bbs/Gossiping/M.1708555913.A.A76.html', '/bbs/Gossiping/M.1708556060.A.DD7.html', '/bbs/Gossiping/M.1708556086.A.06B.html']
https://www.ptt.cc//bbs/Gossiping/M.1708555641.A.B86.html
作者：A98454 ( 阿肥是肥宅的肥)
看板：Gossiping
標題：Re: [問卦] 科舉制度是控制人民的手段嗎？
日期：Thu Feb 22 03:47:22 2024
內容： 米那桑 空尼幾蛙 關於J個問題  阿肥覺得是這樣的

 說到控制人民就要說說 法家思想了，荀子認為任性本惡，因此人需要學習去壓制內心慾望
的邪惡，遵行禮法

 法家延續了這個思想，認為學習的成效還是不夠，人就是犯賤必須要有法來約束他們，才
會乖乖

 因此商鞅變法並提出農戰的理論，農戰理論就是說呢，國家糧食充足就能變強，所以人民
就別想有的沒的好好跟著國家給的kpi ,乖乖種田就好了

 至於那些有其他想法的，思想hen 危險的，就發配充軍思想改造，避免這些人影響農民種
田，所以人民不要想有的沒乖乖種田就好了

 然後糧食充足了，飽暖思淫慾，就有更多農奴了

 如果農奴太多土地不夠怎麽辦呢？這時候就要發動戰爭消耗多餘的人口了

 這就是農戰的核心思想

  至於科舉制度嗎，算是一種安撫懷柔政策吧，早期九品門第思想，造成平民沒有上升通道
，大概武則天開始，出現科舉制度的雛形，（也許要拉攏一些底層勢力抗衡 朝中的貴族階
級，畢竟武后的政敵很多）

  農民有了科舉制度後，也就是下階層能夠上升，國家有更多人才使用，另外下階層會跟上
階層內鬥，間接分化內部矛盾指向統治者，保護王權

 所以 科舉制度 算是 帝國2.0的新制度發明吧


 大概就是醬

 樓下  李白 怎麽看？

Extracting texts from scanned PDF#

Important

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.

You have to manually install it before you can use it in Python.

## -------------------- ##
## Colab Only           ##
## -------------------- ##

from google.colab import drive
drive.mount('/content/drive')

from PIL import Image
from pytesseract import image_to_string

#import os
#print(os.getcwd())

YOUR_DEMO_DATA_PATH = "/content/drive/MyDrive/ENC2045_demo_data/"  # please change your file path on local machines
filename = YOUR_DEMO_DATA_PATH+'pdf-firth-text.png'

text = image_to_string(Image.open(filename))
print(text)

Stellenbosch Papers in Linguistics, Vol. 15, 1986, 31-60 doi: 10.5774/15-0-96

SPIL 14 (1986) 31- 6¢ 31

THE LINGUISTIC THOUGHT OF J.R. FIRTH

Nigel Love

"The study of the living votce of a
man tn aectton ts a very btg job in-

ii
deed." --- J.R. Firth

John Rupert Firth was born in 1890. After serving as Pro-
fessor of English at the University of the Punjab from 1919
to 1928, he took up a pest in the phonetics department of
University College, London. In 1938 he moved to the lin-
guistics department of the School of Oriental and African
Studies in London, where from 1944 until his retirement in
1956 he was Professor of Generali Linguistics. He died in
1960. He was an influential teacher, some of whose doctrines
(especially those concerning phonology) were widely propa-~
gated and developed by his students in what came to be known

as the "London school” of linguistics.

"The business of linguistics", according to Firth, "is to

1}

describe languages". In saying as much he would have the
assent of most twentieth-century linguistic theorists.

Where he parts company with many is in holding that this
enterprise is not incompatible with, or even separable from,
studying “the living voice of a man in action"; and his
chief interest as a linguistic thinker lies in his attempt
to resist the idea that synchronic descriptive linguistics
should treat what he calis “speech-events" as no more than
a means of access to what really interests the linguist:

the Language-system underlying them.

Languages, according to many theorists, are to be envisaged
as systems of abstract entities. These entities are units

of linguistic “form. Units of linguistic form are of two


Unicode normalization #

text = 'I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ'
print(text)

## encode the strings in bytes
text2 = text.encode('utf-8') 
print(text2)

I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣 ȀÆĎǦƓ
b'I feel really \xf0\x9f\x98\xa1. GOGOGO!! \xf0\x9f\x92\xaa\xf0\x9f\x92\xaa\xf0\x9f\x92\xaa  \xf0\x9f\xa4\xa3\xf0\x9f\xa4\xa3 \xc8\x80\xc3\x86\xc4\x8e\xc7\xa6\xc6\x93'

import unicodedata

## normalize by decomposition
unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

'I feel really . GOGOGO!!    ADG'

## normalize by composition
unicodedata.normalize('NFKC', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

'I feel really . GOGOGO!!    '

## normalize full-widths punks
text_punks = '，。「」『』！《'
text_punks2 = unicodedata.normalize('NFKD', text_punks).encode('utf-8', 'ignore').decode('utf-8', 'ignore')

print(text_punks)
print(text_punks2)

，。「」『』！《
,。「」『』!《

## Check comma and exclamation mark
print(ord(text_punks[0]))
print(ord(text_punks2[0])) ## normalized

print(ord(text_punks[6]))
print(ord(text_punks2[6])) ## normalized

Note

NFKC (Normalization Form Compatibility Composition):
- Compatibility: NFKC focuses on ensuring compatibility between different characters or character sequences that have similar meanings or appearances but may be encoded differently.
- Composition: NFKC may compose multiple characters into a single, equivalent character if they represent the same semantic meaning. It aims to create a more compact representation by combining characters where possible.
NFKD (Normalization Form Compatibility Decomposition):
- Compatibility: Like NFKC, NFKD also focuses on ensuring compatibility between characters, but it does this through decomposition rather than composition.
- Decomposition: NFKD decomposes characters or character sequences into their simplest form. It separates complex characters into their basic components, including base characters and any diacritic marks or modifiers.

Cleanup (Preprocessing)#

Segmentation and Tokenization#

Important

NLTK package provides many useful datasets for text analysis. Some of the codes may require you to download the corpus data first. Please see Installing NLTK Data for more information.

## -------------------- ##
## Colab Only           ##
## -------------------- ##

import nltk
nltk.download(['punkt', 'stopwords', 'wordnet'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...

True

from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
'''

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
    print(sent)
    print(word_tokenize(sent))

Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']

Frequent preprocessing
- Stopword removal
- Stemming and/or lemmatization
- Digits/Punctuaions removal
- Case normalization

Removing stopwords, punctuations, digits#

from nltk.corpus import stopwords
from string import punctuation

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)

print(words)

# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
        print(w)

['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']
Mr.
John
O'Neil
works
Wonderland
located
Goleta
Avenue
CA.

Stemming and lemmatization#

## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])

['car', 'revolut', 'better']

## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))

car
revolution
good

Task-specific preprocessing
- Unicode normalization
- Language detection
- Code mixing
- Transliteration (e.g., using piyin for Chinese words in English-Chinese code-switching texts)

Automatic annotations
- POS tagging
- Syntactic/Dependency Parsing
- Named Entity Recognition
- Coreference Resolution

Important Reminders for Preprocessing #

Not all steps are necessary
These steps are NOT sequential
These steps are task-dependent and language-dependent
Goals
- Text Normalization
- Text Tokenization
- Text Enrichment/Annotation

Feature Engineering #

What is feature engineering?#

Feature engineering is the process of transforming raw data into informative features that are suitable for machine learning algorithms.
It usually involves creating, selecting, or modifying features from the original dataset to improve the performance of a machine learning model.
Feature engineering aims to extract relevant information from the data and represent it in a way that enhances the model’s ability to learn patterns and make accurate predictions.
It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. (Cf. construct, operational definitions, and measurement in experimental science)
In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.

Feature Engineering for Classical ML #

Word-based frequency lists
Bag-of-words representations
Domain-specific word frequency lists
Handcrafted features based on domain-specific knowledge

Feature Engineering for DL #

DL directly takes the texts as inputs to the model.
Deep learning models, particularly deep neural networks, are capable of automatically extracting meaningful features from raw data through multiple layers of abstraction.
This process is known as feature learning or representation learning, and it is a key advantage of deep learning over traditional machine learning approaches.
However, a drawback is that deep learning models are often less interpretable compared to traditional machine learning approaches.

Strengths and Weakness #

Modeling #

From Simple to Complex #

Start with heuristics or rules
Experiment with different ML models
- From heuristics to features
- From manual annotation to automatic extraction
- Feature importance (weights)
Find the most optimal model
- Ensemble and stacking
- Redo feature engineering
- Transfer learning
- Reapply heuristics