
 # NLP Pipeline

## A General NLP Pipeline

![nlp-pipeline](../images/nlp-pipeline.png)

### Varations of the NLP Pipelines

- The process may not always be linear.
- There are loops in between.
- These procedures may depend on specific task at hand.

:::{contents}
:depth: 3
:::

## Data Collection

### Data Acquisition: Heart of ML System

- Ideal Setting: We have everything needed.
- Labels and Annotations
- Very often we are dealing with less-than-ideal scenarios

### Less-than-ideal Scenarios

- Initial datasets with limited annotations/labels
- Initial datasets labeled/extracted based on regular expressions or heuristics
- Public datasets (cf. [Google Dataset Search](https://datasetsearch.research.google.com/) or [kaggle](https://www.kaggle.com/))
- Scraped data
- Product intervention (Product intervention refers to the deliberate manipulation or curation of data by product developers or platform owners, which may introduce biases or distortions in the dataset.)

### Data Augmentation

- Data augmentation in natural language processing (NLP) is a technique used to increase the diversity and quantity of training data by leveraging language properties to create new text samples that are syntactically similar to the original source text data.
- Types of strategies:
    - Synonym replacement
    - Related word replacement (based on association metrics)
    - Back translation
    - Replacing entities
    - Adding noise to data (e.g. spelling errors, random words)

## Text Extraction and Cleanup

### Text Extraction

- Extracting raw texts from the input data
    - HTML
    - PDF
- Relevant vs. irrelevant information
    - non-textual information
    - markup
    - metadata
- Encoding format

#### Extracting texts from webpages

Extracting textual contents from the website is a very common way to obtain data. It requires a close study of the structure of the HTML content of the web pages.

The following code attempts to extract hyperlinks from the index page of PTT, a renowned public bulletin board in Taiwan, and automatically navigates to the first hyperlink to access its textual content.

In [20]:
## -------------------- ##
## Colab Only           ##
## -------------------- ##

! pip install requests_html
! apt install tesseract-ocr
! apt install libtesseract-dev
! pip install pytesseract


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libtesseract-dev is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.
Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.10


In [37]:
## Example 1: PTT Gossipping
from requests_html import HTMLSession
import requests
from bs4 import BeautifulSoup
import pandas as pd

## HTML session
session = HTMLSession()
session.cookies.set('over18', '1')  # age verification

## Processing index page
newsurl = 'https://www.ptt.cc/bbs/Gossiping/index.html'
response = session.get(newsurl)
web_content = response.text

## Extract all hyperlinks
soup = BeautifulSoup(web_content,'html.parser')
title = [x['href'] for x in soup.select('div.title > a')]
print(title[:5])

## Extract first article content
first_art_link = 'https://www.ptt.cc/'+ title[0]
print(first_art_link)

response = session.get(first_art_link)
web_content = response.text

## Extract article from HTML
soup = BeautifulSoup(web_content,'html.parser')

## Article Metadata
header = soup.find_all('span','article-meta-value')

# Author
author = header[0].text
# Board
board = header[1].text
# Title
title = header[2].text
# Date
date = header[3].text


## Artcle Content
main_container = soup.find(id='main-container')

## Paragraphs
all_text = main_container.text
pre_text = all_text.split('--')[0]
texts = pre_text.split('\n')
contents = texts[2:]
content = '\n'.join(contents)

# Printing
print('‰ΩúËÄÖÔºö'+author)
print('ÁúãÊùøÔºö'+board)
print('Ê®ôÈ°åÔºö'+title)
print('Êó•ÊúüÔºö'+date)
print('ÂÖßÂÆπÔºö'+content)

['/bbs/Gossiping/M.1708555641.A.B86.html', '/bbs/Gossiping/M.1708555709.A.95A.html', '/bbs/Gossiping/M.1708555913.A.A76.html', '/bbs/Gossiping/M.1708556060.A.DD7.html', '/bbs/Gossiping/M.1708556086.A.06B.html']
https://www.ptt.cc//bbs/Gossiping/M.1708555641.A.B86.html
‰ΩúËÄÖÔºöA98454 ( ÈòøËÇ•ÊòØËÇ•ÂÆÖÁöÑËÇ•)
ÁúãÊùøÔºöGossiping
Ê®ôÈ°åÔºöRe: [ÂïèÂç¶] ÁßëËàâÂà∂Â∫¶ÊòØÊéßÂà∂‰∫∫Ê∞ëÁöÑÊâãÊÆµÂóéÔºü
Êó•ÊúüÔºöThu Feb 22 03:47:22 2024
ÂÖßÂÆπÔºö Á±≥ÈÇ£Ê°ë Á©∫Â∞ºÂπæËõô ÈóúÊñºJÂÄãÂïèÈ°å  ÈòøËÇ•Ë¶∫ÂæóÊòØÈÄôÊ®£ÁöÑ

 Ë™™Âà∞ÊéßÂà∂‰∫∫Ê∞ëÂ∞±Ë¶ÅË™™Ë™™ Ê≥ïÂÆ∂ÊÄùÊÉ≥‰∫ÜÔºåËçÄÂ≠êË™çÁÇ∫‰ªªÊÄßÊú¨ÊÉ°ÔºåÂõ†Ê≠§‰∫∫ÈúÄË¶ÅÂ≠∏ÁøíÂéªÂ£ìÂà∂ÂÖßÂøÉÊÖæÊúõ
ÁöÑÈÇ™ÊÉ°ÔºåÈÅµË°åÁ¶ÆÊ≥ï

 Ê≥ïÂÆ∂Âª∂Á∫å‰∫ÜÈÄôÂÄãÊÄùÊÉ≥ÔºåË™çÁÇ∫Â≠∏ÁøíÁöÑÊàêÊïàÈÇÑÊòØ‰∏çÂ§†Ôºå‰∫∫Â∞±ÊòØÁäØË≥§ÂøÖÈ†àË¶ÅÊúâÊ≥ï‰æÜÁ¥ÑÊùü‰ªñÂÄëÔºåÊâç
ÊúÉ‰πñ‰πñ

 Âõ†Ê≠§ÂïÜÈûÖËÆäÊ≥ï‰∏¶ÊèêÂá∫Ëæ≤Êà∞ÁöÑÁêÜË´ñÔºåËæ≤Êà∞ÁêÜË´ñÂ∞±ÊòØË™™Âë¢ÔºåÂúãÂÆ∂Á≥ßÈ£üÂÖÖË∂≥Â∞±ËÉΩËÆäÂº∑ÔºåÊâÄ‰ª•‰∫∫Ê∞ë
Â∞±Âà•ÊÉ≥ÊúâÁöÑÊ≤íÁöÑÂ•ΩÂ•ΩË∑üËëóÂúãÂÆ∂Áµ¶ÁöÑkpi ,‰πñ‰πñÁ®ÆÁî∞Â∞±Â•Ω‰∫Ü

 Ëá≥ÊñºÈÇ£‰∫õÊúâÂÖ∂‰ªñÊÉ≥Ê≥

In [None]:
## Example 2: China Times (Not working coz IP is locked down)
import requests
from bs4 import BeautifulSoup
import pandas as pd

## China Times Homepage
newsurl = 'https://www.chinatimes.com/realtimenews/?chdtv'
r = requests.get(newsurl)
web_content = r.text
print(r.text)
soup = BeautifulSoup(web_content,'html.parser')
title = [x['href'] for x in soup.select('h3.title > a')]

first_art_link = 'https://www.chinatimes.com'+ title[0]+'?chdtv'

#print(first_art_link)
art_request = requests.get(first_art_link)
art_request.encoding='utf8'
soup_art = BeautifulSoup(art_request.text,'html.parser')

art_content = soup_art.find_all('p')
art_texts = [p.text for p in art_content]
print(art_texts)

#### Extracting texts from scanned PDF

:::{important}
[Tesseract](https://tesseract-ocr.github.io/tessdoc/Installation.html) is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.

You have to **manually** install it before you can use it in Python.
:::

In [None]:
## -------------------- ##
## Colab Only           ##
## -------------------- ##

from google.colab import drive
drive.mount('/content/drive')

In [38]:
from PIL import Image
from pytesseract import image_to_string

#import os
#print(os.getcwd())

YOUR_DEMO_DATA_PATH = "/content/drive/MyDrive/ENC2045_demo_data/"  # please change your file path on local machines
filename = YOUR_DEMO_DATA_PATH+'pdf-firth-text.png'

text = image_to_string(Image.open(filename))
print(text)

Stellenbosch Papers in Linguistics, Vol. 15, 1986, 31-60 doi: 10.5774/15-0-96

SPIL 14 (1986) 31- 6¬¢ 31

THE LINGUISTIC THOUGHT OF J.R. FIRTH

Nigel Love

"The study of the living votce of a
man tn aectton ts a very btg job in-

ii
deed." --- J.R. Firth

John Rupert Firth was born in 1890. After serving as Pro-
fessor of English at the University of the Punjab from 1919
to 1928, he took up a pest in the phonetics department of
University College, London. In 1938 he moved to the lin-
guistics department of the School of Oriental and African
Studies in London, where from 1944 until his retirement in
1956 he was Professor of Generali Linguistics. He died in
1960. He was an influential teacher, some of whose doctrines
(especially those concerning phonology) were widely propa-~
gated and developed by his students in what came to be known

as the "London school‚Äù of linguistics.

"The business of linguistics", according to Firth, "is to

1}

describe languages". In saying as much he would 

### Unicode normalization

In [15]:
text = 'I feel really üò°. GOGOGO!! üí™üí™üí™  ü§£ü§£ »Ä√Üƒé«¶∆ì'
print(text)

## encode the strings in bytes
text2 = text.encode('utf-8') 
print(text2)


I feel really üò°. GOGOGO!! üí™üí™üí™  ü§£ü§£ »Ä√Üƒé«¶∆ì
b'I feel really \xf0\x9f\x98\xa1. GOGOGO!! \xf0\x9f\x92\xaa\xf0\x9f\x92\xaa\xf0\x9f\x92\xaa  \xf0\x9f\xa4\xa3\xf0\x9f\xa4\xa3 \xc8\x80\xc3\x86\xc4\x8e\xc7\xa6\xc6\x93'


In [13]:
import unicodedata

## normalize by decomposition
unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')


'I feel really . GOGOGO!!    ADG'

In [14]:
## normalize by composition
unicodedata.normalize('NFKC', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')

'I feel really . GOGOGO!!    '

In [55]:
## normalize full-widths punks
text_punks = 'Ôºå„ÄÇ„Äå„Äç„Äé„ÄèÔºÅ„Ää'
text_punks2 = unicodedata.normalize('NFKD', text_punks).encode('utf-8', 'ignore').decode('utf-8', 'ignore')

In [56]:
print(text_punks)
print(text_punks2)

Ôºå„ÄÇ„Äå„Äç„Äé„ÄèÔºÅ„Ää
,„ÄÇ„Äå„Äç„Äé„Äè!„Ää


In [64]:
## Check comma and exclamation mark
print(ord(text_punks[0]))
print(ord(text_punks2[0])) ## normalized

print(ord(text_punks[6]))
print(ord(text_punks2[6])) ## normalized

65292
44
65281
33


:::{note}
:class: dropdown

1. **NFKC (Normalization Form Compatibility Composition)**:
   - **Compatibility**: NFKC focuses on ensuring compatibility between different characters or character sequences that have similar meanings or appearances but may be encoded differently.
   - **Composition**: NFKC may compose multiple characters into a single, equivalent character if they represent the same semantic meaning. It aims to create a more compact representation by combining characters where possible.

2. **NFKD (Normalization Form Compatibility Decomposition)**:
   - **Compatibility**: Like NFKC, NFKD also focuses on ensuring compatibility between characters, but it does this through decomposition rather than composition.
   - **Decomposition**: NFKD decomposes characters or character sequences into their simplest form. It separates complex characters into their basic components, including base characters and any diacritic marks or modifiers.
:::

:::{seealso}
:class: dropdown

A few notes on unicode:

1. **Unicode**: Unicode is a standard for encoding characters in digital systems. Each character is assigned a unique code point, which is a numerical value that represents that character. For example, the Unicode code point for the character "Êàë" is `U+6211`.

2. **ord()**: In Python, the `ord()` function returns the Unicode code point (in decimal format) for a given character. For example, to get the Unicode code point for the character "Êàë", you would use `ord("Êàë")`, which returns `25105`.

3. **hex()**: The `hex()` function in Python converts an integer to its hexadecimal representation. So, if we pass the Unicode code point of "Êàë" to `hex()`, it will return its hexadecimal representation. For example, `hex(25105)` will return `'0x6211'`, where `'0x'` indicates that the following digits are in hexadecimal format.
:::

In [66]:
## char vs. code number (decimal), code number (hexadecimal)
print(ord('Êàë'))
print(hex(ord('Êàë')))
print('\u6211')

25105
0x6211
Êàë


- Please check [unicodedata documentation](https://docs.python.org/3/library/unicodedata.html) for more detail on character normalization.
- Other useful libraries
    - Spelling check: pyenchant, Microsoft REST API
    - PDF:  PyPDF, PDFMiner
    - OCR: pytesseract


### Cleanup (Preprocessing)
    

#### Segmentation and Tokenization

:::{important}
NLTK package provides many useful datasets for text analysis. Some of the codes may require you to download the corpus data first. Please see [Installing NLTK Data](https://www.nltk.org/data.html) for more information.
:::

In [32]:
## -------------------- ##
## Colab Only           ##
## -------------------- ##

import nltk
nltk.download(['punkt', 'stopwords', 'wordnet'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [26]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
'''

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
    print(sent)
    print(word_tokenize(sent))


Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']


- Frequent preprocessing
    - Stopword removal
    - Stemming and/or lemmatization
    - Digits/Punctuaions removal
    - Case normalization
    

#### Removing stopwords, punctuations, digits

In [29]:
from nltk.corpus import stopwords
from string import punctuation

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)

print(words)

# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
        print(w)

['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']
Mr.
John
O'Neil
works
Wonderland
located
Goleta
Avenue
CA.


#### Stemming and lemmatization

In [35]:
## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])


['car', 'revolut', 'better']


In [36]:
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))

car
revolution
good


- Task-specific preprocessing
    - Unicode normalization
    - Language detection
    - Code mixing
    - Transliteration (e.g., using piyin for Chinese words in English-Chinese code-switching texts)
    

- Automatic annotations
    - POS tagging
    - Syntactic/Dependency Parsing
    - Named Entity Recognition
    - Coreference Resolution
    

### Important Reminders for Preprocessing

- Not all steps are necessary
- These steps are NOT sequential
- These steps are task-dependent and language-dependent
- Goals
    - Text Normalization
    - Text Tokenization
    - Text Enrichment/Annotation

## Feature Engineering

### What is feature engineering?

- Feature engineering is the process of transforming raw data into informative features that are suitable for machine learning algorithms. 
- It usually involves creating, selecting, or modifying features from the original dataset to improve the performance of a machine learning model.
- Feature engineering aims to extract relevant information from the data and represent it in a way that enhances the model's ability to learn patterns and make accurate predictions.
- It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. (Cf. *construct*, *operational definitions*, and *measurement* in experimental science)
- In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.

### Feature Engineering for Classical ML

- Word-based frequency lists
- Bag-of-words representations
- Domain-specific word frequency lists
- Handcrafted features based on domain-specific knowledge

### Feature Engineering for DL

- DL directly takes the texts as inputs to the model.
- Deep learning models, particularly deep neural networks, are capable of automatically extracting meaningful features from raw data through multiple layers of abstraction. 
- This process is known as feature learning or representation learning, and it is a key advantage of deep learning over traditional machine learning approaches.
- However, a drawback is that deep learning models are often less interpretable compared to traditional machine learning approaches.
    

### Strengths and Weakness

![](../images/feature-engineer-strengths.png)

![](../images/feature-engineer-weakness.png)

## Modeling

### From Simple to Complex

- Start with heuristics or rules
- Experiment with different ML models
    - From heuristics to features
    - From manual annotation to automatic extraction
    - Feature importance (weights)
- Find the most optimal model
    - Ensemble and stacking
    - Redo feature engineering
    - Transfer learning
    - Reapply heuristics

:::{note}
- Ensemble learning combines predictions from multiple individual models directly, while stacking involves training a meta-model to combine predictions from base models.
- Ensemble learning is typically simpler to implement but may not capture complex relationships as effectively as stacking.
- Stacking can lead to improved performance but requires additional data for training the meta-model and is more complex to implement and tune.
:::

## Evaluation

### Why evaluation?

- We need to know how *good* the model we've built is -- "Goodness"
- Factors relating to the evaluation methods
    - Model building
    - Deployment
    - Production
- ML metrics vs. Business metrics


### Intrinsic vs. Extrinsic Evaluation

Intrinsic evaluation assesses the performance of a machine learning model based solely on its performance on a specific task or dataset. Extrinsic evaluation assesses the performance of a machine learning model within the context of a larger system or real-world application.

- Take spam-classification system as an example
    -  Intrinsic:
        - the precision and recall of the spam classification/prediction
    - Extrinsic:
        - the amount of time users spent on a spam email
    

### General Principles

- Do intrinsic evaluation before extrinsic.
- Extrinsic evaluation is more expensive because it often invovles project stakeholders outside the AI team.
- Only when we get consistently good results in intrinsic evaluation should we go for extrinsic evaluation.
- Bad results in intrinsic often implies bad results in extrinsic as well.

### Common Intrinsic Metrics

- Principles for Evaluation Metrics Selection
- Data type of the labels (ground truths)
    - Binary (e.g., sentiment)
    - Ordinal (e.g., informational retrieval)
    - Categorical (e.g., POS tags)
    - Textual (e.g., named entity, machine translation, text generation)
- Automatic vs. Human Evalation

## Post-Modeling Phases

### Post-Modeling Phases

- Deployment of the model in a  production environment (e.g., web service)
- Monitoring system performance on a regular basis
- Updating system with new-coming data

## References

- Chapter 2 of Practical Natural Language Processing. {cite}`vajjala2020`