NLP Pipeline#

A General NLP Pipeline#

nlp-pipeline

Varations of the NLP Pipelines#

  • The process may not always be linear.

  • There are loops in between.

  • These procedures may depend on specific task at hand.

Data Collection#

Data Acquisition: Heart of ML System#

  • Ideal Setting: We have everything needed.

  • Labels and Annotations

  • Very often we are dealing with less-than-idea scenarios

Less-than-ideal Scenarios#

  • Initial datasets with limited annotations/labels

  • Initial datasets labeled based on regular expressions or heuristics

  • Public datasets (cf. Google Dataset Search)

  • Scrape data

  • Product intervention

  • Data augmentation

Data Augmentation#

  • It is a technique to exploit language properties to create texts that are syntactically similar to the source text data.

  • Types of strategies:

    • synonym replacement

    • Related word replacement (based on association metrics)

    • Back translation

    • Replacing entities

    • Adding noise to data (e.g. spelling errors, random words)

Text Extraction and Cleanup#

Text Extraction#

  • Extracting raw texts from the input data

    • HTML

    • PDF

  • Relevant vs. irrelevant information

    • non-textual information

    • markup

    • metadata

  • Encoding format

Extracting texts from webpages#

import requests 
from bs4 import BeautifulSoup
import pandas as pd
 
 
url = 'https://news.google.com/topics/CAAqJQgKIh9DQkFTRVFvSUwyMHZNRFptTXpJU0JYcG9MVlJYS0FBUAE?hl=zh-TW&gl=TW&ceid=TW%3Azh-Hant'
r = requests.get(url)
web_content = r.text
soup = BeautifulSoup(web_content,'html.parser')
title = soup.find_all('a', class_='DY5T1d')
first_art_link = title[0]['href'].replace('.','https://news.google.com',1)

#print(first_art_link)
art_request = requests.get(first_art_link)
art_request.encoding='utf8'
soup_art = BeautifulSoup(art_request.text,'html.parser')

art_content = soup_art.find_all('p')
art_texts = [p.text for p in art_content]
print(art_texts)
['衛福部立桃園醫院群聚感染九人確診,中央流行疫情指揮中心昨證實,院內六、七、九、十樓出現確診案例,為高風險區,昨起啟動「清空醫院」計畫,今將二百廿名患者依風險移至他院的專責隔離病房;,院方已採紅、黃、綠區醫護分艙分流管制,其中,三、六、七、十三樓加護病房及內科病房「紅區」全面管制醫護進出。', '\n', '\r\n一名曾參與過SARS防疫核心的感染症專家擔憂,和平醫院封院前,許多患者趕著自行轉院或出院,導致部分醫院淪陷,這次如果沒有妥善處裡,恐怕悲劇重新上演。建議先清空另一間醫院病房,再接受整批住院患者,才能確保感控。', '\n', '\r\n部桃不願具名的內科醫師表示,目前內科系門診已全部停診,只留一名醫師待命,各病房患者將陸續清空,全院非紅區內科病房還收治病患兩百多人,由完全未接觸、非紅區的十多名醫師率護士輪流照護,並嚴格禁止進入紅區。', '\n', '\r\n指揮中心擴大匡列院內員工,累計三五三名醫院員工需隔離十四天,其中不少人必須集中檢疫或回院隔離,專家指出,這猶如當年SARS和平醫院封院,召回所有醫護人員回院,相關員工勢必承受極大身心壓力,衛福部必須正視這個問題。', '\n', '\r\n專家指出,目前兩百多名住院患者雖然檢驗結果為陰性,但考量潛伏期,無法保證零確診;若分散到多家未清空的醫院,風險很大,只要一個環節出錯,就可能把疫情帶進人更多的醫院和社區。', '\n', '\r\n專家說,和平醫院封院前,許多患者趕著自行轉院或出院,導致部分醫院淪陷;這次清空桃園醫院前自行轉院或出院患者,均具有潛在風險,將成為疫情最令人擔憂之處。醫護感染院內足跡 圖/聯合報提供', '\n ▪【最新數據】德國又現新毒株!沖繩發布緊急事態 \n ▪【數據圖表】新冠肺炎最終章何時登場 365天疫情數據解析  ▪【QA測驗】2020疫情肆虐 你的生活變什麼樣? ▪【疫苗專題】全球新冠疫苗爭奪戰開跑 台灣起步早為何還落後?  ▪【疫情專題】新冠肺炎給出一記重拳 全球航空業到底有多慘?', '\n                    部桃院內群聚感染事件昨天再添四名確診個案,分別為護理師、住院病患外籍看護,及兩名與染疫醫護(案八六三)同住家人,其中一人...                  ', '\n                    行政院昨宣布停辦今年台灣燈會。交通部長林佳龍前天對今年舉辦台灣燈會仍有信心,強調「沒有停辦的問題」,但昨在行政院長蘇貞昌...                  ', '\n                    部立桃園醫院已有六名醫護染疫,甚至已擴散到其家人,累計九人確診,為防院內感染持續擴大,陸軍六軍團三三化學兵群昨天派出約二...                  ', '\n                    衛福部立桃園醫院群聚感染九人確診,中央流行疫情指揮中心昨證實,院內六、七、九、十樓出現確診案例,為高風險區,昨起啟動「清...                  ', '\n                    部立桃園醫院群聚事件愈演愈烈,昨天新增4名確診個案,分別為護理師、住院病患外籍看護及兩名與染疫醫護(案863)同住家人。短短8天9人確診,指揮中心指揮官陳時中在記者會只保守地說「疫情至少發展到二波」,但發言人莊人祥會後坦承,「已經進入第四波傳播」...                  ', '\n                    台灣燈會決定停辦,新竹市長林智堅昨表示,為嚴肅面對疫情,上周起就積極與中央討論,最後決議停辦,這對新竹市政府及交通部觀光...                  ']

Extracting texts from scanned PDF#

from PIL import Image
from pytesseract import image_to_string

filename = '../../../RepositoryData/data/pdf-firth-text.png'
text = image_to_string(Image.open(filename))
print(text)
Stellenbosch Papers in Linguistics, Vol. 15, 1986, 31-60 doi: 10.5774/15-0-96

SPIL 14 (1986) 31- 6¢ 31

THE LINGUISTIC THOUGHT OF J.R. FIRTH

Nigel Love

"The study of the living votce of a
man tn aectton ts a very btg job in-

ii
deed." --- J.R. Firth

John Rupert Firth was born in 1890. After serving as Pro-
fessor of English at the University of the Punjab from 1919
to 1928, he took up a pest in the phonetics department of
University College, London. In 1938 he moved to the lin-
guistics department of the School of Oriental and African
Studies in London, where from 1944 until his retirement in
1956 he was Professor of Generali Linguistics. He died in
1960. He was an influential teacher, some of whose doctrines
(especially those concerning phonology) were widely propa-~
gated and developed by his students in what came to be known

as the "London school” of linguistics.

"The business of linguistics", according to Firth, "is to

1}

describe languages". In saying as much he would have the
assent of most twentieth-century linguistic theorists.

Where he parts company with many is in holding that this
enterprise is not incompatible with, or even separable from,
studying “the living voice of a man in action"; and his
chief interest as a linguistic thinker lies in his attempt
to resist the idea that synchronic descriptive linguistics
should treat what he calis “speech-events" as no more than
a means of access to what really interests the linguist:

the Language-system underlying them.

Languages, according to many theorists, are to be envisaged
as systems of abstract entities. These entities are units

of linguistic “form. Units of linguistic form are of two


Unicode normalization#

text = 'I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣'
print(text)
text2 = text.encode('utf-8')
print(text2)
I feel really 😡. GOGOGO!! 💪💪💪  🤣🤣
b'I feel really \xf0\x9f\x98\xa1. GOGOGO!! \xf0\x9f\x92\xaa\xf0\x9f\x92\xaa\xf0\x9f\x92\xaa  \xf0\x9f\xa4\xa3\xf0\x9f\xa4\xa3'
  • Other useful libraries

    • Spelling check: pyenchant, Microsoft REST API

    • PDF: PyPDF, PDFMiner

    • OCR: pytesseract

Cleanup#

  • Preliminaries

    • Sentence segmentation

    • Word tokenization

Segmentation and Tokenization#

from nltk.tokenize import sent_tokenize, word_tokenize

text = '''
Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
'''

## sent segmentation
sents = sent_tokenize(text)

## word tokenization
for sent in sents:
    print(sent)
    print(word_tokenize(sent))
Python is an interpreted, high-level and general-purpose programming language.
['Python', 'is', 'an', 'interpreted', ',', 'high-level', 'and', 'general-purpose', 'programming', 'language', '.']
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
['Python', "'s", 'design', 'philosophy', 'emphasizes', 'code', 'readability', 'with', 'its', 'notable', 'use', 'of', 'significant', 'whitespace', '.']
Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.
['Its', 'language', 'constructs', 'and', 'object-oriented', 'approach', 'aim', 'to', 'help', 'programmers', 'write', 'clear', ',', 'logical', 'code', 'for', 'small', 'and', 'large-scale', 'projects', '.']
  • Frequent preprocessing

    • Stopword removal

    • Stemming and/or lemmatization

    • Digits/Punctuaions removal

    • Case normalization

Removing stopwords, punctuations, digits#

from nltk.corpus import stopwords
from string import punctuation

eng_stopwords = stopwords.words('english')

text = "Mr. John O'Neil works at Wonderland, located at 245 Goleta Avenue, CA., 74208."

words = word_tokenize(text)

print(words)

# remove stopwords, punctuations, digits
for w in words:
    if w not in eng_stopwords and w not in punctuation and not w.isdigit():
        print(w)
['Mr.', 'John', "O'Neil", 'works', 'at', 'Wonderland', ',', 'located', 'at', '245', 'Goleta', 'Avenue', ',', 'CA.', ',', '74208', '.']
Mr.
John
O'Neil
works
Wonderland
located
Goleta
Avenue
CA.

Stemming and lemmatization#

## Stemming
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

words = ['cars','revolution', 'better']
print([stemmer.stem(w) for w in words])
['car', 'revolut', 'better']
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## Wordnet requires POS of words
poss = ['n','n','a']

for w,p in zip(words,poss):
    print(lemmatizer.lemmatize(w, pos=p))
car
revolution
good
  • Task-specific preprocessing

    • Unicode normalization

    • language detection

    • code mixing

    • transliteration

  • Automatic annotations

    • POS tagging

    • Parsing

    • Named entity recognition

    • coreference resolution

Important Reminders for Preprocessing#

  • Not all steps are necessary

  • These steps are NOT sequential

  • These steps are task-dependent

Feature Engineering#

What is feature engineering?#

  • It refers to a process to feed the extracted and preprocessed texts into a machine-learning algorithm.

  • It aims at capturing the characteristics of the text into a numeric vector that can be understood by the ML algorithms. (Cf. construct, operational definitions, and measurement in experimental science)

  • In short, it concerns how to meaningfully represent texts quantitatively, i.e., text representation.

Feature Engineering for Classical ML#

  • word-based frequency lists

  • bag-of-words representations

  • domain-specific word frequency lists

  • handcrafted features based on domain-specific knowledge

Feature Engineering for DL#

  • DL directly takes the texts as inputs to the model.

  • The DL model is capable of learning features from the texts (e.g., embeddings)

  • Less interpretable.

Modeling#

From Simple to Complex#

  • Start with heuristics or rules

  • Experiment with different ML models

    • from heuristics to features

    • from manual annotation to automatic extraction

    • feature importance (weights)

  • Find the most optimal model

    • Ensemble and stacking

    • Redo feature engineering

    • Transfer learning

    • Reapply heuristics

Evaluation#

Why evaluation?#

  • We need to know how good the model we’ve built is – “Goodness”

  • Factors relating to the evaluation methods

    • model building

    • deployment

    • production

  • ML metrics vs. business metrics

Intrinsic vs. Extrinsic Evaluation#

  • Take spam-classification system as an example

  • Intrinsic:

    • the precision and recall of the spam classification/prediction

  • Extrinsic:

    • the amount of time users spent on a spam email

General Principles#

  • Do intrinsic evaluation before extrinsic.

  • Extrinsic evaluation is more expensive because it often invovles project stakeholders outside the AI team.

  • Only when we get consistently good results in intrinsic evaluation should we go for extrinsic evaluation.

  • Bad results in intrinsic often implies bad results in extrinsic as well.

Common Intrinsic Metrics#

  • Principles for Evaluation Metrics Selection

  • Data type of the labels (ground truths)

    • Binary (e.g., sentiment)

    • Ordinal (e.g., informational retrieval)

    • Categorical (e.g., POS tags)

    • Textual (e.g., named entity, machine translation, text generation)

  • Automatic vs. Human Evalation

Post-Modeling Phases#

Post-Modeling Phases#

  • Deployment of the model in a production environment (e.g., web service)

  • Monitoring system performance on a regular basis

  • Updating system with new-coming data

References#

  • Chapter 2 of Practical Natural Language Processing. [1]

1

Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, and Harshit Surana. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems. O'Reilly Media, 2020.