Natural Language Processing (ckipnlp)#

  • Chinese NLP toolkit developed by Academia Sinica

  • The CPU version works pretty slowly

  • The documentation of ckipnlp is limited. Need more time to figure out what is what and how to do what :)

  • Documentation:

from ckipnlp.pipeline import CkipPipeline, CkipDocument

pipeline = CkipPipeline()
doc = CkipDocument(raw='中研院的開發系統,來測試看看,挺酷的!')
# Word Segmentation
pipeline.get_ws(doc)
print(doc.ws)
for line in doc.ws:
    print(line.to_text())

# Part-of-Speech Tagging
pipeline.get_pos(doc)
print(doc.pos)
for line in doc.pos:
    print(line.to_text())

# Named-Entity Recognition
pipeline.get_ner(doc)
print(doc.ner)
# Constituency Parsing
#pipeline.get_conparse(doc)
#print(doc.conparse)
/Users/Alvin/opt/anaconda3/envs/ckiptagger/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/Users/Alvin/opt/anaconda3/envs/ckiptagger/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/Users/Alvin/opt/anaconda3/envs/ckiptagger/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/Users/Alvin/opt/anaconda3/envs/ckiptagger/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/Users/Alvin/opt/anaconda3/envs/ckiptagger/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/Users/Alvin/opt/anaconda3/envs/ckiptagger/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
[['中研院', '的', '開發', '系統', ',', '來', '測試', '看看', ',', '挺', '酷', '的', '!']]
中研院 的 開發 系統 , 來 測試 看看 , 挺 酷 的 !
[['Nc', 'DE', 'Nv', 'Na', 'COMMACATEGORY', 'D', 'VC', 'Di', 'COMMACATEGORY', 'Dfa', 'VH', 'T', 'EXCLAMATIONCATEGORY']]
Nc DE Nv Na COMMACATEGORY D VC Di COMMACATEGORY Dfa VH T EXCLAMATIONCATEGORY
[[NerToken(word='中研院', ner='ORG', idx=(0, 3))]]
from ckipnlp.container.util.wspos import WsPosParagraph

# Word Segmentation & Part-of-Speech Tagging
for line in WsPosParagraph.to_text(doc.ws, doc.pos):
    print(line)
中研院(Nc) 的(DE) 開發(Nv) 系統(Na) ,(COMMACATEGORY) 來(D) 測試(VC) 看看(Di) ,(COMMACATEGORY) 挺(Dfa) 酷(VH) 的(T) !(EXCLAMATIONCATEGORY)
from ckipnlp.container.util.wspos import WsPosSentence
for line in WsPosParagraph.to_text(doc.ws, doc.pos):
    print(line)
中研院(Nc) 的(DE) 開發(Nv) 系統(Na) ,(COMMACATEGORY) 來(D) 測試(VC) 看看(Di) ,(COMMACATEGORY) 挺(Dfa) 酷(VH) 的(T) !(EXCLAMATIONCATEGORY)
doc2 = CkipDocument(raw='武漢肺炎全球肆虐,至今已有2906萬人確診、92萬染疫身亡,而流亡美國的中國大陸病毒學家閻麗夢,14日時開通了推特帳號,並公布一份長達26頁的科學論文,研究直指武肺病毒與自然人畜共通傳染病的病毒不同,並呼籲追查武漢P4實驗室及美國衛生研究院(NIH)之間的金流,引發討論。')
# Word Segmentation & Part-of-Speech Tagging

# Word Segmentation
pipeline.get_ws(doc2)
print(doc2.ws)
for line in doc2.ws:
    print(line.to_text())
[['武漢', '肺炎', '全球', '肆虐', ',', '至今', '已', '有', '2906萬', '人', '確診', '、', '92萬', '染疫', '身亡', ',', '而', '流亡', '美國', '的', '中國', '大陸', '病毒學家', '閻麗夢', ',', '14日', '時', '開通', '了', '推特', '帳號', ',', '並', '公布', '一', '份', '長達', '26', '頁', '的', '科學', '論文', ',', '研究', '直指', '武肺', '病毒', '與', '自然', '人畜', '共通', '傳染病', '的', '病毒', '不同', ',', '並', '呼籲', '追查', '武漢', 'P4', '實驗室', '及', '美國', '衛生', '研究院', '(', 'NIH', ')', '之間', '的', '金流', ',', '引發', '討論', '。']]
武漢 肺炎 全球 肆虐 , 至今 已 有 2906萬 人 確診 、 92萬 染疫 身亡 , 而 流亡 美國 的 中國 大陸 病毒學家 閻麗夢 , 14日 時 開通 了 推特 帳號 , 並 公布 一 份 長達 26 頁 的 科學 論文 , 研究 直指 武肺 病毒 與 自然 人畜 共通 傳染病 的 病毒 不同 , 並 呼籲 追查 武漢 P4 實驗室 及 美國 衛生 研究院 ( NIH ) 之間 的 金流 , 引發 討論 。
# Part-of-Speech Tagging
pipeline.get_pos(doc2)
print(doc2.pos)
for line in doc2.pos:
    print(line.to_text())
[['Nc', 'Na', 'Nc', 'VC', 'COMMACATEGORY', 'D', 'D', 'V_2', 'Neu', 'Na', 'VA', 'PAUSECATEGORY', 'Neu', 'Na', 'VH', 'COMMACATEGORY', 'Cbb', 'VCL', 'Nc', 'DE', 'Nc', 'Nc', 'Na', 'Nb', 'COMMACATEGORY', 'Nd', 'Ng', 'VH', 'Di', 'Na', 'Na', 'COMMACATEGORY', 'Cbb', 'VE', 'Neu', 'Nf', 'VJ', 'Neu', 'Nf', 'DE', 'Na', 'Na', 'COMMACATEGORY', 'VE', 'VE', 'Na', 'Na', 'Caa', 'Na', 'Na', 'A', 'Na', 'DE', 'Na', 'VH', 'COMMACATEGORY', 'Cbb', 'VE', 'VC', 'Nc', 'Nb', 'Nc', 'Caa', 'Nc', 'Na', 'Nc', 'PARENTHESISCATEGORY', 'FW', 'PARENTHESISCATEGORY', 'Ng', 'DE', 'Na', 'COMMACATEGORY', 'VC', 'VE', 'PERIODCATEGORY']]
Nc Na Nc VC COMMACATEGORY D D V_2 Neu Na VA PAUSECATEGORY Neu Na VH COMMACATEGORY Cbb VCL Nc DE Nc Nc Na Nb COMMACATEGORY Nd Ng VH Di Na Na COMMACATEGORY Cbb VE Neu Nf VJ Neu Nf DE Na Na COMMACATEGORY VE VE Na Na Caa Na Na A Na DE Na VH COMMACATEGORY Cbb VE VC Nc Nb Nc Caa Nc Na Nc PARENTHESISCATEGORY FW PARENTHESISCATEGORY Ng DE Na COMMACATEGORY VC VE PERIODCATEGORY
for line in WsPosParagraph.to_text(doc2.ws, doc2.pos):
    print(line)
武漢(Nc) 肺炎(Na) 全球(Nc) 肆虐(VC) ,(COMMACATEGORY) 至今(D) 已(D) 有(V_2) 2906萬(Neu) 人(Na) 確診(VA) 、(PAUSECATEGORY) 92萬(Neu) 染疫(Na) 身亡(VH) ,(COMMACATEGORY) 而(Cbb) 流亡(VCL) 美國(Nc) 的(DE) 中國(Nc) 大陸(Nc) 病毒學家(Na) 閻麗夢(Nb) ,(COMMACATEGORY) 14日(Nd) 時(Ng) 開通(VH) 了(Di) 推特(Na) 帳號(Na) ,(COMMACATEGORY) 並(Cbb) 公布(VE) 一(Neu) 份(Nf) 長達(VJ) 26(Neu) 頁(Nf) 的(DE) 科學(Na) 論文(Na) ,(COMMACATEGORY) 研究(VE) 直指(VE) 武肺(Na) 病毒(Na) 與(Caa) 自然(Na) 人畜(Na) 共通(A) 傳染病(Na) 的(DE) 病毒(Na) 不同(VH) ,(COMMACATEGORY) 並(Cbb) 呼籲(VE) 追查(VC) 武漢(Nc) P4(Nb) 實驗室(Nc) 及(Caa) 美國(Nc) 衛生(Na) 研究院(Nc) ((PARENTHESISCATEGORY) NIH(FW) )(PARENTHESISCATEGORY) 之間(Ng) 的(DE) 金流(Na) ,(COMMACATEGORY) 引發(VC) 討論(VE) 。(PERIODCATEGORY)
pipeline.get_ner(doc2)
print(doc2.ner)

#WsPosSentence.to_text(doc2.ws, doc2.pos)
[[NerToken(word='14日', ner='DATE', idx=(48, 51)), NerToken(word='NIH', ner='ORG', idx=(121, 124)), NerToken(word='閻麗夢', ner='PERSON', idx=(44, 47)), NerToken(word='武漢', ner='GPE', idx=(0, 2)), NerToken(word='美國', ner='GPE', idx=(33, 35)), NerToken(word='92萬', ner='CARDINAL', idx=(22, 25)), NerToken(word='美國衛生研究院', ner='ORG', idx=(113, 120)), NerToken(word='武漢P4實驗室', ner='ORG', idx=(105, 112)), NerToken(word='中國大陸', ner='GPE', idx=(36, 40)), NerToken(word='26', ner='CARDINAL', idx=(67, 69)), NerToken(word='2906萬', ner='CARDINAL', idx=(13, 18))]]