Chinese Natural Language Processing (spaCy)#
Installation#
# Install package
## In terminal:
!pip install spacy
## Download language model for Chinese and English
!spacy download en
!python -m spacy download zh_core_web_lg
Documentation: spaCy Chinese model
import spacy
from spacy import displacy
# load language model
nlp_zh = spacy.load('zh_core_web_sm')## disable=["parser"]
# parse text
doc = nlp_zh('這是一個中文的句子')
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/n7/ltpzwx813c599nfxfb94s_640000gn/T/jieba.cache
Loading model cost 0.742 seconds.
Prefix dict has been built successfully.
Linguistic Features#
After we parse and tag a given text, we can extract token-level information:
Text: the original word text
Lemma: the base form of the word
POS: the simple universal POS tag
Tag: the detailed POS tag
Dep: Syntactic dependency
Shape: Word shape (capitalization, punc, digits)
is alpha
is stop
Note
For more information on POS tags, see spaCy (POS tag scheme documentation)[https://spacy.io/api/annotation#pos-tagging].
# parts of speech tagging
for token in doc:
print(((token.text,
token.lemma_,
token.pos_,
token.tag_,
token.dep_,
token.shape_,
token.is_alpha,
token.is_stop,
)))
('這', '這', 'ADV', 'AD', 'advmod', 'x', True, False)
('是', '是', 'VERB', 'VC', 'cop', 'x', True, True)
('一', '一', 'NUM', 'CD', 'nummod', 'x', True, True)
('個', '個', 'NUM', 'M', 'mark:clf', 'x', True, False)
('中文', '中文', 'NOUN', 'NN', 'nmod:assmod', 'xx', True, False)
('的', '的', 'PART', 'DEG', 'case', 'x', True, True)
('句子', '句子', 'NOUN', 'NN', 'ROOT', 'xx', True, False)
## Output in different ways
for token in doc:
print('%s_%s' % (token.text, token.tag_))
out = ''
for token in doc:
out = out + ' '+ '/'.join((token.text, token.tag_))
print(out)
# Noun chunking not working??
for n in doc.noun_chunks:
print(n.text)
## Check meaning of a POS tag (Not working??)
spacy.explain('VC')
Visualization Linguistic Features#
# Visualize
displacy.render(doc, style="dep")
options = {"compact": True, "bg": "#09a3d5",
"color": "white", "font": "Source Sans Pro",
"distance": 120}
displacy.render(doc, style="dep", options=options)
## longer paragraphs
text_long = '''武漢肺炎全球肆虐,至今已有2906萬人確診、92萬染疫身亡,而流亡美國的中國大陸病毒學家閻麗夢,14日時開通了推特帳號,並公布一份長達26頁的科學論文,研究直指武肺病毒與自然人畜共通傳染病的病毒不同,並呼籲追查武漢P4實驗室及美國衛生研究院(NIH)之間的金流,引發討論。'''
text_long_list = text_long.split(sep=",")
len(text_long_list)
for c in text_long_list:
print(c)
武漢肺炎全球肆虐
至今已有2906萬人確診、92萬染疫身亡
而流亡美國的中國大陸病毒學家閻麗夢
14日時開通了推特帳號
並公布一份長達26頁的科學論文
研究直指武肺病毒與自然人畜共通傳染病的病毒不同
並呼籲追查武漢P4實驗室及美國衛生研究院(NIH)之間的金流
引發討論。
## parse the texts
doc2 = list(nlp_zh.pipe(text_long_list))
len(doc2)
8
# Visual dependency for each sentence-like chunk
sentence_spans = list(doc2)
options = {"compact": True, "bg": "#09a3d5",
"color": "white", "font": "Source Sans Pro",
"distance": 120}
displacy.render(sentence_spans, style="dep", options=options)
colors = {"CARDINAL": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}
options = {"ents": ["CARDINAL"], "colors": colors}
displacy.render(sentence_spans[1], style="ent")
至今已有
2906
CARDINAL
萬人確診、
92
CARDINAL
萬染疫身亡
NP Chunking#
## Print noun phrase for each doc
for d in doc2:
for np in d.noun_chunks:
print(np.text,
np.root.text,
np.root.dep_,
np.root.head.text)
print('---')
---
---
---
---
---
---
---
---
Named Entity Recognition#
Text: original entity text
Start: index of start of entity in the Doc
End: index of end of entity in the Doc
Label: Entity label, type
## Print ents for each doc
for d in doc2:
for e in d.ents:
print(e.text, e.label_)
print('---')
---
2906 CARDINAL
92 CARDINAL
---
---
14日 CARDINAL
---
26 CARDINAL
---
---
---
---