Natural Language Processing (spaCy)#
Installation#
Spacy language models
# Install package
## In terminal:
!pip install spacy
## Download language model for Chinese and English
!spacy download en
!spacy download zh
!spacy download en_vectors_web_lg ## pretrained word vectors
import spacy
from spacy import displacy
# load language model
nlp_en = spacy.load('en') ## disable=["parser"]
# parse text
doc = nlp_en('This is a sentence')
Linguistic Features#
After we parse and tag a given text, we can extract token-level information:
Text: the original word text
Lemma: the base form of the word
POS: the simple universal POS tag
Tag: the detailed POS tag
Dep: Syntactic dependency
Shape: Word shape (capitalization, punc, digits)
is alpha
is stop
# parts of speech tagging
for token in doc:
print(((token.text,
token.lemma_,
token.pos_,
token.tag_,
token.dep_,
token.shape_,
token.is_alpha,
token.is_stop,
)))
('This', 'this', 'DET', 'DT', 'nsubj', 'Xxxx', True, True)
('is', 'be', 'AUX', 'VBZ', 'ROOT', 'xx', True, True)
('a', 'a', 'DET', 'DT', 'det', 'x', True, True)
('sentence', 'sentence', 'NOUN', 'NN', 'attr', 'xxxx', True, False)
## Output in different ways
for token in doc:
print('%s_%s' % (token.text, token.tag_))
out = ''
for token in doc:
out = out + ' '+ '/'.join((token.text, token.tag_))
print(out)
This_DT
is_VBZ
a_DT
sentence_NN
This/DT is/VBZ a/DT sentence/NN
## Check meaning of a POS tag
spacy.explain('VBZ')
'verb, 3rd person singular present'
Visualization Linguistic Features#
# Visualize
displacy.render(doc, style="dep")
options = {"compact": True, "bg": "#09a3d5",
"color": "white", "font": "Source Sans Pro"}
displacy.render(doc, style="dep", options=options)
## longer paragraphs
text_long = '''In a presidency of unprecedented disruption and turmoil, Donald Trump's support has remained remarkably stable. That stability, paradoxically, points toward years of rising turbulence in American politics and life.
Trump's approval ratings and support in the presidential race against Democratic nominee Joe Biden have oscillated in a strikingly narrow range of around 40%-45% that appears largely immune to both good news -- the long economic boom during his presidency's first years -- and bad -- impeachment, the worst pandemic in more than a century, revelations that he's disparaged military service and blunt warnings that he is unfit for the job from former senior officials in his own government. Perhaps the newest disclosures, in the upcoming book from Bob Woodward, that Trump knew the coronavirus was far more dangerous than the common flu even as he told Americans precisely the opposite, will break this pattern, but most political strategists in both parties are skeptical that it will.
'''
## parse the texts
doc2 = nlp_en(text_long)
## get the sentences of the doc2
sentence_spans = list(doc2.sents)
options = {"compact": True, "bg": "#09a3d5",
"color": "white", "font": "Source Sans Pro",
"distance": 120}
displacy.render(sentence_spans, style="dep", options=options)
colors = {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}
options = {"ents": ["ORG"], "colors": colors}
displacy.render(sentence_spans[0], style="ent")
In a presidency of unprecedented disruption and turmoil,
Donald Trump's
PERSON
support has remained remarkably stable.
NP Chunking#
for c in doc2.noun_chunks:
print(c.text, c.root.text, c.root.dep_, c.root.head.text)
a presidency presidency pobj In
unprecedented disruption disruption pobj of
turmoil turmoil conj disruption
Donald Trump's support support nsubj remained
That stability stability nsubj points
years years pobj toward
rising turbulence turbulence pobj of
American politics politics pobj in
life life conj politics
Trump's approval ratings ratings nsubj oscillated
support support conj ratings
the presidential race race pobj in
Democratic nominee Joe Biden Biden pobj against
a strikingly narrow range range pobj in
around 40%-45% % pobj of
both good news news pobj to
the long economic boom boom appos news
his presidency's first years years pobj during
bad -- impeachment impeachment conj years
the worst pandemic pandemic appos impeachment
a century century pobj than
he he nsubj 's
disparaged military service service attr 's
blunt warnings warnings conj service
he he nsubj is
the job job pobj for
former senior officials officials pobj from
his own government government pobj in
the newest disclosures disclosures nsubj knew
the upcoming book book pobj in
Bob Woodward Woodward pobj from
Trump Trump nsubj knew
the coronavirus coronavirus nsubj was
the common flu flu pobj than
he he nsubj told
Americans Americans dobj told
precisely the opposite opposite dobj told
this pattern pattern dobj break
most political strategists strategists nsubj are
both parties parties pobj in
it it nsubj will
Named Entity Recognition#
Text: original entity text
Start: index of start of entity in the Doc
End: index of end of entity in the Doc
Label: Entity label, type
for ent in doc2.ents:
print(ent.text,
ent.start_char,
ent.end_char,
ent.label_)
Donald Trump's 57 71 PERSON
years 157 162 DATE
American 187 195 NORP
Trump 215 220 PERSON
Democratic 285 295 NORP
Joe Biden 304 313 PERSON
around 40%-45% 362 376 PERCENT
more than a century 534 553 DATE
Bob Woodward 763 775 PERSON
Trump 782 787 PERSON
Americans 868 877 NORP