Natural Language Processing (spaCy)#

Installation#

# Install package
## In terminal:
!pip install spacy

## Download language model for Chinese and English
!spacy download en
!spacy download zh
!spacy download en_vectors_web_lg ## pretrained word vectors
import spacy
from spacy import displacy
# load language model
nlp_en = spacy.load('en') ## disable=["parser"]
# parse text 
doc = nlp_en('This is a sentence')

Linguistic Features#

  • After we parse and tag a given text, we can extract token-level information:

    • Text: the original word text

    • Lemma: the base form of the word

    • POS: the simple universal POS tag

    • Tag: the detailed POS tag

    • Dep: Syntactic dependency

    • Shape: Word shape (capitalization, punc, digits)

    • is alpha

    • is stop

# parts of speech tagging
for token in doc:
    print(((token.text, 
            token.lemma_, 
            token.pos_, 
            token.tag_,
            token.dep_,
            token.shape_,
            token.is_alpha,
            token.is_stop,
            )))
('This', 'this', 'DET', 'DT', 'nsubj', 'Xxxx', True, True)
('is', 'be', 'AUX', 'VBZ', 'ROOT', 'xx', True, True)
('a', 'a', 'DET', 'DT', 'det', 'x', True, True)
('sentence', 'sentence', 'NOUN', 'NN', 'attr', 'xxxx', True, False)
## Output in different ways
for token in doc:
    print('%s_%s' % (token.text, token.tag_))
    
out = ''
for token in doc:
    out = out + ' '+ '/'.join((token.text, token.tag_))
print(out)
This_DT
is_VBZ
a_DT
sentence_NN
 This/DT is/VBZ a/DT sentence/NN
## Check meaning of a POS tag
spacy.explain('VBZ')
'verb, 3rd person singular present'

Visualization Linguistic Features#

# Visualize
displacy.render(doc, style="dep")
This DET is AUX a DET sentence NOUN nsubj det attr
options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}
displacy.render(doc, style="dep", options=options)
This DET is AUX a DET sentence NOUN nsubj det attr
## longer paragraphs
text_long = '''In a presidency of unprecedented disruption and turmoil, Donald Trump's support has remained remarkably stable. That stability, paradoxically, points toward years of rising turbulence in American politics and life.
Trump's approval ratings and support in the presidential race against Democratic nominee Joe Biden have oscillated in a strikingly narrow range of around 40%-45% that appears largely immune to both good news -- the long economic boom during his presidency's first years -- and bad -- impeachment, the worst pandemic in more than a century, revelations that he's disparaged military service and blunt warnings that he is unfit for the job from former senior officials in his own government. Perhaps the newest disclosures, in the upcoming book from Bob Woodward, that Trump knew the coronavirus was far more dangerous than the common flu even as he told Americans precisely the opposite, will break this pattern, but most political strategists in both parties are skeptical that it will.
'''
## parse the texts
doc2 = nlp_en(text_long)
## get the sentences of the doc2
sentence_spans = list(doc2.sents)
options = {"compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro",
          "distance": 120}
displacy.render(sentence_spans, style="dep", options=options)
In ADP a DET presidency NOUN of ADP unprecedented ADJ disruption NOUN and CCONJ turmoil, NOUN Donald PROPN Trump PROPN 's PART support NOUN has AUX remained VERB remarkably ADV stable. ADJ prep det pobj prep amod pobj cc conj compound poss case nsubj aux advmod acomp That DET stability, NOUN paradoxically, ADV points NOUN toward ADP years NOUN of ADP rising VERB turbulence NOUN in ADP American ADJ politics NOUN and CCONJ life. NOUN SPACE det nsubj advmod prep pobj prep amod pobj prep amod pobj cc punct Trump PROPN 's PART approval NOUN ratings NOUN and CCONJ support NOUN in ADP the DET presidential ADJ race NOUN against ADP Democratic ADJ nominee NOUN Joe PROPN Biden PROPN have AUX oscillated VERB in ADP a DET strikingly ADV narrow ADJ range NOUN of ADP around ADP 40%-45% NUM that DET appears VERB largely ADV immune ADJ to ADP both CCONJ good PROPN news -- PROPN the DET long ADJ economic ADJ boom NOUN during ADP his DET presidency NOUN 's PART first ADJ years -- NOUN and CCONJ bad -- ADJ impeachment, NOUN the DET worst ADJ pandemic NOUN in ADP more ADJ than SCONJ a DET century, NOUN revelations VERB that SCONJ he PRON 's AUX disparaged VERB military ADJ service NOUN and CCONJ blunt ADJ warnings NOUN that SCONJ he PRON is AUX unfit ADJ for ADP the DET job NOUN from ADP former ADJ senior ADJ officials NOUN in ADP his DET own ADJ government. NOUN poss case compound nsubj cc conj prep det amod pobj prep amod compound compound pobj aux prep det advmod amod pobj prep quantmod pobj nsubj relcl advmod oprd prep det amod pobj det amod amod appos prep poss poss case amod pobj cc amod conj det amod appos prep pobj prep det pobj dobj mark nsubj acl amod amod attr cc amod conj mark nsubj ccomp acomp prep det pobj prep amod amod pobj prep poss amod pobj Perhaps ADV the DET newest ADJ disclosures, NOUN in ADP the DET upcoming ADJ book NOUN from ADP Bob PROPN Woodward, PROPN that SCONJ Trump PROPN knew VERB the DET coronavirus NOUN was AUX far ADV more ADV dangerous ADJ than SCONJ the DET common ADJ flu NOUN even ADV as SCONJ he PRON told VERB Americans PROPN precisely ADV the DET opposite, NOUN will VERB break VERB this DET pattern, NOUN but CCONJ most ADJ political ADJ strategists NOUN in ADP both DET parties NOUN are AUX skeptical ADJ that SCONJ it PRON will. VERB SPACE advmod det amod nsubj prep det amod pobj prep compound pobj mark nsubj det nsubj ccomp advmod advmod acomp prep det amod pobj advmod mark nsubj advcl dobj advmod det dobj aux conj det dobj cc amod amod nsubj prep det pobj conj acomp mark nsubj punct
colors = {"ORG": "linear-gradient(90deg, #aa9cfc, #fc9ce7)"}
options = {"ents": ["ORG"], "colors": colors}


displacy.render(sentence_spans[0], style="ent")
In a presidency of unprecedented disruption and turmoil, Donald Trump's PERSON support has remained remarkably stable.

NP Chunking#

for c in doc2.noun_chunks:
    print(c.text, c.root.text, c.root.dep_, c.root.head.text)
a presidency presidency pobj In
unprecedented disruption disruption pobj of
turmoil turmoil conj disruption
Donald Trump's support support nsubj remained
That stability stability nsubj points
years years pobj toward
rising turbulence turbulence pobj of
American politics politics pobj in
life life conj politics
Trump's approval ratings ratings nsubj oscillated
support support conj ratings
the presidential race race pobj in
Democratic nominee Joe Biden Biden pobj against
a strikingly narrow range range pobj in
around 40%-45% % pobj of
both good news news pobj to
the long economic boom boom appos news
his presidency's first years years pobj during
bad -- impeachment impeachment conj years
the worst pandemic pandemic appos impeachment
a century century pobj than
he he nsubj 's
disparaged military service service attr 's
blunt warnings warnings conj service
he he nsubj is
the job job pobj for
former senior officials officials pobj from
his own government government pobj in
the newest disclosures disclosures nsubj knew
the upcoming book book pobj in
Bob Woodward Woodward pobj from
Trump Trump nsubj knew
the coronavirus coronavirus nsubj was
the common flu flu pobj than
he he nsubj told
Americans Americans dobj told
precisely the opposite opposite dobj told
this pattern pattern dobj break
most political strategists strategists nsubj are
both parties parties pobj in
it it nsubj will

Named Entity Recognition#

  • Text: original entity text

  • Start: index of start of entity in the Doc

  • End: index of end of entity in the Doc

  • Label: Entity label, type

for ent in doc2.ents:
    print(ent.text,
         ent.start_char,
         ent.end_char,
         ent.label_)
Donald Trump's 57 71 PERSON
years 157 162 DATE
American 187 195 NORP
Trump 215 220 PERSON
Democratic 285 295 NORP
Joe Biden 304 313 PERSON
around 40%-45% 362 376 PERCENT
more than a century 534 553 DATE
Bob Woodward 763 775 PERSON
Trump 782 787 PERSON
Americans 868 877 NORP