2. Assignment II: Preprocessing#
2.1. Question 1#
Please use the gutenberg
corpus provided in nltk
and extract the text written by Lewis Caroll, i.e., fileid == 'carroll-alice.txt'
, as your corpus data.
With this corpus data, please perform text preprocessing on the sentences of the corpus.
In particular, please:
pos-tag all the sentences to get the parts-of-speech of each word
lemmatize all words using
WordNetLemmatizer
in NLTK on a sentential basis
Please provide your output as shown below:
it is a data frame
the column
alice_sents
includes the original sentence textsthe column
alice_sents_pos
includes annotated version of the sentences with each token asword/postag
the column
sents_lem
includes the lemmatized version of the sentences
Note
Please note that the lemmatized form of the BE verbs (e.g., was) should be be. This is a quick check if your lemmatization works successfully.
Show code cell source
alice_sents_df[:20]
sents | sents_pos | sents_lem | |
---|---|---|---|
0 | [ Alice ' s Adventures in Wonderland by Lewis ... | [/JJ Alice/NNP '/POS s/NN Adventures/NNS in/IN... | [ Alice ' s Adventures in Wonderland by Lewis ... |
1 | CHAPTER I . | CHAPTER/NN I/PRP ./. | CHAPTER I . |
2 | Down the Rabbit - Hole | Down/IN the/DT Rabbit/NNP -/: Hole/NN | Down the Rabbit - Hole |
3 | Alice was beginning to get very tired of sitti... | Alice/NNP was/VBD beginning/VBG to/TO get/VB v... | Alice be begin to get very tired of sit by her... |
4 | So she was considering in her own mind ( as we... | So/IN she/PRP was/VBD considering/VBG in/IN he... | So she be consider in her own mind ( as well a... |
5 | There was nothing so VERY remarkable in that ;... | There/EX was/VBD nothing/NN so/RB VERY/RB rema... | There be nothing so VERY remarkable in that ; ... |
6 | Oh dear ! | Oh/UH dear/NN !/. | Oh dear ! |
7 | I shall be late !' | I/PRP shall/MD be/VB late/JJ !'/NN | I shall be late !' |
8 | ( when she thought it over afterwards , it occ... | (/( when/WRB she/PRP thought/VBD it/PRP over/I... | ( when she think it over afterwards , it occur... |
9 | In another moment down went Alice after it , n... | In/IN another/DT moment/NN down/RP went/VBD Al... | In another moment down go Alice after it , nev... |
10 | The rabbit - hole went straight on like a tunn... | The/DT rabbit/NN -/: hole/NN went/VBD straight... | The rabbit - hole go straight on like a tunnel... |
11 | Either the well was very deep , or she fell ve... | Either/CC the/DT well/NN was/VBD very/RB deep/... | Either the well be very deep , or she fell ver... |
12 | First , she tried to look down and make out wh... | First/RB ,/, she/PRP tried/VBD to/TO look/VB d... | First , she try to look down and make out what... |
13 | She took down a jar from one of the shelves as... | She/PRP took/VBD down/RP a/DT jar/NN from/IN o... | She take down a jar from one of the shelf a sh... |
14 | ' Well !' | '/POS Well/NNP !'/NN | ' Well !' |
15 | thought Alice to herself , ' after such a fall... | thought/VBN Alice/NNP to/TO herself/VB ,/, '/'... | think Alice to herself , ' after such a fall a... |
16 | How brave they ' ll all think me at home ! | How/WRB brave/VBP they/PRP '/'' ll/VBP all/DT ... | How brave they ' ll all think me at home ! |
17 | Why , I wouldn ' t say anything about it , eve... | Why/WRB ,/, I/PRP wouldn/VBP '/'' t/NNS say/VB... | Why , I wouldn ' t say anything about it , eve... |
18 | ( Which was very likely true .) | (/( Which/NNP was/VBD very/RB likely/JJ true/J... | ( Which be very likely true .) |
19 | Down , down , down . | Down/NNP ,/, down/RB ,/, down/RB ./. | Down , down , down . |
2.2. Question 2#
Based on the output of Question 1, please create a lemma frequency list of the corpus, carroll-alice.txt
, using the lemmatized forms by including only lemmas which are:
consisting of only alphabets or hyphens
at least 5-character long
The casing is irrelevant (i.e., case normalization is needed).
The expected output is provided as follows (showing the top 20 lemmas and their frequencies).
Show code cell source
# top 20
alice_df[:21]
LEMMA | FREQ | |
---|---|---|
0 | alice | 398 |
1 | little | 128 |
2 | think | 118 |
3 | there | 99 |
4 | about | 94 |
5 | begin | 92 |
6 | would | 83 |
7 | again | 83 |
8 | herself | 83 |
9 | thing | 80 |
10 | could | 77 |
11 | queen | 75 |
12 | turtle | 61 |
13 | hatter | 57 |
14 | quite | 55 |
15 | gryphon | 55 |
16 | rabbit | 52 |
17 | their | 52 |
18 | first | 51 |
19 | voice | 51 |
20 | which | 49 |
2.3. Question 3#
Please identify top verbs that co-occcur with the name Alice in the text, with Alice being the subject of the verb.
Please use the en_core_web_sm
model in spacy
for English dependency parsing.
To simply the task, please identify all the verbs that have a dependency relation of nsubj
with the noun Alice
(where Alice
is the dependent, and the verb is the head).
The expected output is provided below (showing the top 20 heads of Alice
for the nsubj
dependency relation.)
Show code cell source
alice_nsubj_df[:21]
nsubj-head | FREQ | |
---|---|---|
0 | said | 127 |
1 | thought | 32 |
2 | replied | 13 |
3 | was | 10 |
4 | began | 8 |
5 | went | 7 |
6 | looked | 7 |
7 | felt | 5 |
8 | like | 5 |
9 | think | 4 |
10 | had | 4 |
11 | ventured | 4 |
12 | beginning | 3 |
13 | been | 3 |
14 | heard | 3 |
15 | hear | 3 |
16 | waited | 3 |
17 | remarked | 3 |
18 | asked | 3 |
19 | see | 3 |
20 | got | 2 |