2. Assignment II: Preprocessing#

2.1. Question 1#

Please use the gutenberg corpus provided in nltk and extract the text written by Lewis Caroll, i.e., fileid == 'carroll-alice.txt', as your corpus data.

With this corpus data, please perform text preprocessing on the sentences of the corpus.

In particular, please:

  • pos-tag all the sentences to get the parts-of-speech of each word

  • lemmatize all words using WordNetLemmatizer in NLTK on a sentential basis

Please provide your output as shown below:

  • it is a data frame

  • the column alice_sents includes the original sentence texts

  • the column alice_sents_pos includes annotated version of the sentences with each token as word/postag

  • the column sents_lem includes the lemmatized version of the sentences

Note

Please note that the lemmatized form of the BE verbs (e.g., was) should be be. This is a quick check if your lemmatization works successfully.

Hide code cell source
alice_sents_df[:20]
sents sents_pos sents_lem
0 [ Alice ' s Adventures in Wonderland by Lewis ... [/JJ Alice/NNP '/POS s/NN Adventures/NNS in/IN... [ Alice ' s Adventures in Wonderland by Lewis ...
1 CHAPTER I . CHAPTER/NN I/PRP ./. CHAPTER I .
2 Down the Rabbit - Hole Down/IN the/DT Rabbit/NNP -/: Hole/NN Down the Rabbit - Hole
3 Alice was beginning to get very tired of sitti... Alice/NNP was/VBD beginning/VBG to/TO get/VB v... Alice be begin to get very tired of sit by her...
4 So she was considering in her own mind ( as we... So/IN she/PRP was/VBD considering/VBG in/IN he... So she be consider in her own mind ( as well a...
5 There was nothing so VERY remarkable in that ;... There/EX was/VBD nothing/NN so/RB VERY/RB rema... There be nothing so VERY remarkable in that ; ...
6 Oh dear ! Oh/UH dear/NN !/. Oh dear !
7 I shall be late !' I/PRP shall/MD be/VB late/JJ !'/NN I shall be late !'
8 ( when she thought it over afterwards , it occ... (/( when/WRB she/PRP thought/VBD it/PRP over/I... ( when she think it over afterwards , it occur...
9 In another moment down went Alice after it , n... In/IN another/DT moment/NN down/RP went/VBD Al... In another moment down go Alice after it , nev...
10 The rabbit - hole went straight on like a tunn... The/DT rabbit/NN -/: hole/NN went/VBD straight... The rabbit - hole go straight on like a tunnel...
11 Either the well was very deep , or she fell ve... Either/CC the/DT well/NN was/VBD very/RB deep/... Either the well be very deep , or she fell ver...
12 First , she tried to look down and make out wh... First/RB ,/, she/PRP tried/VBD to/TO look/VB d... First , she try to look down and make out what...
13 She took down a jar from one of the shelves as... She/PRP took/VBD down/RP a/DT jar/NN from/IN o... She take down a jar from one of the shelf a sh...
14 ' Well !' '/POS Well/NNP !'/NN ' Well !'
15 thought Alice to herself , ' after such a fall... thought/VBN Alice/NNP to/TO herself/VB ,/, '/'... think Alice to herself , ' after such a fall a...
16 How brave they ' ll all think me at home ! How/WRB brave/VBP they/PRP '/'' ll/VBP all/DT ... How brave they ' ll all think me at home !
17 Why , I wouldn ' t say anything about it , eve... Why/WRB ,/, I/PRP wouldn/VBP '/'' t/NNS say/VB... Why , I wouldn ' t say anything about it , eve...
18 ( Which was very likely true .) (/( Which/NNP was/VBD very/RB likely/JJ true/J... ( Which be very likely true .)
19 Down , down , down . Down/NNP ,/, down/RB ,/, down/RB ./. Down , down , down .

2.2. Question 2#

Based on the output of Question 1, please create a lemma frequency list of the corpus, carroll-alice.txt, using the lemmatized forms by including only lemmas which are:

  • consisting of only alphabets or hyphens

  • at least 5-character long

The casing is irrelevant (i.e., case normalization is needed).

The expected output is provided as follows (showing the top 20 lemmas and their frequencies).

Hide code cell source
# top 20
alice_df[:21]
LEMMA FREQ
0 alice 398
1 little 128
2 think 118
3 there 99
4 about 94
5 begin 92
6 would 83
7 again 83
8 herself 83
9 thing 80
10 could 77
11 queen 75
12 turtle 61
13 hatter 57
14 quite 55
15 gryphon 55
16 rabbit 52
17 their 52
18 first 51
19 voice 51
20 which 49

2.3. Question 3#

Please identify top verbs that co-occcur with the name Alice in the text, with Alice being the subject of the verb.

Please use the en_core_web_sm model in spacy for English dependency parsing.

To simply the task, please identify all the verbs that have a dependency relation of nsubj with the noun Alice (where Alice is the dependent, and the verb is the head).

The expected output is provided below (showing the top 20 heads of Alice for the nsubj dependency relation.)

Hide code cell source
alice_nsubj_df[:21]
nsubj-head FREQ
0 said 127
1 thought 32
2 replied 13
3 was 10
4 began 8
5 went 7
6 looked 7
7 felt 5
8 like 5
9 think 4
10 had 4
11 ventured 4
12 beginning 3
13 been 3
14 heard 3
15 hear 3
16 waited 3
17 remarked 3
18 asked 3
19 see 3
20 got 2