{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assignment II: Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", "\n", "Please use the `gutenberg` corpus provided in `nltk` and extract the text written by Lewis Caroll, i.e., `fileid == 'carroll-alice.txt'`, as your corpus data.\n", "\n", "With this corpus data, please perform text preprocessing on the **sentences** of the corpus.\n", "\n", "In particular, please:\n", "\n", "- pos-tag all the sentences to get the parts-of-speech of each word\n", "- lemmatize all words using `WordNetLemmatizer` in NLTK on a sentential basis\n", "\n", "Please provide your output as shown below:\n", "\n", "- it is a data frame\n", "- the column `alice_sents` includes the original sentence texts\n", "- the column `alice_sents_pos` includes annotated version of the sentences with each token as `word/postag` \n", "- the column `sents_lem` includes the lemmatized version of the sentences\n", "\n", "\n", "```{note}\n", "Please note that the lemmatized form of the BE verbs (e.g., *was*) should be *be*. This is a quick check if your lemmatization works successfully.\n", "```\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentssents_possents_lem
0[ Alice ' s Adventures in Wonderland by Lewis ...[/JJ Alice/NNP '/POS s/NN Adventures/NNS in/IN...[ Alice ' s Adventures in Wonderland by Lewis ...
1CHAPTER I .CHAPTER/NN I/PRP ./.CHAPTER I .
2Down the Rabbit - HoleDown/IN the/DT Rabbit/NNP -/: Hole/NNDown the Rabbit - Hole
3Alice was beginning to get very tired of sitti...Alice/NNP was/VBD beginning/VBG to/TO get/VB v...Alice be begin to get very tired of sit by her...
4So she was considering in her own mind ( as we...So/IN she/PRP was/VBD considering/VBG in/IN he...So she be consider in her own mind ( as well a...
5There was nothing so VERY remarkable in that ;...There/EX was/VBD nothing/NN so/RB VERY/RB rema...There be nothing so VERY remarkable in that ; ...
6Oh dear !Oh/UH dear/NN !/.Oh dear !
7I shall be late !'I/PRP shall/MD be/VB late/JJ !'/NNI shall be late !'
8( when she thought it over afterwards , it occ...(/( when/WRB she/PRP thought/VBD it/PRP over/I...( when she think it over afterwards , it occur...
9In another moment down went Alice after it , n...In/IN another/DT moment/NN down/RP went/VBD Al...In another moment down go Alice after it , nev...
10The rabbit - hole went straight on like a tunn...The/DT rabbit/NN -/: hole/NN went/VBD straight...The rabbit - hole go straight on like a tunnel...
11Either the well was very deep , or she fell ve...Either/CC the/DT well/NN was/VBD very/RB deep/...Either the well be very deep , or she fell ver...
12First , she tried to look down and make out wh...First/RB ,/, she/PRP tried/VBD to/TO look/VB d...First , she try to look down and make out what...
13She took down a jar from one of the shelves as...She/PRP took/VBD down/RP a/DT jar/NN from/IN o...She take down a jar from one of the shelf a sh...
14' Well !''/POS Well/NNP !'/NN' Well !'
15thought Alice to herself , ' after such a fall...thought/VBN Alice/NNP to/TO herself/VB ,/, '/'...think Alice to herself , ' after such a fall a...
16How brave they ' ll all think me at home !How/WRB brave/VBP they/PRP '/'' ll/VBP all/DT ...How brave they ' ll all think me at home !
17Why , I wouldn ' t say anything about it , eve...Why/WRB ,/, I/PRP wouldn/VBP '/'' t/NNS say/VB...Why , I wouldn ' t say anything about it , eve...
18( Which was very likely true .)(/( Which/NNP was/VBD very/RB likely/JJ true/J...( Which be very likely true .)
19Down , down , down .Down/NNP ,/, down/RB ,/, down/RB ./.Down , down , down .
\n", "
" ], "text/plain": [ " sents \\\n", "0 [ Alice ' s Adventures in Wonderland by Lewis ... \n", "1 CHAPTER I . \n", "2 Down the Rabbit - Hole \n", "3 Alice was beginning to get very tired of sitti... \n", "4 So she was considering in her own mind ( as we... \n", "5 There was nothing so VERY remarkable in that ;... \n", "6 Oh dear ! \n", "7 I shall be late !' \n", "8 ( when she thought it over afterwards , it occ... \n", "9 In another moment down went Alice after it , n... \n", "10 The rabbit - hole went straight on like a tunn... \n", "11 Either the well was very deep , or she fell ve... \n", "12 First , she tried to look down and make out wh... \n", "13 She took down a jar from one of the shelves as... \n", "14 ' Well !' \n", "15 thought Alice to herself , ' after such a fall... \n", "16 How brave they ' ll all think me at home ! \n", "17 Why , I wouldn ' t say anything about it , eve... \n", "18 ( Which was very likely true .) \n", "19 Down , down , down . \n", "\n", " sents_pos \\\n", "0 [/JJ Alice/NNP '/POS s/NN Adventures/NNS in/IN... \n", "1 CHAPTER/NN I/PRP ./. \n", "2 Down/IN the/DT Rabbit/NNP -/: Hole/NN \n", "3 Alice/NNP was/VBD beginning/VBG to/TO get/VB v... \n", "4 So/IN she/PRP was/VBD considering/VBG in/IN he... \n", "5 There/EX was/VBD nothing/NN so/RB VERY/RB rema... \n", "6 Oh/UH dear/NN !/. \n", "7 I/PRP shall/MD be/VB late/JJ !'/NN \n", "8 (/( when/WRB she/PRP thought/VBD it/PRP over/I... \n", "9 In/IN another/DT moment/NN down/RP went/VBD Al... \n", "10 The/DT rabbit/NN -/: hole/NN went/VBD straight... \n", "11 Either/CC the/DT well/NN was/VBD very/RB deep/... \n", "12 First/RB ,/, she/PRP tried/VBD to/TO look/VB d... \n", "13 She/PRP took/VBD down/RP a/DT jar/NN from/IN o... \n", "14 '/POS Well/NNP !'/NN \n", "15 thought/VBN Alice/NNP to/TO herself/VB ,/, '/'... \n", "16 How/WRB brave/VBP they/PRP '/'' ll/VBP all/DT ... \n", "17 Why/WRB ,/, I/PRP wouldn/VBP '/'' t/NNS say/VB... \n", "18 (/( Which/NNP was/VBD very/RB likely/JJ true/J... \n", "19 Down/NNP ,/, down/RB ,/, down/RB ./. \n", "\n", " sents_lem \n", "0 [ Alice ' s Adventures in Wonderland by Lewis ... \n", "1 CHAPTER I . \n", "2 Down the Rabbit - Hole \n", "3 Alice be begin to get very tired of sit by her... \n", "4 So she be consider in her own mind ( as well a... \n", "5 There be nothing so VERY remarkable in that ; ... \n", "6 Oh dear ! \n", "7 I shall be late !' \n", "8 ( when she think it over afterwards , it occur... \n", "9 In another moment down go Alice after it , nev... \n", "10 The rabbit - hole go straight on like a tunnel... \n", "11 Either the well be very deep , or she fell ver... \n", "12 First , she try to look down and make out what... \n", "13 She take down a jar from one of the shelf a sh... \n", "14 ' Well !' \n", "15 think Alice to herself , ' after such a fall a... \n", "16 How brave they ' ll all think me at home ! \n", "17 Why , I wouldn ' t say anything about it , eve... \n", "18 ( Which be very likely true .) \n", "19 Down , down , down . " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alice_sents_df[:20]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2\n", "\n", "Based on the output of **Question 1**, please create a lemma frequency list of the corpus, `carroll-alice.txt`, using the lemmatized forms by including only lemmas which are:\n", "- consisting of only alphabets or hyphens\n", "- at least 5-character long\n", "\n", "The casing is irrelevant (i.e., case normalization is needed).\n", "\n", "The expected output is provided as follows (showing the top 20 lemmas and their frequencies).\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
LEMMAFREQ
0alice398
1little128
2think118
3there99
4about94
5begin92
6would83
7again83
8herself83
9thing80
10could77
11queen75
12turtle61
13hatter57
14quite55
15gryphon55
16rabbit52
17their52
18first51
19voice51
20which49
\n", "
" ], "text/plain": [ " LEMMA FREQ\n", "0 alice 398\n", "1 little 128\n", "2 think 118\n", "3 there 99\n", "4 about 94\n", "5 begin 92\n", "6 would 83\n", "7 again 83\n", "8 herself 83\n", "9 thing 80\n", "10 could 77\n", "11 queen 75\n", "12 turtle 61\n", "13 hatter 57\n", "14 quite 55\n", "15 gryphon 55\n", "16 rabbit 52\n", "17 their 52\n", "18 first 51\n", "19 voice 51\n", "20 which 49" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# top 20\n", "alice_df[:21]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 3\n", "\n", "Please identify top verbs that co-occcur with the name *Alice* in the text, with *Alice* being the **subject** of the verb. \n", "\n", "Please use the `en_core_web_sm` model in `spacy` for English dependency parsing.\n", "\n", "To simply the task, please identify all the verbs that have a dependency relation of `nsubj` with the noun `Alice` (where `Alice` is the **dependent**, and the verb is the **head**).\n", "\n", "The expected output is provided below (showing the top 20 heads of `Alice` for the `nsubj` dependency relation.)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nsubj-headFREQ
0said127
1thought32
2replied13
3was10
4began8
5went7
6looked7
7felt5
8like5
9think4
10had4
11ventured4
12beginning3
13been3
14heard3
15hear3
16waited3
17remarked3
18asked3
19see3
20got2
\n", "
" ], "text/plain": [ " nsubj-head FREQ\n", "0 said 127\n", "1 thought 32\n", "2 replied 13\n", "3 was 10\n", "4 began 8\n", "5 went 7\n", "6 looked 7\n", "7 felt 5\n", "8 like 5\n", "9 think 4\n", "10 had 4\n", "11 ventured 4\n", "12 beginning 3\n", "13 been 3\n", "14 heard 3\n", "15 hear 3\n", "16 waited 3\n", "17 remarked 3\n", "18 asked 3\n", "19 see 3\n", "20 got 2" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alice_nsubj_df[:21]" ] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "python-notes", "language": "python", "name": "python-notes" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }