3. Assignment III: Chinese Language Processing#

3.1. Question 1#

The csv file dcard-top100.csv includes top 100 posts from Dcard, which a on-line discussion forum for school life in Taiwan. The texts are in the content column.

Please preprocess the data by:

  • removing symbols, punctuations, emoticons or other non-linguistic symbols

  • removing stopwords (Please use the stopword list provided in demo_data/stopwords/tomlinNTUB-chinese-stopwords.txt)

  • performing word segmentation on the corpus using ckip-transformer

  • creating a word frequency list of this tiny corpus

  • including only word tokens which have at least two characters in the frequency list

Warning

Please note that the preprocessing steps are important. Removal of characters from texts may have a lot to do with the word segmentation performance.

Hide code cell source
dcards_df[:21]
LEMMA FREQ
0 真的 115
1 沒有 92
2 覺得 90
3 知道 70
4 看到 67
5 現在 63
6 喜歡 56
7 朋友 54
8 其實 52
9 一直 52
10 不會 51
11 發現 43
12 男友 42
13 一下 41
14 已經 41
15 很多 40
16 時間 40
17 工作 40
18 分享 39
19 感覺 39
20 一起 38

3.2. Question 2#

Use ckip-transformer to extract all named entities and create a frequency list of the named entities.

In particular, please identify named entities of organizations (ORG) and geographical names (GPE) and provide their frequencies in the Dcard Corpus.

Hide code cell source
ner_df[:21]
LEMMA FREQ
0 台灣 23
1 日本 18
2 台南 7
3 6
4 英國 5
5 台中 4
6 沖繩 4
7 韓國 4
8 聖圭 4
9 台北 3
10 德國 3
11 台大 2
12 杜克大學 2
13 東京 2
14 SHINee 2
15 2
16 板橋 2
17 武林 2
18 巴黎 2
19 鹿港 2
20 Celine 2

3.3. Question 3#

In this exercise, please work with spacy for Chinese processing. (Use the model zh_core_web_trf)

Please process the same Dcard Corpus (from the csv file) by:

  • performing the word tokenization

  • identifying all nouns and verbs (i.e., words whose tags start with N or V)

  • identifying all words with at least two characters

  • removing all words that contain alphabets or digits

  • removing all words that are included in the stopword_list (cf. Question 1)

Based on the above text-preprocessing criteria, your goal is to create a word frequency list and visualize the result in a Word Cloud.

Note

spacy uses the jieba for Chinese word segmentation. There may be more tagging errors. In the expected results presented below, I did not use any self-defined dictionary. For this exercise, please ignore any tagging errors out of the module for the moment.

Tip

Please check the module wordcloud for the visualization.

Hide code cell source
nouns_verbs_df[:21]
N-V FREQ
0 知道 70
1 看到 67
2 朋友 55
3 喜歡 55
4 分享 43
5 男友 42
6 沒有 41
7 工作 40
8 很多 39
9 第一 38
10 感覺 38
11 時間 37
12 發現 36
13 == 36
14 希望 35
15 感情 33
16 今天 33
17 蛋糕 32
18 部分 30
19 出去 30
20 想要 30
Hide code cell source
plt.figure(figsize=(7,5))
plt.imshow(wc)
plt.axis("off")
plt.show()
../_images/68191a787b06a259f86906b66c47fbcc28f2a389fcfd82cd5d714bd7dc3d3f7f.png

3.4. Question 4#

Following Question 3, after you process each article with spacy, please extract all the subject + predicate word pairs from the corpus.

To simplify the task, please extract word token pairs whose dependency relation is nsubj, with the predicate being the head and subject being the dependent.

  • Remove words that include alphabets and digits

Hide code cell source
sub_pred_df[:21]
SUBJ-PRED FREQ
0 我_喜歡 21
1 他_說 20
2 我_想 19
3 我_覺 19
4 我_知道 16
5 我_看 14
6 我_看到 11
7 我_用 10
8 我_說 8
9 大家_好 8
10 我_愛 8
11 我_有 7
12 你_有 7
13 我_去 6
14 他_覺 6
15 我_要 6
16 我_決定 6
17 她_說 6
18 我_發現 5
19 我_問 5
20 我_在 5