Vectorizing Texts#

  • Feature engineering for text represenation

import nltk
from nltk.corpus import brown
brown.categories()
['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']
corpus_id = brown.fileids(categories=['reviews','fiction','humor'])
corpus_text = [' '.join(w) for w in [brown.words(fileids=cid) for cid in corpus_id]]
print(corpus_text[0][:100])
corpus_cat = [brown.categories(fileids=cid)[0] for cid in corpus_id]

print(len(corpus_text))
print(len(corpus_id))
print(len(corpus_cat))
type(corpus_id)
It is not news that Nathan Milstein is a wizard of the violin . Certainly not in Orchestra Hall wher
55
55
55
list
import numpy as np
import pandas as pd
import re

assert len(corpus_text)==len(corpus_cat)

corpus_df = pd.DataFrame({'Text': corpus_text, 'Category': corpus_cat, 'ID': corpus_id})
corpus_df
Text Category ID
0 It is not news that Nathan Milstein is a wizar... reviews cc01
1 Television has yet to work out a living arrang... reviews cc02
2 Francois D'Albert , Hungarian-born violinist w... reviews cc03
3 The Theatre-by-the-Sea , Matunuck , presents `... reviews cc04
4 The superb intellectual and spiritual vitality... reviews cc05
5 George Kennan's account of relations between R... reviews cc06
6 Some of the New York Philharmonic musicians wh... reviews cc07
7 Had a funny experience at Newport yesterday af... reviews cc08
8 Murray Louis and his dance company appeared at... reviews cc09
9 Ring Of Bright Water , by Gavin Maxwell . 211 ... reviews cc10
10 Mischa Elman shared last night's Lewisohn Stad... reviews cc11
11 Radio is easily outdistancing television in it... reviews cc12
12 A tribe in ancient India believed the earth wa... reviews cc13
13 Elisabeth Schwarzkopf sang so magnificently Sa... reviews cc14
14 As autumn starts its annual sweep , few Americ... reviews cc15
15 A year ago it was bruited that the primary cha... reviews cc16
16 The reading public , the theatergoing public ,... reviews cc17
17 Thirty-three Scotty did not go back to school ... fiction ck01
18 Where their sharp edges seemed restless as sea... fiction ck02
19 Mickie sat over his second whisky-on-the-rocks... fiction ck03
20 The Bishop looked at him coldly and said , `` ... fiction ck04
21 Payne dismounted in Madison Place and handed t... fiction ck05
22 With a sneer , the man spread his legs and , a... fiction ck06
23 If the crummy bastard could write ! ! That's h... fiction ck07
24 Rousseau is so persuasive that Voltaire is alm... fiction ck08
25 It was the first time any of us had laughed si... fiction ck09
26 That summer the gambling houses were closed , ... fiction ck10
27 Standing in the shelter of the tent -- a rejec... fiction ck11
28 She was a child too much a part of her environ... fiction ck12
29 In the dim underwater light they dressed and s... fiction ck13
30 He brought with him a mixture of myrrh and alo... fiction ck14
31 Beth was very still and her breath came in sma... fiction ck15
32 The red glow from the cove had died out of the... fiction ck16
33 Burly leathered men and wrinkled women in drab... fiction ck17
34 She was getting real dramatic . I'd have been ... fiction ck18
35 There was one fact which Rector could not over... fiction ck19
36 She concluded by asking him to name another ho... fiction ck20
37 Beckworth handed the pass to the colonel . He ... fiction ck21
38 I would not want to be one of those writers wh... fiction ck22
39 It was not as though she noted clearly that he... fiction ck23
40 His eyes were old and they never saw well , bu... fiction ck24
41 He was in his mid-fifties at this time , long ... fiction ck25
42 But they all said , `` No , your time will com... fiction ck26
43 `` But tell me , doctor , where do you plan to... fiction ck27
44 Going downstairs with the tray , Winston wishe... fiction ck28
45 Was it love ? ? I had no doubt that it was . D... fiction ck29
46 It was among these that Hinkle identified a ph... humor cr01
47 I realized that Hamlet was faced with an entir... humor cr02
48 Needless to say , I was furious at this unpara... humor cr03
49 Up to date , however , his garden was still mo... humor cr04
50 Ambiguity Nothing in English has been ridicule... humor cr05
51 I called the other afternoon on my old friend ... humor cr06
52 One day , the children had wanted to get up on... humor cr07
53 Pueri aquam de silvas ad agricolas portant , a... humor cr08
54 Dear Sirs : Let me begin by clearing up any po... humor cr09
## Clean up texts
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
## function
def normalize_text(text):
    ## remove special characters
    text = re.sub(r'[^a-zA-Z\s]','', text, re.I|re.A)
    text = text.lower().strip()
    tokens = wpt.tokenize(text)
    ## filtering
    #tokens_filtered = [w for w in tokens if w not in stop_words and re.search(r'\D+', w)]
    tokens_filtered = tokens
    text_output = ' '.join(tokens_filtered)
    
    return text_output

## vectorize function
normalize_corpus= np.vectorize(normalize_text)

corpus_norm = normalize_corpus(corpus_text)
corpus_norm[1][:200]
'television has yet to work out a living arrangement with jazz which comes to the medium more as an uneasy guest than as a relaxed member of the family there seems to be an unfortunate assumption that '

Bag of Words#

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(min_df=0.2, max_df=1.)
cv_matrix = cv.fit_transform(corpus_norm)
cv_matrix


## view the array
type(cv_matrix)

cv_matrix = cv_matrix.toarray()

vocab = cv.get_feature_names()
boa_unigram = pd.DataFrame(cv_matrix, columns=vocab)
boa_unigram
able about above across act added after afternoon again against ... written wrote year years yes yet york you young your
0 1 5 0 0 0 0 1 0 1 0 ... 0 0 0 0 0 1 0 3 0 1
1 0 3 0 1 2 0 1 0 1 0 ... 0 0 0 4 0 2 1 8 1 4
2 1 3 0 0 1 0 3 0 0 0 ... 1 0 0 5 0 0 3 4 2 1
3 0 2 0 0 4 0 2 0 3 0 ... 1 0 0 1 0 0 2 0 2 1
4 1 4 1 0 0 0 1 0 1 1 ... 1 0 0 0 0 1 0 6 0 2
5 0 11 0 0 0 0 4 0 1 1 ... 0 0 1 1 0 1 0 6 0 0
6 1 1 1 0 1 1 6 2 2 0 ... 0 1 1 0 1 0 2 0 2 0
7 1 10 0 0 0 0 4 4 3 0 ... 0 1 3 1 0 0 0 3 1 2
8 1 1 1 0 0 1 0 0 2 0 ... 2 1 0 1 0 2 6 1 2 0
9 0 1 0 1 0 0 3 0 0 1 ... 0 0 2 4 0 1 2 4 1 0
10 1 6 1 0 0 1 1 0 1 2 ... 0 0 2 2 0 1 0 0 1 0
11 0 3 0 0 0 0 1 0 0 1 ... 0 0 3 4 0 1 5 0 2 0
12 2 4 0 0 3 0 2 0 1 0 ... 0 0 0 6 0 0 2 1 1 1
13 0 3 1 0 1 0 3 0 1 0 ... 4 0 0 1 0 0 3 0 1 0
14 0 5 1 1 1 0 1 0 0 0 ... 0 0 4 0 0 0 3 2 1 0
15 1 9 1 0 0 0 2 0 1 1 ... 1 0 1 2 0 1 1 0 0 0
16 0 4 1 0 1 0 1 0 1 0 ... 0 0 5 2 0 0 1 0 0 0
17 1 11 0 2 0 0 3 2 2 0 ... 0 0 0 0 0 4 0 16 1 0
18 0 9 2 0 0 0 1 1 2 2 ... 0 0 0 0 0 3 0 9 0 2
19 0 10 1 2 0 0 0 0 1 0 ... 0 0 0 1 0 1 2 22 3 2
20 0 10 0 3 0 0 3 0 2 0 ... 0 0 0 8 0 1 0 10 2 1
21 1 6 2 1 3 0 3 0 2 0 ... 0 0 0 0 0 2 1 6 1 0
22 0 1 0 3 0 0 1 1 5 0 ... 0 0 0 0 0 2 1 11 0 2
23 0 5 1 0 0 0 1 0 6 0 ... 0 0 0 0 0 1 0 11 1 2
24 0 4 0 0 0 0 6 0 2 2 ... 1 1 0 2 4 0 0 13 0 4
25 0 9 1 1 0 0 3 0 0 1 ... 0 0 0 0 0 2 0 6 1 2
26 0 1 2 0 0 0 3 0 2 2 ... 1 0 0 1 0 0 0 13 0 2
27 0 3 2 4 0 0 3 1 7 1 ... 0 0 0 0 0 0 0 4 4 0
28 0 5 0 2 0 0 1 0 3 0 ... 0 0 0 1 3 2 0 1 0 1
29 1 5 0 2 1 0 2 1 1 1 ... 0 0 0 0 0 0 0 7 1 0
30 1 4 0 0 0 1 4 1 1 0 ... 0 0 1 1 0 5 0 8 3 1
31 0 6 0 1 0 0 5 2 3 4 ... 1 4 0 1 1 0 0 11 2 4
32 0 0 0 2 0 0 0 0 2 0 ... 1 0 1 1 1 1 3 1 0 0
33 1 4 0 1 0 0 0 1 2 0 ... 0 0 0 3 2 0 0 21 0 4
34 0 16 1 1 1 1 2 0 2 0 ... 0 1 0 1 0 0 1 6 0 0
35 2 12 0 0 0 1 2 2 0 1 ... 1 0 0 0 0 1 0 13 0 0
36 0 7 0 0 1 0 0 0 3 2 ... 0 3 0 1 0 2 0 12 0 2
37 0 0 0 2 0 0 1 1 3 4 ... 1 0 0 0 4 0 0 9 0 0
38 0 1 0 0 0 0 4 2 0 1 ... 0 0 3 2 0 0 1 14 0 5
39 0 2 1 1 0 1 0 0 5 2 ... 0 1 0 7 1 0 0 8 2 1
40 0 12 0 0 0 0 3 0 2 0 ... 0 0 0 1 1 0 0 21 0 4
41 0 1 2 1 0 0 4 1 5 0 ... 0 0 0 6 2 2 0 0 1 0
42 0 12 0 0 0 0 3 1 2 2 ... 0 0 2 1 1 0 0 9 0 1
43 0 2 0 2 0 0 2 1 2 0 ... 0 0 0 1 0 0 0 5 1 0
44 0 9 0 1 0 1 4 0 0 0 ... 0 1 0 5 0 0 0 30 3 2
45 0 4 0 1 1 0 2 1 3 0 ... 0 1 2 4 0 2 1 1 1 0
46 0 2 0 0 0 0 4 0 3 0 ... 0 0 1 2 0 0 0 5 0 1
47 1 6 0 0 0 1 8 1 2 1 ... 0 0 2 1 1 0 2 11 0 2
48 0 3 0 0 0 0 1 0 1 4 ... 0 3 1 1 0 2 0 1 3 0
49 2 6 1 0 0 0 5 1 2 1 ... 0 0 0 1 0 0 0 27 0 1
50 0 1 0 0 0 0 1 0 1 0 ... 0 1 3 2 1 1 0 26 0 8
51 1 7 0 0 0 0 2 1 2 2 ... 1 3 1 9 1 0 0 20 0 3
52 1 5 0 0 0 1 2 0 3 0 ... 0 0 0 0 2 0 0 20 0 5
53 0 4 0 2 0 0 5 0 1 1 ... 0 0 0 2 0 1 2 2 0 1
54 1 2 2 1 0 1 1 0 0 2 ... 0 0 0 3 0 0 0 45 1 17

55 rows × 739 columns

More Complex Bag-of-Words#

  • Filter features based on word classes

  • Include n-gram features

Bag of N-grams Model#

## N-grams

cv_ngram = CountVectorizer(ngram_range=(1,3), min_df = 0.2)
cv_ngram_matrix = cv_ngram.fit_transform(corpus_norm)

boa_ngram = pd.DataFrame(cv_ngram_matrix.toarray(),columns = cv_ngram.get_feature_names())
boa_ngram.head()
able able to about about it about the above across across the act added ... yet york you you are you can you could you have you know young your
0 1 0 5 0 3 0 0 0 0 0 ... 1 0 3 0 0 0 0 0 0 1
1 0 0 3 0 0 0 1 0 2 0 ... 2 1 8 1 0 0 1 0 1 4
2 1 0 3 0 0 0 0 0 1 0 ... 0 3 4 1 0 0 1 0 2 1
3 0 0 2 0 1 0 0 0 4 0 ... 0 2 0 0 0 0 0 0 2 1
4 1 1 4 1 1 1 0 0 0 0 ... 1 0 6 0 1 1 1 0 0 2

5 rows × 1193 columns

TF-IDF Model#

from sklearn.feature_extraction.text import TfidfTransformer
tf = TfidfTransformer(norm = 'l2', use_idf=True)
tf_matrix = tf.fit_transform(cv_matrix)

tfidf_unigram = pd.DataFrame(tf_matrix.toarray(), columns = cv.get_feature_names())
tfidf_unigram
able about above across act added after afternoon again against ... written wrote year years yes yet york you young your
0 0.009462 0.024754 0.000000 0.000000 0.000000 0.000000 0.005318 0.000000 0.005614 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.008088 0.000000 0.016540 0.000000 0.007161
1 0.000000 0.014533 0.000000 0.008444 0.022309 0.000000 0.005204 0.000000 0.005493 0.000000 ... 0.000000 0.000000 0.000000 0.024528 0.000000 0.015829 0.009042 0.043160 0.007592 0.028028
2 0.009405 0.014763 0.000000 0.000000 0.011331 0.000000 0.015859 0.000000 0.000000 0.000000 ... 0.011331 0.000000 0.000000 0.031143 0.000000 0.000000 0.027553 0.021920 0.015424 0.007117
3 0.000000 0.010555 0.000000 0.000000 0.048609 0.000000 0.011339 0.000000 0.017954 0.000000 ... 0.012152 0.000000 0.000000 0.006680 0.000000 0.000000 0.019701 0.000000 0.016542 0.007634
4 0.009887 0.020690 0.009887 0.000000 0.000000 0.000000 0.005557 0.000000 0.005866 0.008821 ... 0.011910 0.000000 0.000000 0.000000 0.000000 0.008451 0.000000 0.034563 0.000000 0.014963
5 0.000000 0.046936 0.000000 0.000000 0.000000 0.000000 0.018335 0.000000 0.004839 0.007276 ... 0.000000 0.000000 0.008356 0.005401 0.000000 0.006971 0.000000 0.028511 0.000000 0.000000
6 0.009412 0.004924 0.009412 0.000000 0.011338 0.012071 0.031740 0.018824 0.011168 0.000000 ... 0.000000 0.011338 0.009644 0.000000 0.010704 0.000000 0.018382 0.000000 0.015434 0.000000
7 0.009725 0.050883 0.000000 0.000000 0.000000 0.000000 0.021865 0.038902 0.017310 0.000000 ... 0.000000 0.011716 0.029895 0.006441 0.000000 0.000000 0.000000 0.017000 0.007974 0.014719
8 0.008883 0.004648 0.008883 0.000000 0.000000 0.011393 0.000000 0.000000 0.010540 0.000000 ... 0.021403 0.010701 0.000000 0.005883 0.000000 0.015186 0.052047 0.005176 0.014567 0.000000
9 0.000000 0.004898 0.000000 0.008538 0.000000 0.000000 0.015786 0.000000 0.000000 0.008353 ... 0.000000 0.000000 0.019186 0.024800 0.000000 0.008003 0.018285 0.021820 0.007677 0.000000
10 0.009336 0.029307 0.009336 0.000000 0.000000 0.011973 0.005247 0.000000 0.005539 0.016659 ... 0.000000 0.000000 0.019132 0.012365 0.000000 0.007980 0.000000 0.000000 0.007655 0.000000
11 0.000000 0.013760 0.000000 0.000000 0.000000 0.000000 0.004927 0.000000 0.000000 0.007821 ... 0.000000 0.000000 0.026948 0.023223 0.000000 0.007493 0.042804 0.000000 0.014376 0.000000
12 0.017945 0.018778 0.000000 0.000000 0.032428 0.000000 0.010086 0.000000 0.005323 0.000000 ... 0.000000 0.000000 0.000000 0.035652 0.000000 0.000000 0.017524 0.005228 0.007357 0.006790
13 0.000000 0.013423 0.008552 0.000000 0.010303 0.000000 0.014420 0.000000 0.005074 0.000000 ... 0.041210 0.000000 0.000000 0.005663 0.000000 0.000000 0.025054 0.000000 0.007012 0.000000
14 0.000000 0.026945 0.010300 0.009394 0.012409 0.000000 0.005789 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.042216 0.000000 0.000000 0.000000 0.030175 0.012003 0.008446 0.000000
15 0.009780 0.046052 0.009780 0.000000 0.000000 0.000000 0.010994 0.000000 0.005802 0.008725 ... 0.011782 0.000000 0.010021 0.012953 0.000000 0.008360 0.009550 0.000000 0.000000 0.000000
16 0.000000 0.020279 0.009690 0.000000 0.011674 0.000000 0.005446 0.000000 0.005749 0.000000 ... 0.000000 0.000000 0.049644 0.012834 0.000000 0.000000 0.009463 0.000000 0.000000 0.000000
17 0.009886 0.056898 0.000000 0.018032 0.000000 0.000000 0.016670 0.019773 0.011731 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.033802 0.000000 0.092167 0.008106 0.000000
18 0.000000 0.044869 0.019058 0.000000 0.000000 0.000000 0.005356 0.009529 0.011307 0.017003 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.024435 0.000000 0.049969 0.000000 0.014422
19 0.000000 0.049857 0.009529 0.017381 0.000000 0.000000 0.000000 0.000000 0.005654 0.000000 ... 0.000000 0.000000 0.000000 0.006311 0.000000 0.008145 0.018611 0.122152 0.023440 0.014423
20 0.000000 0.046828 0.000000 0.024488 0.000000 0.000000 0.015092 0.000000 0.010620 0.000000 ... 0.000000 0.000000 0.000000 0.047418 0.000000 0.007650 0.000000 0.052150 0.014677 0.006773
21 0.009171 0.028791 0.018343 0.008364 0.033146 0.000000 0.015464 0.000000 0.010883 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.015679 0.008956 0.032063 0.007520 0.000000
22 0.000000 0.004591 0.000000 0.024006 0.000000 0.000000 0.004932 0.008774 0.026028 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.015000 0.008568 0.056236 0.000000 0.013280
23 0.000000 0.027181 0.010390 0.000000 0.000000 0.000000 0.005840 0.000000 0.036987 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.008881 0.000000 0.066595 0.008519 0.015726
24 0.000000 0.019541 0.000000 0.000000 0.000000 0.000000 0.031488 0.000000 0.011079 0.016661 ... 0.011248 0.011248 0.000000 0.012367 0.042476 0.000000 0.000000 0.070725 0.000000 0.028264
25 0.000000 0.041602 0.008835 0.008057 0.000000 0.000000 0.014897 0.000000 0.000000 0.007882 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.015104 0.000000 0.030887 0.007244 0.013372
26 0.000000 0.004729 0.018077 0.000000 0.000000 0.000000 0.015240 0.000000 0.010725 0.016128 ... 0.010889 0.000000 0.000000 0.005986 0.000000 0.000000 0.000000 0.068463 0.000000 0.013680
27 0.000000 0.012572 0.016019 0.029219 0.000000 0.000000 0.013506 0.008010 0.033264 0.007146 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.018668 0.026270 0.000000
28 0.000000 0.026287 0.000000 0.018328 0.000000 0.000000 0.005648 0.000000 0.017885 0.000000 ... 0.000000 0.000000 0.000000 0.006655 0.034284 0.017178 0.000000 0.005855 0.000000 0.007604
29 0.008547 0.022359 0.000000 0.015590 0.010297 0.000000 0.009608 0.008547 0.005071 0.007626 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.034861 0.007008 0.000000
30 0.009866 0.020647 0.000000 0.000000 0.000000 0.012653 0.022180 0.009866 0.005853 0.000000 ... 0.000000 0.000000 0.010109 0.006533 0.000000 0.042164 0.000000 0.045987 0.024268 0.007466
31 0.000000 0.028949 0.000000 0.008410 0.000000 0.000000 0.025916 0.018444 0.016414 0.032910 ... 0.011109 0.044438 0.000000 0.006107 0.010488 0.000000 0.000000 0.059105 0.015123 0.027914
32 0.000000 0.000000 0.000000 0.016363 0.000000 0.000000 0.000000 0.000000 0.010645 0.000000 ... 0.010807 0.000000 0.009192 0.005941 0.010203 0.007668 0.026281 0.005227 0.000000 0.000000
33 0.010317 0.021592 0.000000 0.009409 0.000000 0.000000 0.000000 0.010317 0.012242 0.000000 ... 0.000000 0.000000 0.000000 0.020497 0.023467 0.000000 0.000000 0.126240 0.000000 0.031230
34 0.000000 0.092212 0.011015 0.010046 0.013270 0.014127 0.012382 0.000000 0.013071 0.000000 ... 0.000000 0.013270 0.000000 0.007295 0.000000 0.000000 0.010757 0.038510 0.000000 0.000000
35 0.016857 0.052918 0.000000 0.000000 0.000000 0.010810 0.009475 0.016857 0.000000 0.007520 ... 0.010154 0.000000 0.000000 0.000000 0.000000 0.007204 0.000000 0.063843 0.000000 0.000000
36 0.000000 0.036244 0.000000 0.000000 0.011922 0.000000 0.000000 0.000000 0.017614 0.017659 ... 0.000000 0.035766 0.000000 0.006554 0.000000 0.016918 0.000000 0.069194 0.000000 0.014978
37 0.000000 0.000000 0.000000 0.015838 0.000000 0.000000 0.004880 0.008683 0.015455 0.030987 ... 0.010460 0.000000 0.000000 0.000000 0.039500 0.000000 0.000000 0.045534 0.000000 0.000000
38 0.000000 0.005114 0.000000 0.000000 0.000000 0.000000 0.021975 0.019548 0.000000 0.008720 ... 0.000000 0.000000 0.030045 0.012946 0.000000 0.000000 0.009545 0.079731 0.000000 0.036983
39 0.000000 0.008647 0.008264 0.007536 0.000000 0.010598 0.000000 0.000000 0.024514 0.014745 ... 0.000000 0.009955 0.000000 0.038308 0.009398 0.000000 0.000000 0.038519 0.013551 0.006254
40 0.000000 0.062135 0.000000 0.000000 0.000000 0.000000 0.016687 0.000000 0.011743 0.000000 ... 0.000000 0.000000 0.000000 0.006554 0.011255 0.000000 0.000000 0.121093 0.000000 0.029957
41 0.000000 0.005035 0.019245 0.008776 0.000000 0.000000 0.021634 0.009623 0.028545 0.000000 ... 0.000000 0.000000 0.000000 0.038235 0.021887 0.016450 0.000000 0.000000 0.007890 0.000000
42 0.000000 0.062556 0.000000 0.000000 0.000000 0.000000 0.016800 0.009964 0.011823 0.017779 ... 0.000000 0.000000 0.020418 0.006598 0.011331 0.000000 0.000000 0.052249 0.000000 0.007540
43 0.000000 0.008408 0.000000 0.014657 0.000000 0.000000 0.009033 0.008036 0.009535 0.000000 ... 0.000000 0.000000 0.000000 0.005321 0.000000 0.000000 0.000000 0.023410 0.006589 0.000000
44 0.000000 0.045169 0.000000 0.008748 0.000000 0.012303 0.021566 0.000000 0.000000 0.000000 ... 0.000000 0.011556 0.000000 0.031763 0.000000 0.000000 0.000000 0.167676 0.023596 0.014518
45 0.000000 0.020685 0.000000 0.009014 0.011907 0.000000 0.011111 0.009884 0.017592 0.000000 ... 0.000000 0.011907 0.020255 0.026182 0.000000 0.016897 0.009652 0.005759 0.008104 0.000000
46 0.000000 0.010388 0.000000 0.000000 0.000000 0.000000 0.022318 0.000000 0.017669 0.000000 ... 0.000000 0.000000 0.010171 0.013148 0.000000 0.000000 0.000000 0.028920 0.000000 0.007512
47 0.010698 0.033583 0.000000 0.000000 0.000000 0.013720 0.048102 0.010698 0.012694 0.009544 ... 0.000000 0.000000 0.021923 0.007085 0.012167 0.000000 0.020893 0.068566 0.000000 0.016191
48 0.000000 0.015427 0.000000 0.000000 0.000000 0.000000 0.005524 0.000000 0.005831 0.035076 ... 0.000000 0.035522 0.010071 0.006509 0.000000 0.016802 0.000000 0.005727 0.024177 0.000000
49 0.019850 0.031157 0.009925 0.000000 0.000000 0.000000 0.027892 0.009925 0.011777 0.008855 ... 0.000000 0.000000 0.000000 0.006573 0.000000 0.000000 0.000000 0.156141 0.000000 0.007511
50 0.000000 0.005730 0.000000 0.000000 0.000000 0.000000 0.006155 0.000000 0.006497 0.000000 ... 0.000000 0.013193 0.033662 0.014504 0.012454 0.009361 0.000000 0.165899 0.000000 0.066297
51 0.009381 0.034356 0.000000 0.000000 0.000000 0.000000 0.010545 0.009381 0.011131 0.016739 ... 0.011301 0.033903 0.009612 0.055911 0.010669 0.000000 0.000000 0.109317 0.000000 0.021297
52 0.010202 0.026688 0.000000 0.000000 0.000000 0.013084 0.011468 0.000000 0.018158 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.023205 0.000000 0.000000 0.118885 0.000000 0.038601
53 0.000000 0.018951 0.000000 0.016517 0.000000 0.000000 0.025448 0.000000 0.005373 0.008079 ... 0.000000 0.000000 0.000000 0.011994 0.000000 0.007740 0.017686 0.010553 0.000000 0.006853
54 0.010298 0.010776 0.020596 0.009392 0.000000 0.013208 0.005788 0.000000 0.000000 0.018376 ... 0.000000 0.000000 0.000000 0.020460 0.000000 0.000000 0.000000 0.270016 0.008444 0.132484

55 rows × 739 columns

tf_matrix2 = tf.fit_transform(cv_ngram_matrix)
tfidf_ngram = pd.DataFrame(tf_matrix2.toarray(), columns = cv_ngram.get_feature_names())
tfidf_ngram
able able to about about it about the above across across the act added ... yet york you you are you can you could you have you know young your
0 0.009217 0.000000 0.024112 0.000000 0.023146 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.007878 0.000000 0.016111 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006975
1 0.000000 0.000000 0.014143 0.000000 0.000000 0.000000 0.008217 0.000000 0.021710 0.000000 ... 0.015404 0.008799 0.042001 0.011192 0.000000 0.000000 0.010248 0.000000 0.007388 0.027275
2 0.009164 0.000000 0.014384 0.000000 0.000000 0.000000 0.000000 0.000000 0.011040 0.000000 ... 0.000000 0.026847 0.021358 0.011383 0.000000 0.000000 0.010422 0.000000 0.015028 0.006935
3 0.000000 0.000000 0.010331 0.000000 0.008264 0.000000 0.000000 0.000000 0.047577 0.000000 ... 0.000000 0.019283 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.016191 0.007472
4 0.009664 0.011305 0.020224 0.011642 0.008089 0.009664 0.000000 0.000000 0.000000 0.000000 ... 0.008260 0.000000 0.033784 0.000000 0.010990 0.012003 0.010990 0.000000 0.000000 0.014626
5 0.000000 0.000000 0.045861 0.009600 0.020010 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.006811 0.000000 0.027858 0.000000 0.009063 0.000000 0.009063 0.000000 0.000000 0.000000
6 0.009253 0.010825 0.004841 0.000000 0.000000 0.009253 0.000000 0.000000 0.011147 0.011867 ... 0.000000 0.018072 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.015174 0.000000
7 0.009444 0.011048 0.049410 0.000000 0.015810 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.016508 0.000000 0.000000 0.000000 0.010740 0.000000 0.007743 0.014293
8 0.008653 0.000000 0.004527 0.000000 0.000000 0.008653 0.000000 0.000000 0.000000 0.011097 ... 0.014792 0.050696 0.005041 0.000000 0.000000 0.000000 0.000000 0.000000 0.014189 0.000000
9 0.000000 0.000000 0.004790 0.000000 0.007663 0.000000 0.008349 0.008940 0.000000 0.000000 ... 0.007826 0.017881 0.021338 0.000000 0.010412 0.000000 0.000000 0.000000 0.007507 0.000000
10 0.009146 0.000000 0.028710 0.000000 0.022966 0.009146 0.000000 0.000000 0.000000 0.011729 ... 0.007817 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007499 0.000000
11 0.000000 0.000000 0.013455 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.007327 0.041855 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.014057 0.000000
12 0.017607 0.020598 0.018424 0.000000 0.014738 0.000000 0.000000 0.000000 0.031817 0.000000 ... 0.000000 0.017194 0.005130 0.000000 0.000000 0.000000 0.000000 0.000000 0.007218 0.006662
13 0.000000 0.000000 0.013175 0.000000 0.000000 0.008394 0.000000 0.000000 0.010112 0.000000 ... 0.000000 0.024591 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006883 0.000000
14 0.000000 0.000000 0.026481 0.000000 0.008473 0.010123 0.009232 0.009885 0.012195 0.000000 ... 0.000000 0.029655 0.011796 0.000000 0.011512 0.000000 0.000000 0.000000 0.008300 0.000000
15 0.009561 0.000000 0.045023 0.000000 0.008003 0.009561 0.000000 0.000000 0.000000 0.000000 ... 0.008173 0.009337 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
16 0.000000 0.000000 0.019991 0.000000 0.000000 0.009552 0.000000 0.000000 0.011508 0.000000 ... 0.000000 0.009328 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
17 0.009487 0.000000 0.054599 0.022858 0.007941 0.000000 0.017304 0.018528 0.000000 0.000000 ... 0.032436 0.000000 0.088442 0.000000 0.000000 0.011784 0.000000 0.000000 0.007779 0.000000
18 0.000000 0.000000 0.043551 0.000000 0.015484 0.018498 0.000000 0.000000 0.000000 0.000000 ... 0.023717 0.000000 0.048500 0.011488 0.010519 0.000000 0.010519 0.000000 0.000000 0.013998
19 0.000000 0.000000 0.048336 0.000000 0.015466 0.009239 0.016851 0.018043 0.000000 0.000000 ... 0.007897 0.018043 0.118424 0.000000 0.000000 0.022951 0.021014 0.010507 0.022725 0.013983
20 0.000000 0.000000 0.045488 0.041896 0.000000 0.000000 0.023787 0.025470 0.000000 0.000000 ... 0.007432 0.000000 0.050658 0.000000 0.000000 0.000000 0.000000 0.009888 0.014257 0.006579
21 0.008779 0.010270 0.027558 0.010576 0.000000 0.017557 0.008006 0.000000 0.031727 0.000000 ... 0.015008 0.008573 0.030690 0.000000 0.000000 0.000000 0.000000 0.000000 0.007198 0.000000
22 0.000000 0.000000 0.004498 0.000000 0.000000 0.000000 0.023519 0.025183 0.000000 0.000000 ... 0.014696 0.008394 0.055096 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.013011
23 0.000000 0.000000 0.026297 0.012110 0.000000 0.010052 0.000000 0.000000 0.000000 0.000000 ... 0.008592 0.000000 0.064428 0.000000 0.022865 0.000000 0.000000 0.011432 0.008242 0.015214
24 0.000000 0.000000 0.018921 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.068481 0.011230 0.010282 0.011230 0.000000 0.000000 0.000000 0.027367
25 0.000000 0.000000 0.040605 0.020777 0.021654 0.008623 0.007864 0.008421 0.000000 0.000000 ... 0.014742 0.000000 0.030146 0.000000 0.000000 0.021422 0.009807 0.000000 0.007070 0.013051
26 0.000000 0.000000 0.004620 0.000000 0.000000 0.017660 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.066884 0.010968 0.000000 0.000000 0.000000 0.010042 0.000000 0.013364
27 0.000000 0.000000 0.012231 0.000000 0.000000 0.015585 0.028427 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.018162 0.000000 0.000000 0.000000 0.000000 0.008863 0.025558 0.000000
28 0.000000 0.000000 0.025542 0.000000 0.008173 0.000000 0.017809 0.019069 0.000000 0.000000 ... 0.016691 0.000000 0.005689 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007389
29 0.008238 0.009637 0.021549 0.000000 0.000000 0.000000 0.015025 0.016088 0.009924 0.000000 ... 0.000000 0.000000 0.033598 0.000000 0.000000 0.000000 0.000000 0.000000 0.006754 0.000000
30 0.009582 0.011210 0.020054 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.012289 ... 0.040953 0.000000 0.044666 0.000000 0.010898 0.000000 0.000000 0.000000 0.023571 0.007251
31 0.000000 0.000000 0.028117 0.000000 0.014995 0.000000 0.008168 0.008746 0.000000 0.000000 ... 0.000000 0.000000 0.057407 0.000000 0.000000 0.000000 0.000000 0.010186 0.014688 0.027112
32 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.016031 0.008583 0.000000 0.000000 ... 0.007513 0.025749 0.005121 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
33 0.010092 0.011807 0.021122 0.012158 0.008448 0.000000 0.009204 0.009855 0.000000 0.000000 ... 0.000000 0.000000 0.123490 0.012536 0.022956 0.000000 0.000000 0.000000 0.000000 0.030550
34 0.000000 0.000000 0.088805 0.012780 0.017760 0.010608 0.009675 0.010359 0.012780 0.013605 ... 0.000000 0.010359 0.037087 0.000000 0.000000 0.013177 0.000000 0.012065 0.000000 0.000000
35 0.016226 0.018982 0.050935 0.029320 0.006791 0.000000 0.000000 0.000000 0.000000 0.010405 ... 0.006935 0.000000 0.061451 0.010077 0.018453 0.020154 0.009227 0.000000 0.000000 0.000000
36 0.000000 0.000000 0.035172 0.000000 0.016077 0.000000 0.000000 0.000000 0.011569 0.000000 ... 0.016418 0.000000 0.067148 0.000000 0.000000 0.000000 0.010922 0.021844 0.000000 0.014535
37 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.015380 0.016468 0.000000 0.000000 ... 0.000000 0.000000 0.044218 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
38 0.000000 0.000000 0.004955 0.000000 0.007927 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.009247 0.077247 0.000000 0.000000 0.000000 0.010770 0.000000 0.000000 0.035831
39 0.000000 0.000000 0.008419 0.009693 0.000000 0.008046 0.007338 0.007857 0.000000 0.010319 ... 0.000000 0.000000 0.037504 0.000000 0.000000 0.009994 0.000000 0.009150 0.013194 0.006089
40 0.000000 0.000000 0.059723 0.000000 0.023887 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.116393 0.011815 0.000000 0.000000 0.000000 0.010818 0.000000 0.028794
41 0.000000 0.000000 0.004826 0.000000 0.000000 0.018449 0.008413 0.009008 0.000000 0.000000 ... 0.015770 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007564 0.000000
42 0.000000 0.000000 0.059596 0.000000 0.015891 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.049777 0.000000 0.010795 0.011790 0.000000 0.010795 0.000000 0.007183
43 0.000000 0.000000 0.008240 0.000000 0.006591 0.000000 0.014363 0.015379 0.000000 0.000000 ... 0.000000 0.000000 0.022941 0.000000 0.000000 0.000000 0.000000 0.000000 0.006457 0.000000
44 0.000000 0.000000 0.043550 0.011142 0.007742 0.000000 0.008435 0.009031 0.000000 0.011862 ... 0.000000 0.000000 0.161665 0.000000 0.000000 0.011488 0.000000 0.010518 0.022750 0.013998
45 0.000000 0.000000 0.020163 0.000000 0.000000 0.000000 0.008786 0.009408 0.011606 0.000000 ... 0.016470 0.009408 0.005614 0.000000 0.000000 0.000000 0.000000 0.010957 0.007900 0.000000
46 0.000000 0.000000 0.010144 0.000000 0.008115 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.028243 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007336
47 0.010314 0.012066 0.032378 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.013228 ... 0.000000 0.020144 0.066105 0.012811 0.011730 0.000000 0.011730 0.000000 0.000000 0.015610
48 0.000000 0.000000 0.015133 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.016482 0.000000 0.005618 0.000000 0.000000 0.000000 0.000000 0.000000 0.023716 0.000000
49 0.019175 0.022432 0.030097 0.011550 0.000000 0.009587 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.150827 0.000000 0.010904 0.011909 0.000000 0.010904 0.000000 0.007255
50 0.000000 0.000000 0.005582 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.009119 0.000000 0.161614 0.013251 0.012133 0.000000 0.000000 0.000000 0.000000 0.064585
51 0.009137 0.010689 0.033463 0.000000 0.022945 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.106476 0.034048 0.000000 0.000000 0.020783 0.000000 0.000000 0.020743
52 0.009785 0.011447 0.025597 0.000000 0.024571 0.000000 0.000000 0.000000 0.000000 0.012549 ... 0.000000 0.000000 0.114022 0.000000 0.000000 0.012154 0.000000 0.011128 0.000000 0.037023
53 0.000000 0.000000 0.018548 0.000000 0.000000 0.000000 0.016166 0.017310 0.000000 0.000000 ... 0.007576 0.017310 0.010328 0.000000 0.000000 0.000000 0.010080 0.000000 0.000000 0.006707
54 0.010130 0.011851 0.010600 0.000000 0.000000 0.020260 0.009238 0.009892 0.000000 0.012992 ... 0.000000 0.000000 0.265603 0.012582 0.023041 0.000000 0.011521 0.000000 0.008306 0.130319

55 rows × 1193 columns

  • We can also create TF-IDF model directly from corpus

from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df = 0.2,
                     ngram_range=(1,3), 
                     use_idf=True
                    )
tv_matrix = tv.fit_transform(corpus_norm)
tfidf_ngram2 = pd.DataFrame(tv_matrix.toarray(), columns=tv.get_feature_names())
tfidf_ngram2
able able to about about it about the above across across the act added ... yet york you you are you can you could you have you know young your
0 0.009217 0.000000 0.024112 0.000000 0.023146 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.007878 0.000000 0.016111 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006975
1 0.000000 0.000000 0.014143 0.000000 0.000000 0.000000 0.008217 0.000000 0.021710 0.000000 ... 0.015404 0.008799 0.042001 0.011192 0.000000 0.000000 0.010248 0.000000 0.007388 0.027275
2 0.009164 0.000000 0.014384 0.000000 0.000000 0.000000 0.000000 0.000000 0.011040 0.000000 ... 0.000000 0.026847 0.021358 0.011383 0.000000 0.000000 0.010422 0.000000 0.015028 0.006935
3 0.000000 0.000000 0.010331 0.000000 0.008264 0.000000 0.000000 0.000000 0.047577 0.000000 ... 0.000000 0.019283 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.016191 0.007472
4 0.009664 0.011305 0.020224 0.011642 0.008089 0.009664 0.000000 0.000000 0.000000 0.000000 ... 0.008260 0.000000 0.033784 0.000000 0.010990 0.012003 0.010990 0.000000 0.000000 0.014626
5 0.000000 0.000000 0.045861 0.009600 0.020010 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.006811 0.000000 0.027858 0.000000 0.009063 0.000000 0.009063 0.000000 0.000000 0.000000
6 0.009253 0.010825 0.004841 0.000000 0.000000 0.009253 0.000000 0.000000 0.011147 0.011867 ... 0.000000 0.018072 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.015174 0.000000
7 0.009444 0.011048 0.049410 0.000000 0.015810 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.016508 0.000000 0.000000 0.000000 0.010740 0.000000 0.007743 0.014293
8 0.008653 0.000000 0.004527 0.000000 0.000000 0.008653 0.000000 0.000000 0.000000 0.011097 ... 0.014792 0.050696 0.005041 0.000000 0.000000 0.000000 0.000000 0.000000 0.014189 0.000000
9 0.000000 0.000000 0.004790 0.000000 0.007663 0.000000 0.008349 0.008940 0.000000 0.000000 ... 0.007826 0.017881 0.021338 0.000000 0.010412 0.000000 0.000000 0.000000 0.007507 0.000000
10 0.009146 0.000000 0.028710 0.000000 0.022966 0.009146 0.000000 0.000000 0.000000 0.011729 ... 0.007817 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007499 0.000000
11 0.000000 0.000000 0.013455 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.007327 0.041855 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.014057 0.000000
12 0.017607 0.020598 0.018424 0.000000 0.014738 0.000000 0.000000 0.000000 0.031817 0.000000 ... 0.000000 0.017194 0.005130 0.000000 0.000000 0.000000 0.000000 0.000000 0.007218 0.006662
13 0.000000 0.000000 0.013175 0.000000 0.000000 0.008394 0.000000 0.000000 0.010112 0.000000 ... 0.000000 0.024591 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.006883 0.000000
14 0.000000 0.000000 0.026481 0.000000 0.008473 0.010123 0.009232 0.009885 0.012195 0.000000 ... 0.000000 0.029655 0.011796 0.000000 0.011512 0.000000 0.000000 0.000000 0.008300 0.000000
15 0.009561 0.000000 0.045023 0.000000 0.008003 0.009561 0.000000 0.000000 0.000000 0.000000 ... 0.008173 0.009337 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
16 0.000000 0.000000 0.019991 0.000000 0.000000 0.009552 0.000000 0.000000 0.011508 0.000000 ... 0.000000 0.009328 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
17 0.009487 0.000000 0.054599 0.022858 0.007941 0.000000 0.017304 0.018528 0.000000 0.000000 ... 0.032436 0.000000 0.088442 0.000000 0.000000 0.011784 0.000000 0.000000 0.007779 0.000000
18 0.000000 0.000000 0.043551 0.000000 0.015484 0.018498 0.000000 0.000000 0.000000 0.000000 ... 0.023717 0.000000 0.048500 0.011488 0.010519 0.000000 0.010519 0.000000 0.000000 0.013998
19 0.000000 0.000000 0.048336 0.000000 0.015466 0.009239 0.016851 0.018043 0.000000 0.000000 ... 0.007897 0.018043 0.118424 0.000000 0.000000 0.022951 0.021014 0.010507 0.022725 0.013983
20 0.000000 0.000000 0.045488 0.041896 0.000000 0.000000 0.023787 0.025470 0.000000 0.000000 ... 0.007432 0.000000 0.050658 0.000000 0.000000 0.000000 0.000000 0.009888 0.014257 0.006579
21 0.008779 0.010270 0.027558 0.010576 0.000000 0.017557 0.008006 0.000000 0.031727 0.000000 ... 0.015008 0.008573 0.030690 0.000000 0.000000 0.000000 0.000000 0.000000 0.007198 0.000000
22 0.000000 0.000000 0.004498 0.000000 0.000000 0.000000 0.023519 0.025183 0.000000 0.000000 ... 0.014696 0.008394 0.055096 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.013011
23 0.000000 0.000000 0.026297 0.012110 0.000000 0.010052 0.000000 0.000000 0.000000 0.000000 ... 0.008592 0.000000 0.064428 0.000000 0.022865 0.000000 0.000000 0.011432 0.008242 0.015214
24 0.000000 0.000000 0.018921 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.068481 0.011230 0.010282 0.011230 0.000000 0.000000 0.000000 0.027367
25 0.000000 0.000000 0.040605 0.020777 0.021654 0.008623 0.007864 0.008421 0.000000 0.000000 ... 0.014742 0.000000 0.030146 0.000000 0.000000 0.021422 0.009807 0.000000 0.007070 0.013051
26 0.000000 0.000000 0.004620 0.000000 0.000000 0.017660 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.066884 0.010968 0.000000 0.000000 0.000000 0.010042 0.000000 0.013364
27 0.000000 0.000000 0.012231 0.000000 0.000000 0.015585 0.028427 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.018162 0.000000 0.000000 0.000000 0.000000 0.008863 0.025558 0.000000
28 0.000000 0.000000 0.025542 0.000000 0.008173 0.000000 0.017809 0.019069 0.000000 0.000000 ... 0.016691 0.000000 0.005689 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007389
29 0.008238 0.009637 0.021549 0.000000 0.000000 0.000000 0.015025 0.016088 0.009924 0.000000 ... 0.000000 0.000000 0.033598 0.000000 0.000000 0.000000 0.000000 0.000000 0.006754 0.000000
30 0.009582 0.011210 0.020054 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.012289 ... 0.040953 0.000000 0.044666 0.000000 0.010898 0.000000 0.000000 0.000000 0.023571 0.007251
31 0.000000 0.000000 0.028117 0.000000 0.014995 0.000000 0.008168 0.008746 0.000000 0.000000 ... 0.000000 0.000000 0.057407 0.000000 0.000000 0.000000 0.000000 0.010186 0.014688 0.027112
32 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.016031 0.008583 0.000000 0.000000 ... 0.007513 0.025749 0.005121 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
33 0.010092 0.011807 0.021122 0.012158 0.008448 0.000000 0.009204 0.009855 0.000000 0.000000 ... 0.000000 0.000000 0.123490 0.012536 0.022956 0.000000 0.000000 0.000000 0.000000 0.030550
34 0.000000 0.000000 0.088805 0.012780 0.017760 0.010608 0.009675 0.010359 0.012780 0.013605 ... 0.000000 0.010359 0.037087 0.000000 0.000000 0.013177 0.000000 0.012065 0.000000 0.000000
35 0.016226 0.018982 0.050935 0.029320 0.006791 0.000000 0.000000 0.000000 0.000000 0.010405 ... 0.006935 0.000000 0.061451 0.010077 0.018453 0.020154 0.009227 0.000000 0.000000 0.000000
36 0.000000 0.000000 0.035172 0.000000 0.016077 0.000000 0.000000 0.000000 0.011569 0.000000 ... 0.016418 0.000000 0.067148 0.000000 0.000000 0.000000 0.010922 0.021844 0.000000 0.014535
37 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.015380 0.016468 0.000000 0.000000 ... 0.000000 0.000000 0.044218 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
38 0.000000 0.000000 0.004955 0.000000 0.007927 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.009247 0.077247 0.000000 0.000000 0.000000 0.010770 0.000000 0.000000 0.035831
39 0.000000 0.000000 0.008419 0.009693 0.000000 0.008046 0.007338 0.007857 0.000000 0.010319 ... 0.000000 0.000000 0.037504 0.000000 0.000000 0.009994 0.000000 0.009150 0.013194 0.006089
40 0.000000 0.000000 0.059723 0.000000 0.023887 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.116393 0.011815 0.000000 0.000000 0.000000 0.010818 0.000000 0.028794
41 0.000000 0.000000 0.004826 0.000000 0.000000 0.018449 0.008413 0.009008 0.000000 0.000000 ... 0.015770 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007564 0.000000
42 0.000000 0.000000 0.059596 0.000000 0.015891 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.049777 0.000000 0.010795 0.011790 0.000000 0.010795 0.000000 0.007183
43 0.000000 0.000000 0.008240 0.000000 0.006591 0.000000 0.014363 0.015379 0.000000 0.000000 ... 0.000000 0.000000 0.022941 0.000000 0.000000 0.000000 0.000000 0.000000 0.006457 0.000000
44 0.000000 0.000000 0.043550 0.011142 0.007742 0.000000 0.008435 0.009031 0.000000 0.011862 ... 0.000000 0.000000 0.161665 0.000000 0.000000 0.011488 0.000000 0.010518 0.022750 0.013998
45 0.000000 0.000000 0.020163 0.000000 0.000000 0.000000 0.008786 0.009408 0.011606 0.000000 ... 0.016470 0.009408 0.005614 0.000000 0.000000 0.000000 0.000000 0.010957 0.007900 0.000000
46 0.000000 0.000000 0.010144 0.000000 0.008115 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.028243 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.007336
47 0.010314 0.012066 0.032378 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.013228 ... 0.000000 0.020144 0.066105 0.012811 0.011730 0.000000 0.011730 0.000000 0.000000 0.015610
48 0.000000 0.000000 0.015133 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.016482 0.000000 0.005618 0.000000 0.000000 0.000000 0.000000 0.000000 0.023716 0.000000
49 0.019175 0.022432 0.030097 0.011550 0.000000 0.009587 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.150827 0.000000 0.010904 0.011909 0.000000 0.010904 0.000000 0.007255
50 0.000000 0.000000 0.005582 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.009119 0.000000 0.161614 0.013251 0.012133 0.000000 0.000000 0.000000 0.000000 0.064585
51 0.009137 0.010689 0.033463 0.000000 0.022945 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.106476 0.034048 0.000000 0.000000 0.020783 0.000000 0.000000 0.020743
52 0.009785 0.011447 0.025597 0.000000 0.024571 0.000000 0.000000 0.000000 0.000000 0.012549 ... 0.000000 0.000000 0.114022 0.000000 0.000000 0.012154 0.000000 0.011128 0.000000 0.037023
53 0.000000 0.000000 0.018548 0.000000 0.000000 0.000000 0.016166 0.017310 0.000000 0.000000 ... 0.007576 0.017310 0.010328 0.000000 0.000000 0.000000 0.010080 0.000000 0.000000 0.006707
54 0.010130 0.011851 0.010600 0.000000 0.000000 0.020260 0.009238 0.009892 0.000000 0.012992 ... 0.000000 0.000000 0.265603 0.012582 0.023041 0.000000 0.011521 0.000000 0.008306 0.130319

55 rows × 1193 columns

Document Similarity#

  • Cluster analysis with R seems more intuitive to me

from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df
0 1 2 3 4 5 6 7 8 9 ... 45 46 47 48 49 50 51 52 53 54
0 1.000000 0.903099 0.901749 0.901110 0.907569 0.930193 0.874775 0.894402 0.919195 0.918026 ... 0.774057 0.861176 0.731567 0.844309 0.757781 0.843069 0.879384 0.712273 0.938602 0.866187
1 0.903099 1.000000 0.871965 0.867151 0.880732 0.913532 0.861910 0.902770 0.887822 0.877728 ... 0.784384 0.841174 0.729602 0.828479 0.717505 0.830369 0.836896 0.694544 0.907940 0.852431
2 0.901749 0.871965 1.000000 0.905119 0.873995 0.901962 0.883693 0.882181 0.896813 0.889755 ... 0.786636 0.821484 0.770126 0.842209 0.727968 0.815429 0.854184 0.710397 0.888556 0.842170
3 0.901110 0.867151 0.905119 1.000000 0.887274 0.898775 0.872573 0.882392 0.895758 0.890735 ... 0.780414 0.825479 0.750377 0.848413 0.720307 0.813878 0.852632 0.694705 0.899165 0.838796
4 0.907569 0.880732 0.873995 0.887274 1.000000 0.907642 0.840482 0.865856 0.884396 0.913948 ... 0.782834 0.862465 0.739062 0.856936 0.783967 0.854642 0.876750 0.717906 0.910563 0.878084
5 0.930193 0.913532 0.901962 0.898775 0.907642 1.000000 0.874888 0.896952 0.915662 0.923222 ... 0.783759 0.849392 0.744431 0.842141 0.746043 0.840681 0.871048 0.702883 0.931814 0.865160
6 0.874775 0.861910 0.883693 0.872573 0.840482 0.874888 1.000000 0.860151 0.882502 0.852995 ... 0.752524 0.796356 0.742264 0.828274 0.704168 0.770870 0.818364 0.670490 0.864791 0.816389
7 0.894402 0.902770 0.882181 0.882392 0.865856 0.896952 0.860151 1.000000 0.878723 0.877895 ... 0.799398 0.859184 0.767273 0.847824 0.746111 0.826701 0.843524 0.740893 0.896333 0.845723
8 0.919195 0.887822 0.896813 0.895758 0.884396 0.915662 0.882502 0.878723 1.000000 0.883385 ... 0.761283 0.827417 0.731610 0.825890 0.745811 0.822658 0.854792 0.682048 0.910304 0.840261
9 0.918026 0.877728 0.889755 0.890735 0.913948 0.923222 0.852995 0.877895 0.883385 1.000000 ... 0.765622 0.859278 0.729147 0.838511 0.755803 0.839939 0.859853 0.695509 0.920241 0.863758
10 0.893628 0.867787 0.881295 0.877463 0.889364 0.902471 0.860747 0.873244 0.907378 0.879919 ... 0.770757 0.847329 0.725159 0.838996 0.775047 0.825331 0.832768 0.706481 0.895585 0.835030
11 0.918032 0.922543 0.877476 0.873176 0.896175 0.923494 0.865996 0.897044 0.897752 0.905827 ... 0.764435 0.851556 0.709733 0.835111 0.720147 0.822738 0.840674 0.663372 0.924252 0.848980
12 0.927320 0.915016 0.882587 0.879687 0.894711 0.922641 0.856478 0.898665 0.918277 0.897774 ... 0.769405 0.836532 0.729273 0.825734 0.719975 0.836236 0.863377 0.685054 0.929722 0.853080
13 0.916993 0.900951 0.913324 0.892250 0.897556 0.923803 0.888728 0.888100 0.920030 0.894838 ... 0.773878 0.831713 0.738261 0.832680 0.736512 0.822101 0.852300 0.694546 0.913104 0.843727
14 0.879863 0.865431 0.878511 0.863772 0.863142 0.878933 0.844643 0.846818 0.871842 0.868329 ... 0.759897 0.779336 0.719440 0.804986 0.670269 0.786946 0.833296 0.641161 0.878097 0.832097
15 0.903929 0.853194 0.873932 0.883668 0.904154 0.896867 0.832619 0.854780 0.879057 0.894117 ... 0.775972 0.852444 0.743408 0.847293 0.767151 0.829395 0.866052 0.701045 0.895306 0.857095
16 0.907692 0.874330 0.866760 0.871648 0.892364 0.905170 0.836240 0.858848 0.892530 0.886955 ... 0.747956 0.819538 0.711203 0.831813 0.700731 0.818488 0.875829 0.658974 0.907822 0.840660
17 0.722748 0.676630 0.720698 0.704657 0.750416 0.708952 0.690665 0.723350 0.682600 0.749956 ... 0.704389 0.801974 0.702493 0.763639 0.812151 0.724943 0.714101 0.750128 0.712810 0.747868
18 0.852268 0.820648 0.800214 0.789683 0.854566 0.840020 0.779562 0.839935 0.816550 0.854762 ... 0.755226 0.880251 0.746072 0.826234 0.822705 0.801768 0.809347 0.741113 0.859247 0.840234
19 0.840175 0.788794 0.822618 0.804291 0.837034 0.824228 0.792047 0.830945 0.803000 0.838556 ... 0.792280 0.870559 0.778227 0.851410 0.852138 0.812832 0.831184 0.790361 0.832035 0.858552
20 0.833332 0.798374 0.834168 0.817030 0.821097 0.825280 0.808250 0.843359 0.813096 0.817110 ... 0.805719 0.861150 0.809262 0.852424 0.808533 0.789840 0.820379 0.772771 0.826265 0.832442
21 0.745682 0.691691 0.724727 0.708737 0.732307 0.710049 0.695002 0.755924 0.712325 0.740827 ... 0.711484 0.829470 0.712595 0.772976 0.817639 0.731415 0.722279 0.752113 0.731588 0.744636
22 0.816264 0.769034 0.799091 0.782308 0.812292 0.794071 0.787110 0.810181 0.795496 0.818210 ... 0.763423 0.839915 0.735678 0.826365 0.796230 0.762080 0.791648 0.710864 0.812984 0.831387
23 0.741841 0.669861 0.689774 0.685933 0.735625 0.692563 0.669424 0.719684 0.691292 0.729020 ... 0.665508 0.783379 0.650770 0.725722 0.814485 0.740312 0.731499 0.707777 0.725159 0.747473
24 0.841091 0.804408 0.805012 0.808854 0.862205 0.827907 0.791908 0.808168 0.798110 0.839622 ... 0.769851 0.871765 0.755433 0.838300 0.842024 0.792607 0.807654 0.767430 0.844423 0.839348
25 0.820705 0.785139 0.815059 0.803249 0.810044 0.822846 0.793844 0.816898 0.798980 0.809980 ... 0.814741 0.836125 0.792851 0.847521 0.765031 0.766337 0.812754 0.758284 0.821126 0.822159
26 0.846585 0.808344 0.789634 0.800136 0.853079 0.832964 0.775050 0.817873 0.808568 0.846014 ... 0.755862 0.872566 0.715610 0.819897 0.798217 0.796102 0.801881 0.712568 0.854132 0.846940
27 0.851304 0.797714 0.797572 0.784558 0.826298 0.826986 0.778319 0.824567 0.817484 0.844344 ... 0.742222 0.853406 0.695424 0.809665 0.777829 0.787153 0.810296 0.704694 0.851361 0.817153
28 0.824971 0.816717 0.786819 0.777690 0.816804 0.818824 0.771025 0.831528 0.799508 0.818772 ... 0.812883 0.881882 0.791936 0.827054 0.787598 0.803041 0.778636 0.785832 0.829446 0.819105
29 0.852327 0.814751 0.839825 0.817154 0.840744 0.839730 0.819181 0.842704 0.833108 0.843954 ... 0.784725 0.857058 0.806187 0.837527 0.817567 0.806701 0.837980 0.771634 0.847044 0.846029
30 0.794010 0.776042 0.755943 0.754476 0.827370 0.790795 0.733149 0.788210 0.750026 0.821269 ... 0.756161 0.876442 0.717054 0.805261 0.837379 0.780225 0.753993 0.744322 0.800286 0.814796
31 0.823972 0.813417 0.782432 0.775116 0.811153 0.810251 0.763265 0.833756 0.791667 0.811959 ... 0.796111 0.861117 0.752117 0.811259 0.759871 0.807596 0.790919 0.762348 0.827733 0.827300
32 0.878325 0.852549 0.866249 0.841610 0.862310 0.875473 0.855769 0.872726 0.864158 0.875575 ... 0.803574 0.878432 0.769746 0.864909 0.787805 0.799803 0.833120 0.746661 0.883848 0.849817
33 0.883584 0.852411 0.845208 0.821051 0.875078 0.886667 0.807734 0.857516 0.837904 0.877588 ... 0.794945 0.866058 0.759273 0.848622 0.798370 0.851570 0.853594 0.754632 0.879834 0.888921
34 0.675786 0.661941 0.672636 0.671671 0.700025 0.663805 0.654760 0.720069 0.656132 0.667695 ... 0.823057 0.785372 0.758352 0.792715 0.756819 0.732708 0.709800 0.786138 0.674175 0.726947
35 0.810933 0.773532 0.763758 0.756611 0.822137 0.789180 0.739294 0.791340 0.764288 0.814760 ... 0.752378 0.885662 0.722331 0.812275 0.863883 0.804854 0.787881 0.757975 0.804107 0.825393
36 0.844746 0.831847 0.830006 0.822755 0.857587 0.844472 0.792833 0.853440 0.813177 0.853282 ... 0.822397 0.904533 0.784227 0.852288 0.821485 0.847995 0.818471 0.805294 0.845315 0.849890
37 0.824112 0.780643 0.765167 0.757907 0.814186 0.805177 0.750204 0.798305 0.784583 0.826856 ... 0.732387 0.864541 0.679961 0.799603 0.799258 0.766278 0.765276 0.691883 0.825864 0.812077
38 0.765444 0.759691 0.782322 0.748915 0.764829 0.766151 0.741087 0.792770 0.748863 0.761727 ... 0.791290 0.807429 0.776408 0.797459 0.774330 0.787869 0.759214 0.814736 0.757332 0.777322
39 0.841180 0.842735 0.795105 0.789407 0.812744 0.834937 0.781318 0.857953 0.821731 0.821977 ... 0.793983 0.853737 0.748958 0.811971 0.724716 0.813735 0.795875 0.747016 0.847919 0.821548
40 0.661117 0.601695 0.645461 0.626393 0.674573 0.623891 0.609137 0.665443 0.603241 0.663814 ... 0.661901 0.745358 0.657060 0.713643 0.785833 0.696702 0.673876 0.738064 0.640975 0.710919
41 0.753643 0.745934 0.777722 0.751371 0.776517 0.751814 0.733240 0.787569 0.724689 0.771514 ... 0.780559 0.830094 0.754532 0.793743 0.770456 0.755606 0.735875 0.800807 0.752646 0.760544
42 0.728488 0.710539 0.732329 0.718885 0.723947 0.712334 0.703711 0.757266 0.698572 0.712019 ... 0.768337 0.806658 0.818925 0.771588 0.763566 0.746599 0.733842 0.828015 0.708562 0.748552
43 0.880814 0.860635 0.835784 0.835429 0.860004 0.869494 0.842492 0.876014 0.862838 0.867116 ... 0.807759 0.890710 0.774815 0.864797 0.781265 0.807796 0.828113 0.729236 0.891353 0.874514
44 0.705313 0.679096 0.705116 0.679263 0.714611 0.681784 0.660269 0.728331 0.681556 0.704017 ... 0.727093 0.783006 0.706936 0.744437 0.797124 0.744491 0.711070 0.772361 0.692165 0.748060
45 0.774057 0.784384 0.786636 0.780414 0.782834 0.783759 0.752524 0.799398 0.761283 0.765622 ... 1.000000 0.840797 0.809486 0.882053 0.723502 0.768755 0.783202 0.751529 0.783454 0.811948
46 0.861176 0.841174 0.821484 0.825479 0.862465 0.849392 0.796356 0.859184 0.827417 0.859278 ... 0.840797 1.000000 0.788950 0.880254 0.833002 0.833217 0.814247 0.782475 0.861600 0.861988
47 0.731567 0.729602 0.770126 0.750377 0.739062 0.744431 0.742264 0.767273 0.731610 0.729147 ... 0.809486 0.788950 1.000000 0.797582 0.748504 0.738363 0.751753 0.789740 0.724311 0.776450
48 0.844309 0.828479 0.842209 0.848413 0.856936 0.842141 0.828274 0.847824 0.825890 0.838511 ... 0.882053 0.880254 0.797582 1.000000 0.767375 0.806944 0.838058 0.758050 0.859021 0.857034
49 0.757781 0.717505 0.727968 0.720307 0.783967 0.746043 0.704168 0.746111 0.745811 0.755803 ... 0.723502 0.833002 0.748504 0.767375 1.000000 0.763742 0.751773 0.760735 0.741123 0.794523
50 0.843069 0.830369 0.815429 0.813878 0.854642 0.840681 0.770870 0.826701 0.822658 0.839939 ... 0.768755 0.833217 0.738363 0.806944 0.763742 1.000000 0.866217 0.757738 0.840602 0.871413
51 0.879384 0.836896 0.854184 0.852632 0.876750 0.871048 0.818364 0.843524 0.854792 0.859853 ... 0.783202 0.814247 0.751753 0.838058 0.751773 0.866217 1.000000 0.729374 0.869848 0.869194
52 0.712273 0.694544 0.710397 0.694705 0.717906 0.702883 0.670490 0.740893 0.682048 0.695509 ... 0.751529 0.782475 0.789740 0.758050 0.760735 0.757738 0.729374 1.000000 0.696812 0.737664
53 0.938602 0.907940 0.888556 0.899165 0.910563 0.931814 0.864791 0.896333 0.910304 0.920241 ... 0.783454 0.861600 0.724311 0.859021 0.741123 0.840602 0.869848 0.696812 1.000000 0.871567
54 0.866187 0.852431 0.842170 0.838796 0.878084 0.865160 0.816389 0.845723 0.840261 0.863758 ... 0.811948 0.861988 0.776450 0.857034 0.794523 0.871413 0.869194 0.737664 0.871567 1.000000

55 rows × 55 columns

%load_ext rpy2.ipython
%%R -i similarity_matrix
library(dplyr)
#library(ggplot2)

head(similarity_matrix)
hclust(as.dist(similarity_matrix),method="ward.D2") %>%
plot
R[write to console]: 
Attaching package: ‘dplyr’


R[write to console]: The following objects are masked from ‘package:stats’:

    filter, lag


R[write to console]: The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union
../_images/4ce3cb4f31f86016d8d7d42ed4a7d355f98577948795370079a43f31caaa4ccb.png
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(similarity_matrix, 'ward')
pd.DataFrame(Z)

# but to draw dendrogram?
dendrogram(Z)
# Don't like the graph in python
{'icoord': [[5.0, 5.0, 15.0, 15.0],
  [25.0, 25.0, 35.0, 35.0],
  [10.0, 10.0, 30.0, 30.0],
  [65.0, 65.0, 75.0, 75.0],
  [55.0, 55.0, 70.0, 70.0],
  [45.0, 45.0, 62.5, 62.5],
  [115.0, 115.0, 125.0, 125.0],
  [105.0, 105.0, 120.0, 120.0],
  [95.0, 95.0, 112.5, 112.5],
  [85.0, 85.0, 103.75, 103.75],
  [53.75, 53.75, 94.375, 94.375],
  [20.0, 20.0, 74.0625, 74.0625],
  [155.0, 155.0, 165.0, 165.0],
  [145.0, 145.0, 160.0, 160.0],
  [185.0, 185.0, 195.0, 195.0],
  [175.0, 175.0, 190.0, 190.0],
  [152.5, 152.5, 182.5, 182.5],
  [135.0, 135.0, 167.5, 167.5],
  [205.0, 205.0, 215.0, 215.0],
  [245.0, 245.0, 255.0, 255.0],
  [235.0, 235.0, 250.0, 250.0],
  [225.0, 225.0, 242.5, 242.5],
  [265.0, 265.0, 275.0, 275.0],
  [295.0, 295.0, 305.0, 305.0],
  [285.0, 285.0, 300.0, 300.0],
  [270.0, 270.0, 292.5, 292.5],
  [233.75, 233.75, 281.25, 281.25],
  [210.0, 210.0, 257.5, 257.5],
  [151.25, 151.25, 233.75, 233.75],
  [315.0, 315.0, 325.0, 325.0],
  [345.0, 345.0, 355.0, 355.0],
  [335.0, 335.0, 350.0, 350.0],
  [320.0, 320.0, 342.5, 342.5],
  [375.0, 375.0, 385.0, 385.0],
  [365.0, 365.0, 380.0, 380.0],
  [331.25, 331.25, 372.5, 372.5],
  [395.0, 395.0, 405.0, 405.0],
  [425.0, 425.0, 435.0, 435.0],
  [415.0, 415.0, 430.0, 430.0],
  [400.0, 400.0, 422.5, 422.5],
  [455.0, 455.0, 465.0, 465.0],
  [445.0, 445.0, 460.0, 460.0],
  [475.0, 475.0, 485.0, 485.0],
  [495.0, 495.0, 505.0, 505.0],
  [480.0, 480.0, 500.0, 500.0],
  [515.0, 515.0, 525.0, 525.0],
  [535.0, 535.0, 545.0, 545.0],
  [520.0, 520.0, 540.0, 540.0],
  [490.0, 490.0, 530.0, 530.0],
  [452.5, 452.5, 510.0, 510.0],
  [411.25, 411.25, 481.25, 481.25],
  [351.875, 351.875, 446.25, 446.25],
  [192.5, 192.5, 399.0625, 399.0625],
  [47.03125, 47.03125, 295.78125, 295.78125]],
 'dcoord': [[0.0, 0.21333867488882308, 0.21333867488882308, 0.0],
  [0.0, 0.3805535089335521, 0.3805535089335521, 0.0],
  [0.21333867488882308,
   0.5855848260443224,
   0.5855848260443224,
   0.3805535089335521],
  [0.0, 0.2872644901251715, 0.2872644901251715, 0.0],
  [0.0, 0.34472959074707116, 0.34472959074707116, 0.2872644901251715],
  [0.0, 0.38938388145098235, 0.38938388145098235, 0.34472959074707116],
  [0.0, 0.33352399998340854, 0.33352399998340854, 0.0],
  [0.0, 0.3974263535175602, 0.3974263535175602, 0.33352399998340854],
  [0.0, 0.4140604517840557, 0.4140604517840557, 0.3974263535175602],
  [0.0, 0.6053583701171363, 0.6053583701171363, 0.4140604517840557],
  [0.38938388145098235,
   0.7629594367621403,
   0.7629594367621403,
   0.6053583701171363],
  [0.5855848260443224,
   0.9584521635877752,
   0.9584521635877752,
   0.7629594367621403],
  [0.0, 0.10629097243894707, 0.10629097243894707, 0.0],
  [0.0, 0.14589106217232223, 0.14589106217232223, 0.10629097243894707],
  [0.0, 0.14845298440347174, 0.14845298440347174, 0.0],
  [0.0, 0.20323545927584558, 0.20323545927584558, 0.14845298440347174],
  [0.14589106217232223,
   0.27053775393863067,
   0.27053775393863067,
   0.20323545927584558],
  [0.0, 0.2834693852894986, 0.2834693852894986, 0.27053775393863067],
  [0.0, 0.27579034064360586, 0.27579034064360586, 0.0],
  [0.0, 0.14055830099509545, 0.14055830099509545, 0.0],
  [0.0, 0.17317330451288437, 0.17317330451288437, 0.14055830099509545],
  [0.0, 0.22889162471627206, 0.22889162471627206, 0.17317330451288437],
  [0.0, 0.1646798411677444, 0.1646798411677444, 0.0],
  [0.0, 0.14195089035900987, 0.14195089035900987, 0.0],
  [0.0, 0.20457278941880547, 0.20457278941880547, 0.14195089035900987],
  [0.1646798411677444,
   0.23724513617131898,
   0.23724513617131898,
   0.20457278941880547],
  [0.22889162471627206,
   0.35990544890675036,
   0.35990544890675036,
   0.23724513617131898],
  [0.27579034064360586,
   0.43126319905416755,
   0.43126319905416755,
   0.35990544890675036],
  [0.2834693852894986,
   0.5929383229764894,
   0.5929383229764894,
   0.43126319905416755],
  [0.0, 0.19468395945144223, 0.19468395945144223, 0.0],
  [0.0, 0.21074083639138996, 0.21074083639138996, 0.0],
  [0.0, 0.23596337662440936, 0.23596337662440936, 0.21074083639138996],
  [0.19468395945144223,
   0.2733141711848642,
   0.2733141711848642,
   0.23596337662440936],
  [0.0, 0.17997077302830203, 0.17997077302830203, 0.0],
  [0.0, 0.3285798092301384, 0.3285798092301384, 0.17997077302830203],
  [0.2733141711848642,
   0.45274108992749557,
   0.45274108992749557,
   0.3285798092301384],
  [0.0, 0.26903843779564635, 0.26903843779564635, 0.0],
  [0.0, 0.2778464314878544, 0.2778464314878544, 0.0],
  [0.0, 0.30565801021330624, 0.30565801021330624, 0.2778464314878544],
  [0.26903843779564635,
   0.5132012196645103,
   0.5132012196645103,
   0.30565801021330624],
  [0.0, 0.17361359077926955, 0.17361359077926955, 0.0],
  [0.0, 0.26920063399006655, 0.26920063399006655, 0.17361359077926955],
  [0.0, 0.19234199258680965, 0.19234199258680965, 0.0],
  [0.0, 0.23259832292710847, 0.23259832292710847, 0.0],
  [0.19234199258680965,
   0.3207803415976658,
   0.3207803415976658,
   0.23259832292710847],
  [0.0, 0.1772785970481172, 0.1772785970481172, 0.0],
  [0.0, 0.2078965544789089, 0.2078965544789089, 0.0],
  [0.1772785970481172,
   0.34725719635513774,
   0.34725719635513774,
   0.2078965544789089],
  [0.3207803415976658,
   0.4830351672375104,
   0.4830351672375104,
   0.34725719635513774],
  [0.26920063399006655,
   0.5837152751551123,
   0.5837152751551123,
   0.4830351672375104],
  [0.5132012196645103,
   0.6895392014624909,
   0.6895392014624909,
   0.5837152751551123],
  [0.45274108992749557,
   0.8150132507151894,
   0.8150132507151894,
   0.6895392014624909],
  [0.5929383229764894,
   2.081013463489638,
   2.081013463489638,
   0.8150132507151894],
  [0.9584521635877752,
   3.1556008507593947,
   3.1556008507593947,
   2.081013463489638]],
 'ivl': ['38',
  '41',
  '45',
  '47',
  '23',
  '49',
  '17',
  '21',
  '40',
  '44',
  '34',
  '42',
  '52',
  '7',
  '5',
  '0',
  '53',
  '15',
  '4',
  '9',
  '6',
  '14',
  '16',
  '1',
  '11',
  '12',
  '2',
  '3',
  '10',
  '8',
  '13',
  '18',
  '26',
  '22',
  '27',
  '37',
  '24',
  '30',
  '35',
  '50',
  '51',
  '25',
  '20',
  '48',
  '39',
  '28',
  '31',
  '36',
  '46',
  '19',
  '29',
  '32',
  '43',
  '33',
  '54'],
 'leaves': [38,
  41,
  45,
  47,
  23,
  49,
  17,
  21,
  40,
  44,
  34,
  42,
  52,
  7,
  5,
  0,
  53,
  15,
  4,
  9,
  6,
  14,
  16,
  1,
  11,
  12,
  2,
  3,
  10,
  8,
  13,
  18,
  26,
  22,
  27,
  37,
  24,
  30,
  35,
  50,
  51,
  25,
  20,
  48,
  39,
  28,
  31,
  36,
  46,
  19,
  29,
  32,
  43,
  33,
  54],
 'color_list': ['g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'b']}
../_images/ffeb73daeaadb812de7b090cd24a2731617491e90c210c494d4f008474bbfe98.png