Vectorizing Texts#
Feature engineering for text represenation
import nltk
from nltk.corpus import brown
brown.categories()
['adventure',
'belles_lettres',
'editorial',
'fiction',
'government',
'hobbies',
'humor',
'learned',
'lore',
'mystery',
'news',
'religion',
'reviews',
'romance',
'science_fiction']
corpus_id = brown.fileids(categories=['reviews','fiction','humor'])
corpus_text = [' '.join(w) for w in [brown.words(fileids=cid) for cid in corpus_id]]
print(corpus_text[0][:100])
corpus_cat = [brown.categories(fileids=cid)[0] for cid in corpus_id]
print(len(corpus_text))
print(len(corpus_id))
print(len(corpus_cat))
type(corpus_id)
It is not news that Nathan Milstein is a wizard of the violin . Certainly not in Orchestra Hall wher
55
55
55
list
import numpy as np
import pandas as pd
import re
assert len(corpus_text)==len(corpus_cat)
corpus_df = pd.DataFrame({'Text': corpus_text, 'Category': corpus_cat, 'ID': corpus_id})
corpus_df
Text | Category | ID | |
---|---|---|---|
0 | It is not news that Nathan Milstein is a wizar... | reviews | cc01 |
1 | Television has yet to work out a living arrang... | reviews | cc02 |
2 | Francois D'Albert , Hungarian-born violinist w... | reviews | cc03 |
3 | The Theatre-by-the-Sea , Matunuck , presents `... | reviews | cc04 |
4 | The superb intellectual and spiritual vitality... | reviews | cc05 |
5 | George Kennan's account of relations between R... | reviews | cc06 |
6 | Some of the New York Philharmonic musicians wh... | reviews | cc07 |
7 | Had a funny experience at Newport yesterday af... | reviews | cc08 |
8 | Murray Louis and his dance company appeared at... | reviews | cc09 |
9 | Ring Of Bright Water , by Gavin Maxwell . 211 ... | reviews | cc10 |
10 | Mischa Elman shared last night's Lewisohn Stad... | reviews | cc11 |
11 | Radio is easily outdistancing television in it... | reviews | cc12 |
12 | A tribe in ancient India believed the earth wa... | reviews | cc13 |
13 | Elisabeth Schwarzkopf sang so magnificently Sa... | reviews | cc14 |
14 | As autumn starts its annual sweep , few Americ... | reviews | cc15 |
15 | A year ago it was bruited that the primary cha... | reviews | cc16 |
16 | The reading public , the theatergoing public ,... | reviews | cc17 |
17 | Thirty-three Scotty did not go back to school ... | fiction | ck01 |
18 | Where their sharp edges seemed restless as sea... | fiction | ck02 |
19 | Mickie sat over his second whisky-on-the-rocks... | fiction | ck03 |
20 | The Bishop looked at him coldly and said , `` ... | fiction | ck04 |
21 | Payne dismounted in Madison Place and handed t... | fiction | ck05 |
22 | With a sneer , the man spread his legs and , a... | fiction | ck06 |
23 | If the crummy bastard could write ! ! That's h... | fiction | ck07 |
24 | Rousseau is so persuasive that Voltaire is alm... | fiction | ck08 |
25 | It was the first time any of us had laughed si... | fiction | ck09 |
26 | That summer the gambling houses were closed , ... | fiction | ck10 |
27 | Standing in the shelter of the tent -- a rejec... | fiction | ck11 |
28 | She was a child too much a part of her environ... | fiction | ck12 |
29 | In the dim underwater light they dressed and s... | fiction | ck13 |
30 | He brought with him a mixture of myrrh and alo... | fiction | ck14 |
31 | Beth was very still and her breath came in sma... | fiction | ck15 |
32 | The red glow from the cove had died out of the... | fiction | ck16 |
33 | Burly leathered men and wrinkled women in drab... | fiction | ck17 |
34 | She was getting real dramatic . I'd have been ... | fiction | ck18 |
35 | There was one fact which Rector could not over... | fiction | ck19 |
36 | She concluded by asking him to name another ho... | fiction | ck20 |
37 | Beckworth handed the pass to the colonel . He ... | fiction | ck21 |
38 | I would not want to be one of those writers wh... | fiction | ck22 |
39 | It was not as though she noted clearly that he... | fiction | ck23 |
40 | His eyes were old and they never saw well , bu... | fiction | ck24 |
41 | He was in his mid-fifties at this time , long ... | fiction | ck25 |
42 | But they all said , `` No , your time will com... | fiction | ck26 |
43 | `` But tell me , doctor , where do you plan to... | fiction | ck27 |
44 | Going downstairs with the tray , Winston wishe... | fiction | ck28 |
45 | Was it love ? ? I had no doubt that it was . D... | fiction | ck29 |
46 | It was among these that Hinkle identified a ph... | humor | cr01 |
47 | I realized that Hamlet was faced with an entir... | humor | cr02 |
48 | Needless to say , I was furious at this unpara... | humor | cr03 |
49 | Up to date , however , his garden was still mo... | humor | cr04 |
50 | Ambiguity Nothing in English has been ridicule... | humor | cr05 |
51 | I called the other afternoon on my old friend ... | humor | cr06 |
52 | One day , the children had wanted to get up on... | humor | cr07 |
53 | Pueri aquam de silvas ad agricolas portant , a... | humor | cr08 |
54 | Dear Sirs : Let me begin by clearing up any po... | humor | cr09 |
## Clean up texts
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
print(stop_words)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
## function
def normalize_text(text):
## remove special characters
text = re.sub(r'[^a-zA-Z\s]','', text, re.I|re.A)
text = text.lower().strip()
tokens = wpt.tokenize(text)
## filtering
#tokens_filtered = [w for w in tokens if w not in stop_words and re.search(r'\D+', w)]
tokens_filtered = tokens
text_output = ' '.join(tokens_filtered)
return text_output
## vectorize function
normalize_corpus= np.vectorize(normalize_text)
corpus_norm = normalize_corpus(corpus_text)
corpus_norm[1][:200]
'television has yet to work out a living arrangement with jazz which comes to the medium more as an uneasy guest than as a relaxed member of the family there seems to be an unfortunate assumption that '
Bag of Words#
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(min_df=0.2, max_df=1.)
cv_matrix = cv.fit_transform(corpus_norm)
cv_matrix
## view the array
type(cv_matrix)
cv_matrix = cv_matrix.toarray()
vocab = cv.get_feature_names()
boa_unigram = pd.DataFrame(cv_matrix, columns=vocab)
boa_unigram
able | about | above | across | act | added | after | afternoon | again | against | ... | written | wrote | year | years | yes | yet | york | you | young | your | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 5 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 |
1 | 0 | 3 | 0 | 1 | 2 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 4 | 0 | 2 | 1 | 8 | 1 | 4 |
2 | 1 | 3 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 5 | 0 | 0 | 3 | 4 | 2 | 1 |
3 | 0 | 2 | 0 | 0 | 4 | 0 | 2 | 0 | 3 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 2 | 1 |
4 | 1 | 4 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 6 | 0 | 2 |
5 | 0 | 11 | 0 | 0 | 0 | 0 | 4 | 0 | 1 | 1 | ... | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 6 | 0 | 0 |
6 | 1 | 1 | 1 | 0 | 1 | 1 | 6 | 2 | 2 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 2 | 0 | 2 | 0 |
7 | 1 | 10 | 0 | 0 | 0 | 0 | 4 | 4 | 3 | 0 | ... | 0 | 1 | 3 | 1 | 0 | 0 | 0 | 3 | 1 | 2 |
8 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | ... | 2 | 1 | 0 | 1 | 0 | 2 | 6 | 1 | 2 | 0 |
9 | 0 | 1 | 0 | 1 | 0 | 0 | 3 | 0 | 0 | 1 | ... | 0 | 0 | 2 | 4 | 0 | 1 | 2 | 4 | 1 | 0 |
10 | 1 | 6 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 2 | ... | 0 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 1 | 0 |
11 | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | ... | 0 | 0 | 3 | 4 | 0 | 1 | 5 | 0 | 2 | 0 |
12 | 2 | 4 | 0 | 0 | 3 | 0 | 2 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 6 | 0 | 0 | 2 | 1 | 1 | 1 |
13 | 0 | 3 | 1 | 0 | 1 | 0 | 3 | 0 | 1 | 0 | ... | 4 | 0 | 0 | 1 | 0 | 0 | 3 | 0 | 1 | 0 |
14 | 0 | 5 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 4 | 0 | 0 | 0 | 3 | 2 | 1 | 0 |
15 | 1 | 9 | 1 | 0 | 0 | 0 | 2 | 0 | 1 | 1 | ... | 1 | 0 | 1 | 2 | 0 | 1 | 1 | 0 | 0 | 0 |
16 | 0 | 4 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 5 | 2 | 0 | 0 | 1 | 0 | 0 | 0 |
17 | 1 | 11 | 0 | 2 | 0 | 0 | 3 | 2 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 16 | 1 | 0 |
18 | 0 | 9 | 2 | 0 | 0 | 0 | 1 | 1 | 2 | 2 | ... | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 9 | 0 | 2 |
19 | 0 | 10 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 2 | 22 | 3 | 2 |
20 | 0 | 10 | 0 | 3 | 0 | 0 | 3 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 8 | 0 | 1 | 0 | 10 | 2 | 1 |
21 | 1 | 6 | 2 | 1 | 3 | 0 | 3 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 6 | 1 | 0 |
22 | 0 | 1 | 0 | 3 | 0 | 0 | 1 | 1 | 5 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 11 | 0 | 2 |
23 | 0 | 5 | 1 | 0 | 0 | 0 | 1 | 0 | 6 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 11 | 1 | 2 |
24 | 0 | 4 | 0 | 0 | 0 | 0 | 6 | 0 | 2 | 2 | ... | 1 | 1 | 0 | 2 | 4 | 0 | 0 | 13 | 0 | 4 |
25 | 0 | 9 | 1 | 1 | 0 | 0 | 3 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 6 | 1 | 2 |
26 | 0 | 1 | 2 | 0 | 0 | 0 | 3 | 0 | 2 | 2 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 13 | 0 | 2 |
27 | 0 | 3 | 2 | 4 | 0 | 0 | 3 | 1 | 7 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 4 | 0 |
28 | 0 | 5 | 0 | 2 | 0 | 0 | 1 | 0 | 3 | 0 | ... | 0 | 0 | 0 | 1 | 3 | 2 | 0 | 1 | 0 | 1 |
29 | 1 | 5 | 0 | 2 | 1 | 0 | 2 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 1 | 0 |
30 | 1 | 4 | 0 | 0 | 0 | 1 | 4 | 1 | 1 | 0 | ... | 0 | 0 | 1 | 1 | 0 | 5 | 0 | 8 | 3 | 1 |
31 | 0 | 6 | 0 | 1 | 0 | 0 | 5 | 2 | 3 | 4 | ... | 1 | 4 | 0 | 1 | 1 | 0 | 0 | 11 | 2 | 4 |
32 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | ... | 1 | 0 | 1 | 1 | 1 | 1 | 3 | 1 | 0 | 0 |
33 | 1 | 4 | 0 | 1 | 0 | 0 | 0 | 1 | 2 | 0 | ... | 0 | 0 | 0 | 3 | 2 | 0 | 0 | 21 | 0 | 4 |
34 | 0 | 16 | 1 | 1 | 1 | 1 | 2 | 0 | 2 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 6 | 0 | 0 |
35 | 2 | 12 | 0 | 0 | 0 | 1 | 2 | 2 | 0 | 1 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 13 | 0 | 0 |
36 | 0 | 7 | 0 | 0 | 1 | 0 | 0 | 0 | 3 | 2 | ... | 0 | 3 | 0 | 1 | 0 | 2 | 0 | 12 | 0 | 2 |
37 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 1 | 3 | 4 | ... | 1 | 0 | 0 | 0 | 4 | 0 | 0 | 9 | 0 | 0 |
38 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 2 | 0 | 1 | ... | 0 | 0 | 3 | 2 | 0 | 0 | 1 | 14 | 0 | 5 |
39 | 0 | 2 | 1 | 1 | 0 | 1 | 0 | 0 | 5 | 2 | ... | 0 | 1 | 0 | 7 | 1 | 0 | 0 | 8 | 2 | 1 |
40 | 0 | 12 | 0 | 0 | 0 | 0 | 3 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 21 | 0 | 4 |
41 | 0 | 1 | 2 | 1 | 0 | 0 | 4 | 1 | 5 | 0 | ... | 0 | 0 | 0 | 6 | 2 | 2 | 0 | 0 | 1 | 0 |
42 | 0 | 12 | 0 | 0 | 0 | 0 | 3 | 1 | 2 | 2 | ... | 0 | 0 | 2 | 1 | 1 | 0 | 0 | 9 | 0 | 1 |
43 | 0 | 2 | 0 | 2 | 0 | 0 | 2 | 1 | 2 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 5 | 1 | 0 |
44 | 0 | 9 | 0 | 1 | 0 | 1 | 4 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 5 | 0 | 0 | 0 | 30 | 3 | 2 |
45 | 0 | 4 | 0 | 1 | 1 | 0 | 2 | 1 | 3 | 0 | ... | 0 | 1 | 2 | 4 | 0 | 2 | 1 | 1 | 1 | 0 |
46 | 0 | 2 | 0 | 0 | 0 | 0 | 4 | 0 | 3 | 0 | ... | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 5 | 0 | 1 |
47 | 1 | 6 | 0 | 0 | 0 | 1 | 8 | 1 | 2 | 1 | ... | 0 | 0 | 2 | 1 | 1 | 0 | 2 | 11 | 0 | 2 |
48 | 0 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 4 | ... | 0 | 3 | 1 | 1 | 0 | 2 | 0 | 1 | 3 | 0 |
49 | 2 | 6 | 1 | 0 | 0 | 0 | 5 | 1 | 2 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 27 | 0 | 1 |
50 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 1 | 3 | 2 | 1 | 1 | 0 | 26 | 0 | 8 |
51 | 1 | 7 | 0 | 0 | 0 | 0 | 2 | 1 | 2 | 2 | ... | 1 | 3 | 1 | 9 | 1 | 0 | 0 | 20 | 0 | 3 |
52 | 1 | 5 | 0 | 0 | 0 | 1 | 2 | 0 | 3 | 0 | ... | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 20 | 0 | 5 |
53 | 0 | 4 | 0 | 2 | 0 | 0 | 5 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 2 | 0 | 1 | 2 | 2 | 0 | 1 |
54 | 1 | 2 | 2 | 1 | 0 | 1 | 1 | 0 | 0 | 2 | ... | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 45 | 1 | 17 |
55 rows × 739 columns
More Complex Bag-of-Words#
Filter features based on word classes
Include n-gram features
Bag of N-grams Model#
## N-grams
cv_ngram = CountVectorizer(ngram_range=(1,3), min_df = 0.2)
cv_ngram_matrix = cv_ngram.fit_transform(corpus_norm)
boa_ngram = pd.DataFrame(cv_ngram_matrix.toarray(),columns = cv_ngram.get_feature_names())
boa_ngram.head()
able | able to | about | about it | about the | above | across | across the | act | added | ... | yet | york | you | you are | you can | you could | you have | you know | young | your | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 5 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 3 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | ... | 2 | 1 | 8 | 1 | 0 | 0 | 1 | 0 | 1 | 4 |
2 | 1 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 3 | 4 | 1 | 0 | 0 | 1 | 0 | 2 | 1 |
3 | 0 | 0 | 2 | 0 | 1 | 0 | 0 | 0 | 4 | 0 | ... | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 |
4 | 1 | 1 | 4 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 6 | 0 | 1 | 1 | 1 | 0 | 0 | 2 |
5 rows × 1193 columns
TF-IDF Model#
from sklearn.feature_extraction.text import TfidfTransformer
tf = TfidfTransformer(norm = 'l2', use_idf=True)
tf_matrix = tf.fit_transform(cv_matrix)
tfidf_unigram = pd.DataFrame(tf_matrix.toarray(), columns = cv.get_feature_names())
tfidf_unigram
able | about | above | across | act | added | after | afternoon | again | against | ... | written | wrote | year | years | yes | yet | york | you | young | your | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.009462 | 0.024754 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.005318 | 0.000000 | 0.005614 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.008088 | 0.000000 | 0.016540 | 0.000000 | 0.007161 |
1 | 0.000000 | 0.014533 | 0.000000 | 0.008444 | 0.022309 | 0.000000 | 0.005204 | 0.000000 | 0.005493 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.024528 | 0.000000 | 0.015829 | 0.009042 | 0.043160 | 0.007592 | 0.028028 |
2 | 0.009405 | 0.014763 | 0.000000 | 0.000000 | 0.011331 | 0.000000 | 0.015859 | 0.000000 | 0.000000 | 0.000000 | ... | 0.011331 | 0.000000 | 0.000000 | 0.031143 | 0.000000 | 0.000000 | 0.027553 | 0.021920 | 0.015424 | 0.007117 |
3 | 0.000000 | 0.010555 | 0.000000 | 0.000000 | 0.048609 | 0.000000 | 0.011339 | 0.000000 | 0.017954 | 0.000000 | ... | 0.012152 | 0.000000 | 0.000000 | 0.006680 | 0.000000 | 0.000000 | 0.019701 | 0.000000 | 0.016542 | 0.007634 |
4 | 0.009887 | 0.020690 | 0.009887 | 0.000000 | 0.000000 | 0.000000 | 0.005557 | 0.000000 | 0.005866 | 0.008821 | ... | 0.011910 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.008451 | 0.000000 | 0.034563 | 0.000000 | 0.014963 |
5 | 0.000000 | 0.046936 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.018335 | 0.000000 | 0.004839 | 0.007276 | ... | 0.000000 | 0.000000 | 0.008356 | 0.005401 | 0.000000 | 0.006971 | 0.000000 | 0.028511 | 0.000000 | 0.000000 |
6 | 0.009412 | 0.004924 | 0.009412 | 0.000000 | 0.011338 | 0.012071 | 0.031740 | 0.018824 | 0.011168 | 0.000000 | ... | 0.000000 | 0.011338 | 0.009644 | 0.000000 | 0.010704 | 0.000000 | 0.018382 | 0.000000 | 0.015434 | 0.000000 |
7 | 0.009725 | 0.050883 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.021865 | 0.038902 | 0.017310 | 0.000000 | ... | 0.000000 | 0.011716 | 0.029895 | 0.006441 | 0.000000 | 0.000000 | 0.000000 | 0.017000 | 0.007974 | 0.014719 |
8 | 0.008883 | 0.004648 | 0.008883 | 0.000000 | 0.000000 | 0.011393 | 0.000000 | 0.000000 | 0.010540 | 0.000000 | ... | 0.021403 | 0.010701 | 0.000000 | 0.005883 | 0.000000 | 0.015186 | 0.052047 | 0.005176 | 0.014567 | 0.000000 |
9 | 0.000000 | 0.004898 | 0.000000 | 0.008538 | 0.000000 | 0.000000 | 0.015786 | 0.000000 | 0.000000 | 0.008353 | ... | 0.000000 | 0.000000 | 0.019186 | 0.024800 | 0.000000 | 0.008003 | 0.018285 | 0.021820 | 0.007677 | 0.000000 |
10 | 0.009336 | 0.029307 | 0.009336 | 0.000000 | 0.000000 | 0.011973 | 0.005247 | 0.000000 | 0.005539 | 0.016659 | ... | 0.000000 | 0.000000 | 0.019132 | 0.012365 | 0.000000 | 0.007980 | 0.000000 | 0.000000 | 0.007655 | 0.000000 |
11 | 0.000000 | 0.013760 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.004927 | 0.000000 | 0.000000 | 0.007821 | ... | 0.000000 | 0.000000 | 0.026948 | 0.023223 | 0.000000 | 0.007493 | 0.042804 | 0.000000 | 0.014376 | 0.000000 |
12 | 0.017945 | 0.018778 | 0.000000 | 0.000000 | 0.032428 | 0.000000 | 0.010086 | 0.000000 | 0.005323 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.035652 | 0.000000 | 0.000000 | 0.017524 | 0.005228 | 0.007357 | 0.006790 |
13 | 0.000000 | 0.013423 | 0.008552 | 0.000000 | 0.010303 | 0.000000 | 0.014420 | 0.000000 | 0.005074 | 0.000000 | ... | 0.041210 | 0.000000 | 0.000000 | 0.005663 | 0.000000 | 0.000000 | 0.025054 | 0.000000 | 0.007012 | 0.000000 |
14 | 0.000000 | 0.026945 | 0.010300 | 0.009394 | 0.012409 | 0.000000 | 0.005789 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.042216 | 0.000000 | 0.000000 | 0.000000 | 0.030175 | 0.012003 | 0.008446 | 0.000000 |
15 | 0.009780 | 0.046052 | 0.009780 | 0.000000 | 0.000000 | 0.000000 | 0.010994 | 0.000000 | 0.005802 | 0.008725 | ... | 0.011782 | 0.000000 | 0.010021 | 0.012953 | 0.000000 | 0.008360 | 0.009550 | 0.000000 | 0.000000 | 0.000000 |
16 | 0.000000 | 0.020279 | 0.009690 | 0.000000 | 0.011674 | 0.000000 | 0.005446 | 0.000000 | 0.005749 | 0.000000 | ... | 0.000000 | 0.000000 | 0.049644 | 0.012834 | 0.000000 | 0.000000 | 0.009463 | 0.000000 | 0.000000 | 0.000000 |
17 | 0.009886 | 0.056898 | 0.000000 | 0.018032 | 0.000000 | 0.000000 | 0.016670 | 0.019773 | 0.011731 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.033802 | 0.000000 | 0.092167 | 0.008106 | 0.000000 |
18 | 0.000000 | 0.044869 | 0.019058 | 0.000000 | 0.000000 | 0.000000 | 0.005356 | 0.009529 | 0.011307 | 0.017003 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.024435 | 0.000000 | 0.049969 | 0.000000 | 0.014422 |
19 | 0.000000 | 0.049857 | 0.009529 | 0.017381 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.005654 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.006311 | 0.000000 | 0.008145 | 0.018611 | 0.122152 | 0.023440 | 0.014423 |
20 | 0.000000 | 0.046828 | 0.000000 | 0.024488 | 0.000000 | 0.000000 | 0.015092 | 0.000000 | 0.010620 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.047418 | 0.000000 | 0.007650 | 0.000000 | 0.052150 | 0.014677 | 0.006773 |
21 | 0.009171 | 0.028791 | 0.018343 | 0.008364 | 0.033146 | 0.000000 | 0.015464 | 0.000000 | 0.010883 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.015679 | 0.008956 | 0.032063 | 0.007520 | 0.000000 |
22 | 0.000000 | 0.004591 | 0.000000 | 0.024006 | 0.000000 | 0.000000 | 0.004932 | 0.008774 | 0.026028 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.015000 | 0.008568 | 0.056236 | 0.000000 | 0.013280 |
23 | 0.000000 | 0.027181 | 0.010390 | 0.000000 | 0.000000 | 0.000000 | 0.005840 | 0.000000 | 0.036987 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.008881 | 0.000000 | 0.066595 | 0.008519 | 0.015726 |
24 | 0.000000 | 0.019541 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.031488 | 0.000000 | 0.011079 | 0.016661 | ... | 0.011248 | 0.011248 | 0.000000 | 0.012367 | 0.042476 | 0.000000 | 0.000000 | 0.070725 | 0.000000 | 0.028264 |
25 | 0.000000 | 0.041602 | 0.008835 | 0.008057 | 0.000000 | 0.000000 | 0.014897 | 0.000000 | 0.000000 | 0.007882 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.015104 | 0.000000 | 0.030887 | 0.007244 | 0.013372 |
26 | 0.000000 | 0.004729 | 0.018077 | 0.000000 | 0.000000 | 0.000000 | 0.015240 | 0.000000 | 0.010725 | 0.016128 | ... | 0.010889 | 0.000000 | 0.000000 | 0.005986 | 0.000000 | 0.000000 | 0.000000 | 0.068463 | 0.000000 | 0.013680 |
27 | 0.000000 | 0.012572 | 0.016019 | 0.029219 | 0.000000 | 0.000000 | 0.013506 | 0.008010 | 0.033264 | 0.007146 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.018668 | 0.026270 | 0.000000 |
28 | 0.000000 | 0.026287 | 0.000000 | 0.018328 | 0.000000 | 0.000000 | 0.005648 | 0.000000 | 0.017885 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.006655 | 0.034284 | 0.017178 | 0.000000 | 0.005855 | 0.000000 | 0.007604 |
29 | 0.008547 | 0.022359 | 0.000000 | 0.015590 | 0.010297 | 0.000000 | 0.009608 | 0.008547 | 0.005071 | 0.007626 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.034861 | 0.007008 | 0.000000 |
30 | 0.009866 | 0.020647 | 0.000000 | 0.000000 | 0.000000 | 0.012653 | 0.022180 | 0.009866 | 0.005853 | 0.000000 | ... | 0.000000 | 0.000000 | 0.010109 | 0.006533 | 0.000000 | 0.042164 | 0.000000 | 0.045987 | 0.024268 | 0.007466 |
31 | 0.000000 | 0.028949 | 0.000000 | 0.008410 | 0.000000 | 0.000000 | 0.025916 | 0.018444 | 0.016414 | 0.032910 | ... | 0.011109 | 0.044438 | 0.000000 | 0.006107 | 0.010488 | 0.000000 | 0.000000 | 0.059105 | 0.015123 | 0.027914 |
32 | 0.000000 | 0.000000 | 0.000000 | 0.016363 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010645 | 0.000000 | ... | 0.010807 | 0.000000 | 0.009192 | 0.005941 | 0.010203 | 0.007668 | 0.026281 | 0.005227 | 0.000000 | 0.000000 |
33 | 0.010317 | 0.021592 | 0.000000 | 0.009409 | 0.000000 | 0.000000 | 0.000000 | 0.010317 | 0.012242 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.020497 | 0.023467 | 0.000000 | 0.000000 | 0.126240 | 0.000000 | 0.031230 |
34 | 0.000000 | 0.092212 | 0.011015 | 0.010046 | 0.013270 | 0.014127 | 0.012382 | 0.000000 | 0.013071 | 0.000000 | ... | 0.000000 | 0.013270 | 0.000000 | 0.007295 | 0.000000 | 0.000000 | 0.010757 | 0.038510 | 0.000000 | 0.000000 |
35 | 0.016857 | 0.052918 | 0.000000 | 0.000000 | 0.000000 | 0.010810 | 0.009475 | 0.016857 | 0.000000 | 0.007520 | ... | 0.010154 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007204 | 0.000000 | 0.063843 | 0.000000 | 0.000000 |
36 | 0.000000 | 0.036244 | 0.000000 | 0.000000 | 0.011922 | 0.000000 | 0.000000 | 0.000000 | 0.017614 | 0.017659 | ... | 0.000000 | 0.035766 | 0.000000 | 0.006554 | 0.000000 | 0.016918 | 0.000000 | 0.069194 | 0.000000 | 0.014978 |
37 | 0.000000 | 0.000000 | 0.000000 | 0.015838 | 0.000000 | 0.000000 | 0.004880 | 0.008683 | 0.015455 | 0.030987 | ... | 0.010460 | 0.000000 | 0.000000 | 0.000000 | 0.039500 | 0.000000 | 0.000000 | 0.045534 | 0.000000 | 0.000000 |
38 | 0.000000 | 0.005114 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.021975 | 0.019548 | 0.000000 | 0.008720 | ... | 0.000000 | 0.000000 | 0.030045 | 0.012946 | 0.000000 | 0.000000 | 0.009545 | 0.079731 | 0.000000 | 0.036983 |
39 | 0.000000 | 0.008647 | 0.008264 | 0.007536 | 0.000000 | 0.010598 | 0.000000 | 0.000000 | 0.024514 | 0.014745 | ... | 0.000000 | 0.009955 | 0.000000 | 0.038308 | 0.009398 | 0.000000 | 0.000000 | 0.038519 | 0.013551 | 0.006254 |
40 | 0.000000 | 0.062135 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.016687 | 0.000000 | 0.011743 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.006554 | 0.011255 | 0.000000 | 0.000000 | 0.121093 | 0.000000 | 0.029957 |
41 | 0.000000 | 0.005035 | 0.019245 | 0.008776 | 0.000000 | 0.000000 | 0.021634 | 0.009623 | 0.028545 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.038235 | 0.021887 | 0.016450 | 0.000000 | 0.000000 | 0.007890 | 0.000000 |
42 | 0.000000 | 0.062556 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.016800 | 0.009964 | 0.011823 | 0.017779 | ... | 0.000000 | 0.000000 | 0.020418 | 0.006598 | 0.011331 | 0.000000 | 0.000000 | 0.052249 | 0.000000 | 0.007540 |
43 | 0.000000 | 0.008408 | 0.000000 | 0.014657 | 0.000000 | 0.000000 | 0.009033 | 0.008036 | 0.009535 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.005321 | 0.000000 | 0.000000 | 0.000000 | 0.023410 | 0.006589 | 0.000000 |
44 | 0.000000 | 0.045169 | 0.000000 | 0.008748 | 0.000000 | 0.012303 | 0.021566 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.011556 | 0.000000 | 0.031763 | 0.000000 | 0.000000 | 0.000000 | 0.167676 | 0.023596 | 0.014518 |
45 | 0.000000 | 0.020685 | 0.000000 | 0.009014 | 0.011907 | 0.000000 | 0.011111 | 0.009884 | 0.017592 | 0.000000 | ... | 0.000000 | 0.011907 | 0.020255 | 0.026182 | 0.000000 | 0.016897 | 0.009652 | 0.005759 | 0.008104 | 0.000000 |
46 | 0.000000 | 0.010388 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.022318 | 0.000000 | 0.017669 | 0.000000 | ... | 0.000000 | 0.000000 | 0.010171 | 0.013148 | 0.000000 | 0.000000 | 0.000000 | 0.028920 | 0.000000 | 0.007512 |
47 | 0.010698 | 0.033583 | 0.000000 | 0.000000 | 0.000000 | 0.013720 | 0.048102 | 0.010698 | 0.012694 | 0.009544 | ... | 0.000000 | 0.000000 | 0.021923 | 0.007085 | 0.012167 | 0.000000 | 0.020893 | 0.068566 | 0.000000 | 0.016191 |
48 | 0.000000 | 0.015427 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.005524 | 0.000000 | 0.005831 | 0.035076 | ... | 0.000000 | 0.035522 | 0.010071 | 0.006509 | 0.000000 | 0.016802 | 0.000000 | 0.005727 | 0.024177 | 0.000000 |
49 | 0.019850 | 0.031157 | 0.009925 | 0.000000 | 0.000000 | 0.000000 | 0.027892 | 0.009925 | 0.011777 | 0.008855 | ... | 0.000000 | 0.000000 | 0.000000 | 0.006573 | 0.000000 | 0.000000 | 0.000000 | 0.156141 | 0.000000 | 0.007511 |
50 | 0.000000 | 0.005730 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006155 | 0.000000 | 0.006497 | 0.000000 | ... | 0.000000 | 0.013193 | 0.033662 | 0.014504 | 0.012454 | 0.009361 | 0.000000 | 0.165899 | 0.000000 | 0.066297 |
51 | 0.009381 | 0.034356 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010545 | 0.009381 | 0.011131 | 0.016739 | ... | 0.011301 | 0.033903 | 0.009612 | 0.055911 | 0.010669 | 0.000000 | 0.000000 | 0.109317 | 0.000000 | 0.021297 |
52 | 0.010202 | 0.026688 | 0.000000 | 0.000000 | 0.000000 | 0.013084 | 0.011468 | 0.000000 | 0.018158 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.023205 | 0.000000 | 0.000000 | 0.118885 | 0.000000 | 0.038601 |
53 | 0.000000 | 0.018951 | 0.000000 | 0.016517 | 0.000000 | 0.000000 | 0.025448 | 0.000000 | 0.005373 | 0.008079 | ... | 0.000000 | 0.000000 | 0.000000 | 0.011994 | 0.000000 | 0.007740 | 0.017686 | 0.010553 | 0.000000 | 0.006853 |
54 | 0.010298 | 0.010776 | 0.020596 | 0.009392 | 0.000000 | 0.013208 | 0.005788 | 0.000000 | 0.000000 | 0.018376 | ... | 0.000000 | 0.000000 | 0.000000 | 0.020460 | 0.000000 | 0.000000 | 0.000000 | 0.270016 | 0.008444 | 0.132484 |
55 rows × 739 columns
tf_matrix2 = tf.fit_transform(cv_ngram_matrix)
tfidf_ngram = pd.DataFrame(tf_matrix2.toarray(), columns = cv_ngram.get_feature_names())
tfidf_ngram
able | able to | about | about it | about the | above | across | across the | act | added | ... | yet | york | you | you are | you can | you could | you have | you know | young | your | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.009217 | 0.000000 | 0.024112 | 0.000000 | 0.023146 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.007878 | 0.000000 | 0.016111 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006975 |
1 | 0.000000 | 0.000000 | 0.014143 | 0.000000 | 0.000000 | 0.000000 | 0.008217 | 0.000000 | 0.021710 | 0.000000 | ... | 0.015404 | 0.008799 | 0.042001 | 0.011192 | 0.000000 | 0.000000 | 0.010248 | 0.000000 | 0.007388 | 0.027275 |
2 | 0.009164 | 0.000000 | 0.014384 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.011040 | 0.000000 | ... | 0.000000 | 0.026847 | 0.021358 | 0.011383 | 0.000000 | 0.000000 | 0.010422 | 0.000000 | 0.015028 | 0.006935 |
3 | 0.000000 | 0.000000 | 0.010331 | 0.000000 | 0.008264 | 0.000000 | 0.000000 | 0.000000 | 0.047577 | 0.000000 | ... | 0.000000 | 0.019283 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.016191 | 0.007472 |
4 | 0.009664 | 0.011305 | 0.020224 | 0.011642 | 0.008089 | 0.009664 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.008260 | 0.000000 | 0.033784 | 0.000000 | 0.010990 | 0.012003 | 0.010990 | 0.000000 | 0.000000 | 0.014626 |
5 | 0.000000 | 0.000000 | 0.045861 | 0.009600 | 0.020010 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.006811 | 0.000000 | 0.027858 | 0.000000 | 0.009063 | 0.000000 | 0.009063 | 0.000000 | 0.000000 | 0.000000 |
6 | 0.009253 | 0.010825 | 0.004841 | 0.000000 | 0.000000 | 0.009253 | 0.000000 | 0.000000 | 0.011147 | 0.011867 | ... | 0.000000 | 0.018072 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.015174 | 0.000000 |
7 | 0.009444 | 0.011048 | 0.049410 | 0.000000 | 0.015810 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.016508 | 0.000000 | 0.000000 | 0.000000 | 0.010740 | 0.000000 | 0.007743 | 0.014293 |
8 | 0.008653 | 0.000000 | 0.004527 | 0.000000 | 0.000000 | 0.008653 | 0.000000 | 0.000000 | 0.000000 | 0.011097 | ... | 0.014792 | 0.050696 | 0.005041 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014189 | 0.000000 |
9 | 0.000000 | 0.000000 | 0.004790 | 0.000000 | 0.007663 | 0.000000 | 0.008349 | 0.008940 | 0.000000 | 0.000000 | ... | 0.007826 | 0.017881 | 0.021338 | 0.000000 | 0.010412 | 0.000000 | 0.000000 | 0.000000 | 0.007507 | 0.000000 |
10 | 0.009146 | 0.000000 | 0.028710 | 0.000000 | 0.022966 | 0.009146 | 0.000000 | 0.000000 | 0.000000 | 0.011729 | ... | 0.007817 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007499 | 0.000000 |
11 | 0.000000 | 0.000000 | 0.013455 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.007327 | 0.041855 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014057 | 0.000000 |
12 | 0.017607 | 0.020598 | 0.018424 | 0.000000 | 0.014738 | 0.000000 | 0.000000 | 0.000000 | 0.031817 | 0.000000 | ... | 0.000000 | 0.017194 | 0.005130 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007218 | 0.006662 |
13 | 0.000000 | 0.000000 | 0.013175 | 0.000000 | 0.000000 | 0.008394 | 0.000000 | 0.000000 | 0.010112 | 0.000000 | ... | 0.000000 | 0.024591 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006883 | 0.000000 |
14 | 0.000000 | 0.000000 | 0.026481 | 0.000000 | 0.008473 | 0.010123 | 0.009232 | 0.009885 | 0.012195 | 0.000000 | ... | 0.000000 | 0.029655 | 0.011796 | 0.000000 | 0.011512 | 0.000000 | 0.000000 | 0.000000 | 0.008300 | 0.000000 |
15 | 0.009561 | 0.000000 | 0.045023 | 0.000000 | 0.008003 | 0.009561 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.008173 | 0.009337 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
16 | 0.000000 | 0.000000 | 0.019991 | 0.000000 | 0.000000 | 0.009552 | 0.000000 | 0.000000 | 0.011508 | 0.000000 | ... | 0.000000 | 0.009328 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
17 | 0.009487 | 0.000000 | 0.054599 | 0.022858 | 0.007941 | 0.000000 | 0.017304 | 0.018528 | 0.000000 | 0.000000 | ... | 0.032436 | 0.000000 | 0.088442 | 0.000000 | 0.000000 | 0.011784 | 0.000000 | 0.000000 | 0.007779 | 0.000000 |
18 | 0.000000 | 0.000000 | 0.043551 | 0.000000 | 0.015484 | 0.018498 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.023717 | 0.000000 | 0.048500 | 0.011488 | 0.010519 | 0.000000 | 0.010519 | 0.000000 | 0.000000 | 0.013998 |
19 | 0.000000 | 0.000000 | 0.048336 | 0.000000 | 0.015466 | 0.009239 | 0.016851 | 0.018043 | 0.000000 | 0.000000 | ... | 0.007897 | 0.018043 | 0.118424 | 0.000000 | 0.000000 | 0.022951 | 0.021014 | 0.010507 | 0.022725 | 0.013983 |
20 | 0.000000 | 0.000000 | 0.045488 | 0.041896 | 0.000000 | 0.000000 | 0.023787 | 0.025470 | 0.000000 | 0.000000 | ... | 0.007432 | 0.000000 | 0.050658 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.009888 | 0.014257 | 0.006579 |
21 | 0.008779 | 0.010270 | 0.027558 | 0.010576 | 0.000000 | 0.017557 | 0.008006 | 0.000000 | 0.031727 | 0.000000 | ... | 0.015008 | 0.008573 | 0.030690 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007198 | 0.000000 |
22 | 0.000000 | 0.000000 | 0.004498 | 0.000000 | 0.000000 | 0.000000 | 0.023519 | 0.025183 | 0.000000 | 0.000000 | ... | 0.014696 | 0.008394 | 0.055096 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.013011 |
23 | 0.000000 | 0.000000 | 0.026297 | 0.012110 | 0.000000 | 0.010052 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.008592 | 0.000000 | 0.064428 | 0.000000 | 0.022865 | 0.000000 | 0.000000 | 0.011432 | 0.008242 | 0.015214 |
24 | 0.000000 | 0.000000 | 0.018921 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.068481 | 0.011230 | 0.010282 | 0.011230 | 0.000000 | 0.000000 | 0.000000 | 0.027367 |
25 | 0.000000 | 0.000000 | 0.040605 | 0.020777 | 0.021654 | 0.008623 | 0.007864 | 0.008421 | 0.000000 | 0.000000 | ... | 0.014742 | 0.000000 | 0.030146 | 0.000000 | 0.000000 | 0.021422 | 0.009807 | 0.000000 | 0.007070 | 0.013051 |
26 | 0.000000 | 0.000000 | 0.004620 | 0.000000 | 0.000000 | 0.017660 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.066884 | 0.010968 | 0.000000 | 0.000000 | 0.000000 | 0.010042 | 0.000000 | 0.013364 |
27 | 0.000000 | 0.000000 | 0.012231 | 0.000000 | 0.000000 | 0.015585 | 0.028427 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.018162 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.008863 | 0.025558 | 0.000000 |
28 | 0.000000 | 0.000000 | 0.025542 | 0.000000 | 0.008173 | 0.000000 | 0.017809 | 0.019069 | 0.000000 | 0.000000 | ... | 0.016691 | 0.000000 | 0.005689 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007389 |
29 | 0.008238 | 0.009637 | 0.021549 | 0.000000 | 0.000000 | 0.000000 | 0.015025 | 0.016088 | 0.009924 | 0.000000 | ... | 0.000000 | 0.000000 | 0.033598 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006754 | 0.000000 |
30 | 0.009582 | 0.011210 | 0.020054 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.012289 | ... | 0.040953 | 0.000000 | 0.044666 | 0.000000 | 0.010898 | 0.000000 | 0.000000 | 0.000000 | 0.023571 | 0.007251 |
31 | 0.000000 | 0.000000 | 0.028117 | 0.000000 | 0.014995 | 0.000000 | 0.008168 | 0.008746 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.057407 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010186 | 0.014688 | 0.027112 |
32 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.016031 | 0.008583 | 0.000000 | 0.000000 | ... | 0.007513 | 0.025749 | 0.005121 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
33 | 0.010092 | 0.011807 | 0.021122 | 0.012158 | 0.008448 | 0.000000 | 0.009204 | 0.009855 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.123490 | 0.012536 | 0.022956 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.030550 |
34 | 0.000000 | 0.000000 | 0.088805 | 0.012780 | 0.017760 | 0.010608 | 0.009675 | 0.010359 | 0.012780 | 0.013605 | ... | 0.000000 | 0.010359 | 0.037087 | 0.000000 | 0.000000 | 0.013177 | 0.000000 | 0.012065 | 0.000000 | 0.000000 |
35 | 0.016226 | 0.018982 | 0.050935 | 0.029320 | 0.006791 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010405 | ... | 0.006935 | 0.000000 | 0.061451 | 0.010077 | 0.018453 | 0.020154 | 0.009227 | 0.000000 | 0.000000 | 0.000000 |
36 | 0.000000 | 0.000000 | 0.035172 | 0.000000 | 0.016077 | 0.000000 | 0.000000 | 0.000000 | 0.011569 | 0.000000 | ... | 0.016418 | 0.000000 | 0.067148 | 0.000000 | 0.000000 | 0.000000 | 0.010922 | 0.021844 | 0.000000 | 0.014535 |
37 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.015380 | 0.016468 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.044218 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
38 | 0.000000 | 0.000000 | 0.004955 | 0.000000 | 0.007927 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.009247 | 0.077247 | 0.000000 | 0.000000 | 0.000000 | 0.010770 | 0.000000 | 0.000000 | 0.035831 |
39 | 0.000000 | 0.000000 | 0.008419 | 0.009693 | 0.000000 | 0.008046 | 0.007338 | 0.007857 | 0.000000 | 0.010319 | ... | 0.000000 | 0.000000 | 0.037504 | 0.000000 | 0.000000 | 0.009994 | 0.000000 | 0.009150 | 0.013194 | 0.006089 |
40 | 0.000000 | 0.000000 | 0.059723 | 0.000000 | 0.023887 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.116393 | 0.011815 | 0.000000 | 0.000000 | 0.000000 | 0.010818 | 0.000000 | 0.028794 |
41 | 0.000000 | 0.000000 | 0.004826 | 0.000000 | 0.000000 | 0.018449 | 0.008413 | 0.009008 | 0.000000 | 0.000000 | ... | 0.015770 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007564 | 0.000000 |
42 | 0.000000 | 0.000000 | 0.059596 | 0.000000 | 0.015891 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.049777 | 0.000000 | 0.010795 | 0.011790 | 0.000000 | 0.010795 | 0.000000 | 0.007183 |
43 | 0.000000 | 0.000000 | 0.008240 | 0.000000 | 0.006591 | 0.000000 | 0.014363 | 0.015379 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.022941 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006457 | 0.000000 |
44 | 0.000000 | 0.000000 | 0.043550 | 0.011142 | 0.007742 | 0.000000 | 0.008435 | 0.009031 | 0.000000 | 0.011862 | ... | 0.000000 | 0.000000 | 0.161665 | 0.000000 | 0.000000 | 0.011488 | 0.000000 | 0.010518 | 0.022750 | 0.013998 |
45 | 0.000000 | 0.000000 | 0.020163 | 0.000000 | 0.000000 | 0.000000 | 0.008786 | 0.009408 | 0.011606 | 0.000000 | ... | 0.016470 | 0.009408 | 0.005614 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010957 | 0.007900 | 0.000000 |
46 | 0.000000 | 0.000000 | 0.010144 | 0.000000 | 0.008115 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.028243 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007336 |
47 | 0.010314 | 0.012066 | 0.032378 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.013228 | ... | 0.000000 | 0.020144 | 0.066105 | 0.012811 | 0.011730 | 0.000000 | 0.011730 | 0.000000 | 0.000000 | 0.015610 |
48 | 0.000000 | 0.000000 | 0.015133 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.016482 | 0.000000 | 0.005618 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.023716 | 0.000000 |
49 | 0.019175 | 0.022432 | 0.030097 | 0.011550 | 0.000000 | 0.009587 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.150827 | 0.000000 | 0.010904 | 0.011909 | 0.000000 | 0.010904 | 0.000000 | 0.007255 |
50 | 0.000000 | 0.000000 | 0.005582 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.009119 | 0.000000 | 0.161614 | 0.013251 | 0.012133 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.064585 |
51 | 0.009137 | 0.010689 | 0.033463 | 0.000000 | 0.022945 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.106476 | 0.034048 | 0.000000 | 0.000000 | 0.020783 | 0.000000 | 0.000000 | 0.020743 |
52 | 0.009785 | 0.011447 | 0.025597 | 0.000000 | 0.024571 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.012549 | ... | 0.000000 | 0.000000 | 0.114022 | 0.000000 | 0.000000 | 0.012154 | 0.000000 | 0.011128 | 0.000000 | 0.037023 |
53 | 0.000000 | 0.000000 | 0.018548 | 0.000000 | 0.000000 | 0.000000 | 0.016166 | 0.017310 | 0.000000 | 0.000000 | ... | 0.007576 | 0.017310 | 0.010328 | 0.000000 | 0.000000 | 0.000000 | 0.010080 | 0.000000 | 0.000000 | 0.006707 |
54 | 0.010130 | 0.011851 | 0.010600 | 0.000000 | 0.000000 | 0.020260 | 0.009238 | 0.009892 | 0.000000 | 0.012992 | ... | 0.000000 | 0.000000 | 0.265603 | 0.012582 | 0.023041 | 0.000000 | 0.011521 | 0.000000 | 0.008306 | 0.130319 |
55 rows × 1193 columns
We can also create TF-IDF model directly from corpus
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df = 0.2,
ngram_range=(1,3),
use_idf=True
)
tv_matrix = tv.fit_transform(corpus_norm)
tfidf_ngram2 = pd.DataFrame(tv_matrix.toarray(), columns=tv.get_feature_names())
tfidf_ngram2
able | able to | about | about it | about the | above | across | across the | act | added | ... | yet | york | you | you are | you can | you could | you have | you know | young | your | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.009217 | 0.000000 | 0.024112 | 0.000000 | 0.023146 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.007878 | 0.000000 | 0.016111 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006975 |
1 | 0.000000 | 0.000000 | 0.014143 | 0.000000 | 0.000000 | 0.000000 | 0.008217 | 0.000000 | 0.021710 | 0.000000 | ... | 0.015404 | 0.008799 | 0.042001 | 0.011192 | 0.000000 | 0.000000 | 0.010248 | 0.000000 | 0.007388 | 0.027275 |
2 | 0.009164 | 0.000000 | 0.014384 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.011040 | 0.000000 | ... | 0.000000 | 0.026847 | 0.021358 | 0.011383 | 0.000000 | 0.000000 | 0.010422 | 0.000000 | 0.015028 | 0.006935 |
3 | 0.000000 | 0.000000 | 0.010331 | 0.000000 | 0.008264 | 0.000000 | 0.000000 | 0.000000 | 0.047577 | 0.000000 | ... | 0.000000 | 0.019283 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.016191 | 0.007472 |
4 | 0.009664 | 0.011305 | 0.020224 | 0.011642 | 0.008089 | 0.009664 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.008260 | 0.000000 | 0.033784 | 0.000000 | 0.010990 | 0.012003 | 0.010990 | 0.000000 | 0.000000 | 0.014626 |
5 | 0.000000 | 0.000000 | 0.045861 | 0.009600 | 0.020010 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.006811 | 0.000000 | 0.027858 | 0.000000 | 0.009063 | 0.000000 | 0.009063 | 0.000000 | 0.000000 | 0.000000 |
6 | 0.009253 | 0.010825 | 0.004841 | 0.000000 | 0.000000 | 0.009253 | 0.000000 | 0.000000 | 0.011147 | 0.011867 | ... | 0.000000 | 0.018072 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.015174 | 0.000000 |
7 | 0.009444 | 0.011048 | 0.049410 | 0.000000 | 0.015810 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.016508 | 0.000000 | 0.000000 | 0.000000 | 0.010740 | 0.000000 | 0.007743 | 0.014293 |
8 | 0.008653 | 0.000000 | 0.004527 | 0.000000 | 0.000000 | 0.008653 | 0.000000 | 0.000000 | 0.000000 | 0.011097 | ... | 0.014792 | 0.050696 | 0.005041 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014189 | 0.000000 |
9 | 0.000000 | 0.000000 | 0.004790 | 0.000000 | 0.007663 | 0.000000 | 0.008349 | 0.008940 | 0.000000 | 0.000000 | ... | 0.007826 | 0.017881 | 0.021338 | 0.000000 | 0.010412 | 0.000000 | 0.000000 | 0.000000 | 0.007507 | 0.000000 |
10 | 0.009146 | 0.000000 | 0.028710 | 0.000000 | 0.022966 | 0.009146 | 0.000000 | 0.000000 | 0.000000 | 0.011729 | ... | 0.007817 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007499 | 0.000000 |
11 | 0.000000 | 0.000000 | 0.013455 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.007327 | 0.041855 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.014057 | 0.000000 |
12 | 0.017607 | 0.020598 | 0.018424 | 0.000000 | 0.014738 | 0.000000 | 0.000000 | 0.000000 | 0.031817 | 0.000000 | ... | 0.000000 | 0.017194 | 0.005130 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007218 | 0.006662 |
13 | 0.000000 | 0.000000 | 0.013175 | 0.000000 | 0.000000 | 0.008394 | 0.000000 | 0.000000 | 0.010112 | 0.000000 | ... | 0.000000 | 0.024591 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006883 | 0.000000 |
14 | 0.000000 | 0.000000 | 0.026481 | 0.000000 | 0.008473 | 0.010123 | 0.009232 | 0.009885 | 0.012195 | 0.000000 | ... | 0.000000 | 0.029655 | 0.011796 | 0.000000 | 0.011512 | 0.000000 | 0.000000 | 0.000000 | 0.008300 | 0.000000 |
15 | 0.009561 | 0.000000 | 0.045023 | 0.000000 | 0.008003 | 0.009561 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.008173 | 0.009337 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
16 | 0.000000 | 0.000000 | 0.019991 | 0.000000 | 0.000000 | 0.009552 | 0.000000 | 0.000000 | 0.011508 | 0.000000 | ... | 0.000000 | 0.009328 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
17 | 0.009487 | 0.000000 | 0.054599 | 0.022858 | 0.007941 | 0.000000 | 0.017304 | 0.018528 | 0.000000 | 0.000000 | ... | 0.032436 | 0.000000 | 0.088442 | 0.000000 | 0.000000 | 0.011784 | 0.000000 | 0.000000 | 0.007779 | 0.000000 |
18 | 0.000000 | 0.000000 | 0.043551 | 0.000000 | 0.015484 | 0.018498 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.023717 | 0.000000 | 0.048500 | 0.011488 | 0.010519 | 0.000000 | 0.010519 | 0.000000 | 0.000000 | 0.013998 |
19 | 0.000000 | 0.000000 | 0.048336 | 0.000000 | 0.015466 | 0.009239 | 0.016851 | 0.018043 | 0.000000 | 0.000000 | ... | 0.007897 | 0.018043 | 0.118424 | 0.000000 | 0.000000 | 0.022951 | 0.021014 | 0.010507 | 0.022725 | 0.013983 |
20 | 0.000000 | 0.000000 | 0.045488 | 0.041896 | 0.000000 | 0.000000 | 0.023787 | 0.025470 | 0.000000 | 0.000000 | ... | 0.007432 | 0.000000 | 0.050658 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.009888 | 0.014257 | 0.006579 |
21 | 0.008779 | 0.010270 | 0.027558 | 0.010576 | 0.000000 | 0.017557 | 0.008006 | 0.000000 | 0.031727 | 0.000000 | ... | 0.015008 | 0.008573 | 0.030690 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007198 | 0.000000 |
22 | 0.000000 | 0.000000 | 0.004498 | 0.000000 | 0.000000 | 0.000000 | 0.023519 | 0.025183 | 0.000000 | 0.000000 | ... | 0.014696 | 0.008394 | 0.055096 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.013011 |
23 | 0.000000 | 0.000000 | 0.026297 | 0.012110 | 0.000000 | 0.010052 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.008592 | 0.000000 | 0.064428 | 0.000000 | 0.022865 | 0.000000 | 0.000000 | 0.011432 | 0.008242 | 0.015214 |
24 | 0.000000 | 0.000000 | 0.018921 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.068481 | 0.011230 | 0.010282 | 0.011230 | 0.000000 | 0.000000 | 0.000000 | 0.027367 |
25 | 0.000000 | 0.000000 | 0.040605 | 0.020777 | 0.021654 | 0.008623 | 0.007864 | 0.008421 | 0.000000 | 0.000000 | ... | 0.014742 | 0.000000 | 0.030146 | 0.000000 | 0.000000 | 0.021422 | 0.009807 | 0.000000 | 0.007070 | 0.013051 |
26 | 0.000000 | 0.000000 | 0.004620 | 0.000000 | 0.000000 | 0.017660 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.066884 | 0.010968 | 0.000000 | 0.000000 | 0.000000 | 0.010042 | 0.000000 | 0.013364 |
27 | 0.000000 | 0.000000 | 0.012231 | 0.000000 | 0.000000 | 0.015585 | 0.028427 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.018162 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.008863 | 0.025558 | 0.000000 |
28 | 0.000000 | 0.000000 | 0.025542 | 0.000000 | 0.008173 | 0.000000 | 0.017809 | 0.019069 | 0.000000 | 0.000000 | ... | 0.016691 | 0.000000 | 0.005689 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007389 |
29 | 0.008238 | 0.009637 | 0.021549 | 0.000000 | 0.000000 | 0.000000 | 0.015025 | 0.016088 | 0.009924 | 0.000000 | ... | 0.000000 | 0.000000 | 0.033598 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006754 | 0.000000 |
30 | 0.009582 | 0.011210 | 0.020054 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.012289 | ... | 0.040953 | 0.000000 | 0.044666 | 0.000000 | 0.010898 | 0.000000 | 0.000000 | 0.000000 | 0.023571 | 0.007251 |
31 | 0.000000 | 0.000000 | 0.028117 | 0.000000 | 0.014995 | 0.000000 | 0.008168 | 0.008746 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.057407 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010186 | 0.014688 | 0.027112 |
32 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.016031 | 0.008583 | 0.000000 | 0.000000 | ... | 0.007513 | 0.025749 | 0.005121 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
33 | 0.010092 | 0.011807 | 0.021122 | 0.012158 | 0.008448 | 0.000000 | 0.009204 | 0.009855 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.123490 | 0.012536 | 0.022956 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.030550 |
34 | 0.000000 | 0.000000 | 0.088805 | 0.012780 | 0.017760 | 0.010608 | 0.009675 | 0.010359 | 0.012780 | 0.013605 | ... | 0.000000 | 0.010359 | 0.037087 | 0.000000 | 0.000000 | 0.013177 | 0.000000 | 0.012065 | 0.000000 | 0.000000 |
35 | 0.016226 | 0.018982 | 0.050935 | 0.029320 | 0.006791 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010405 | ... | 0.006935 | 0.000000 | 0.061451 | 0.010077 | 0.018453 | 0.020154 | 0.009227 | 0.000000 | 0.000000 | 0.000000 |
36 | 0.000000 | 0.000000 | 0.035172 | 0.000000 | 0.016077 | 0.000000 | 0.000000 | 0.000000 | 0.011569 | 0.000000 | ... | 0.016418 | 0.000000 | 0.067148 | 0.000000 | 0.000000 | 0.000000 | 0.010922 | 0.021844 | 0.000000 | 0.014535 |
37 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.015380 | 0.016468 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.044218 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
38 | 0.000000 | 0.000000 | 0.004955 | 0.000000 | 0.007927 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.009247 | 0.077247 | 0.000000 | 0.000000 | 0.000000 | 0.010770 | 0.000000 | 0.000000 | 0.035831 |
39 | 0.000000 | 0.000000 | 0.008419 | 0.009693 | 0.000000 | 0.008046 | 0.007338 | 0.007857 | 0.000000 | 0.010319 | ... | 0.000000 | 0.000000 | 0.037504 | 0.000000 | 0.000000 | 0.009994 | 0.000000 | 0.009150 | 0.013194 | 0.006089 |
40 | 0.000000 | 0.000000 | 0.059723 | 0.000000 | 0.023887 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.116393 | 0.011815 | 0.000000 | 0.000000 | 0.000000 | 0.010818 | 0.000000 | 0.028794 |
41 | 0.000000 | 0.000000 | 0.004826 | 0.000000 | 0.000000 | 0.018449 | 0.008413 | 0.009008 | 0.000000 | 0.000000 | ... | 0.015770 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007564 | 0.000000 |
42 | 0.000000 | 0.000000 | 0.059596 | 0.000000 | 0.015891 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.049777 | 0.000000 | 0.010795 | 0.011790 | 0.000000 | 0.010795 | 0.000000 | 0.007183 |
43 | 0.000000 | 0.000000 | 0.008240 | 0.000000 | 0.006591 | 0.000000 | 0.014363 | 0.015379 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.022941 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.006457 | 0.000000 |
44 | 0.000000 | 0.000000 | 0.043550 | 0.011142 | 0.007742 | 0.000000 | 0.008435 | 0.009031 | 0.000000 | 0.011862 | ... | 0.000000 | 0.000000 | 0.161665 | 0.000000 | 0.000000 | 0.011488 | 0.000000 | 0.010518 | 0.022750 | 0.013998 |
45 | 0.000000 | 0.000000 | 0.020163 | 0.000000 | 0.000000 | 0.000000 | 0.008786 | 0.009408 | 0.011606 | 0.000000 | ... | 0.016470 | 0.009408 | 0.005614 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.010957 | 0.007900 | 0.000000 |
46 | 0.000000 | 0.000000 | 0.010144 | 0.000000 | 0.008115 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.028243 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.007336 |
47 | 0.010314 | 0.012066 | 0.032378 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.013228 | ... | 0.000000 | 0.020144 | 0.066105 | 0.012811 | 0.011730 | 0.000000 | 0.011730 | 0.000000 | 0.000000 | 0.015610 |
48 | 0.000000 | 0.000000 | 0.015133 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.016482 | 0.000000 | 0.005618 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.023716 | 0.000000 |
49 | 0.019175 | 0.022432 | 0.030097 | 0.011550 | 0.000000 | 0.009587 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.150827 | 0.000000 | 0.010904 | 0.011909 | 0.000000 | 0.010904 | 0.000000 | 0.007255 |
50 | 0.000000 | 0.000000 | 0.005582 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.009119 | 0.000000 | 0.161614 | 0.013251 | 0.012133 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.064585 |
51 | 0.009137 | 0.010689 | 0.033463 | 0.000000 | 0.022945 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.106476 | 0.034048 | 0.000000 | 0.000000 | 0.020783 | 0.000000 | 0.000000 | 0.020743 |
52 | 0.009785 | 0.011447 | 0.025597 | 0.000000 | 0.024571 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.012549 | ... | 0.000000 | 0.000000 | 0.114022 | 0.000000 | 0.000000 | 0.012154 | 0.000000 | 0.011128 | 0.000000 | 0.037023 |
53 | 0.000000 | 0.000000 | 0.018548 | 0.000000 | 0.000000 | 0.000000 | 0.016166 | 0.017310 | 0.000000 | 0.000000 | ... | 0.007576 | 0.017310 | 0.010328 | 0.000000 | 0.000000 | 0.000000 | 0.010080 | 0.000000 | 0.000000 | 0.006707 |
54 | 0.010130 | 0.011851 | 0.010600 | 0.000000 | 0.000000 | 0.020260 | 0.009238 | 0.009892 | 0.000000 | 0.012992 | ... | 0.000000 | 0.000000 | 0.265603 | 0.012582 | 0.023041 | 0.000000 | 0.011521 | 0.000000 | 0.008306 | 0.130319 |
55 rows × 1193 columns
Document Similarity#
Cluster analysis with R seems more intuitive to me
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.000000 | 0.903099 | 0.901749 | 0.901110 | 0.907569 | 0.930193 | 0.874775 | 0.894402 | 0.919195 | 0.918026 | ... | 0.774057 | 0.861176 | 0.731567 | 0.844309 | 0.757781 | 0.843069 | 0.879384 | 0.712273 | 0.938602 | 0.866187 |
1 | 0.903099 | 1.000000 | 0.871965 | 0.867151 | 0.880732 | 0.913532 | 0.861910 | 0.902770 | 0.887822 | 0.877728 | ... | 0.784384 | 0.841174 | 0.729602 | 0.828479 | 0.717505 | 0.830369 | 0.836896 | 0.694544 | 0.907940 | 0.852431 |
2 | 0.901749 | 0.871965 | 1.000000 | 0.905119 | 0.873995 | 0.901962 | 0.883693 | 0.882181 | 0.896813 | 0.889755 | ... | 0.786636 | 0.821484 | 0.770126 | 0.842209 | 0.727968 | 0.815429 | 0.854184 | 0.710397 | 0.888556 | 0.842170 |
3 | 0.901110 | 0.867151 | 0.905119 | 1.000000 | 0.887274 | 0.898775 | 0.872573 | 0.882392 | 0.895758 | 0.890735 | ... | 0.780414 | 0.825479 | 0.750377 | 0.848413 | 0.720307 | 0.813878 | 0.852632 | 0.694705 | 0.899165 | 0.838796 |
4 | 0.907569 | 0.880732 | 0.873995 | 0.887274 | 1.000000 | 0.907642 | 0.840482 | 0.865856 | 0.884396 | 0.913948 | ... | 0.782834 | 0.862465 | 0.739062 | 0.856936 | 0.783967 | 0.854642 | 0.876750 | 0.717906 | 0.910563 | 0.878084 |
5 | 0.930193 | 0.913532 | 0.901962 | 0.898775 | 0.907642 | 1.000000 | 0.874888 | 0.896952 | 0.915662 | 0.923222 | ... | 0.783759 | 0.849392 | 0.744431 | 0.842141 | 0.746043 | 0.840681 | 0.871048 | 0.702883 | 0.931814 | 0.865160 |
6 | 0.874775 | 0.861910 | 0.883693 | 0.872573 | 0.840482 | 0.874888 | 1.000000 | 0.860151 | 0.882502 | 0.852995 | ... | 0.752524 | 0.796356 | 0.742264 | 0.828274 | 0.704168 | 0.770870 | 0.818364 | 0.670490 | 0.864791 | 0.816389 |
7 | 0.894402 | 0.902770 | 0.882181 | 0.882392 | 0.865856 | 0.896952 | 0.860151 | 1.000000 | 0.878723 | 0.877895 | ... | 0.799398 | 0.859184 | 0.767273 | 0.847824 | 0.746111 | 0.826701 | 0.843524 | 0.740893 | 0.896333 | 0.845723 |
8 | 0.919195 | 0.887822 | 0.896813 | 0.895758 | 0.884396 | 0.915662 | 0.882502 | 0.878723 | 1.000000 | 0.883385 | ... | 0.761283 | 0.827417 | 0.731610 | 0.825890 | 0.745811 | 0.822658 | 0.854792 | 0.682048 | 0.910304 | 0.840261 |
9 | 0.918026 | 0.877728 | 0.889755 | 0.890735 | 0.913948 | 0.923222 | 0.852995 | 0.877895 | 0.883385 | 1.000000 | ... | 0.765622 | 0.859278 | 0.729147 | 0.838511 | 0.755803 | 0.839939 | 0.859853 | 0.695509 | 0.920241 | 0.863758 |
10 | 0.893628 | 0.867787 | 0.881295 | 0.877463 | 0.889364 | 0.902471 | 0.860747 | 0.873244 | 0.907378 | 0.879919 | ... | 0.770757 | 0.847329 | 0.725159 | 0.838996 | 0.775047 | 0.825331 | 0.832768 | 0.706481 | 0.895585 | 0.835030 |
11 | 0.918032 | 0.922543 | 0.877476 | 0.873176 | 0.896175 | 0.923494 | 0.865996 | 0.897044 | 0.897752 | 0.905827 | ... | 0.764435 | 0.851556 | 0.709733 | 0.835111 | 0.720147 | 0.822738 | 0.840674 | 0.663372 | 0.924252 | 0.848980 |
12 | 0.927320 | 0.915016 | 0.882587 | 0.879687 | 0.894711 | 0.922641 | 0.856478 | 0.898665 | 0.918277 | 0.897774 | ... | 0.769405 | 0.836532 | 0.729273 | 0.825734 | 0.719975 | 0.836236 | 0.863377 | 0.685054 | 0.929722 | 0.853080 |
13 | 0.916993 | 0.900951 | 0.913324 | 0.892250 | 0.897556 | 0.923803 | 0.888728 | 0.888100 | 0.920030 | 0.894838 | ... | 0.773878 | 0.831713 | 0.738261 | 0.832680 | 0.736512 | 0.822101 | 0.852300 | 0.694546 | 0.913104 | 0.843727 |
14 | 0.879863 | 0.865431 | 0.878511 | 0.863772 | 0.863142 | 0.878933 | 0.844643 | 0.846818 | 0.871842 | 0.868329 | ... | 0.759897 | 0.779336 | 0.719440 | 0.804986 | 0.670269 | 0.786946 | 0.833296 | 0.641161 | 0.878097 | 0.832097 |
15 | 0.903929 | 0.853194 | 0.873932 | 0.883668 | 0.904154 | 0.896867 | 0.832619 | 0.854780 | 0.879057 | 0.894117 | ... | 0.775972 | 0.852444 | 0.743408 | 0.847293 | 0.767151 | 0.829395 | 0.866052 | 0.701045 | 0.895306 | 0.857095 |
16 | 0.907692 | 0.874330 | 0.866760 | 0.871648 | 0.892364 | 0.905170 | 0.836240 | 0.858848 | 0.892530 | 0.886955 | ... | 0.747956 | 0.819538 | 0.711203 | 0.831813 | 0.700731 | 0.818488 | 0.875829 | 0.658974 | 0.907822 | 0.840660 |
17 | 0.722748 | 0.676630 | 0.720698 | 0.704657 | 0.750416 | 0.708952 | 0.690665 | 0.723350 | 0.682600 | 0.749956 | ... | 0.704389 | 0.801974 | 0.702493 | 0.763639 | 0.812151 | 0.724943 | 0.714101 | 0.750128 | 0.712810 | 0.747868 |
18 | 0.852268 | 0.820648 | 0.800214 | 0.789683 | 0.854566 | 0.840020 | 0.779562 | 0.839935 | 0.816550 | 0.854762 | ... | 0.755226 | 0.880251 | 0.746072 | 0.826234 | 0.822705 | 0.801768 | 0.809347 | 0.741113 | 0.859247 | 0.840234 |
19 | 0.840175 | 0.788794 | 0.822618 | 0.804291 | 0.837034 | 0.824228 | 0.792047 | 0.830945 | 0.803000 | 0.838556 | ... | 0.792280 | 0.870559 | 0.778227 | 0.851410 | 0.852138 | 0.812832 | 0.831184 | 0.790361 | 0.832035 | 0.858552 |
20 | 0.833332 | 0.798374 | 0.834168 | 0.817030 | 0.821097 | 0.825280 | 0.808250 | 0.843359 | 0.813096 | 0.817110 | ... | 0.805719 | 0.861150 | 0.809262 | 0.852424 | 0.808533 | 0.789840 | 0.820379 | 0.772771 | 0.826265 | 0.832442 |
21 | 0.745682 | 0.691691 | 0.724727 | 0.708737 | 0.732307 | 0.710049 | 0.695002 | 0.755924 | 0.712325 | 0.740827 | ... | 0.711484 | 0.829470 | 0.712595 | 0.772976 | 0.817639 | 0.731415 | 0.722279 | 0.752113 | 0.731588 | 0.744636 |
22 | 0.816264 | 0.769034 | 0.799091 | 0.782308 | 0.812292 | 0.794071 | 0.787110 | 0.810181 | 0.795496 | 0.818210 | ... | 0.763423 | 0.839915 | 0.735678 | 0.826365 | 0.796230 | 0.762080 | 0.791648 | 0.710864 | 0.812984 | 0.831387 |
23 | 0.741841 | 0.669861 | 0.689774 | 0.685933 | 0.735625 | 0.692563 | 0.669424 | 0.719684 | 0.691292 | 0.729020 | ... | 0.665508 | 0.783379 | 0.650770 | 0.725722 | 0.814485 | 0.740312 | 0.731499 | 0.707777 | 0.725159 | 0.747473 |
24 | 0.841091 | 0.804408 | 0.805012 | 0.808854 | 0.862205 | 0.827907 | 0.791908 | 0.808168 | 0.798110 | 0.839622 | ... | 0.769851 | 0.871765 | 0.755433 | 0.838300 | 0.842024 | 0.792607 | 0.807654 | 0.767430 | 0.844423 | 0.839348 |
25 | 0.820705 | 0.785139 | 0.815059 | 0.803249 | 0.810044 | 0.822846 | 0.793844 | 0.816898 | 0.798980 | 0.809980 | ... | 0.814741 | 0.836125 | 0.792851 | 0.847521 | 0.765031 | 0.766337 | 0.812754 | 0.758284 | 0.821126 | 0.822159 |
26 | 0.846585 | 0.808344 | 0.789634 | 0.800136 | 0.853079 | 0.832964 | 0.775050 | 0.817873 | 0.808568 | 0.846014 | ... | 0.755862 | 0.872566 | 0.715610 | 0.819897 | 0.798217 | 0.796102 | 0.801881 | 0.712568 | 0.854132 | 0.846940 |
27 | 0.851304 | 0.797714 | 0.797572 | 0.784558 | 0.826298 | 0.826986 | 0.778319 | 0.824567 | 0.817484 | 0.844344 | ... | 0.742222 | 0.853406 | 0.695424 | 0.809665 | 0.777829 | 0.787153 | 0.810296 | 0.704694 | 0.851361 | 0.817153 |
28 | 0.824971 | 0.816717 | 0.786819 | 0.777690 | 0.816804 | 0.818824 | 0.771025 | 0.831528 | 0.799508 | 0.818772 | ... | 0.812883 | 0.881882 | 0.791936 | 0.827054 | 0.787598 | 0.803041 | 0.778636 | 0.785832 | 0.829446 | 0.819105 |
29 | 0.852327 | 0.814751 | 0.839825 | 0.817154 | 0.840744 | 0.839730 | 0.819181 | 0.842704 | 0.833108 | 0.843954 | ... | 0.784725 | 0.857058 | 0.806187 | 0.837527 | 0.817567 | 0.806701 | 0.837980 | 0.771634 | 0.847044 | 0.846029 |
30 | 0.794010 | 0.776042 | 0.755943 | 0.754476 | 0.827370 | 0.790795 | 0.733149 | 0.788210 | 0.750026 | 0.821269 | ... | 0.756161 | 0.876442 | 0.717054 | 0.805261 | 0.837379 | 0.780225 | 0.753993 | 0.744322 | 0.800286 | 0.814796 |
31 | 0.823972 | 0.813417 | 0.782432 | 0.775116 | 0.811153 | 0.810251 | 0.763265 | 0.833756 | 0.791667 | 0.811959 | ... | 0.796111 | 0.861117 | 0.752117 | 0.811259 | 0.759871 | 0.807596 | 0.790919 | 0.762348 | 0.827733 | 0.827300 |
32 | 0.878325 | 0.852549 | 0.866249 | 0.841610 | 0.862310 | 0.875473 | 0.855769 | 0.872726 | 0.864158 | 0.875575 | ... | 0.803574 | 0.878432 | 0.769746 | 0.864909 | 0.787805 | 0.799803 | 0.833120 | 0.746661 | 0.883848 | 0.849817 |
33 | 0.883584 | 0.852411 | 0.845208 | 0.821051 | 0.875078 | 0.886667 | 0.807734 | 0.857516 | 0.837904 | 0.877588 | ... | 0.794945 | 0.866058 | 0.759273 | 0.848622 | 0.798370 | 0.851570 | 0.853594 | 0.754632 | 0.879834 | 0.888921 |
34 | 0.675786 | 0.661941 | 0.672636 | 0.671671 | 0.700025 | 0.663805 | 0.654760 | 0.720069 | 0.656132 | 0.667695 | ... | 0.823057 | 0.785372 | 0.758352 | 0.792715 | 0.756819 | 0.732708 | 0.709800 | 0.786138 | 0.674175 | 0.726947 |
35 | 0.810933 | 0.773532 | 0.763758 | 0.756611 | 0.822137 | 0.789180 | 0.739294 | 0.791340 | 0.764288 | 0.814760 | ... | 0.752378 | 0.885662 | 0.722331 | 0.812275 | 0.863883 | 0.804854 | 0.787881 | 0.757975 | 0.804107 | 0.825393 |
36 | 0.844746 | 0.831847 | 0.830006 | 0.822755 | 0.857587 | 0.844472 | 0.792833 | 0.853440 | 0.813177 | 0.853282 | ... | 0.822397 | 0.904533 | 0.784227 | 0.852288 | 0.821485 | 0.847995 | 0.818471 | 0.805294 | 0.845315 | 0.849890 |
37 | 0.824112 | 0.780643 | 0.765167 | 0.757907 | 0.814186 | 0.805177 | 0.750204 | 0.798305 | 0.784583 | 0.826856 | ... | 0.732387 | 0.864541 | 0.679961 | 0.799603 | 0.799258 | 0.766278 | 0.765276 | 0.691883 | 0.825864 | 0.812077 |
38 | 0.765444 | 0.759691 | 0.782322 | 0.748915 | 0.764829 | 0.766151 | 0.741087 | 0.792770 | 0.748863 | 0.761727 | ... | 0.791290 | 0.807429 | 0.776408 | 0.797459 | 0.774330 | 0.787869 | 0.759214 | 0.814736 | 0.757332 | 0.777322 |
39 | 0.841180 | 0.842735 | 0.795105 | 0.789407 | 0.812744 | 0.834937 | 0.781318 | 0.857953 | 0.821731 | 0.821977 | ... | 0.793983 | 0.853737 | 0.748958 | 0.811971 | 0.724716 | 0.813735 | 0.795875 | 0.747016 | 0.847919 | 0.821548 |
40 | 0.661117 | 0.601695 | 0.645461 | 0.626393 | 0.674573 | 0.623891 | 0.609137 | 0.665443 | 0.603241 | 0.663814 | ... | 0.661901 | 0.745358 | 0.657060 | 0.713643 | 0.785833 | 0.696702 | 0.673876 | 0.738064 | 0.640975 | 0.710919 |
41 | 0.753643 | 0.745934 | 0.777722 | 0.751371 | 0.776517 | 0.751814 | 0.733240 | 0.787569 | 0.724689 | 0.771514 | ... | 0.780559 | 0.830094 | 0.754532 | 0.793743 | 0.770456 | 0.755606 | 0.735875 | 0.800807 | 0.752646 | 0.760544 |
42 | 0.728488 | 0.710539 | 0.732329 | 0.718885 | 0.723947 | 0.712334 | 0.703711 | 0.757266 | 0.698572 | 0.712019 | ... | 0.768337 | 0.806658 | 0.818925 | 0.771588 | 0.763566 | 0.746599 | 0.733842 | 0.828015 | 0.708562 | 0.748552 |
43 | 0.880814 | 0.860635 | 0.835784 | 0.835429 | 0.860004 | 0.869494 | 0.842492 | 0.876014 | 0.862838 | 0.867116 | ... | 0.807759 | 0.890710 | 0.774815 | 0.864797 | 0.781265 | 0.807796 | 0.828113 | 0.729236 | 0.891353 | 0.874514 |
44 | 0.705313 | 0.679096 | 0.705116 | 0.679263 | 0.714611 | 0.681784 | 0.660269 | 0.728331 | 0.681556 | 0.704017 | ... | 0.727093 | 0.783006 | 0.706936 | 0.744437 | 0.797124 | 0.744491 | 0.711070 | 0.772361 | 0.692165 | 0.748060 |
45 | 0.774057 | 0.784384 | 0.786636 | 0.780414 | 0.782834 | 0.783759 | 0.752524 | 0.799398 | 0.761283 | 0.765622 | ... | 1.000000 | 0.840797 | 0.809486 | 0.882053 | 0.723502 | 0.768755 | 0.783202 | 0.751529 | 0.783454 | 0.811948 |
46 | 0.861176 | 0.841174 | 0.821484 | 0.825479 | 0.862465 | 0.849392 | 0.796356 | 0.859184 | 0.827417 | 0.859278 | ... | 0.840797 | 1.000000 | 0.788950 | 0.880254 | 0.833002 | 0.833217 | 0.814247 | 0.782475 | 0.861600 | 0.861988 |
47 | 0.731567 | 0.729602 | 0.770126 | 0.750377 | 0.739062 | 0.744431 | 0.742264 | 0.767273 | 0.731610 | 0.729147 | ... | 0.809486 | 0.788950 | 1.000000 | 0.797582 | 0.748504 | 0.738363 | 0.751753 | 0.789740 | 0.724311 | 0.776450 |
48 | 0.844309 | 0.828479 | 0.842209 | 0.848413 | 0.856936 | 0.842141 | 0.828274 | 0.847824 | 0.825890 | 0.838511 | ... | 0.882053 | 0.880254 | 0.797582 | 1.000000 | 0.767375 | 0.806944 | 0.838058 | 0.758050 | 0.859021 | 0.857034 |
49 | 0.757781 | 0.717505 | 0.727968 | 0.720307 | 0.783967 | 0.746043 | 0.704168 | 0.746111 | 0.745811 | 0.755803 | ... | 0.723502 | 0.833002 | 0.748504 | 0.767375 | 1.000000 | 0.763742 | 0.751773 | 0.760735 | 0.741123 | 0.794523 |
50 | 0.843069 | 0.830369 | 0.815429 | 0.813878 | 0.854642 | 0.840681 | 0.770870 | 0.826701 | 0.822658 | 0.839939 | ... | 0.768755 | 0.833217 | 0.738363 | 0.806944 | 0.763742 | 1.000000 | 0.866217 | 0.757738 | 0.840602 | 0.871413 |
51 | 0.879384 | 0.836896 | 0.854184 | 0.852632 | 0.876750 | 0.871048 | 0.818364 | 0.843524 | 0.854792 | 0.859853 | ... | 0.783202 | 0.814247 | 0.751753 | 0.838058 | 0.751773 | 0.866217 | 1.000000 | 0.729374 | 0.869848 | 0.869194 |
52 | 0.712273 | 0.694544 | 0.710397 | 0.694705 | 0.717906 | 0.702883 | 0.670490 | 0.740893 | 0.682048 | 0.695509 | ... | 0.751529 | 0.782475 | 0.789740 | 0.758050 | 0.760735 | 0.757738 | 0.729374 | 1.000000 | 0.696812 | 0.737664 |
53 | 0.938602 | 0.907940 | 0.888556 | 0.899165 | 0.910563 | 0.931814 | 0.864791 | 0.896333 | 0.910304 | 0.920241 | ... | 0.783454 | 0.861600 | 0.724311 | 0.859021 | 0.741123 | 0.840602 | 0.869848 | 0.696812 | 1.000000 | 0.871567 |
54 | 0.866187 | 0.852431 | 0.842170 | 0.838796 | 0.878084 | 0.865160 | 0.816389 | 0.845723 | 0.840261 | 0.863758 | ... | 0.811948 | 0.861988 | 0.776450 | 0.857034 | 0.794523 | 0.871413 | 0.869194 | 0.737664 | 0.871567 | 1.000000 |
55 rows × 55 columns
%load_ext rpy2.ipython
%%R -i similarity_matrix
library(dplyr)
#library(ggplot2)
head(similarity_matrix)
hclust(as.dist(similarity_matrix),method="ward.D2") %>%
plot
R[write to console]:
Attaching package: ‘dplyr’
R[write to console]: The following objects are masked from ‘package:stats’:
filter, lag
R[write to console]: The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(similarity_matrix, 'ward')
pd.DataFrame(Z)
# but to draw dendrogram?
dendrogram(Z)
# Don't like the graph in python
{'icoord': [[5.0, 5.0, 15.0, 15.0],
[25.0, 25.0, 35.0, 35.0],
[10.0, 10.0, 30.0, 30.0],
[65.0, 65.0, 75.0, 75.0],
[55.0, 55.0, 70.0, 70.0],
[45.0, 45.0, 62.5, 62.5],
[115.0, 115.0, 125.0, 125.0],
[105.0, 105.0, 120.0, 120.0],
[95.0, 95.0, 112.5, 112.5],
[85.0, 85.0, 103.75, 103.75],
[53.75, 53.75, 94.375, 94.375],
[20.0, 20.0, 74.0625, 74.0625],
[155.0, 155.0, 165.0, 165.0],
[145.0, 145.0, 160.0, 160.0],
[185.0, 185.0, 195.0, 195.0],
[175.0, 175.0, 190.0, 190.0],
[152.5, 152.5, 182.5, 182.5],
[135.0, 135.0, 167.5, 167.5],
[205.0, 205.0, 215.0, 215.0],
[245.0, 245.0, 255.0, 255.0],
[235.0, 235.0, 250.0, 250.0],
[225.0, 225.0, 242.5, 242.5],
[265.0, 265.0, 275.0, 275.0],
[295.0, 295.0, 305.0, 305.0],
[285.0, 285.0, 300.0, 300.0],
[270.0, 270.0, 292.5, 292.5],
[233.75, 233.75, 281.25, 281.25],
[210.0, 210.0, 257.5, 257.5],
[151.25, 151.25, 233.75, 233.75],
[315.0, 315.0, 325.0, 325.0],
[345.0, 345.0, 355.0, 355.0],
[335.0, 335.0, 350.0, 350.0],
[320.0, 320.0, 342.5, 342.5],
[375.0, 375.0, 385.0, 385.0],
[365.0, 365.0, 380.0, 380.0],
[331.25, 331.25, 372.5, 372.5],
[395.0, 395.0, 405.0, 405.0],
[425.0, 425.0, 435.0, 435.0],
[415.0, 415.0, 430.0, 430.0],
[400.0, 400.0, 422.5, 422.5],
[455.0, 455.0, 465.0, 465.0],
[445.0, 445.0, 460.0, 460.0],
[475.0, 475.0, 485.0, 485.0],
[495.0, 495.0, 505.0, 505.0],
[480.0, 480.0, 500.0, 500.0],
[515.0, 515.0, 525.0, 525.0],
[535.0, 535.0, 545.0, 545.0],
[520.0, 520.0, 540.0, 540.0],
[490.0, 490.0, 530.0, 530.0],
[452.5, 452.5, 510.0, 510.0],
[411.25, 411.25, 481.25, 481.25],
[351.875, 351.875, 446.25, 446.25],
[192.5, 192.5, 399.0625, 399.0625],
[47.03125, 47.03125, 295.78125, 295.78125]],
'dcoord': [[0.0, 0.21333867488882308, 0.21333867488882308, 0.0],
[0.0, 0.3805535089335521, 0.3805535089335521, 0.0],
[0.21333867488882308,
0.5855848260443224,
0.5855848260443224,
0.3805535089335521],
[0.0, 0.2872644901251715, 0.2872644901251715, 0.0],
[0.0, 0.34472959074707116, 0.34472959074707116, 0.2872644901251715],
[0.0, 0.38938388145098235, 0.38938388145098235, 0.34472959074707116],
[0.0, 0.33352399998340854, 0.33352399998340854, 0.0],
[0.0, 0.3974263535175602, 0.3974263535175602, 0.33352399998340854],
[0.0, 0.4140604517840557, 0.4140604517840557, 0.3974263535175602],
[0.0, 0.6053583701171363, 0.6053583701171363, 0.4140604517840557],
[0.38938388145098235,
0.7629594367621403,
0.7629594367621403,
0.6053583701171363],
[0.5855848260443224,
0.9584521635877752,
0.9584521635877752,
0.7629594367621403],
[0.0, 0.10629097243894707, 0.10629097243894707, 0.0],
[0.0, 0.14589106217232223, 0.14589106217232223, 0.10629097243894707],
[0.0, 0.14845298440347174, 0.14845298440347174, 0.0],
[0.0, 0.20323545927584558, 0.20323545927584558, 0.14845298440347174],
[0.14589106217232223,
0.27053775393863067,
0.27053775393863067,
0.20323545927584558],
[0.0, 0.2834693852894986, 0.2834693852894986, 0.27053775393863067],
[0.0, 0.27579034064360586, 0.27579034064360586, 0.0],
[0.0, 0.14055830099509545, 0.14055830099509545, 0.0],
[0.0, 0.17317330451288437, 0.17317330451288437, 0.14055830099509545],
[0.0, 0.22889162471627206, 0.22889162471627206, 0.17317330451288437],
[0.0, 0.1646798411677444, 0.1646798411677444, 0.0],
[0.0, 0.14195089035900987, 0.14195089035900987, 0.0],
[0.0, 0.20457278941880547, 0.20457278941880547, 0.14195089035900987],
[0.1646798411677444,
0.23724513617131898,
0.23724513617131898,
0.20457278941880547],
[0.22889162471627206,
0.35990544890675036,
0.35990544890675036,
0.23724513617131898],
[0.27579034064360586,
0.43126319905416755,
0.43126319905416755,
0.35990544890675036],
[0.2834693852894986,
0.5929383229764894,
0.5929383229764894,
0.43126319905416755],
[0.0, 0.19468395945144223, 0.19468395945144223, 0.0],
[0.0, 0.21074083639138996, 0.21074083639138996, 0.0],
[0.0, 0.23596337662440936, 0.23596337662440936, 0.21074083639138996],
[0.19468395945144223,
0.2733141711848642,
0.2733141711848642,
0.23596337662440936],
[0.0, 0.17997077302830203, 0.17997077302830203, 0.0],
[0.0, 0.3285798092301384, 0.3285798092301384, 0.17997077302830203],
[0.2733141711848642,
0.45274108992749557,
0.45274108992749557,
0.3285798092301384],
[0.0, 0.26903843779564635, 0.26903843779564635, 0.0],
[0.0, 0.2778464314878544, 0.2778464314878544, 0.0],
[0.0, 0.30565801021330624, 0.30565801021330624, 0.2778464314878544],
[0.26903843779564635,
0.5132012196645103,
0.5132012196645103,
0.30565801021330624],
[0.0, 0.17361359077926955, 0.17361359077926955, 0.0],
[0.0, 0.26920063399006655, 0.26920063399006655, 0.17361359077926955],
[0.0, 0.19234199258680965, 0.19234199258680965, 0.0],
[0.0, 0.23259832292710847, 0.23259832292710847, 0.0],
[0.19234199258680965,
0.3207803415976658,
0.3207803415976658,
0.23259832292710847],
[0.0, 0.1772785970481172, 0.1772785970481172, 0.0],
[0.0, 0.2078965544789089, 0.2078965544789089, 0.0],
[0.1772785970481172,
0.34725719635513774,
0.34725719635513774,
0.2078965544789089],
[0.3207803415976658,
0.4830351672375104,
0.4830351672375104,
0.34725719635513774],
[0.26920063399006655,
0.5837152751551123,
0.5837152751551123,
0.4830351672375104],
[0.5132012196645103,
0.6895392014624909,
0.6895392014624909,
0.5837152751551123],
[0.45274108992749557,
0.8150132507151894,
0.8150132507151894,
0.6895392014624909],
[0.5929383229764894,
2.081013463489638,
2.081013463489638,
0.8150132507151894],
[0.9584521635877752,
3.1556008507593947,
3.1556008507593947,
2.081013463489638]],
'ivl': ['38',
'41',
'45',
'47',
'23',
'49',
'17',
'21',
'40',
'44',
'34',
'42',
'52',
'7',
'5',
'0',
'53',
'15',
'4',
'9',
'6',
'14',
'16',
'1',
'11',
'12',
'2',
'3',
'10',
'8',
'13',
'18',
'26',
'22',
'27',
'37',
'24',
'30',
'35',
'50',
'51',
'25',
'20',
'48',
'39',
'28',
'31',
'36',
'46',
'19',
'29',
'32',
'43',
'33',
'54'],
'leaves': [38,
41,
45,
47,
23,
49,
17,
21,
40,
44,
34,
42,
52,
7,
5,
0,
53,
15,
4,
9,
6,
14,
16,
1,
11,
12,
2,
3,
10,
8,
13,
18,
26,
22,
27,
37,
24,
30,
35,
50,
51,
25,
20,
48,
39,
28,
31,
36,
46,
19,
29,
32,
43,
33,
54],
'color_list': ['g',
'g',
'g',
'g',
'g',
'g',
'g',
'g',
'g',
'g',
'g',
'g',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'r',
'b']}