4. Assignment IV: Text Vectorization#
4.1. Question 1#
Use nltk.corpus.inaugural
as the corpus data for this exercise. It is a collection of US presidential inaugural speeches over the years. Cluster all the inaugural speeches collected in the corpus based on their bag-of-words vectorized representations.
Please consider the following settings for bag-of-words model:
Use the English stopwords provided in
nltk.corpus.stopwords.words('english')
to remove uninformative words.Lemmatize word tokens using
WordNetLemmatizer()
.Normalize the letter casing.
Include in the Bag-of-words model only words consisting as alphabets or hyphens.
Use
TfIdfVectorizer()
for bag-of-word vectorization.
Show code cell source
tfidf_df
abandon | abandonment | abate | abdicate | abeyance | abhorring | abide | abiding | ability | abnormal | ... | yesterday | yield | yielding | york | yorktown | young | your | youth | zeal | zone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1789-Washington | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1793-Washington | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1797-Adams | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 |
1801-Jefferson | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 |
1805-Jefferson | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.10 | 0.00 |
1809-Madison | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1813-Madison | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1817-Monroe | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.04 |
1821-Monroe | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | ... | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1825-Adams | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | ... | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 |
1829-Jackson | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 |
1833-Jackson | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.00 | ... | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1837-VanBuren | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.03 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1841-Harrison | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | ... | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1845-Polk | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1849-Taylor | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1853-Pierce | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1857-Buchanan | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1861-Lincoln | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.01 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1865-Lincoln | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1869-Grant | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.00 | 0.04 | 0.00 | 0.10 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1873-Grant | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1877-Hayes | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1881-Garfield | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.02 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1885-Cleveland | 0.03 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 |
1889-Harrison | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 0.00 | ... | 0.00 | 0.02 | 0.00 | 0.06 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1893-Cleveland | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1897-McKinley | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.07 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.06 | 0.00 |
1901-McKinley | 0.03 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 0.02 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1905-Roosevelt | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1909-Taft | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1913-Wilson | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1917-Wilson | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1921-Harding | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.03 | ... | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1925-Coolidge | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1929-Hoover | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.03 | 0.06 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 |
1933-Roosevelt | 0.00 | 0.04 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1937-Roosevelt | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 0.02 | 0.00 | ... | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1941-Roosevelt | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1945-Roosevelt | 0.00 | 0.09 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1949-Truman | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1953-Eisenhower | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.02 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1957-Eisenhower | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1961-Kennedy | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1965-Johnson | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1969-Nixon | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.07 | 0.00 | 0.00 |
1973-Nixon | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1977-Carter | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1981-Reagan | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1985-Reagan | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 |
1989-Bush | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 |
1993-Clinton | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | ... | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 |
1997-Clinton | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
2001-Bush | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
2005-Bush | 0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
2009-Obama | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
2013-Obama | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 |
2017-Trump | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
58 rows × 5240 columns
Show code cell source
similarity_df
1789-Washington | 1793-Washington | 1797-Adams | 1801-Jefferson | 1805-Jefferson | 1809-Madison | 1813-Madison | 1817-Monroe | 1821-Monroe | 1825-Adams | ... | 1981-Reagan | 1985-Reagan | 1989-Bush | 1993-Clinton | 1997-Clinton | 2001-Bush | 2005-Bush | 2009-Obama | 2013-Obama | 2017-Trump | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1789-Washington | 1.000000 | 0.100469 | 0.263077 | 0.222896 | 0.221114 | 0.235420 | 0.155258 | 0.271341 | 0.252055 | 0.271323 | ... | 0.132736 | 0.143454 | 0.128986 | 0.098833 | 0.114395 | 0.135037 | 0.141243 | 0.137906 | 0.137122 | 0.100266 |
1793-Washington | 0.100469 | 1.000000 | 0.116753 | 0.110464 | 0.105334 | 0.058795 | 0.083094 | 0.085385 | 0.098381 | 0.079854 | ... | 0.089135 | 0.062981 | 0.055985 | 0.085991 | 0.057881 | 0.079049 | 0.113584 | 0.053607 | 0.093894 | 0.101117 |
1797-Adams | 0.263077 | 0.116753 | 1.000000 | 0.271978 | 0.255818 | 0.269615 | 0.177193 | 0.373816 | 0.325464 | 0.363519 | ... | 0.187246 | 0.206405 | 0.176901 | 0.177555 | 0.212440 | 0.191061 | 0.226490 | 0.211674 | 0.200487 | 0.192239 |
1801-Jefferson | 0.222896 | 0.110464 | 0.271978 | 1.000000 | 0.312475 | 0.256926 | 0.159447 | 0.292838 | 0.249709 | 0.311659 | ... | 0.207168 | 0.227679 | 0.192643 | 0.133177 | 0.177914 | 0.169416 | 0.190955 | 0.211151 | 0.198564 | 0.131511 |
1805-Jefferson | 0.221114 | 0.105334 | 0.255818 | 0.312475 | 1.000000 | 0.252321 | 0.170299 | 0.333710 | 0.315158 | 0.299334 | ... | 0.172774 | 0.174084 | 0.171594 | 0.117881 | 0.135964 | 0.180568 | 0.181756 | 0.168434 | 0.189254 | 0.111912 |
1809-Madison | 0.235420 | 0.058795 | 0.269615 | 0.256926 | 0.252321 | 1.000000 | 0.189923 | 0.311826 | 0.272209 | 0.303731 | ... | 0.116073 | 0.148220 | 0.119446 | 0.109400 | 0.124340 | 0.136778 | 0.158336 | 0.143664 | 0.153434 | 0.105658 |
1813-Madison | 0.155258 | 0.083094 | 0.177193 | 0.159447 | 0.170299 | 0.189923 | 1.000000 | 0.252817 | 0.255972 | 0.187200 | ... | 0.124457 | 0.115699 | 0.105333 | 0.089038 | 0.105538 | 0.134604 | 0.135726 | 0.124087 | 0.144395 | 0.095246 |
1817-Monroe | 0.271341 | 0.085385 | 0.373816 | 0.292838 | 0.333710 | 0.311826 | 0.252817 | 1.000000 | 0.526531 | 0.424570 | ... | 0.189510 | 0.215947 | 0.185875 | 0.140203 | 0.172928 | 0.173909 | 0.188759 | 0.187230 | 0.194673 | 0.158218 |
1821-Monroe | 0.252055 | 0.098381 | 0.325464 | 0.249709 | 0.315158 | 0.272209 | 0.255972 | 0.526531 | 1.000000 | 0.373039 | ... | 0.190434 | 0.192262 | 0.170365 | 0.139803 | 0.163219 | 0.173564 | 0.173309 | 0.181176 | 0.202169 | 0.140568 |
1825-Adams | 0.271323 | 0.079854 | 0.363519 | 0.311659 | 0.299334 | 0.303731 | 0.187200 | 0.424570 | 0.373039 | 1.000000 | ... | 0.178828 | 0.229070 | 0.176181 | 0.138276 | 0.200806 | 0.188288 | 0.194322 | 0.189036 | 0.212285 | 0.136648 |
1829-Jackson | 0.239658 | 0.079782 | 0.300506 | 0.236003 | 0.264835 | 0.252786 | 0.128543 | 0.329080 | 0.285537 | 0.295693 | ... | 0.127304 | 0.149233 | 0.111519 | 0.087705 | 0.102312 | 0.134401 | 0.145167 | 0.127328 | 0.141798 | 0.088108 |
1833-Jackson | 0.253547 | 0.081887 | 0.301258 | 0.262312 | 0.232326 | 0.202349 | 0.149420 | 0.394226 | 0.295446 | 0.344810 | ... | 0.196121 | 0.218858 | 0.142032 | 0.127417 | 0.154544 | 0.139271 | 0.171167 | 0.140782 | 0.160684 | 0.111765 |
1837-VanBuren | 0.271228 | 0.117099 | 0.341596 | 0.291945 | 0.320544 | 0.289802 | 0.202584 | 0.407942 | 0.361319 | 0.352325 | ... | 0.190075 | 0.211343 | 0.167520 | 0.143890 | 0.175345 | 0.184472 | 0.205019 | 0.200026 | 0.202815 | 0.158900 |
1841-Harrison | 0.297393 | 0.127392 | 0.396508 | 0.329862 | 0.340092 | 0.260406 | 0.205536 | 0.434200 | 0.410776 | 0.407258 | ... | 0.228042 | 0.224249 | 0.221103 | 0.144520 | 0.185594 | 0.199540 | 0.229179 | 0.186926 | 0.225010 | 0.176415 |
1845-Polk | 0.266331 | 0.096425 | 0.340340 | 0.310211 | 0.359256 | 0.270406 | 0.201489 | 0.483244 | 0.415317 | 0.477900 | ... | 0.212794 | 0.240690 | 0.152218 | 0.126507 | 0.182691 | 0.162359 | 0.191676 | 0.163536 | 0.191139 | 0.160435 |
1849-Taylor | 0.231938 | 0.127836 | 0.267764 | 0.222268 | 0.228333 | 0.196377 | 0.151193 | 0.301321 | 0.273624 | 0.332504 | ... | 0.148453 | 0.147958 | 0.132474 | 0.103793 | 0.116114 | 0.154935 | 0.144307 | 0.135801 | 0.150633 | 0.132662 |
1853-Pierce | 0.241781 | 0.082371 | 0.289567 | 0.310205 | 0.296786 | 0.285845 | 0.192135 | 0.333828 | 0.321594 | 0.349790 | ... | 0.210606 | 0.215175 | 0.204147 | 0.162242 | 0.190930 | 0.187278 | 0.217798 | 0.219303 | 0.216425 | 0.150565 |
1857-Buchanan | 0.229841 | 0.105716 | 0.352369 | 0.253419 | 0.278403 | 0.243105 | 0.190539 | 0.439213 | 0.386490 | 0.360350 | ... | 0.179700 | 0.200743 | 0.160270 | 0.126147 | 0.169273 | 0.155799 | 0.197082 | 0.180739 | 0.191561 | 0.153631 |
1861-Lincoln | 0.217834 | 0.137906 | 0.281549 | 0.258098 | 0.244461 | 0.197658 | 0.153655 | 0.339499 | 0.313925 | 0.319026 | ... | 0.169377 | 0.190612 | 0.171882 | 0.121957 | 0.142565 | 0.131084 | 0.159304 | 0.141254 | 0.172601 | 0.114982 |
1865-Lincoln | 0.087317 | 0.041699 | 0.102704 | 0.112267 | 0.117207 | 0.103497 | 0.146779 | 0.162541 | 0.168557 | 0.169242 | ... | 0.105916 | 0.128205 | 0.114306 | 0.087169 | 0.116270 | 0.098314 | 0.101175 | 0.151608 | 0.123847 | 0.080439 |
1869-Grant | 0.157543 | 0.092843 | 0.188218 | 0.163297 | 0.219834 | 0.166446 | 0.109324 | 0.215937 | 0.222320 | 0.200927 | ... | 0.120316 | 0.134302 | 0.101294 | 0.080857 | 0.114267 | 0.119278 | 0.128227 | 0.153464 | 0.139444 | 0.143127 |
1873-Grant | 0.167341 | 0.064751 | 0.212574 | 0.193820 | 0.214197 | 0.162942 | 0.133407 | 0.252082 | 0.256386 | 0.209966 | ... | 0.159831 | 0.177298 | 0.150492 | 0.124808 | 0.163144 | 0.158463 | 0.148972 | 0.158733 | 0.170947 | 0.134293 |
1877-Hayes | 0.268814 | 0.089404 | 0.334941 | 0.264048 | 0.257693 | 0.240323 | 0.195639 | 0.369109 | 0.353596 | 0.360152 | ... | 0.172962 | 0.223189 | 0.179651 | 0.136826 | 0.177776 | 0.199904 | 0.191357 | 0.182017 | 0.199402 | 0.152157 |
1881-Garfield | 0.198924 | 0.115467 | 0.333774 | 0.271410 | 0.268331 | 0.242683 | 0.178135 | 0.376154 | 0.327899 | 0.357122 | ... | 0.222055 | 0.273979 | 0.203552 | 0.195838 | 0.255653 | 0.194294 | 0.257129 | 0.212226 | 0.242525 | 0.175955 |
1885-Cleveland | 0.242538 | 0.150555 | 0.313967 | 0.248439 | 0.247422 | 0.184372 | 0.148094 | 0.331348 | 0.262678 | 0.314950 | ... | 0.211754 | 0.215103 | 0.155788 | 0.170707 | 0.208044 | 0.183904 | 0.185560 | 0.175272 | 0.208878 | 0.168875 |
1889-Harrison | 0.201107 | 0.110286 | 0.311506 | 0.210770 | 0.274758 | 0.237249 | 0.171349 | 0.339874 | 0.328590 | 0.289515 | ... | 0.184922 | 0.211650 | 0.186595 | 0.152990 | 0.192581 | 0.191090 | 0.199910 | 0.186237 | 0.194092 | 0.159556 |
1893-Cleveland | 0.187081 | 0.082868 | 0.245104 | 0.224297 | 0.204598 | 0.168568 | 0.132638 | 0.275993 | 0.233625 | 0.243179 | ... | 0.182321 | 0.197097 | 0.130463 | 0.145739 | 0.170313 | 0.177705 | 0.175770 | 0.166623 | 0.185582 | 0.136337 |
1897-McKinley | 0.198859 | 0.093912 | 0.311027 | 0.240926 | 0.252417 | 0.220070 | 0.192731 | 0.369348 | 0.386539 | 0.302438 | ... | 0.200912 | 0.218691 | 0.184370 | 0.166455 | 0.166244 | 0.161263 | 0.180224 | 0.182362 | 0.205185 | 0.161228 |
1901-McKinley | 0.194617 | 0.074237 | 0.284042 | 0.228160 | 0.208129 | 0.186097 | 0.179577 | 0.325199 | 0.349830 | 0.283190 | ... | 0.201454 | 0.225115 | 0.191309 | 0.154646 | 0.181927 | 0.158640 | 0.220130 | 0.179475 | 0.212481 | 0.144608 |
1905-Roosevelt | 0.135136 | 0.029364 | 0.205791 | 0.167768 | 0.163999 | 0.143277 | 0.106975 | 0.182755 | 0.172575 | 0.181196 | ... | 0.202199 | 0.198707 | 0.224674 | 0.174449 | 0.220328 | 0.192387 | 0.184496 | 0.239070 | 0.200623 | 0.160882 |
1909-Taft | 0.199444 | 0.078073 | 0.226108 | 0.195742 | 0.225772 | 0.194128 | 0.143382 | 0.298367 | 0.332566 | 0.264082 | ... | 0.174716 | 0.200757 | 0.178160 | 0.151725 | 0.176534 | 0.169025 | 0.172847 | 0.177021 | 0.185307 | 0.148024 |
1913-Wilson | 0.144212 | 0.043186 | 0.178558 | 0.187577 | 0.156692 | 0.129842 | 0.100065 | 0.170188 | 0.154201 | 0.168813 | ... | 0.216933 | 0.216021 | 0.252401 | 0.174070 | 0.207795 | 0.166648 | 0.194795 | 0.242959 | 0.184979 | 0.174285 |
1917-Wilson | 0.146761 | 0.033300 | 0.227297 | 0.202814 | 0.210788 | 0.194649 | 0.154554 | 0.226717 | 0.227365 | 0.217365 | ... | 0.186601 | 0.215518 | 0.241470 | 0.219048 | 0.222669 | 0.209503 | 0.230307 | 0.238202 | 0.218702 | 0.174091 |
1921-Harding | 0.162135 | 0.079501 | 0.217242 | 0.196333 | 0.190327 | 0.176106 | 0.196644 | 0.225617 | 0.216628 | 0.230081 | ... | 0.242715 | 0.271511 | 0.250049 | 0.309930 | 0.270029 | 0.231276 | 0.280244 | 0.261933 | 0.250398 | 0.232015 |
1925-Coolidge | 0.180862 | 0.084242 | 0.317610 | 0.265334 | 0.279402 | 0.254756 | 0.196499 | 0.343191 | 0.342836 | 0.327664 | ... | 0.264206 | 0.291312 | 0.241890 | 0.223416 | 0.230321 | 0.252573 | 0.289345 | 0.248131 | 0.270035 | 0.229347 |
1929-Hoover | 0.179763 | 0.079659 | 0.284060 | 0.225666 | 0.212693 | 0.192845 | 0.174830 | 0.271156 | 0.275473 | 0.270245 | ... | 0.234234 | 0.277041 | 0.208748 | 0.207688 | 0.223184 | 0.227476 | 0.268296 | 0.228143 | 0.217975 | 0.192496 |
1933-Roosevelt | 0.140593 | 0.042248 | 0.171998 | 0.149116 | 0.164020 | 0.142785 | 0.125921 | 0.215969 | 0.194958 | 0.179669 | ... | 0.206676 | 0.217863 | 0.252645 | 0.194562 | 0.211253 | 0.158872 | 0.182693 | 0.237826 | 0.217244 | 0.150715 |
1937-Roosevelt | 0.149017 | 0.066298 | 0.235792 | 0.192260 | 0.192132 | 0.154735 | 0.148608 | 0.242199 | 0.212352 | 0.250178 | ... | 0.266448 | 0.280185 | 0.282505 | 0.248074 | 0.293379 | 0.226946 | 0.252626 | 0.274907 | 0.255194 | 0.222135 |
1941-Roosevelt | 0.121813 | 0.060130 | 0.251591 | 0.164989 | 0.155462 | 0.128153 | 0.117272 | 0.179671 | 0.174861 | 0.187263 | ... | 0.250740 | 0.281791 | 0.301133 | 0.267443 | 0.309926 | 0.284725 | 0.361932 | 0.307005 | 0.295888 | 0.239297 |
1945-Roosevelt | 0.059130 | 0.059212 | 0.101460 | 0.129558 | 0.099616 | 0.099088 | 0.073456 | 0.124281 | 0.102751 | 0.118771 | ... | 0.177419 | 0.235870 | 0.264253 | 0.200823 | 0.210729 | 0.180100 | 0.194650 | 0.210754 | 0.239829 | 0.162344 |
1949-Truman | 0.139416 | 0.048009 | 0.248022 | 0.209795 | 0.186282 | 0.196736 | 0.153536 | 0.257644 | 0.247849 | 0.228649 | ... | 0.263074 | 0.343316 | 0.273920 | 0.237350 | 0.246739 | 0.214087 | 0.323666 | 0.246192 | 0.274052 | 0.182489 |
1953-Eisenhower | 0.146783 | 0.061330 | 0.211000 | 0.211501 | 0.190056 | 0.167791 | 0.155883 | 0.231503 | 0.207911 | 0.202824 | ... | 0.285598 | 0.329671 | 0.301611 | 0.282566 | 0.281789 | 0.263716 | 0.323174 | 0.299918 | 0.299235 | 0.222314 |
1957-Eisenhower | 0.103462 | 0.058415 | 0.209156 | 0.193006 | 0.172302 | 0.163121 | 0.142133 | 0.178976 | 0.188737 | 0.228676 | ... | 0.273809 | 0.327630 | 0.289657 | 0.273558 | 0.272879 | 0.247863 | 0.347072 | 0.296111 | 0.274539 | 0.226652 |
1961-Kennedy | 0.125081 | 0.058710 | 0.151539 | 0.200763 | 0.156918 | 0.108401 | 0.166571 | 0.155277 | 0.159073 | 0.162137 | ... | 0.301872 | 0.307210 | 0.277737 | 0.279308 | 0.271597 | 0.234492 | 0.263239 | 0.288503 | 0.265859 | 0.195977 |
1965-Johnson | 0.114979 | 0.042360 | 0.170738 | 0.207508 | 0.175268 | 0.145923 | 0.108162 | 0.165974 | 0.153818 | 0.232030 | ... | 0.265267 | 0.301875 | 0.268056 | 0.271136 | 0.307208 | 0.245794 | 0.272477 | 0.305062 | 0.279320 | 0.219613 |
1969-Nixon | 0.122489 | 0.072958 | 0.185052 | 0.232233 | 0.159178 | 0.120675 | 0.112196 | 0.167725 | 0.165441 | 0.171592 | ... | 0.326510 | 0.380931 | 0.366201 | 0.326139 | 0.359952 | 0.276771 | 0.291767 | 0.332912 | 0.296507 | 0.311116 |
1973-Nixon | 0.124212 | 0.069505 | 0.194943 | 0.201252 | 0.152232 | 0.146956 | 0.118058 | 0.185654 | 0.185928 | 0.188753 | ... | 0.297767 | 0.377705 | 0.315695 | 0.431685 | 0.435901 | 0.319748 | 0.326282 | 0.342594 | 0.312722 | 0.310858 |
1977-Carter | 0.112771 | 0.074625 | 0.187136 | 0.173800 | 0.157600 | 0.157880 | 0.142972 | 0.172554 | 0.165357 | 0.179238 | ... | 0.270177 | 0.272716 | 0.244791 | 0.245557 | 0.304649 | 0.256859 | 0.276639 | 0.256244 | 0.256056 | 0.221400 |
1981-Reagan | 0.132736 | 0.089135 | 0.187246 | 0.207168 | 0.172774 | 0.116073 | 0.124457 | 0.189510 | 0.190434 | 0.178828 | ... | 1.000000 | 0.425823 | 0.324284 | 0.324647 | 0.327586 | 0.276996 | 0.324140 | 0.317933 | 0.319519 | 0.306585 |
1985-Reagan | 0.143454 | 0.062981 | 0.206405 | 0.227679 | 0.174084 | 0.148220 | 0.115699 | 0.215947 | 0.192262 | 0.229070 | ... | 0.425823 | 1.000000 | 0.380280 | 0.359553 | 0.381739 | 0.308768 | 0.370062 | 0.337947 | 0.350950 | 0.292282 |
1989-Bush | 0.128986 | 0.055985 | 0.176901 | 0.192643 | 0.171594 | 0.119446 | 0.105333 | 0.185875 | 0.170365 | 0.176181 | ... | 0.324284 | 0.380280 | 1.000000 | 0.344350 | 0.352426 | 0.317701 | 0.315920 | 0.348850 | 0.331689 | 0.305273 |
1993-Clinton | 0.098833 | 0.085991 | 0.177555 | 0.133177 | 0.117881 | 0.109400 | 0.089038 | 0.140203 | 0.139803 | 0.138276 | ... | 0.324647 | 0.359553 | 0.344350 | 1.000000 | 0.441915 | 0.318053 | 0.364296 | 0.394513 | 0.376239 | 0.377192 |
1997-Clinton | 0.114395 | 0.057881 | 0.212440 | 0.177914 | 0.135964 | 0.124340 | 0.105538 | 0.172928 | 0.163219 | 0.200806 | ... | 0.327586 | 0.381739 | 0.352426 | 0.441915 | 1.000000 | 0.359197 | 0.366349 | 0.365223 | 0.389452 | 0.368267 |
2001-Bush | 0.135037 | 0.079049 | 0.191061 | 0.169416 | 0.180568 | 0.136778 | 0.134604 | 0.173909 | 0.173564 | 0.188288 | ... | 0.276996 | 0.308768 | 0.317701 | 0.318053 | 0.359197 | 1.000000 | 0.373461 | 0.318852 | 0.350078 | 0.308994 |
2005-Bush | 0.141243 | 0.113584 | 0.226490 | 0.190955 | 0.181756 | 0.158336 | 0.135726 | 0.188759 | 0.173309 | 0.194322 | ... | 0.324140 | 0.370062 | 0.315920 | 0.364296 | 0.366349 | 0.373461 | 1.000000 | 0.352493 | 0.377653 | 0.329433 |
2009-Obama | 0.137906 | 0.053607 | 0.211674 | 0.211151 | 0.168434 | 0.143664 | 0.124087 | 0.187230 | 0.181176 | 0.189036 | ... | 0.317933 | 0.337947 | 0.348850 | 0.394513 | 0.365223 | 0.318852 | 0.352493 | 1.000000 | 0.442743 | 0.334788 |
2013-Obama | 0.137122 | 0.093894 | 0.200487 | 0.198564 | 0.189254 | 0.153434 | 0.144395 | 0.194673 | 0.202169 | 0.212285 | ... | 0.319519 | 0.350950 | 0.331689 | 0.376239 | 0.389452 | 0.350078 | 0.377653 | 0.442743 | 1.000000 | 0.328498 |
2017-Trump | 0.100266 | 0.101117 | 0.192239 | 0.131511 | 0.111912 | 0.105658 | 0.095246 | 0.158218 | 0.140568 | 0.136648 | ... | 0.306585 | 0.292282 | 0.305273 | 0.377192 | 0.368267 | 0.308994 | 0.329433 | 0.334788 | 0.328498 | 1.000000 |
58 rows × 58 columns
Show code cell source
plt.figure(figsize=(7, 5))
plt.title('US Presidential Inaugural Speech Analysis')
plt.xlabel('Inaugural Speech')
plt.ylabel('Distance')
color_threshold = 1.8
dendrogram(Z, color_threshold = color_threshold, labels=textsid)
plt.axhline(y=color_threshold, c='k', ls='--', lw=0.5)
<matplotlib.lines.Line2D at 0x7f96a0cd1278>
4.2. Question 2#
Please use the Chinese song lyrics from the directory, demo_data/ChineseSongLyrics
, as the corpus for this exercise. The directory is a collection of song lyrics from nine Chinese pop-song artists.
Please utilize the bag-of-words method to vectorize each artist’s song lyrics and provide a cluster analysis of each artist in terms of their textual similarities in lyrics.
A few notes for data processing:
Please use
ckip-transformers
for word segmentation and POS tagging.Please build the bag-of-words model using the
Tfidfvectorizer()
.Please include in the model only words (a) whose POS tags start with ‘N’ or ‘V’, and (b) which consist of NO digits, alphabets, symbols and punctuations.
Please make sure you have the word tokens intact when doing the vectorization using
Tfidfvectorizer()
.
The expected result is a dendrogram as shown below. But please note that depending on how you preprocess the data and adjust the parameters of bag-of-words representations, we may have somewhat different results. Please indicate and justify your parameter settings in the bag-of-words model creation (i.e., using markdown cells in notebook).
Show code cell source
tv_df
ㄧ | ㄧ起 | 一 | 一下 | 一世 | 一九四三 | 一二三 | 一些 | 一些些 | 一個個 | ... | 鼓起 | 鼻 | 鼻子 | 鼻尖 | 鼻涕 | 齊 | 齣 | 龍 | 龍捲風 | 龜 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
周杰倫 | 0.003903 | 0.001115 | 0.173294 | 0.002746 | 0.000000 | 0.001115 | 0.000558 | 0.001398 | 0.001115 | 0.000000 | ... | 0.000000 | 0.000000 | 0.001147 | 0.000558 | 0.000000 | 0.000000 | 0.001939 | 0.015998 | 0.005576 | 0.000558 |
周興哲 | 0.000000 | 0.000000 | 0.151433 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.001383 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
張惠妹 | 0.000000 | 0.000000 | 0.159486 | 0.001253 | 0.000679 | 0.000000 | 0.000000 | 0.004084 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000590 | 0.000000 | 0.000000 |
林俊傑 | 0.000000 | 0.000000 | 0.225602 | 0.004501 | 0.000000 | 0.000914 | 0.000000 | 0.006876 | 0.000000 | 0.000000 | ... | 0.001589 | 0.000000 | 0.000627 | 0.000000 | 0.000914 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
田馥甄 | 0.000000 | 0.000000 | 0.193735 | 0.000000 | 0.000000 | 0.000000 | 0.003713 | 0.003724 | 0.000000 | 0.000000 | ... | 0.000000 | 0.001614 | 0.000000 | 0.001856 | 0.000000 | 0.000000 | 0.001614 | 0.000000 | 0.000000 | 0.000000 |
蔡依林 | 0.000692 | 0.000692 | 0.175459 | 0.007239 | 0.000000 | 0.000000 | 0.000000 | 0.003470 | 0.000692 | 0.000000 | ... | 0.000000 | 0.000601 | 0.003794 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
鄧紫棋 | 0.000000 | 0.000000 | 0.191900 | 0.001905 | 0.000000 | 0.000000 | 0.000000 | 0.000517 | 0.000000 | 0.001032 | ... | 0.000897 | 0.000000 | 0.003536 | 0.000000 | 0.001032 | 0.004126 | 0.000000 | 0.000000 | 0.006189 | 0.002063 |
陳奕迅 | 0.000000 | 0.000000 | 0.231057 | 0.004663 | 0.000842 | 0.000000 | 0.000000 | 0.006543 | 0.000000 | 0.000421 | ... | 0.001464 | 0.004391 | 0.000289 | 0.000000 | 0.000000 | 0.001263 | 0.003659 | 0.000366 | 0.000000 | 0.000000 |
高爾宣 | 0.000000 | 0.000000 | 0.077348 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
9 rows × 7337 columns
Show code cell source
similarity_doc_df
周杰倫 | 周興哲 | 張惠妹 | 林俊傑 | 田馥甄 | 蔡依林 | 鄧紫棋 | 陳奕迅 | 高爾宣 | |
---|---|---|---|---|---|---|---|---|---|
周杰倫 | 1.000000 | 0.924684 | 0.954221 | 0.964494 | 0.938640 | 0.929480 | 0.928913 | 0.942586 | 0.911681 |
周興哲 | 0.924684 | 1.000000 | 0.971298 | 0.949029 | 0.941841 | 0.966280 | 0.961808 | 0.945111 | 0.789708 |
張惠妹 | 0.954221 | 0.971298 | 1.000000 | 0.969764 | 0.966961 | 0.974818 | 0.973992 | 0.967873 | 0.834848 |
林俊傑 | 0.964494 | 0.949029 | 0.969764 | 1.000000 | 0.962102 | 0.953864 | 0.954434 | 0.967477 | 0.866838 |
田馥甄 | 0.938640 | 0.941841 | 0.966961 | 0.962102 | 1.000000 | 0.957751 | 0.948707 | 0.956214 | 0.833359 |
蔡依林 | 0.929480 | 0.966280 | 0.974818 | 0.953864 | 0.957751 | 1.000000 | 0.965339 | 0.957491 | 0.795709 |
鄧紫棋 | 0.928913 | 0.961808 | 0.973992 | 0.954434 | 0.948707 | 0.965339 | 1.000000 | 0.967545 | 0.792193 |
陳奕迅 | 0.942586 | 0.945111 | 0.967873 | 0.967477 | 0.956214 | 0.957491 | 0.967545 | 1.000000 | 0.823111 |
高爾宣 | 0.911681 | 0.789708 | 0.834848 | 0.866838 | 0.833359 | 0.795709 | 0.792193 | 0.823111 | 1.000000 |
Show code cell source
plt.figure(figsize=(7, 5))
plt.title('Chinese Pop Song Artists')
plt.xlabel('Artists')
plt.ylabel('Distance')
color_threshold = 0.1
dendrogram(Z, labels=fileids, leaf_rotation = 90, color_threshold = color_threshold, above_threshold_color='b')
plt.axhline(y=color_threshold, c='k', ls='--', lw=0.5)
<matplotlib.lines.Line2D at 0x7f96830a64a8>