Word Embeddings#

  • The state-of-art method of vectorizing texts is to learn the numeric representations of words using deep learning methods.

  • These deep-learning based numeric representations of linguistic units are commonly referred to as embeddings.

  • Word embeddings can be learned either along with the target NLP task (e.g., the Embedding layer in RNN Language Model) or via an unsupervised method based on a large number of texts.

  • In this tutorial, we will look at two main algorithms in word2vec that allow us to learn the word embeddings in an unsupervised manner from a large collection of texts.

  • Strengths of word embeddings

    • They can be learned using unsupervised methods.

    • They include quite a proportion of the lexical semantics.

    • They can be learned by batch. We don’t have to process the entire corpus and create the word-by-document matrix for vectorization.

    • Therefore, it is less likely to run into the memory capacity issue for huge corpora.

Overview#

What is word2vec?#

  • Word2vec is one of the most popular techniques to learn word embeddings using a two-layer neural network.

  • The input is a text corpus and the output is a set of word vectors.

  • Research has shown that these embeddings include rich semantic information of words, which allow us to perform interesting semantic computation (See Mikolov et al’s works in References).

Basis of Word Embeddings: Distributional Semantics#

  • You shall know a word by the company it keeps” (Firth, 1975).

  • Word distributions show a considerable amount of lexical semantics.

  • Construction/Pattern distributions show a considerable amount of the constructional semantics.

  • Semantics of linguistic units are implicitly or explicitly embedded in their distributions (i.e., occurrences and co-occurrences) in language use (Distributional Semantics).

Main training algorithms of word2vec#

  • Continuous Bag-of-Words (CBOW): The general language modeling task for embeddings training is to learn a model that is capable of using the context words to predict a target word.

  • Skip-Gram: The general language modeling task for embeddings training is to learn a model that is capable of using a target word to predict its context words.

  • Other variants of embeddings training:

    • fasttext from Facebook

    • GloVe from Stanford NLP Group

  • There are many ways to train work embeddings.

    • gensim: Simplest and straightforward implementation of word2vec.

    • Training based on deep learning packages (e.g., keras, tensorflow)

    • spacy (It comes with the pre-trained embeddings models, using GloVe.)

  • See Sarkar (2019), Chapter 4, for more comprehensive reviews.

An Intuitive Understanding of CBOW#

An Intuitive Understanding of Skip-gram#

Import necessary dependencies and settings#

import pandas as pd
import numpy as np
import re
import nltk
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['figure.dpi'] = 300
pd.options.display.max_colwidth = 200
# # Google Colab Adhoc Setting
# !nvidia-smi
# nltk.download(['gutenberg','punkt','stopwords'])
# !pip show spacy
# !pip install --upgrade spacy
# #!python -m spacy download en_core_web_trf
# !python -m spacy download en_core_web_lg

Sample Corpus: A Naive Example#

corpus = [
    'The sky is blue and beautiful.', 'Love this blue and beautiful sky!',
    'The quick brown fox jumps over the lazy dog.',
    "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
    'I love green eggs, ham, sausages and bacon!',
    'The brown fox is quick and the blue dog is lazy!',
    'The sky is very blue and the sky is very beautiful today',
    'The dog is lazy but the brown fox is quick!'
]
labels = [
    'weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather',
    'animals'
]

corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus, 'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df
Document Category
0 The sky is blue and beautiful. weather
1 Love this blue and beautiful sky! weather
2 The quick brown fox jumps over the lazy dog. animals
3 A king's breakfast has sausages, ham, bacon, eggs, toast and beans food
4 I love green eggs, ham, sausages and bacon! food
5 The brown fox is quick and the blue dog is lazy! animals
6 The sky is very blue and the sky is very beautiful today weather
7 The dog is lazy but the brown fox is quick! animals

Simple text pre-processing#

  • Usually for unsupervised word2vec learning, we don’t really need much text preprocessing.

  • So we keep our preprocessing to the minimum.

    • Remove only symbols/punctuations, as well as redundant whitespaces.

    • Perform word tokenization, which would also determine the base units for embeddings learning.

Suggestions#

  • If you are using keras to build the network for embeddings training, please prepare your input corpus data for Tokenizer()in the format where each token is delimited by a whitespace.

  • If you are using gensim to train word embeddings, please tokenize your corpus data first. That is, the gensim only requires a tokenized version of the corpus and it will learn the word embeddings for you.

wpt = nltk.WordPunctTokenizer()
# stop_words = nltk.corpus.stopwords.words('english')
def preprocess_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    doc = ' '.join(tokens)
    return doc

corpus_norm = [preprocess_document(text) for text in corpus]
corpus_tokens = [preprocess_document(text).split(' ') for text in corpus]
print(corpus_norm)
print(corpus_tokens)
['the sky is blue and beautiful', 'love this blue and beautiful sky', 'the quick brown fox jumps over the lazy dog', 'a kings breakfast has sausages ham bacon eggs toast and beans', 'i love green eggs ham sausages and bacon', 'the brown fox is quick and the blue dog is lazy', 'the sky is very blue and the sky is very beautiful today', 'the dog is lazy but the brown fox is quick']
[['the', 'sky', 'is', 'blue', 'and', 'beautiful'], ['love', 'this', 'blue', 'and', 'beautiful', 'sky'], ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog'], ['a', 'kings', 'breakfast', 'has', 'sausages', 'ham', 'bacon', 'eggs', 'toast', 'and', 'beans'], ['i', 'love', 'green', 'eggs', 'ham', 'sausages', 'and', 'bacon'], ['the', 'brown', 'fox', 'is', 'quick', 'and', 'the', 'blue', 'dog', 'is', 'lazy'], ['the', 'sky', 'is', 'very', 'blue', 'and', 'the', 'sky', 'is', 'very', 'beautiful', 'today'], ['the', 'dog', 'is', 'lazy', 'but', 'the', 'brown', 'fox', 'is', 'quick']]

Training Embeddings Using word2vec#

  • The expected inputs of gensim.model.word2vec is token-based corpus object.

%%time

from gensim.models import word2vec

# Set values for various parameters
feature_size = 10  
window_context = 5  
min_word_count = 1  

w2v_model = word2vec.Word2Vec(
    corpus_tokens,
    size=feature_size,        # Word embeddings dimensionality
    window=window_context,    # Context window size
    min_count=min_word_count, # Minimum word count
    sg=1,                     # `1` for skip-gram; otherwise CBOW.
    seed = 123,               # random seed
    workers=1,                # number of cores to use
    negative = 5,             # how many negative samples should be drawn
    cbow_mean = 1,            # whether to use the average of context word embeddings or sum(concat)
    iter=10000,               # number of epochs for the entire corpus
    batch_words=10000,        # batch size
)
CPU times: user 3.75 s, sys: 1.11 s, total: 4.85 s
Wall time: 8.88 s

Visualizing Word Embeddings#

  • Embeddings represent words in multidimensional space.

  • We can inspect the quality of embeddings using dimensional reduction and visualize words in a 2D plot.

from sklearn.manifold import TSNE

words = w2v_model.wv.index2word ## get the word forms of voculary
wvs = w2v_model.wv[words] ## get embeddings of all word forms

tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=5)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(wvs)
labels = words

plt.figure(figsize=(12, 6))
plt.scatter(T[:, 0], T[:, 1], c='orange', edgecolors='r')
for label, x, y in zip(labels, T[:, 0], T[:, 1]):
    plt.annotate(label,
                 xy=(x + 1, y + 1),
                 xytext=(0, 0),
                 textcoords='offset points')
../_images/22770c8e857371b373a6b9a3aa4d6d8786317cb3a0c50b39bebd6c73377f169f.png
  • All trained word embeddings are included in w2v_model.wv.

  • We can extract all word forms in the vocabulary from w2v_model.wv.index2word.

  • We can easily extract embeddings for any specific words from w2v_model.wv.

w2v_model.wv.index2word[:5]
['the', 'is', 'and', 'sky', 'blue']
[w2v_model.wv[w] for w in w2v_model.wv.index2word[:5]]
[array([ 0.784301  ,  0.25009292,  1.4975246 , -1.0075508 ,  0.61629665,
         0.6063493 ,  0.15591188, -0.8170678 ,  0.36189288,  0.4369783 ],
       dtype=float32),
 array([ 1.1491109 ,  0.71390206,  1.6037664 , -1.4727064 ,  0.11018104,
         0.34038126,  0.5746719 , -0.900755  , -0.10498449,  0.32840192],
       dtype=float32),
 array([ 0.42948523,  0.46259484,  0.7993391 ,  0.26920947,  1.0111133 ,
        -0.24869905, -0.32832903,  0.8455077 , -0.17631389, -0.55040485],
       dtype=float32),
 array([ 1.2447667 ,  2.4593425 ,  0.61004704, -0.84361315, -0.20513304,
         0.30263227,  0.47103634,  0.55743533,  0.8404286 , -0.9310767 ],
       dtype=float32),
 array([-0.33282068,  1.8938363 ,  1.8055332 , -0.63900596, -0.49224195,
         0.5072632 ,  0.13565086,  0.12676433,  0.64882666,  0.4231832 ],
       dtype=float32)]

From Word Embeddings to Document Embeddings#

  • With word embeddings, we can compute the average embeddings for the entire document, i.e., the document embeddings.

  • These document embeddings are also assumed to have included considerable semantic information of the document.

  • We can for example use them for document classification/clustering.

def average_word_vectors(words, model, vocabulary, num_features):

    feature_vector = np.zeros((num_features, ), dtype="float64")
    nwords = 0.

    for word in words:
        if word in vocabulary:
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])

    if nwords:
        feature_vector = np.divide(feature_vector, nwords)

    return feature_vector


def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [
        average_word_vectors(tokenized_sentence, model, vocabulary,
                             num_features) for tokenized_sentence in corpus
    ]
    return np.array(features)
w2v_feature_array = averaged_word_vectorizer(corpus=corpus_tokens,
                                             model=w2v_model,
                                             num_features=feature_size)
pd.DataFrame(w2v_feature_array, index=corpus_norm)
/Users/Alvin/opt/anaconda3/envs/python-notes/lib/python3.7/site-packages/ipykernel_launcher.py:9: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
  if __name__ == '__main__':
0 1 2 3 4 5 6 7 8 9
the sky is blue and beautiful 0.881949 1.282482 1.169370 -0.766623 0.046958 0.318763 0.103750 0.160109 0.445177 -0.037771
love this blue and beautiful sky 0.993389 1.442332 0.907495 -0.437124 0.158677 0.673241 0.480995 0.830446 0.618259 -0.151463
the quick brown fox jumps over the lazy dog 0.800939 0.774233 0.671456 -1.122219 1.047285 -0.288948 -0.183970 -1.174881 0.382866 1.032578
a kings breakfast has sausages ham bacon eggs toast and beans 0.855796 1.194596 -0.217047 1.199348 0.968665 0.441076 -0.001284 0.060136 -0.823116 0.794330
i love green eggs ham sausages and bacon 0.498210 1.100473 0.000850 0.787160 1.128485 0.799985 0.501790 0.930674 -0.078503 0.863962
the brown fox is quick and the blue dog is lazy 0.808819 0.703581 0.927804 -1.018162 0.665659 0.047246 0.053088 -0.788958 0.434566 0.594682
the sky is very blue and the sky is very beautiful today 1.178684 1.438528 1.165024 -0.811881 -0.052740 0.396082 0.169894 -0.133908 0.448993 -0.253550
the dog is lazy but the brown fox is quick 0.875077 0.730577 0.911616 -1.165358 0.759716 -0.055843 0.023659 -1.040530 0.533149 0.764571
  • Let’s cluster these documents based on their document embeddings.

from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

similarity_doc_matrix = cosine_similarity(w2v_feature_array)
similarity_doc_df = pd.DataFrame(similarity_doc_matrix)
similarity_doc_df
0 1 2 3 4 5 6 7
0 1.000000 0.912405 0.571349 0.170281 0.321300 0.763058 0.981439 0.699931
1 0.912405 1.000000 0.317095 0.287722 0.560264 0.533041 0.879302 0.454163
2 0.571349 0.317095 1.000000 0.224509 0.158110 0.951416 0.570610 0.973488
3 0.170281 0.287722 0.224509 1.000000 0.826242 0.158243 0.173629 0.139257
4 0.321300 0.560264 0.158110 0.826242 1.000000 0.179381 0.253323 0.134353
5 0.763058 0.533041 0.951416 0.158243 0.179381 1.000000 0.762788 0.994132
6 0.981439 0.879302 0.570610 0.173629 0.253323 0.762788 1.000000 0.704748
7 0.699931 0.454163 0.973488 0.139257 0.134353 0.994132 0.704748 1.000000
from scipy.cluster.hierarchy import dendrogram, linkage

Z = linkage(similarity_doc_matrix, 'ward')
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data point')
plt.ylabel('Distance')
dendrogram(Z,
           labels=corpus_norm,
           leaf_rotation=0,
           leaf_font_size=8,
           orientation='right',
           color_threshold=0.5)
plt.axvline(x=0.5, c='k', ls='--', lw=0.5)
<matplotlib.lines.Line2D at 0x7fa328bb0128>
../_images/a1c6969721e7f88dc43fbc00c9c42d574fd75d9006caeaac2b36edd4d6d3b944.png
## Other Clustering Methods

from sklearn.cluster import AffinityPropagation

ap = AffinityPropagation()
ap.fit(w2v_feature_array)
cluster_labels = ap.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)

## PCA Plotting
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=0)
pcs = pca.fit_transform(w2v_feature_array)
labels = ap.labels_
categories = list(corpus_df['Category'])
plt.figure(figsize=(8, 6))

for i in range(len(labels)):
    label = labels[i]
    color = 'orange' if label == 0 else 'blue' if label == 1 else 'green'
    annotation_label = categories[i]
    x, y = pcs[i]
    plt.scatter(x, y, c=color, edgecolors='k')
    plt.annotate(annotation_label,
                 xy=(x + 1e-4, y + 1e-3),
                 xytext=(0, 0),
                 textcoords='offset points')
/Users/Alvin/opt/anaconda3/envs/python-notes/lib/python3.7/site-packages/sklearn/cluster/_affinity_propagation.py:152: FutureWarning: 'random_state' has been introduced in 0.23. It will be set to None starting from 0.25 which means that results will differ at every function call. Set 'random_state' to None to silence this warning, or to 0 to keep the behavior of versions <0.23.
  FutureWarning)
../_images/75129a96ddb0f0c15a544e9cfa65c44d5c8c90a22ed0316124972d2f9bd57c76.png

Using Pre-trained Embeddings: GloVe in spacy#

import spacy


nlp = spacy.load('en_core_web_lg',disable=['parse','entity'])

total_vectors = len(nlp.vocab.vectors)
print('Total word vectors:', total_vectors)
Total word vectors: 684830
print(spacy.__version__)
3.0.5

Visualize GloVe word embeddings#

  • Let’s extract the GloVe pretrained embeddings for all the words in our simple corpus.

  • And we visualize their embeddings in a 2D plot via dimensional reduction.

Warning

When using pre-trained embeddings, there are two important things:

  • Be very careful of the tokenization methods used in your text preprocessing. If you use a very different word tokenization method, you may find a lot of unknown words that are not included in the pretrained model.

  • Always check the proportion of the unknown words when vectorizing your corpus texts with pre-trained embeddings.

# get vocab of the corpus
unique_words = set(sum(corpus_tokens,[]))

# extract pre-trained embeddings of all words
word_glove_vectors = np.array([nlp(word).vector for word in unique_words])
pd.DataFrame(word_glove_vectors, index=unique_words)
0 1 2 3 4 5 6 7 8 9 ... 290 291 292 293 294 295 296 297 298 299
has 0.085520 0.501520 0.112660 -0.188260 0.031677 -0.071113 0.050192 -0.357280 -0.099622 2.83730 ... -0.271210 -0.062170 -0.007174 0.036467 0.261120 -0.189210 -0.052768 -0.155080 0.042269 0.014490
sky 0.312550 -0.303080 0.019587 -0.354940 0.100180 -0.141530 -0.514270 0.886110 -0.530540 1.55660 ... -0.667050 0.279110 0.500970 -0.277580 -0.143720 0.342710 0.287580 0.537740 0.363490 0.496920
today -0.156570 0.594890 -0.031445 -0.077586 0.278630 -0.509210 -0.066350 -0.081890 -0.047986 2.80360 ... -0.326580 -0.413380 0.367910 -0.262630 -0.203690 -0.296560 -0.014873 -0.250060 -0.115940 0.083741
quick -0.445630 0.191510 -0.249210 0.465900 0.161950 0.212780 -0.046480 0.021170 0.417660 1.68690 ... -0.329460 0.421860 -0.039543 0.150180 0.338220 0.049554 0.149420 -0.038789 -0.019069 0.348650
beans -0.423290 -0.264500 0.200870 0.082187 0.066944 1.027600 -0.989140 -0.259950 0.145960 0.76645 ... 0.048760 0.351680 -0.786260 -0.368790 -0.528640 0.287650 -0.273120 -1.114000 0.064322 0.223620
dog -0.401760 0.370570 0.021281 -0.341250 0.049538 0.294400 -0.173760 -0.279820 0.067622 2.16930 ... 0.022908 -0.259290 -0.308620 0.001754 -0.189620 0.547890 0.311940 0.246930 0.299290 -0.074861
kings 0.259230 -0.854690 0.360010 -0.642000 0.568530 -0.321420 0.173250 0.133030 -0.089720 1.52860 ... -0.470090 0.063743 -0.545210 -0.192310 -0.301020 1.068500 0.231160 -0.147330 0.662490 -0.577420
is -0.084961 0.502000 0.002382 -0.167550 0.307210 -0.237620 0.160690 -0.367860 -0.058347 2.49900 ... 0.087003 -0.078201 -0.069673 -0.169930 0.235980 0.275500 -0.067180 -0.215110 -0.263040 -0.006017
and -0.185670 0.066008 -0.252090 -0.117250 0.265130 0.064908 0.122910 -0.093979 0.024321 2.49260 ... -0.593960 -0.097729 0.200720 0.170550 -0.004736 -0.039709 0.324980 -0.023452 0.123020 0.331200
breakfast 0.073378 0.227670 0.208420 -0.456790 -0.078219 0.601960 -0.024494 -0.467980 0.054627 2.28370 ... 0.647710 0.373820 0.019931 -0.033672 -0.073184 0.296830 0.340420 -0.599390 -0.061114 0.232200
green -0.072368 0.233200 0.137260 -0.156630 0.248440 0.349870 -0.241700 -0.091426 -0.530150 1.34130 ... -0.405170 0.243570 0.437300 -0.461520 -0.352710 0.336250 0.069899 -0.111550 0.532930 0.712680
blue 0.129450 0.036518 0.032298 -0.060034 0.399840 -0.103020 -0.507880 0.076630 -0.422920 0.81573 ... -0.501280 0.169010 0.548250 -0.319380 -0.072887 0.382950 0.237410 0.052289 0.182060 0.412640
beautiful 0.171200 0.534390 -0.348540 -0.097234 0.101800 -0.170860 0.295650 -0.041816 -0.516550 2.11720 ... -0.285540 0.104670 0.126310 0.120040 0.254380 0.247400 0.207670 0.172580 0.063875 0.350990
sausages -0.174290 -0.064869 -0.046976 0.287420 -0.128150 0.647630 0.056315 -0.240440 -0.025094 0.50222 ... 0.302240 0.195470 -0.653980 -0.291150 -0.684290 -0.266370 0.304310 -0.806830 0.619540 0.201200
love 0.139490 0.534530 -0.252470 -0.125650 0.048748 0.152440 0.199060 -0.065970 0.128830 2.05590 ... -0.124380 0.178440 -0.099469 0.008682 0.089213 -0.075513 -0.049069 -0.015228 0.088408 0.302170
a 0.043798 0.024779 -0.209370 0.497450 0.360190 -0.375030 -0.052078 -0.605550 0.036744 2.20850 ... -0.103470 0.003363 0.217600 -0.204090 0.092415 0.080421 -0.061246 -0.300990 -0.145840 0.281880
fox -0.348680 -0.077720 0.177750 -0.094953 -0.452890 0.237790 0.209440 0.037886 0.035064 0.89901 ... -0.283050 0.270240 -0.654800 0.105300 -0.068738 -0.534750 0.061783 0.123610 -0.553700 -0.544790
but -0.016890 0.174020 -0.302470 -0.300630 0.214150 0.063863 0.101070 -0.241550 -0.095228 2.92530 ... -0.407830 -0.001680 0.015798 0.035886 0.116640 0.023909 -0.053473 -0.326400 0.179570 0.110950
jumps -0.334840 0.215990 -0.350440 -0.260020 0.411070 0.154010 -0.386110 0.206380 0.386700 1.46050 ... -0.107030 -0.279480 -0.186200 -0.543140 -0.479980 -0.284680 0.036022 0.190290 0.692290 -0.071501
the 0.272040 -0.062030 -0.188400 0.023225 -0.018158 0.006719 -0.138770 0.177080 0.177090 2.58820 ... -0.428100 0.168990 0.225110 -0.285570 -0.102800 -0.018168 0.114070 0.130150 -0.183170 0.132300
over -0.304940 0.012330 0.038375 0.097028 0.122100 -0.170170 -0.328420 0.354570 0.376260 2.71370 ... -0.353240 -0.113480 0.311560 -0.204100 -0.326290 -0.137840 0.115910 0.011121 0.140950 -0.021138
very -0.313420 0.372670 -0.416000 0.185190 0.084357 0.053412 0.253420 -0.442260 -0.108610 2.46910 ... -0.374230 0.056770 0.018460 0.002891 0.515560 0.297570 0.334740 -0.308780 0.126210 0.130530
i 0.187330 0.405950 -0.511740 -0.554820 0.039716 0.128870 0.451370 -0.591490 0.155910 1.51370 ... 0.116330 0.079300 -0.391420 -0.324830 0.634510 -0.189100 0.054050 0.164950 0.187570 0.538740
toast 0.130740 -0.193730 0.253270 0.090102 -0.272580 -0.030571 0.096945 -0.115060 0.484000 0.84838 ... 0.142080 0.481910 0.045167 0.057151 -0.149520 -0.495130 -0.086677 -0.569040 -0.359290 0.097443
eggs -0.417810 -0.035192 -0.126150 -0.215930 -0.669740 0.513250 -0.797090 -0.068611 0.634660 1.25630 ... -0.232860 -0.139740 -0.681080 -0.370920 -0.545510 0.073728 0.111620 -0.324700 0.059721 0.159160
brown -0.374120 -0.076264 0.109260 0.186620 0.029943 0.182700 -0.631980 0.133060 -0.128980 0.60343 ... -0.015404 0.392890 -0.034826 -0.720300 -0.365320 0.740510 0.108390 -0.365760 -0.288190 0.114630
lazy -0.353320 -0.299710 -0.176230 -0.321940 -0.385640 0.586110 0.411160 -0.418680 0.073093 1.48650 ... 0.402310 -0.038554 -0.288670 -0.244130 0.460990 0.514170 0.136260 0.344190 -0.845300 -0.077383
bacon -0.430730 -0.016025 0.484620 0.101390 -0.299200 0.761820 -0.353130 -0.325290 0.156730 0.87321 ... 0.304240 0.413440 -0.540730 -0.035930 -0.429450 -0.246590 0.161490 -1.065400 -0.244940 0.269540
this -0.087595 0.355020 0.063868 0.292920 -0.236350 -0.062773 -0.161050 -0.228420 0.041587 2.48440 ... -0.039495 0.093723 0.093557 -0.097551 0.306390 -0.273250 -0.331120 0.034460 -0.150270 0.406730
ham -0.773320 -0.282540 0.580760 0.841480 0.258540 0.585210 -0.021890 -0.463680 0.139070 0.65872 ... 0.464470 0.481400 -0.829200 0.354910 0.224530 -0.493920 0.456930 -0.649100 -0.131930 0.372040

30 rows × 300 columns

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=5)
np.set_printoptions(suppress=True)
T = tsne.fit_transform(word_glove_vectors)
labels = unique_words

plt.figure(figsize=(12, 6))
plt.scatter(T[:, 0], T[:, 1], c='orange', edgecolors='r')
for label, x, y in zip(labels, T[:, 0], T[:, 1]):
    plt.annotate(label,
                 xy=(x + 1, y + 1),
                 xytext=(0, 0),
                 textcoords='offset points')
    
../_images/0042801a0abb00bb2dd427f5d292994feb00af48f7cc7c24d870553279656110.png
  • It is clear to see that when embeddings are trained based on a larger corpus, they reflect more lexical semantic contents.

  • Semantically similar words are indeed closer to each other in the 2D plot.

  • We can of course perform the document-level clustering again using the GloVe embeddings.

  • The good thing about spacy is that it can compute the document average embeddings automatically.

doc_glove_vectors = np.array([nlp(str(doc)).vector for doc in corpus_norm])

import sklearn
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3, random_state=0)
km.fit_transform(doc_glove_vectors)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)
Document Category ClusterLabel
0 The sky is blue and beautiful. weather 2
1 Love this blue and beautiful sky! weather 2
2 The quick brown fox jumps over the lazy dog. animals 1
3 A king's breakfast has sausages, ham, bacon, eggs, toast and beans food 0
4 I love green eggs, ham, sausages and bacon! food 0
5 The brown fox is quick and the blue dog is lazy! animals 1
6 The sky is very blue and the sky is very beautiful today weather 2
7 The dog is lazy but the brown fox is quick! animals 1

fasttext#

  • This section shows a quick example how to train word embeddings based on the nltk.corpus.brown using another algorithm, i.e., fasttext.

  • The FastText model was introduced by Facebook in 2016 as an improved and extended version of the word2vec (See Bojanowski et al [2017] in References below).

  • We will focus more on the implementation. Please see the Bojanowski et al (2017) as well as Sarkar (2019) Chapter 4 for more comprehensive descriptions of the method.

  • Pretrained FastText Embeddings are available here.

from gensim.models.fasttext import FastText
from nltk.corpus import brown

brown_tokens = [brown.words(fileids=f) for f in brown.fileids()]
%%time
# Set values for various parameters
feature_size = 100  # Word vector dimensionality
window_context = 5  # Context window size
min_word_count = 5  # Minimum word count

ft_model = FastText(brown_tokens,
                    size=feature_size,
                    window=window_context,
                    min_count=min_word_count,
                    sg=1,
                    iter=50)
CPU times: user 12min 19s, sys: 26.6 s, total: 12min 46s
Wall time: 5min 22s
  • We can use the trained embeddings model to identify words that are similar to a set of seed words.

  • And then we plot all these words (i.e., the seed words and their semantic neighbors) in one 2D plot based on the dimensional reduction of their embeddings.

# view similar words based on gensim's model
similar_words = {
    search_term:
    [item[0] for item in ft_model.wv.most_similar([search_term], topn=5)]
    for search_term in
    ['think', 'say','news', 'report','nation', 'democracy']
}
similar_words
{'think': ['know', 'Think', 'tell', 'suppose', 'say'],
 'say': ['think', 'said', 'know', 'hear', 'want'],
 'news': ['forthcoming', 'Burr', 'newspapers', 'hailed', 'remarks'],
 'report': ['reports', 'Investigation', 'reporting', 'Report', "officer's"],
 'nation': ['nations', 'national', 'nationally', 'nationwide', 'nationalism'],
 'democracy': ['democratic',
  'cultural',
  'world',
  'capitalism',
  'Christianity']}
from sklearn.decomposition import PCA

words = sum([[k] + v for k, v in similar_words.items()], [])
wvs = ft_model.wv[words]

pca = PCA(n_components=2)
np.set_printoptions(suppress=True)
P = pca.fit_transform(wvs)
labels = words

plt.figure(figsize=(12, 10))
plt.scatter(P[:, 0], P[:, 1], c='lightgreen', edgecolors='g')
for label, x, y in zip(labels, P[:, 0], P[:, 1]):
    plt.annotate(label,
                 xy=(x + 0.03, y + 0.03),
                 xytext=(0, 0),
                 textcoords='offset points')
../_images/dac8d306964cbf9e55525b267f734f1825ca0f014cc296a9d8b6369c76f0e2c4.png
ft_model.wv['democracy']
array([ 0.6285441 , -0.38983855,  0.11425309,  0.06012838, -0.6053551 ,
       -0.16956608, -0.22312629, -0.10888387,  0.12648048, -0.07467075,
       -0.6946541 ,  0.21674407, -0.22717321,  0.4057445 ,  0.29266223,
        0.37295696, -0.08018252,  0.28890017, -0.21639867, -0.9471755 ,
       -0.0367635 , -0.04349962,  0.30649963, -0.55968153, -0.01258501,
       -0.5630713 ,  0.30955324,  0.32570282, -0.5851471 ,  0.7197201 ,
       -0.9000047 , -0.06964479,  0.28557375,  0.0619022 ,  0.3351349 ,
       -0.13018262,  0.02778896, -0.03295739, -0.5289631 , -0.21316352,
       -0.1927315 ,  0.35496324, -0.79570115,  0.02665308, -0.5433889 ,
       -0.27868316,  0.36585337, -0.25602865, -0.2014406 , -0.2385113 ,
       -0.9931128 , -0.05011513, -0.89226973, -0.09251308, -0.49075928,
        0.2526952 ,  0.17661068, -0.32355413,  0.30298945, -0.6417367 ,
       -0.96245843,  0.34119967,  0.50603724, -0.10167728,  0.37157422,
        0.27186954, -0.24821736,  0.33283603,  0.31510666, -0.4702585 ,
        0.08978494, -0.03058098, -0.3733476 ,  0.6416439 ,  0.01975565,
       -0.35120457, -0.1267091 , -0.28540707, -0.41264024,  0.53472847,
       -0.78829867,  0.6358656 ,  0.13061173,  0.78743094,  0.3419407 ,
        0.0620064 , -0.02179374, -0.06819019,  0.30834115, -0.09986465,
       -0.24703988, -0.11666609, -0.19774927, -0.2915043 ,  0.19712956,
        0.16191272,  0.5221547 ,  0.23896886, -0.20449135,  0.35715654],
      dtype=float32)
print(ft_model.wv.similarity(w1='taiwan', w2='freedom'))
print(ft_model.wv.similarity(w1='china', w2='freedom'))
0.22647002
0.18219654

Wrap-up#

  • Two fundamental deep-learning-based models of word representation learning: CBOW and Skip-Gram.

  • From word embeddings to document embeddings

  • More advanced representation learning models: GloVe and FastText.

  • What is more challenging is how to assess the quality of the learned representations (embeddings). Usually embedding models can be evaluated based on their performance on semantics related tasks, such as word similarity and analogy. For those who are interested, you can start with the following two papers on Chinese embeddings:

References#