Topic Modeling: A Naive Example#

What is Topic Modeling?#

  • Topic modeling is an unsupervised learning method, whose objective is to extract the underlying semantic patterns among a collection of texts. These underlying semantic structures are commonly referred to as topics of the corpus.

  • In particular, topic modeling first extracts features from the words in the documents and use mathematical structures and frameworks like matrix factorization and SVD (Singular Value Decomposition) to identify clusters of words that share greater semantic coherence.

  • These clusters of words form the notions of topics.

  • Meanwhile, the mathematical framework will also determine the distribution of these topics for each document.

  • In short, an intuitive understanding of Topic Modeling:

    • Each document consists of several topics (a distribution of different topics).

    • Each topic is connected to particular groups of words (a distribution of different words).

    • With our current data collection, we use the mathmeatical algorithm (i.e., Latent Dirichlet Allocation) to learn these document-by-topic and topic-by-word distributions in an unsupervised way.

  • Potential applications of LDA

    • Topic discovery and document classication

    • Information retrievel based on topic clusters

    • Content recommendation by modeling the topics of interest to a user

    • Understanding trends over time

Flowchart for Topic Modeling#

Data Preparation and Preprocessing#

Import Necessary Dependencies and Settings#

import warnings
warnings.filterwarnings('ignore') ## silent non-critical warnings

import pandas as pd
import numpy as np
import re
import nltk
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV

pd.options.display.max_colwidth = 100
# %matplotlib inline

Simple Corpus#

  • We will be using again a simple corpus for illustration.

  • It is a corpus consisting of eight documents, each of which consists of a simple sentence.

  • Please note that usually for the task of topic modeling, we do not have the topic categories for the documents in the corpus.

corpus = [
    'The sky is blue and beautiful.', 'Love this blue and beautiful sky!',
    'The quick brown fox jumps over the lazy dog.',
    "A king's breakfast has sausages, ham, bacon, eggs, toast and beans",
    'I love green eggs, ham, sausages and bacon!',
    'The brown fox is quick and the blue dog is lazy!',
    'The sky is very blue and the sky is very beautiful today',
    'The dog is lazy but the brown fox is quick!'
]

corpus = np.array(corpus)
corpus_df = pd.DataFrame({'Document': corpus})
corpus_df = corpus_df[['Document']]
corpus_df
Document
0 The sky is blue and beautiful.
1 Love this blue and beautiful sky!
2 The quick brown fox jumps over the lazy dog.
3 A king's breakfast has sausages, ham, bacon, eggs, toast and beans
4 I love green eggs, ham, sausages and bacon!
5 The brown fox is quick and the blue dog is lazy!
6 The sky is very blue and the sky is very beautiful today
7 The dog is lazy but the brown fox is quick!

Simple Text Pre-processing#

  • Depending on the nature of the raw corpus data, we may need to implement more specific steps in text preprocessing.

  • In our current naive example, we may consider:

    • removing symbols and punctuations

    • normalizing the letter case

    • stripping unnecessary/redundant whitespaces

    • removing stopwords (which requires an intermediate tokenization step)

Tip

Other important considerations in text preprocessing include:

  • whether to remove hyphens

  • whether to lemmatize word forms

  • whether to stemmatize word forms

  • whether to remove short word tokens (e.g., disyllabic or multisyllabic words)

  • whether to remove unknown words (e.g., words not listed in WordNet)

  • whether to remove functional words (e.g., including only content words, such as nouns, verbs, adjectives)

## Initialize word-tokenizer
wpt = nltk.WordPunctTokenizer()

## Prepare English stopword list
stop_words = nltk.corpus.stopwords.words('english')


def normalize_document(doc):
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I | re.A) ## punks, symbols
    doc = doc.lower() ## casing
    doc = doc.strip() ## redundant spaces
    tokens = wpt.tokenize(doc) ## word-tokenize
    filtered_tokens = [token for token in tokens if token not in stop_words] ## stopwords
    doc = ' '.join(filtered_tokens) ## concat
    return doc

## Vectorize the function (applicable to a list)
normalize_corpus = np.vectorize(normalize_document)
## proprocess corpus
norm_corpus = normalize_corpus(corpus)
norm_corpus
array(['sky blue beautiful', 'love blue beautiful sky',
       'quick brown fox jumps lazy dog',
       'kings breakfast sausages ham bacon eggs toast beans',
       'love green eggs ham sausages bacon',
       'brown fox quick blue dog lazy', 'sky blue sky beautiful today',
       'dog lazy brown fox quick'], dtype='<U51')
  • The norm_corpus will be the input for our next step, text vectorization.

Text Vectorization#

Bag of Words Model#

  • In topic modeling, the simplest method of text vectorization involves using the naive Bag-of-Words (BOW) model.

  • Key characteristics of the BOW model:

    • It converts texts into numeric representations by tallying the frequency of words in the corpus vocabulary.

    • Sequential word order is disregarded in this approach.

    • Various filtering techniques can be applied to the document-by-word matrix. (For more details, refer to the lecture notes on Lecture Notes: Text Vectorization)

  • For topic modeling, it’s advisable to utilize the count-based vectorizer. Most topic modeling algorithms will handle weightings during mathematical computations.

Warning

  • We should use CountVectorizer instead of TfidfVectorizer when fitting LDA because LDA is based on term count and document count.

  • Fitting LDA with TfidfVectorizer will result in rare words being dis-proportionally sampled.

  • As a result, they will have greater impact and influence on the final topic distribution.

## Countvectorize
cv = CountVectorizer(min_df=0., max_df=1.)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix
<8x20 sparse matrix of type '<class 'numpy.int64'>'
	with 42 stored elements in Compressed Sparse Row format>
## Inspect BOA matrix
## warning might give a memory error if data is too big
cv_matrix = cv_matrix.toarray()
cv_matrix
array([[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0]])
## Check all lexical features
vocab = cv.get_feature_names_out()

## View BOA in df
pd.DataFrame(cv_matrix, columns=vocab)
bacon beans beautiful blue breakfast brown dog eggs fox green ham jumps kings lazy love quick sausages sky toast today
0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
2 0 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 0 0 0
3 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0
4 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0
5 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0
6 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1
7 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0

Latent Dirichlet Allocation#

Intuition of LDA#

  • Latent Dirichlet Allocation (LDA) is a method used to uncover the underlying themes or topics in a collection of documents.

  • In LDA, a “topic” represents a distribution of words across the entire vocabulary of the corpus. Essentially, it tells us which words are likely to co-occur together within a topic.

  • Specifically, LDA operates by assuming that each document is generated based on two crucial probabilistic models: document-by-topic and topic-by-word models.

  • LDA provides us with two main matrices:

    • the Topic by Word Matrix, which shows the likelihood of words being associated with specific topics;

    • the Document by Topic Matrix, which indicates the likelihood of documents being associated with specific topics.

  • To understand a topic, we look at the ranked list of the most probable words within that topic. However, since common words often rank high across multiple topics, it can be challenging to distinguish between them.

  • To overcome this, LDA offers insights into both the frequency of words within each topic and the exclusivity (i.e., the association) of words to a particular topic, helping us understand how strongly each word relates to its assigned topic.

Building LDA Model#

%%time 

lda = LatentDirichletAllocation(n_components=3,  ## Number of topics to generate by the LDA model.
                                max_iter=10000,  ## Maximum number of iterations for the optimization algorithm.
                                random_state=0)  ## Random Seed

doc_topic_matrix = lda.fit_transform(cv_matrix)  # Fit the LDA model
CPU times: user 1.81 s, sys: 1.87 ms, total: 1.81 s
Wall time: 1.83 s

Model Performance Metrics#

  • In topic modeling, we can diagnose the model performance using perplexity and log-likelihood.

    • The higher the log-likelihood, the better.

    • The lower the perplexity, the better.

## log-likelihood
print(lda.score(cv_matrix))

## perplexity
print(lda.perplexity(cv_matrix))
-138.9126330364425
25.292966412842105

Interpretation#

  • To properly interpret the results provided by LDA, we need to get the two important probabilistic models:

    • Document-by-Topic Matrix: This is the matrix returned by the LatentDirichletAllocation object when we fit_transform() the model with the data.

    • Word-by-Topic Matrix: We can retrieve this matrix from a fitted LatentDirichletAllocation object. i.e., LatentDirichletAllocation.components_

  • We have 8 documents and our vocabulary size is 20 (words).

  • We can check the shapes of the two matrices.

print(doc_topic_matrix.shape) ## doc-by-topic matrix
print(lda.components_.shape) ## topic-by-word- matrix
(8, 3)
(3, 20)

Document-by-Topic Matrix#

  • In Document-by-Topic matrix, we can see how each document in the corpus (row) is connected to each topic.

  • In particular, the numbers refer to the probability value of a specific document being connected to a particular topic.

## doc-topic matrix
doc_topic_df = pd.DataFrame(doc_topic_matrix, columns=['T1', 'T2', 'T3'])
doc_topic_df
T1 T2 T3
0 0.832191 0.083480 0.084329
1 0.863554 0.069100 0.067346
2 0.047794 0.047776 0.904430
3 0.037243 0.925559 0.037198
4 0.049121 0.903076 0.047802
5 0.054902 0.047778 0.897321
6 0.888287 0.055697 0.056016
7 0.055704 0.055689 0.888607

Topic-by-Word Matrix#

  • In Topic-by-Word matrix, we can see how each topic (row) is connected to each word in the BOW.

  • In particular, the numbers refer to the importance of the word with respect to each topic.

topic_word_matrix = lda.components_
pd.DataFrame(topic_word_matrix, columns=vocab)
bacon beans beautiful blue breakfast brown dog eggs fox green ham jumps kings lazy love quick sausages sky toast today
0 0.333699 0.333647 3.332365 3.373774 0.333647 0.333891 0.333891 0.333699 0.333891 0.333793 0.333699 0.333814 0.333647 0.333891 1.330416 0.333891 0.333699 4.332439 0.333647 1.332558
1 2.332696 1.332774 0.333853 0.334283 1.332774 0.333761 0.333761 2.332696 0.333761 1.332543 2.332696 0.333767 1.332774 0.333761 1.335461 0.333761 2.332696 0.333812 1.332774 0.333744
2 0.333606 0.333579 0.333782 1.291942 0.333579 3.332347 3.332347 0.333606 3.332347 0.333664 0.333606 1.332419 0.333579 3.332347 0.334123 3.332347 0.333606 0.333749 0.333579 0.333698
  • We can transpose the matrix for clarity of inspection (With each topic on the column).

pd.DataFrame(np.transpose(topic_word_matrix), index=vocab)
0 1 2
bacon 0.333699 2.332696 0.333606
beans 0.333647 1.332774 0.333579
beautiful 3.332365 0.333853 0.333782
blue 3.373774 0.334283 1.291942
breakfast 0.333647 1.332774 0.333579
brown 0.333891 0.333761 3.332347
dog 0.333891 0.333761 3.332347
eggs 0.333699 2.332696 0.333606
fox 0.333891 0.333761 3.332347
green 0.333793 1.332543 0.333664
ham 0.333699 2.332696 0.333606
jumps 0.333814 0.333767 1.332419
kings 0.333647 1.332774 0.333579
lazy 0.333891 0.333761 3.332347
love 1.330416 1.335461 0.334123
quick 0.333891 0.333761 3.332347
sausages 0.333699 2.332696 0.333606
sky 4.332439 0.333812 0.333749
toast 0.333647 1.332774 0.333579
today 1.332558 0.333744 0.333698

Interpreting the Meanings of Topics#

  • This is the most crucial step in topic modeling. The LDA does not give us a label for each topic.

  • It is the analyst who determines the meanings of the topics.

  • These decisions are based on the words under each topic that show high importance weights.

## customized function

def get_topics_meanings(tw_m,
                        vocab,
                        display_weights=False,
                        topn=5,
                        weight_cutoff=0.6):
    """
    This function sorts the words importance under each topic and 
    allows selecting words based on ranks or a cutoff on weights.

    Parameters:
    - tw_m : array-like
        Topic by Word matrix containing the weights of each word in each topic.
    - vocab : list
        List of words in the vocabulary.
    - display_weights : bool, optional (default=False)
        Flag to indicate whether to display words with their weights.
    - topn : int, optional (default=5)
        Number of top words to display for each topic if display_weights is False.
    - weight_cutoff : float, optional (default=0.6)
        Weight cutoff threshold for displaying words if display_weights is True.
    """
    for i, topic_weights in enumerate(tw_m):
        topic = [(token, np.round(weight, 2)) for token, weight in zip(vocab, topic_weights)]
        topic = sorted(topic, key=lambda x: -x[1])
        if display_weights:
            topic = [item for item in topic if item[1] > weight_cutoff]
            print(f"Topic #{i} :\n{topic}")
            print("=" * 20)
        else:
            topic_topn = topic[:topn]
            topic_topn = ' '.join([word for word, weight in topic_topn])
            print(f"Topic #{i} :\n{topic_topn}")
            print('=' * 20)
  • To use the above function:

    • If we are to display the weights of words, then we need to specify the weight_cutoff.

    • If we are to display only the top N words, then we need to specify the topn.

get_topics_meanings(topic_word_matrix,
                    vocab,
                    display_weights=True,
                    weight_cutoff=2)
Topic #0 :
[('sky', 4.33), ('blue', 3.37), ('beautiful', 3.33)]
====================
Topic #1 :
[('bacon', 2.33), ('eggs', 2.33), ('ham', 2.33), ('sausages', 2.33)]
====================
Topic #2 :
[('brown', 3.33), ('dog', 3.33), ('fox', 3.33), ('lazy', 3.33), ('quick', 3.33)]
====================
get_topics_meanings(topic_word_matrix, 
                    vocab, 
                    display_weights=False, 
                    topn=5)
Topic #0 :
sky blue beautiful love today
====================
Topic #1 :
bacon eggs ham sausages love
====================
Topic #2 :
brown dog fox lazy quick
====================

Topics in Documents#

  • After we determine the meanings of the topics, we can now analyze how each document is associated to these topics.

  • That is, we can now look at the Document-by-Topic Matrix.

## interpretation of TOPICS
topics = ['weather', 'food', 'animal']

## Checking
doc_topic_df.columns = topics
doc_topic_df['corpus'] = norm_corpus
doc_topic_df
weather food animal corpus
0 0.832191 0.083480 0.084329 sky blue beautiful
1 0.863554 0.069100 0.067346 love blue beautiful sky
2 0.047794 0.047776 0.904430 quick brown fox jumps lazy dog
3 0.037243 0.925559 0.037198 kings breakfast sausages ham bacon eggs toast beans
4 0.049121 0.903076 0.047802 love green eggs ham sausages bacon
5 0.054902 0.047778 0.897321 brown fox quick blue dog lazy
6 0.888287 0.055697 0.056016 sky blue sky beautiful today
7 0.055704 0.055689 0.888607 dog lazy brown fox quick
  • We can visualize the topics distribution for each document using stack plot.

## create stacked bar plot
x_axis = ['DOC' + str(i) for i in range(len(norm_corpus))]
y_axis = doc_topic_df[['weather', 'food', 'animal']]

fig, ax = plt.subplots(figsize=(15, 8))

# Create stacked bar plot
bottom = np.zeros(len(x_axis))
for col in y_axis.columns:
    ax.bar(x_axis, y_axis[col], bottom=bottom, label=col)
    bottom += np.array(y_axis[col])

# Move the legend outside of the chart
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

plt.show()
../_images/e120c564244290551b9fdd3fea587f903e6313f81384db76e28c141c6640a108.png

Clustering documents using topic model features#

  • We can use the topic distributions of each document as our features to further group the documents into clusters.

  • Here is a quick example of k-means clustering.

## initialize kmeans
km = KMeans(n_clusters=3, random_state=0)

## fit kmeans
km.fit_transform(doc_topic_matrix)

## get cluster labels

cluster_labels = km.labels_

## check results
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)
Document ClusterLabel
0 The sky is blue and beautiful. 2
1 Love this blue and beautiful sky! 2
2 The quick brown fox jumps over the lazy dog. 1
3 A king's breakfast has sausages, ham, bacon, eggs, toast and beans 0
4 I love green eggs, ham, sausages and bacon! 0
5 The brown fox is quick and the blue dog is lazy! 1
6 The sky is very blue and the sky is very beautiful today 2
7 The dog is lazy but the brown fox is quick! 1

Visualizing Topic Models#

import pyLDAvis
import pyLDAvis.gensim
import pyLDAvis.lda_model ## for sklearn LDA; if gensim, use `pyLDAvis.gensim`
import dill
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

pyLDAvis.enable_notebook()
cv_matrix = cv.fit_transform(norm_corpus)
pyLDAvis.lda_model.prepare(lda, cv_matrix, cv, mds='mmds')

Hyperparameter Tuning#

  • One thing we haven’t made explicit is that the number of topics so far has been pre-determined before the analysis.

  • How to find the optimal number of topics can be challenging in topic modeling.

  • We can take this as a hyperparameter of the model and use Grid Search to find the most optimal number of topics.

  • Similarly, we can fine tune the other hyperparameters of LDA as well (e.g., learning_decay).

  • learning_method: The default is batch; that is, use all training data for parameter estimation. If it is online, the model will update the parameters on a token by token basis.

  • learning_decay: If the learning_method is online, we can specify a parameter that controls learning rate in the online learning method (usually set between [0.5, 1.0]).

Tip

Doing Grid Search with LDA models can be very slow. There are some other topic modeling algorithms that are a lot faster. Please refer to Sarkar (2019) Chapter 6 for more information.

Grid Search for Topic Number#

%%time

## Prepare hyperparameters dict
search_params = {'n_components': range(3,8), 'learning_decay': [.5, .7]}

## Initialize LDA
model = LatentDirichletAllocation(learning_method='batch', ## `online` for large datasets
                                  max_iter=10000,
                                  random_state=0)

## Gridsearch
gridsearch = GridSearchCV(model,
                          param_grid=search_params,
                          n_jobs=-1,
                          verbose=1)
gridsearch.fit(cv_matrix)

## Save the best model
best_lda = gridsearch.best_estimator_
Fitting 5 folds for each of 10 candidates, totalling 50 fits
CPU times: user 2.11 s, sys: 13.2 ms, total: 2.12 s
Wall time: 14.2 s
# What did we find?
print("Best Model's Params: ", gridsearch.best_params_)
print("Best Log Likelihood Score: ", gridsearch.best_score_)
print('Best Model Perplexity: ', best_lda.perplexity(cv_matrix))
Best Model's Params:  {'learning_decay': 0.5, 'n_components': 3}
Best Log Likelihood Score:  -50.410868132698795
Best Model Perplexity:  25.292966412842105

Examining the Grid Search Results#

cv_results_df = pd.DataFrame(gridsearch.cv_results_)
cv_results_df
mean_fit_time std_fit_time mean_score_time std_score_time param_learning_decay param_n_components params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 2.534712 0.342016 0.000322 0.000025 0.5 3 {'learning_decay': 0.5, 'n_components': 3} -45.022883 -73.947625 -60.589126 -37.947911 -34.546796 -50.410868 14.789188 1
1 2.991487 0.781605 0.000419 0.000197 0.5 4 {'learning_decay': 0.5, 'n_components': 4} -49.145599 -82.726223 -65.081852 -42.126564 -37.038050 -55.223657 16.689926 3
2 2.830407 0.823932 0.000411 0.000164 0.5 5 {'learning_decay': 0.5, 'n_components': 5} -50.024617 -86.998110 -70.381182 -44.942218 -42.224895 -58.914204 17.163750 5
3 2.613521 0.476992 0.000515 0.000262 0.5 6 {'learning_decay': 0.5, 'n_components': 6} -52.609495 -93.522521 -73.924983 -47.956165 -49.996366 -63.601906 17.621249 7
4 2.508645 0.242184 0.000381 0.000128 0.5 7 {'learning_decay': 0.5, 'n_components': 7} -54.878310 -99.846527 -77.129032 -50.917763 -46.861191 -65.926565 19.934270 9
5 2.602352 0.392834 0.000512 0.000230 0.7 3 {'learning_decay': 0.7, 'n_components': 3} -45.022883 -73.947625 -60.589126 -37.947911 -34.546796 -50.410868 14.789188 1
6 2.955229 0.633029 0.000425 0.000211 0.7 4 {'learning_decay': 0.7, 'n_components': 4} -49.145599 -82.726223 -65.081852 -42.126564 -37.038050 -55.223657 16.689926 3
7 2.843962 0.751603 0.000349 0.000029 0.7 5 {'learning_decay': 0.7, 'n_components': 5} -50.024617 -86.998110 -70.381182 -44.942218 -42.224895 -58.914204 17.163750 5
8 2.597273 0.372056 0.000321 0.000039 0.7 6 {'learning_decay': 0.7, 'n_components': 6} -52.609495 -93.522521 -73.924983 -47.956165 -49.996366 -63.601906 17.621249 7
9 2.111250 0.215773 0.000330 0.000028 0.7 7 {'learning_decay': 0.7, 'n_components': 7} -54.878310 -99.846527 -77.129032 -50.917763 -46.861191 -65.926565 19.934270 9
import seaborn as sns
sns.set_theme(rc={"figure.dpi":150, 'savefig.dpi':150})
sns.pointplot(x="param_n_components",
              y="mean_test_score",
              hue="param_learning_decay",
              data=cv_results_df)
<Axes: xlabel='param_n_components', ylabel='mean_test_score'>
../_images/dea84e2996838bd65dede8c680e64506b1f66d77bbdcd7b2ba52f79bd7da2c23.png
get_topics_meanings(best_lda.components_,
                    vocab,
                    display_weights=True,
                    weight_cutoff=2)
Topic #0 :
[('sky', 4.33), ('blue', 3.37), ('beautiful', 3.33)]
====================
Topic #1 :
[('bacon', 2.33), ('eggs', 2.33), ('ham', 2.33), ('sausages', 2.33)]
====================
Topic #2 :
[('brown', 3.33), ('dog', 3.33), ('fox', 3.33), ('lazy', 3.33), ('quick', 3.33)]
====================

Topic Prediction#

  • We can use our LDA to make predictions of topics for new documents.

  • We need to perform the same procedures with the new document as we did with the training data:

    • text preprocessing

    • count-vectorization

new_texts = ['The sky is so blue', 'Love burger with ham']

## normalize
new_texts_norm = normalize_corpus(new_texts)

## vectorize
new_texts_cv = cv.transform(new_texts_norm)

## Check
new_texts_cv.shape
(2, 20)
## Applied our trained LDA
new_texts_doc_topic_matrix = best_lda.transform(new_texts_cv)

## Check results
topics = ['weather', 'food', 'animal']


new_texts_doc_topic_df = pd.DataFrame(new_texts_doc_topic_matrix,
                                      columns=topics)
new_texts_doc_topic_df['predicted_topic'] = [
    topics[i] for i in np.argmax(new_texts_doc_topic_df.values, axis=1)
]

new_texts_doc_topic_df['corpus'] = new_texts_norm
new_texts_doc_topic_df
weather food animal predicted_topic corpus
0 0.775601 0.111301 0.113098 weather sky blue
1 0.123415 0.764965 0.111620 food love burger ham

Additional Notes#

  • We can calculate a metric to evaluate the coherence of each topic.

  • The coherence computation is implemented in gensim. To apply the coherence comptuation to a sklearn-trained LDA, we need tmtoolkit (tmtoolkit.topicmod.evaluate.metric_coherence_gensim).

  • I leave notes here in case in the future we need to compute the coherence metrics.

Warning

tmtoolkit does not support spacy 3+. Also, tmtoolkit will downgrade several important packages to lower versions. Please use it with caution. I would suggest creating another virtual environment for this.

  • The following codes demonstrate how to find the optimal topic number based on the coherence scores of the topic models.

# Step 5: Calculate Coherence
coherence_model = CoherenceModel(topics=top_words_per_topic, texts=tokenized_text, dictionary=feature_names, coherence='c_v')
coherence_score = coherence_model.get_coherence()
import tmtoolkit
from tmtoolkit.topicmod.evaluate import metric_coherence_gensim
def topic_model_coherence_generator(topic_num_start=2,
                                    topic_num_end=6,
                                    norm_corpus='',
                                    cv_matrix='',
                                    cv=''):
    norm_corpus_tokens = [doc.split() for doc in norm_corpus]
    models = []
    coherence_scores = []

    for i in range(topic_num_start, topic_num_end):
        print(i)
        cur_lda = LatentDirichletAllocation(n_components=i,
                                            max_iter=10000,
                                            random_state=0)
        cur_lda.fit_transform(cv_matrix)
        cur_coherence_score = metric_coherence_gensim(
            measure='c_v',
            top_n=5,
            topic_word_distrib=cur_lda.components_,
            dtm=cv.fit_transform(norm_corpus),
            vocab=np.array(cv.get_feature_names_out()),
            texts=norm_corpus_tokens)
        models.append(cur_lda)
        coherence_scores.append(np.mean(cur_coherence_score))
    return models, coherence_scores
%%time
ts = 2
te = 10
models, coherence_scores = topic_model_coherence_generator(
    ts, te, norm_corpus=norm_corpus, cv=cv, cv_matrix=cv_matrix)
2
3
4
5
6
7
8
9
CPU times: user 17.5 s, sys: 463 ms, total: 18 s
Wall time: 45.8 s
coherence_scores
[0.6517493196892892,
 0.8866946211961687,
 0.765479848024043,
 0.8341252576008902,
 0.8572203857656319,
 0.7066394264220808,
 0.6391495323490347,
 0.6086551968156503]
coherence_df = pd.DataFrame({
    'TOPIC_NUMBER': [str(i) for i in range(ts, te)],
    'COHERENCE_SCORE': np.round(coherence_scores, 4)
})

coherence_df.sort_values(by=["COHERENCE_SCORE"], ascending=False)
TOPIC_NUMBER COHERENCE_SCORE
1 3 0.8867
4 6 0.8572
3 5 0.8341
2 4 0.7655
5 7 0.7066
0 2 0.6517
6 8 0.6391
7 9 0.6087
import plotnine
from plotnine import ggplot, aes, geom_point, geom_line, labs
plotnine.options.dpi = 150

g = (ggplot(coherence_df) + aes(x="TOPIC_NUMBER", y="COHERENCE_SCORE") +
     geom_point(stat="identity") + geom_line(group=1, color="lightgrey") +
     labs(x="Number of Topics", y="Average Coherence Score"))
g
../_images/d588208955e87b54b41662a6241653daf5e1cdda9a4154795f7ee0d8b9030738.png
<Figure Size: (960 x 720)>

Challenges#

  • There are several challenges and limications with LDA:

    • Choosing the number of topics

    • Interpretability

    • Scalability

    • Polysemy and context

  • Maybe we can combine this exploratory method of topic model with LLM by feeding the key terms of topics to the LLM to prompt for more comprehensive desciption and interpretation of the topics.

References#