16.5 Topic Modeling with Latent Dirichlet Allocation
Now that we have some clean data—meaning that each text or tweet has had stop words removed, words stemmed, and its contents converted to a list of unigrams, bigrams, and trigrams—we can perform some interesting analyses. Our goal is to convert the text of the tweet into useful features that can be used in a predictive model. One way we can do this is through topic modeling. Latent Dirichlet allocation (LDA) is one popular topic modeling technique. LDA is a generative statistical model that extracts the high-level topics being discussed across many text documents.
The goal of topic modeling (e.g., LDA) is to identify topics with many documents or tweets that are close together (conceptually) while defining the topics to be as distinct and separate from each other as possible. Study the diagram below to understand the conceptual process:
In summary, the process begins by first cleaning (or preprocessing) the text. This is what we performed in the prior sections of this chapter. In the table below, this is what is occurring in Step 0: clean the text. Stop words are removed, words are stemmed, and uni/bi/trigrams are identified. Step 1 (where we begin this section) is when the uni/bi/trigrams are used to generate the corpus of possible words. Step 2 is to identify the high-level topics based on words and phrases that load together or are commonly found together across all documents. Step 3 involves scoring each document across each of the identified high-level topics. Let’s learn how to do this in Python below.
Latent Dirichlet Allocation Process | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Step 0: clean the text | Step 1: hash (i.e., generate many new features from a single feature) the corpus into a column or feature representing the presence of each possible word or phrase in each record | Step 2: identify topics from the features | Step 3: score results on each topic | Original Text | Preprocessed Text | hash_run | hash_store | hash_fast | hash_walk | topic 1 | topic 2 | topic 3 | topic 1 | topic 2 | topic 3 |
I ran to the store | run store | 1 | 1 | 0 | 0 | These topics are identified mathematically using Bayesian inference | 0.990 | 0.005 | 0.005 | ||
She runs fast | run fast | 1 | 0 | 1 | 0 | 0.005 | 0.990 | 0.005 | |||
He walks fast to the store | walk fast store | 0 | 1 | 1 | 1 | 0.003 | 0.003 | 0.993 |
Step 1: Generate the Dictionary and Corpus
As mentioned above, the first step is to generate a dictionary and corpus. A dictionary is simply a list of all possible words and phrases (i.e., unigrams, bigrams, trigrams, and fougrams) along with assigned identifiers that are usually just integers from 1 to n (n = total number of words and phrases). Let’s do this in Python using the corpora object from the gensim package:
# Create Dictionary
from gensim import corpora
id2word = corpora.Dictionary(docs)
for row in id2word.iteritems():
print(row)
# Output:
# (0, 'engine')
# (1, 'nice')
# (2, 'predictable')
# (3, 'profit')
# (4, 'Region')
# ... [through 2234]
In the code above, only the first line is needed to generate the dictionary and store it in the variable named 'id2word'. The loop is only used to let us see what the structure of the object looks like. Basically, gensim uses Pandas to create a Series object using a row index label as the word or phrase identifier. This allows us to iterate through using the .iteritems() method available with Pandas DataFrames and Series.
Next, we need to use this dictionary to generate the topic modeling corpus, which is a set of word or phrase identifiers and frequencies for each document. In other words, it is a list of lists in which an inner list is generated for every record containing the word or phrase identifier from the dictionary, along with a count of the number of times that the word or phrase appears in the document. See the corpus for this Twitter dataset below:
# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in docs]
corpus
# Output: (wordID, quantity)
# [[(0, 1), (1, 1), (2, 1), (3, 1)], # This is the first Twitter/X post
# [(4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1)], # Second post
# [(10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1)], # Third post
# [(18, 1),
# (19, 1),
# (20, 1),
# (21, 1),
# (22, 1),
# (23, 1),
# (24, 1),
# (25, 1),
# (26, 1)], # Fourth post; not sure why it formatted on different lines though
# ...
In the output, notice that there is an outer list denoted by the first '[' character. Immediately inside that list is another list denoted by another '[' character. This inner list contains tuples with wordID and quantity pairs. Those are identifiers that correspond with the document IDs in the dictionary above. For example, 0 is the ID for the word 'engine', 1 = 'nice', 2 = 'predictable', etc. Now we can see how the IDs in the dictionary were chosen—based on the order those words showed up across the documents. The second value in each tuple refers to the number of times that word or phrase appears in the document. In the data showing above, each word only appears once because the second number in each tuple is always 1. However, if you look further through the data you'll find some instances where the quantity is greater than one.
Step 2: Build the LDA
LDA modeling requires you to specify the number of topics you want to extract before you begin—much like how the k-means algorithm requires you to specify the number of clusters before you begin. However, you do not know upfront how many distinct topics there will be in the dataset. This means you will often iterate through the remaining steps multiple times as you try out different numbers of topics. Let’s begin with something simple, like four topics.
Random_state is simply the random seed used during the modeling process. Chunksize refers to the number of documents that are processed at once in the training algorithm. If you have a lot of memory, you can increase this number and speed up the training algorithm.
Passes refers to the number of times the model is trained on the entire corpus. This is also known as the number of epochs and is used to achieve model convergence. The default value is 1, but higher numbers—if you have the processing power for it—can create more accurate models.
According to the gensim documentation, if per_word_topics is set to True, then “the model also computes a list of topics, sorted in descending order of most likely topics for each word, along with their phi values multiplied by the feature length (i.e., word count).”
# Change the number of topics in the LDA here
topics = 4
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=topics,
random_state=1,
chunksize=20,
passes=10,
per_word_topics=True)
ldatopics = lda_model.show_topics(formatted=False)
for topic in lda_model.print_topics():
print(topic)
# Output:
# (0, '0.062*"support" + 0.040*"CloudComputing" + 0.024*"cloud" + 0.022*"check" + 0.020*"EC2" + 0.017*"fast" + 0.014*"console" + 0.013*"feature" + 0.013*"run" + 0.011*"BigData"')
# (1, '0.024*"S3" + 0.021*"update" + 0.017*"Cloud" + 0.015*"Regions" + 0.015*"TB" + 0.012*"miss" + 0.012*"Beanstalk" + 0.011*"elastic" + 0.010*"Acceleration" + 0.010*"Transfer"')
# (2, '0.047*"new" + 0.033*"AWSLaunch" + 0.029*"use" + 0.020*"region" + 0.017*"RDS" + 0.015*"Region" + 0.013*"redshift" + 0.012*"Lambda" + 0.011*"instance" + 0.010*"awscloud"')
# (3, '0.109*"amp" + 0.029*"available" + 0.026*"learn" + 0.021*"datum" + 0.015*"Service" + 0.012*"add" + 0.012*"VPC" + 0.012*"app" + 0.011*"cloudformation_support" + 0.011*"Kinesis_Firehose"')
The output is a list of words and weights for each topic in the form of a list of tuples. The first topic (0) is primarily represented by the word 'support' at a weight of 0.062, followed by 'CloudComputing' at 0.024. This is much like the beta coefficients in an MLR model.
The last step of training the LDA may feel a bit familiar to you as you examine the results. Much like a predictive model, the topic is determined by a set of inputs (words and phrases) that are assigned a weight that is conceptually like regression coefficient, which you may have learned earlier. The weight in front of each term indicates how much that term contributes to the overall topic relative to the others within that topic.
How Many Topics Should There Be?
One of the primary problems people have when running LDAs is determining the number of topics to build. There is no single rule to always answer this. But there are a few metrics that can help. We will learn to calculate two of them here: perplexity and coherence. Perplexity is a statistical measure of how well a probability model predicts a sample. In other words, it is a fit metric like R2, RMSE, or accuracy. A perplexity measure is calculated for a given LDA model with n topics. The idea is that theoretical word distributions represented by the topics are compared to the actual topic mixtures. However, since human judgment is not used to label the topics, it is possible for the theoretical word distributions to be poor representations of actual human topic inference. Therefore, another metric is needed to assess the coherence of the topics.
Coherence is the degree to which a set of words or phrases agree with or complement each other. Therefore, LDA topic coherence measures the degree of semantic similarity between high-scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. For example, an LDA may assign the terms 'service', 'support', 'help', and 'fantastic' to a single topic. These terms are coherent because they all represent a buyer’s experience with service after a sale. However, the terms 'https', 'www', and '@email.com' may also be part of that topic—not because they have anything to do with the concept of support but because customers often reference the location of where they go for help in the same statements they use to describe the help they are getting. Lower coherence scores would occur when those types of terms are combined.
In summary, the goal is to identify the best n number of topics for an LDA model such that perplexity is lowest and coherence is highest. However, while perplexity and coherence are generally correlated, their relationship is typically positive—meaning that higher coherence comes with the cost of greater perplexity. Let’s create models for n = 3 through 9 topics and compare their perplexity and coherence scores:
from gensim.models import CoherenceModel
df_fit = pd.DataFrame(columns=['index', 'perplexity', 'coherence'])
df_fit.set_index('index', inplace=True)
for n in range(3,10):
# Fit LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=n,
random_state=1,
chunksize=100,
passes=5,
per_word_topics=True)
# Generate fit metrics
coherence = CoherenceModel(model=lda_model, texts=docs, dictionary=id2word, coherence='c_v').get_coherence()
perplexity = lda_model.log_perplexity(corpus)
# Add metrics to df_fit
df_fit.loc[n] = [perplexity, coherence]
df_fit
If the goal is to find the greatest coherence and the smallest perplexity, then let’s plot both scores and examine their trends:
# Visualize results
import seaborn as sns, matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
# Normalize these scores to the same scale
scaler = MinMaxScaler()
df_fit[['perplexity', 'coherence']] = scaler.fit_transform(df_fit[['perplexity', 'coherence']])
plt.plot(df_fit.index, df_fit.perplexity, marker='o');
plt.plot(df_fit.index, df_fit.coherence, marker='o');
plt.legend(['Perplexity', 'Coherence'], loc='best')
plt.xlabel('Number of Topics')
plt.ylabel('Score')
plt.show()
Notice in this chart that it will not always be possible to identify a number of topics where coherence is maximized while perplexity is simultaneously minimized. In the case above, coherence is maximized and perplexity is minimized at 8. It does not always work out that these metrics are optimized on the same number of topics. Let's proceed with using the 8-topic model, but keep in mind that these metrics are just a guide or starting point. We may decide later based on how we interpret the topics that a different number is more appropriate.
Step 3: Score Topics
Finally, now that we have the topic weights assigned to each word and phrase, we can generate new features for each topic and then generate a topic score for every document. This is accomplished by multiplying the weight for a given word in a given topic by the number of occurrences in the document and summing them all together into a single score for each topic for each document. The code below is somewhat complex. If you need help to understand it, break the loop after each step and print out the results. Let's use 8 topics for this model scoring since that was what we decided above.
pd.options.display.max_colwidth = 50
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=8,
random_state=1,
chunksize=100,
passes=5,
per_word_topics=True)
df_topics = df.copy()
num_topics = len(lda_model.get_topics()) # store the number of topics from the last model
for col in range(num_topics): # generate a new column for each topic
df_topics[f'topic_{col + 1}'] = 0.0
# Store the topic score and dominant topic
for i, words in enumerate(docs):
doc = lda_model[id2word.doc2bow(words)] # generate a corpus for this document set of words
for j, score in enumerate(doc[0]): # for each document in the corpus
# Get the topic score and store it in the appropriate column
df_topics.iat[i, (len(df_topics.columns) - ((num_topics) - score[0]))] = score[1]
df_topics.head()
Now we have new features to work with and potentially improve our predictive accuracy. We can use these topic scores in some sort of regression or classification model, for example, to improve the prediction of retweet count. This way, we predict the number of retweets a particular tweet will get before it is even posted. This would allow us to adjust the actual text of the tweet to maximize the predicted retweet count.