16.7 Visualize the LDA
Besides the tabular reports we generated to help us understand the topics, there are actually several useful visualizations that can help explain the latent topics and differences among them. Let’s begin with a frequency distribution.
Frequency Distribution
A frequency distribution is simply a histogram showing the number of terms in each tweet. We like this histogram to have either a normal (bell-shaped) or even distribution (not skewed).
doc_lengths = [len(doc) for doc in df['text']]
sns.displot(doc_lengths, bins=17)
plt.gca().set(ylabel='Number of Posts', xlabel='Post Word Count')
plt.title('Distribution of Post Word Counts')
plt.show()
Clouds of Top N Keywords
Everyone loves word clouds, right? Well, if you do, here is how you can create a word cloud for the top n words in each topic. You can explore modifying the parameters of the WordCloud() object to customize the look and feel:
# 1. Wordcloud of Top N words in each topic
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors, math
# more colors: 'mcolors.XKCD_COLORS', fewer colors: 'mcolors.TABLEAU_COLORS'
cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
cloud = WordCloud(stopwords=stop_words_spacy, # We have already removed stop words, but just in case
background_color='white',
width=2500,
height=1800,
max_words=20,
colormap='tab10',
color_func=lambda *args, **kwargs: cols[i],
prefer_horizontal=1.0)
topics = lda_model.show_topics(formatted=False)
matrix_size = math.ceil(num_topics**(1/2)) # Computes the n by n number of plots to generate
fig, axes = plt.subplots(matrix_size, matrix_size, figsize=(10,10), sharex=True, sharey=True)
for i, ax in enumerate(axes.flatten()):
fig.add_subplot(ax)
try:
topic_words = dict(topics[i][1])
cloud.generate_from_frequencies(topic_words, max_font_size=300)
plt.gca().imshow(cloud)
plt.gca().set_title('Topic ' + str(i+1), fontdict=dict(size=16))
plt.gca().axis('off')
except:
plt.gca().axis('off')
plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()
Topic Keywords Counts
When it comes to the keywords in the topics, the importance (weights) of the keywords matters. Along with that, how frequently the words have appeared in the tweets is also interesting to see. Let’s plot the word counts and the weights of each keyword in the same chart. This allows us to visually identify an important rule: we should eliminate words (1) that occur in multiple topics or (2) whose relative frequency is more than the weight. In the former case, those words do not significantly differentiate between topics and, in the latter, those words are simply less important and have more error.
# Bar chart of word counts for each topic
from collections import Counter
topics = lda_model.show_topics(formatted=False)
data_flat = [w for w_list in docs for w in w_list]
counter = Counter(data_flat)
out = []
for i, topic in topics:
for word, weight in topic:
out.append([word, i + 1, weight, counter[word]])
df_temp = pd.DataFrame(out, columns=['word', 'topic_id', 'importance', 'word_count'])
# Plot Word Count and Weights of Topic Keywords
matrix_size = math.ceil(num_topics**(1/2)) # Computes the n by n number of plots to generate
fig, axes = plt.subplots(matrix_size, matrix_size, figsize=(20,20), sharey=True, dpi=160)
cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
for i, ax in enumerate(axes.flatten()):
ax.bar(x='word', height="word_count", data=df_temp.loc[df_temp.topic_id==i+1, :], color=cols[i+1], width=0.5, alpha=0.3, label='Word Count')
ax_twin = ax.twinx()
ax_twin.bar(x='word', height="importance", data=df_temp.loc[df_temp.topic_id==i+1, :], color=cols[i+1], width=0.2, label='Weights')
ax.set_ylabel('Word Count', color=cols[i+1])
ax.set_title('Topic: ' + str(i + 1), color=cols[i+1], fontsize=16)
ax.tick_params(axis='y', left=False)
ax.set_xticks(ax.get_xticks())
ax.set_xticklabels(df_temp.loc[df_temp.topic_id==i+1, 'word'], rotation=30, horizontalalignment= 'right')
ax.legend(loc='upper center'); ax_twin.legend(loc='upper right')
if i >= len(topics): # Turn off the unneeded subplots
ax.axis('off')
ax.title.set_visible(False)
ax_twin.axis('off')
ax.legend().set_visible(False)
ax_twin.legend().set_visible(False)
fig.tight_layout(w_pad=2)
fig.suptitle('Word Count and Importance of Topic Keywords', fontsize=20, y=1.03)
plt.show()
See if you can identify all of the terms that should be removed based on the two rules given above. Here’s what I see as the problem terms:
Topic 1: use (Rule 1 & 2)
Topic 2: none
Topic 3: amp (Rule 1 & 2)
Topic 4: none
Topic 5: amp (Rule 1), join (Rule 2), learn (Rule 1)
Topic 6: amp (Rule 1), use (Rule 1 & 2)
Topic 7: available (Rule 1)
Topic 8: available (Rule 1), learn (Rule 1), use (Rule 1)
Again, the rule is that we remove any term that either (1) appears in more than one topic or (2) has a shaded bar (count) that is taller than the solid bar (weight). To do that, we go back to the code we created in Section 16.2 and add those words (once each; no need to repeat) into the custom stop word list. In practice, we would re-run all LDA models and re-determine the optimal number of topics after removing those additional stop words. This process might require multiple rounds until there are no more words that violate those rules.
t-SNE Clustering Chart
Another useful way to visualize the number of documents or tweets attributed to each topic is through a t-distributed Stochastic Neighbor Embedding (SNE) chart. t-SNE clustering is a form of dimension reduction, similar to principal components analysis, that does not depend on a linear assumption—meaning, you are not reducing numeric features that allow you to fit a straight regression line through them. This is perfect for text data.
Here is a particularly good article that explains t-SNE in greater detail: t-SNE clearly explained: An intuitive explanation of t-SNE algorithm and why it’s so useful in practice.
# Get topic weights and dominant topics
from sklearn.manifold import TSNE
from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook
import numpy as np
# Get topic weights
topic_weights = []
for i, row_list in enumerate(lda_model[corpus]):
topic_weights.append([w for i, w in row_list[0]])
# Array of topic weights
arr = pd.DataFrame(topic_weights).fillna(0).values
# Keep the well separated points (optional)
arr = arr[np.amax(arr, axis=1) > 0.35]
# Dominant topic number in each doc
topic_num = np.argmax(arr, axis=1)
# tSNE Dimension Reduction
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')
tsne_lda = tsne_model.fit_transform(arr)
# Plot the Topic Clusters using Bokeh
output_notebook()
mycolors = np.array([color for name, color in mcolors.TABLEAU_COLORS.items()])
plot = figure(title=f"t-SNE Clustering of {num_topics} LDA Topics")
plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num])
show(plot)
In summary, t-SNE reduced our four topics to a two-dimensional space in order to visualize where they fit relative to each other. In our case, notice that Topic 3 is very under-represented (which we already knew), and it fits halfway between Topic 0 and Topic 2. I think we have seen enough to conclude that we should revise this analysis to be based on either three topics or five topics. Better fits can be found by either increasing or decreasing the number of topics.