Visualize the LDA

Besides the tabular reports we generated to help us understand the topics, there are actually several useful visualizations that can help explain the latent topics and differences among them. Let’s begin with a frequency distribution.

Frequency Distribution

A frequency distribution is simply a histogram showing the number of terms in each tweet. We like this histogram to have either a normal (bell-shaped) or even distribution (not skewed).

      doc_lengths = [len(doc) for doc in df['text']]

      sns.displot(doc_lengths, bins=17)
      plt.gca().set(ylabel='Number of Posts', xlabel='Post Word Count')
      plt.title('Distribution of Post Word Counts')
      plt.show()
      
Figure 12.6: Frequency Distribution.

Clouds of Top N Keywords

Everyone loves word clouds, right? Well, if you do, here is how you can create a word cloud for the top n words in each topic. You can explore modifying the parameters of the WordCloud() object to customize the look and feel:

      # 1. Wordcloud of Top N words in each topic
      from matplotlib import pyplot as plt
      from wordcloud import WordCloud, STOPWORDS
      import matplotlib.colors as mcolors, math
        
      # more colors: 'mcolors.XKCD_COLORS', fewer colors: 'mcolors.TABLEAU_COLORS'
      cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
      
      cloud = WordCloud(stopwords=stop_words_spacy, # We have already removed stop words, but just in case
                        background_color='white',
                        width=2500,
                        height=1800,
                        max_words=20,
                        colormap='tab10',
                        color_func=lambda *args, **kwargs: cols[i],
                        prefer_horizontal=1.0)
      
      topics = lda_model.show_topics(formatted=False)
      
      matrix_size = math.ceil(num_topics**(1/2))  # Computes the n by n number of plots to generate
      fig, axes = plt.subplots(matrix_size, matrix_size, figsize=(10,10), sharex=True, sharey=True)
        
      for i, ax in enumerate(axes.flatten()):
        fig.add_subplot(ax)
        try:
          topic_words = dict(topics[i][1])
          cloud.generate_from_frequencies(topic_words, max_font_size=300)
          plt.gca().imshow(cloud)
          plt.gca().set_title('Topic ' + str(i+1), fontdict=dict(size=16))
          plt.gca().axis('off')
        except:
          plt.gca().axis('off')
      
      plt.subplots_adjust(wspace=0, hspace=0)
      plt.axis('off')
      plt.margins(x=0, y=0)
      plt.tight_layout()
      plt.show()
      
Figure 12.7: Clouds of Top N Keywords.

Topic Keywords Counts

When it comes to the keywords in the topics, the importance (weights) of the keywords matters. Along with that, how frequently the words have appeared in the tweets is also interesting to see. Let’s plot the word counts and the weights of each keyword in the same chart. This allows us to visually identify an important rule: we should eliminate words (1) that occur in multiple topics or (2) whose relative frequency is more than the weight. In the former case, those words do not significantly differentiate between topics and, in the latter, those words are simply less important and have more error.

      # Bar chart of word counts for each topic
      from collections import Counter
      topics = lda_model.show_topics(formatted=False)
      data_flat = [w for w_list in docs for w in w_list]
      counter = Counter(data_flat)
        
      out = []
      for i, topic in topics:
        for word, weight in topic:
          out.append([word, i + 1, weight, counter[word]])
        
      df_temp = pd.DataFrame(out, columns=['word', 'topic_id', 'importance', 'word_count'])
        
      # Plot Word Count and Weights of Topic Keywords
      matrix_size = math.ceil(num_topics**(1/2))  # Computes the n by n number of plots to generate
      fig, axes = plt.subplots(matrix_size, matrix_size, figsize=(20,20), sharey=True, dpi=160)
      cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
      for i, ax in enumerate(axes.flatten()):
        ax.bar(x='word', height="word_count", data=df_temp.loc[df_temp.topic_id==i+1, :], color=cols[i+1], width=0.5, alpha=0.3, label='Word Count')
        ax_twin = ax.twinx()
        ax_twin.bar(x='word', height="importance", data=df_temp.loc[df_temp.topic_id==i+1, :], color=cols[i+1], width=0.2, label='Weights')
        ax.set_ylabel('Word Count', color=cols[i+1])
        ax.set_title('Topic: ' + str(i + 1), color=cols[i+1], fontsize=16)
        ax.tick_params(axis='y', left=False)
        ax.set_xticks(ax.get_xticks())
        ax.set_xticklabels(df_temp.loc[df_temp.topic_id==i+1, 'word'], rotation=30, horizontalalignment= 'right')
        ax.legend(loc='upper center'); ax_twin.legend(loc='upper right')
        if i >= len(topics): # Turn off the unneeded subplots
          ax.axis('off')
          ax.title.set_visible(False)
          ax_twin.axis('off')
          ax.legend().set_visible(False)
          ax_twin.legend().set_visible(False)
        
      fig.tight_layout(w_pad=2)
      fig.suptitle('Word Count and Importance of Topic Keywords', fontsize=20, y=1.03)
      plt.show()
      
Figure 12.8: Topic Keywords Counts.

See if you can identify all of the terms that should be removed based on the two rules given above. Here’s what I see as the problem terms:

  • Topic 1: use (Rule 1 & 2)

  • Topic 2: none

  • Topic 3: amp (Rule 1 & 2)

  • Topic 4: none

  • Topic 5: amp (Rule 1), join (Rule 2), learn (Rule 1)

  • Topic 6: amp (Rule 1), use (Rule 1 & 2)

  • Topic 7: available (Rule 1)

  • Topic 8: available (Rule 1), learn (Rule 1), use (Rule 1)

Again, the rule is that we remove any term that either (1) appears in more than one topic or (2) has a shaded bar (count) that is taller than the solid bar (weight). To do that, we go back to the code we created in Section 12.2 and add those words (once each; no need to repeat) into the custom stop word list. In practice, we would re-run all LDA models and re-determine the optimal number of topics after removing those additional stop words. This process might require multiple rounds until there are no more words that violate those rules.

t-SNE Clustering Chart

Another useful way to visualize the number of documents or tweets attributed to each topic is through a t-distributed Stochastic Neighbor Embedding (SNE) chart. t-SNE clustering is a form of dimension reduction, similar to principal components analysis, that does not depend on a linear assumption—meaning, you are not reducing numeric features that allow you to fit a straight regression line through them. This is perfect for text data.

For More Information

Here is a particularly good article that explains t-SNE in greater detail: t-SNE clearly explained: An intuitive explanation of t-SNE algorithm and why it’s so useful in practice.

      # Get topic weights and dominant topics
      from sklearn.manifold import TSNE
      from bokeh.plotting import figure, output_file, show
      from bokeh.models import Label
      from bokeh.io import output_notebook
      import numpy as np
        
      # Get topic weights
      topic_weights = []
      for i, row_list in enumerate(lda_model[corpus]):
        topic_weights.append([w for i, w in row_list[0]])
        
      # Array of topic weights
      arr = pd.DataFrame(topic_weights).fillna(0).values
        
      # Keep the well separated points (optional)
      arr = arr[np.amax(arr, axis=1) > 0.35]
        
      # Dominant topic number in each doc
      topic_num = np.argmax(arr, axis=1)
      
      # tSNE Dimension Reduction
      tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')
      tsne_lda = tsne_model.fit_transform(arr)
        
      # Plot the Topic Clusters using Bokeh
      output_notebook()
      mycolors = np.array([color for name, color in mcolors.TABLEAU_COLORS.items()])
      plot = figure(title=f"t-SNE Clustering of {num_topics} LDA Topics")
      plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num])
      show(plot)
      
Figure 12.9: t-SNE Clustering Chart

In summary, t-SNE reduced our four topics to a two-dimensional space in order to visualize where they fit relative to each other. In our case, notice that Topic 3 is very under-represented (which we already knew), and it fits halfway between Topic 0 and Topic 2. I think we have seen enough to conclude that we should revise this analysis to be based on either three topics or five topics. Better fits can be found by either increasing or decreasing the number of topics.