N-Grams

Now that we have a clean list of single words, let’s add pairs, triplets, and quadruplets of words that occur in the dataset. We refer to these phrases as n-grams. An n-gram is a phrase based on n number of words that appear in order together in a corpus. For example, a bigram is a pair of words and a trigram is a group of three words. Importantly, this does not include every possible combination of words that could be created out of those in the corpus, but rather, only those phrases that actually appear in the dataset. For example, if a tweet says, “I really like the AWS portal,” first, the stop words will be removed, leaving “like AWS portal.” Two bigrams would be created: “like AWS” and “AWS portal.” But the bigram “like portal” would not be created because those two words do not appear together. However, the trigram “like AWS portal” would be added since those words appear together in that order.

      import gensim
          
      # gensim.models.Phrases ties together pairs of words that appear together in docs
      bigram = gensim.models.Phrases(docs, min_count=5, threshold=10)
        
      # Repeat the process using the bigrams + docs to train trigrams
      trigram = gensim.models.Phrases(bigram[docs], min_count=5, threshold=10)
        
      # Repeat the process using the bigrams + docs to train fourgrams
      fourgram = gensim.models.Phrases(trigram[docs], min_count=5, threshold=10)
        
      # These are "frozen" (i.e. immutable) versions of the same files that are quicker in memory
      bigram_mod = gensim.models.phrases.Phraser(bigram)
      trigram_mod = gensim.models.phrases.Phraser(trigram)
      fourgram_mod = gensim.models.phrases.Phraser(fourgram)
      

The objects bigram_mod and trigram_mod contain the lists of bigrams and trigrams that appear in the corpus. Trigrams are generated from using the bigrams generated in the prior line of code and adding them to unigrams. Fourgrams are then generated by adding unigrams to trigrams. The min_count parameter indicates how many times the bigram, trigram, or fourgram must appear in the entire corpus for it to get included. A higher threshold parameter means that fewer n-grams will be kept. Understanding exactly what it means is a bit tougher without understanding the concept of n-gram importance. The exact formulas can be found in gensim's documentation.

Conceptually, the threshold score, called original_scorer(), measures how similar an identified n-gram is to the unigrams identified in a document. A higher score means that the potential bigram identified is more like the n-grams contained in the model. Here is a summary of the default scoring method (called original_scorer()) that is used for the threshold parameter of the gensim.models.Phrases object:

  • gensim.models.phrases.original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count)
    • Based on: Mikolov et. al., (2013), “Distributed Representations of Words and Phrases and their Compositionality”.
  • Formula: ((bigram_count - min_count) * len_vocab) / (worda_count * wordb_count)
  • Parameters:
    • worda_count (int) – Number of occurrences for first word.
    • wordb_count (int) – Number of occurrences for second word.
    • bigram_count (int) – Number of co-occurrences for phrase “worda_wordb”.
    • len_vocab (int) – Size of vocabulary.
    • min_count (int) – Minimum collocation count threshold.
    • corpus_word_count (int) – Not used in this particular scoring technique.
  • Returns: Score for given phrase. Can be negative.
  • Return type: float

One way to understand what was created using those objects in the code above is to print their results and see which n-grams made the cut based on the min_count and threshold parameterse. We can print out the terms that were generated.

      # Which n-grams were generated? Let's put their results in a DataFrame just to understand
      bigram_list = list(bigram_mod.phrasegrams.keys())
      trigram_list = list(trigram_mod.phrasegrams.keys())
      fourgram_list = list(fourgram_mod.phrasegrams.keys())
        
      df_ngrams = pd.DataFrame(columns=['bigrams', 'trigrams', 'fourgrams'],
                               index=list(bigram_list + trigram_list + fourgram_list))
      df_ngrams.drop_duplicates(inplace=True)
        
      for ngram in bigram_list + trigram_list + fourgram_list:
        if ngram in bigram_list: df_ngrams.at[ngram, 'bigrams'] = 'x'
        if ngram in trigram_list: df_ngrams.at[ngram, 'trigrams'] = 'x'
        if ngram in fourgram_list: df_ngrams.at[ngram, 'fourgrams'] = 'x'
        
      pd.set_option('display.max_rows', None)
      df_ngrams.sort_index()
      

The screenshot above does not show all of the n-grams, but you can view them completely in your own notebook. You may be wondering why not all possible n-grams were generated. Remember that, based on the min_count and threshold parameters, not all n-grams would make the cut. The next step is to actually add those n-grams to the individual bag of words list for each document in the corpus. I'm going to create a function that generates the n-grams and adds them into the doc lists all at once. The function below recreates those models and accepts their parameters in the ngrams function definition.

      def ngrams(docs, min_count=5, threshold=10):
        from gensim.models import Phrases
        from gensim.models.phrases import Phraser
        
        # Generate n-gram models
        bigram = Phrases(docs, min_count=min_count, threshold=threshold)
        trigram = Phrases(bigram[docs], min_count=min_count, threshold=threshold)
        fourgram = Phrases(trigram[docs], min_count=min_count, threshold=threshold)
          
        # Convert models to "frozen", immutable versions for speed
        bigram_mod = Phraser(bigram)
        trigram_mod = Phraser(trigram)
        fourgram_mod = Phraser(fourgram)
          
        docs = [bigram_mod[doc] for doc in docs]                # Add bigrams to each doc
        docs = [trigram_mod[bigram_mod[doc]] for doc in docs]   # Add trigrams
        docs = [fourgram_mod[trigram_mod[doc]] for doc in docs] # Add fourgrams
        
        return docs
        
      # Call the function
      docs = ngrams(docs)
        
      # Print some samples to see what happened
      for doc in docs[:5]:
        print(doc)
        
      # Output:
      # ['nice', 'predictable', 'profit', 'engine']
      # ['announce', 'new', 'vpn', 'feature', 'Sao_Paulo', 'Region']
      # ['user', 'use', 'Zadara', '+', 'enahnce', 'storage', 'click', 'away']
      # ['CloudFormation', 'add', 'Support', 'VPC', 'NAT', 'Gateway', 'EC2_Container', 'Registry', 'More']
      # ['database', 'migration', 'service', 'available']
      

You may be wondering how the list comprehension works in this function. The code bigram_mod[doc] applies the bigram model to a particular document; meaning, it looks for pairs of unigrams in the doc that could be concatenated (unigram_unigram) to equal one of those that was generated and stored in the bigram_mod object. Notice in the output that only two were identified: Sao_Paulo and EC2_Container. These terms had a min_count greater than or equal to 5 and a threshold score of 10 or greater. Again, that threshold score is a measure of how similar the two unigrams "Sao" and "Paulo"--when appended together--are to the "Sao_Paulo" bigram contained in the bigram_mod object. In this case, the appended "_" unigrams were identical to the bigram. The idea is that they don't have to be a perfect match to still be identified as an acceptable n-gram. This allows for minor differences in case, spelling, and tense which can be advantageous.

There is one last useful cleaning step we can perform related to n-grams. As you hopefully remember from the prior chapter, there are many different parts of speech that can be identified. Not all of them are particularly useful in identifying topics. Some experts recommend that nouns, verbs, adjectives, and adverbs are the most useful. Let's create a function to filter out those that we don't want. I also included proper nouns in this list below in case there are some names that draw attention in social media posts.

      def filter_pos(docs, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV', 'PROPN']):
        docs_with_ngrams = [] # Create a new list to store the final docs
        
        # Filter each n-gram to include only the allowed parts of speech (pos) tags
        for doc in docs:
          doc = nlp(" ".join(doc)) # We have to concatenate the n-grams back together to do this
          docs_with_ngrams.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
          
        return docs_with_ngrams
        
      # Call the function
      docs = filter_pos(docs)
        
      # Print some samples to see what happened
      for doc in docs[:5]:
        print(doc)
        
      # Output:
      # ['nice', 'predictable', 'profit', 'engine']
      # ['announce', 'new', 'vpn', 'feature', 'Sao_Paulo', 'Region']
      # ['user', 'use', 'Zadara', '+', 'enahnce', 'storage', 'click', 'away']
      # ['CloudFormation', 'add', 'Support', 'VPC', 'NAT', 'Gateway', 'EC2_Container', 'Registry', 'more']
      # ['database', 'migration', 'service', 'available']
      

It looks like all of the n-grams in the first five documents were already one of the allowed parts of speech. But if you take a futher look through the documents, you would find that several more terms have been removed. Also, notice that even after all that we have done, we still get misspellings (see "enahnce" in the third row) and other unrecognized terms and acronyms. That's okay. Once we perform topic modeling, we'll see which words are not helping to identify topics and we can add them to the original stop word list.

Now that the documents are tokenized, cleaned, and updated to include relevant n-grams, we are ready to generate topic models.