16.3 Text Cleaning
Remove Duplicates
Human language varies greatly with large amounts of complexity. Before we can begin extracting topics from the text, there is a lot of cleaning that can be done to standardize the data and make it easier to idetnify topics. Let’s begin by simply removing duplicate posts. This is not mandatory for all documents. Re-posted social media are obviously specific to this context.
print(f'Total tweets: {len(df)}')
df = df.loc[~df['text'].str.contains("RT @")] # Remove anything containing the 'RT @' text
df = df.drop_duplicates(subset=['text']) # Remove any duplicate post
print(f'Original tweets: {len(df)}')
# Output
# Total tweets: 1000
# Original tweets: 979
Looks like we had 21 reposts in our dataset.
One of the major problems with social media posts is that such a large volume of them are posted by bots. Bots are software programs designed to drive social media trends by posting automated content. Bots are a somewhat gray area when it comes to social media ethics. Many bots are useful and used for public services, such as emergency alerts. Others are used by organizations as part of a marketing strategy. There are also government-sponsored bots that, while operating on publicly available social media platforms and within legal boundaries, manage to influence politics (Aral and Eckles, 2019), impact elections (Linvil et al., 2019), and sow public discord concerning vaccinations (Walter et al., 2020).1
Remove Emails, URLs, and Other Unnecessary Characters
Next, we will remove line breaks, single quotes, email addresses, and URLs. Basically, we can remove any characters that we do not feel will help identify unique topics in the tweets. For example, email addresses and URLs are specific, not repeated, and rarely named in a way that relates to topics. As a result, they add unnecessary variance to the data that inhibits topic modeling. The function below uses the RegEx language to identify and remove these texts.
def re_mod(doc):
import re
doc = re.sub('\\S*@\\S*\\s?', '', doc) # remove emails
doc = re.sub('\\s+', ' ', doc) # remove newline chars
doc = re.sub("\\'", "", doc) # remove single quotes
doc = re.sub(r"http\S+", "url", doc) # replace URLs with 'url'
return doc
# Convert each tweet to a list of cleaned words and add to a master list
docs = df['text'].map(lambda x: re_mod(x)).values.tolist()
# Print the first five records to see what they look like
for doc in docs[:5]:
print(doc)
# Output
# Amazon Web Services is becoming a nice predictable profit engine
# Announcing four new VPN features in our Sao Paulo Region.
# Are you an user? Use #Zadara + #AWS to enahnce your storage just one click away!
# AWS CloudFormation Adds Support for Amazon VPC NAT Gateway Amazon EC2 Container Registry and More via
# AWS database migration service now available:
This may be the first time you have seen the .map() function of pandas and the lambda x function of python. Lambda functions are executed as soon as they are created. They require no explicit call to invoke the function and are created using the lambda function. In other words, we are representing each cell in the df['text'] series as 'x'. Then use the .map() function of pandas to apply this lambda x using the re_mod(x) funtion which applies to each cell in the series df['text']. There is supplementary chapter in this book (Supplement) that explains lambda functions in greater detail as well as the .apply(), .applymap(), and map() functions of pandas that makes it easy to use lambda functions on DataFrames and Series if you want to learn more.
Essentially, the .map() function converts all of the values in df['text'] to the cleaner versions after applying the re_mod() function and then we store the results in a list (docs) instead of a new column in the DataFrame. When we are processing large amounts of text, it will be much faster to work in a Python list than the whole DataFrame df.
Remove Stop Words and Puncuation; Lemmatization
Next, we are going to do some more serious cleaning that we also performed in the prior chapter. In particular, let's identify stop words and punctuation and remove them. At the same time, we will lemmatize the words to reduce variance and make it easier to identify topics. We can modify and use the function we wrote in the last chapter for this task. But first, which list of stop words should we begin with? There are several standard sets available in different packages. For example, the nltk package provides a list with 179 words:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words_nltk = stopwords.words('english')
print(f'Stopwords in NLTK:\t{len(stop_words_nltk)}')
print(stop_words_nltk)
# After reviewing the LDA, return to add words that you want to eliminate
stop_words_nltk.extend(['AWS', 'Amazon', 'Web', 'Services'])
# Output:
# Stopwords in NLTK: 179
# ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
There is an even longer list available in the SpaCy package which we used in the last chapter:
import spacy
nlp = spacy.load('en_core_web_sm')
stop_words_spacy = nlp.Defaults.stop_words
print(f'Stopwords in spaCy:\t{len(stop_words_spacy)}')
print(stop_words_spacy)
# After reviewing the LDA, return to add words that you want to eliminate
stop_words_spacy |= {"AWS", "Amazon", "Web", "Services"}
# Output:
# Stopwords in spaCy: 326
# {'former', 'whither', 'both', 'thru', 'whoever', '’s', 'seemed', 'i', 'of', 'already', 'and', 'mine', 'afterwards', 'thence', 'anywhere', 'hereupon', 'until', 'name', 'except', 'whereupon', 'the', 'own', 'beside', 'him', 'your', 'even', 'only', 'say', 'is', 'yourself', 'nine', 'thereafter', 'get', 'more', 'sometime', 'well', 'me', 'at', 'neither', 'himself', 'several', 'therefore', 'whatever', 'about', 'did', 'either', 'something', 'we', 'that', 'her', 'there', 'top', 'n’t', 'nor', 'on', 'four', 'am', 'has', 'rather', 'alone', 'via', 'can', 'he', 'twenty', 'each', 'also', 'give', 'hers', 'seem', 'whenever', 'be', 'may', 'must', 'forty', 'being', "n't", 'none', 'mostly', 'part', 'which', 'all', 'too', 'moreover', 'where', 'using', 'due', 'thereupon', 'just', 'other', 'while', 'everywhere', '‘ll', 'then', 'these', 'bottom', 'else', 'fifteen', 'hence', 'beyond', "'ll", 'empty', 'ourselves', 'this', 'behind', 'their', 'why', 'eight', 'elsewhere', 'anyone', 'nowhere', 'yet', '‘ve', 'throughout', 'together', 'without', 'between', 'although', 'myself', 'might', 'hundred', 'never', 'various', 'they', 'nothing', 'call', 'used', 'meanwhile', 'somewhere', 'with', 'should', 'eleven', 'see', 'same', 'further', 'go', 'very', 'everyone', 'six', 'after', 'those', 'much', 'back', 'again', 'when', 'itself', 'within', 'quite', 'take', 'almost', 'themselves', 'what', 'last', 'amount', 'you', 'its', 'no', 'off', 'became', 'since', 'third', 'it', 'many', 'among', 'during', 'anyway', 'everything', 'along', 'becoming', 'how', '’m', 'however', 'latter', 'yourselves', 'than', 'less', 'once', 'another', 'not', 'noone', 'side', 'somehow', 'one', 'front', 'toward', 'was', 'such', 'before', 'herself', 'often', 'now', '‘m', 'whence', 'ten', 'anyhow', 'by', 'regarding', 'amongst', 'beforehand', '‘d', 'therein', 'serious', 'keep', 'per', 'someone', 'had', 'into', 'few', 'were', "'d", 'hereafter', 'towards', 'across', 'sixty', 'ca', 'would', 'thus', 'to', 'onto', 'twelve', 'if', 'three', 'above', '’d', 'my', 'whereas', 'seems', 'herein', 'two', 'a', 'us', '‘re', 'besides', 'down', "'re", 'whereafter', 'but', '’ll', 'up', 'been', 'latterly', 'enough', 'becomes', 'upon', 'any', 'have', 'below', 'against', 'show', 'always', 'make', 'over', 'made', 'she', 'five', "'ve", 'whether', 'because', '‘s', 'whereby', 'thereby', 'do', 'really', 'anything', 'or', 'fifty', 'formerly', 'an', 'who', 'wherever', 'sometimes', 're', 'least', 'cannot', 'most', 'out', 'namely', 'perhaps', 'otherwise', 'seeming', 'move', 'from', 'are', 'them', '’ve', 'nevertheless', 'under', 'next', 'some', 'doing', 'his', 'ours', 'whose', 'become', 'our', 'does', 'nobody', 'whom', 'so', 'could', 'whole', '’re', 'wherein', 'done', 'here', 'first', 'in', "'m", 'n‘t', 'indeed', 'others', 'ever', 'as', 'full', 'through', 'put', 'around', 'for', 'unless', 'though', 'every', 'will', 'hereby', "'s", 'please', 'still', 'yours'}
Both packages offer the ability to modify their lists easily by adding or removing stop words. For this dataset, I recommend we use SpaCy's longer list with those four terms related to AWS added. Every post is about AWS so that term will not help in identifying topics. The function below is just like the one we created in the last chapter except I've added an extra check to make sure that neither the original word, nor the lemmatized version, is in the stop word list. I also changed the name of the function. Lastly, I created a second version that is identical, but it uses python list comprehension. List comprehension can be a nice shorthand method of performing iterations with conditions. It is also processed differently in python which results in it being much faster to use. Therefore, I recommend learning and getting used to using list comprehension wherever possible.
# These two functions do the same thing; the second uses list comprehension
def docs_lemma_stop(doc, stop_words):
unigrams = []
for unigram in doc: # remove stop words and punctuation
if not unigram.text in stop_words: # Check both versions of the word
if not unigram.lemma_ in stop_words: # Check both versions of the word
if not unigram.is_punct: # Remove punctuation
unigrams.append(unigram.lemma_) # append the lemmatized version
return unigrams
# This version is increasingly faster as the dataset gets larger
def docs_lemma_stop(doc, stop_words):
return [unigram.lemma_ for unigram in doc if not unigram.lemma_ in stop_words and not unigram.text in stop_words and not unigram.is_punct]
# Call the function to remove stop words, punctuation and perform lemmatization on each doc
docs = [docs_lemma_stop(nlp(doc), stop_words_spacy) for doc in docs]
# Print the first five records to see what they look like
for doc in docs[:5]:
print(doc)
# Output:
# ['nice', 'predictable', 'profit', 'engine']
# ['announce', 'new', 'vpn', 'feature', 'Sao', 'Paulo', 'Region']
# ['user', 'use', 'Zadara', '+', 'enahnce', 'storage', 'click', 'away']
# ['CloudFormation', 'add', 'Support', 'VPC', 'NAT', 'Gateway', 'EC2', 'Container', 'Registry', 'More']
# ['database', 'migration', 'service', 'available']
Notice that the documents are now in a list format. In text analytics, this tokenized form was originally referred to as a "bag of words." A "bag of words" is simply a list of the words and phrases that comprise a document or corpus. You know what a document is (in this case, each tweet is an individual “document”), but what is a corpus? The corpus is a collection of all the lists of words from all documents. We will use the corpus to determine the topics within the dataset. But first, now that the documents a bit cleaner, there are some extra features we can generate to improve the dataset.