12.1 Introduction

Now that you've had a small taste of how text analytics and NLP models can help you deal with unstructured text, let's take it further by using slightly more advanced models that will tell us what topics exist within documents. This involves a category of MLP models we use to perform what is referred to as "topic modeling." Topic modeling is a technique for discovering the general topics that occur throughout a collection of text documents. Or, it is the process of identifying hidden semantic structures in a text body.

There are many types of topic modeling that each have different use cases, strengths, and weaknesses. The one we will learn in this chapter—Latent Dirichlet Allocation (LDA)—is a very common form that has decent general applicability. Before we run the LDA algorithm, we will first need to clean and prepare the text. That is one way that text analytics departs a bit from the normal CRISP-DM process. We often have to perform some pre-Data Preparation before we can perform the Data Understanding phase. Then we clean again, explore a bit more, and continue the cycle until it "feels" like we have clean topics without knowing for certain. You'll see what I mean as we dive in.

Images in this section were created using DALL·E from OpenAI.

Previous Next