Improving Model Fit

Finally, what did we do all of this for anyway? The purpose of text analytics is to generate new features that can be used to improve model fit. Let’s create a baseline model using this data to predict RetweetCount that does not include the topic scores. Notice in the code below that we are dropping the seven topic scores from this initial model so that we can see how the R2 improves after leaving them in:

      # Generate a baseline model that doesn't include topic scores
      
      import statsmodels.api as sm
      import pandas as pd
        
      df = df_topics.copy()
      y = df.RetweetCount
      X = df.drop(columns=['RetweetCount', 'Reach', 'text', 'topic_1', 'topic_2',
                           'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8'])
      X = pd.get_dummies(X, drop_first=True).assign(const=1)
        
      results = sm.OLS(y, X.astype('float64')).fit()
      print(results.summary())
      
      # Output:
      #                             OLS Regression Results                            
      # ==============================================================================
      # Dep. Variable:           RetweetCount   R-squared:                       0.215
      # Model:                            OLS   Adj. R-squared:                  0.205
      # Method:                 Least Squares   F-statistic:                     20.37
      # Date:                Tue, 26 Mar 2024   Prob (F-statistic):           7.22e-43
      # Time:                        06:09:50   Log-Likelihood:                -4255.7
      # No. Observations:                 979   AIC:                             8539.
      # Df Residuals:                     965   BIC:                             8608.
      # Df Model:                          13                                         
      # Covariance Type:            nonrobust                                         
      # =====================================================================================
      #                      coef    std err          t      P>|t|      [0.025      0.975]
      # -------------------------------------------------------------------------------------
      # Hour                  0.3644      0.116      3.153      0.002       0.138       0.591
      # Day                  -0.1085      0.076     -1.421      0.156      -0.258       0.041
      # Klout                 0.5355      0.051     10.522      0.000       0.436       0.635
      # Sentiment             1.4602      0.680      2.149      0.032       0.127       2.794
      # Gender_Male           1.2055      4.491      0.268      0.788      -7.608      10.019
      # Gender_Unisex        -1.8570      5.630     -0.330      0.742     -12.905       9.191
      # Gender_Unknown        6.1547      4.421      1.392      0.164      -2.521      14.830
      # Weekday_Monday        1.2932      2.351      0.550      0.582      -3.321       5.907
      # Weekday_Saturday      1.9597      3.156      0.621      0.535      -4.234       8.153
      # Weekday_Sunday        1.5432      3.520      0.438      0.661      -5.364       8.451
      # Weekday_Thursday      3.8403      2.106      1.824      0.068      -0.292       7.972
      # Weekday_Tuesday       2.2534      2.162      1.042      0.297      -1.989       6.495
      # Weekday_Wednesday     3.2981      2.201      1.498      0.134      -1.022       7.618
      # const               -28.9080      5.473     -5.282      0.000     -39.648     -18.168
      # ==============================================================================
      # Omnibus:                      958.912   Durbin-Watson:                   2.020
      # Prob(Omnibus):                  0.000   Jarque-Bera (JB):            56871.665
      # Skew:                           4.438   Prob(JB):                         0.00
      # Kurtosis:                      39.268   Cond. No.                     1.01e+03
      # ==============================================================================
      

As you can see, we get an R2 of 21.5 percent. Next, let’s generate another model including the topic scores:

      # Compare an improved model that does include topic scores

      X = df.drop(columns=['RetweetCount', 'Reach', 'text'])
      X = pd.get_dummies(X, drop_first=True).assign(const=1)
        
      results = sm.OLS(y, X.astype('float64')).fit()
      print(results.summary())
      
      # Output:
      #                         OLS Regression Results                            
      # ==============================================================================
      # Dep. Variable:           RetweetCount   R-squared:                       0.239
      # Model:                            OLS   Adj. R-squared:                  0.222
      # Method:                 Least Squares   F-statistic:                     14.28
      # Date:                Tue, 26 Mar 2024   Prob (F-statistic):           8.19e-44
      # Time:                        06:09:57   Log-Likelihood:                -4241.0
      # No. Observations:                 979   AIC:                             8526.
      # Df Residuals:                     957   BIC:                             8633.
      # Df Model:                          21                                         
      # Covariance Type:            nonrobust                                         
      # =====================================================================================
      #                      coef    std err          t      P>|t|      [0.025      0.975]
      # -------------------------------------------------------------------------------------
      # Hour                  0.2962      0.116      2.564      0.011       0.069       0.523
      # Day                  -0.1179      0.076     -1.553      0.121      -0.267       0.031
      # Klout                 0.5248      0.053      9.877      0.000       0.421       0.629
      # Sentiment             1.1501      0.686      1.676      0.094      -0.197       2.497
      # topic_1              52.3840     69.374      0.755      0.450     -83.759     188.527
      # topic_2              45.7537     69.666      0.657      0.511     -90.963     182.470
      # topic_3              54.2263     69.678      0.778      0.437     -82.512     190.965
      # topic_4              56.1795     69.510      0.808      0.419     -80.230     192.589
      # topic_5              47.1367     69.426      0.679      0.497     -89.109     183.382
      # topic_6              41.2176     69.650      0.592      0.554     -95.467     177.902
      # topic_7              45.2370     69.430      0.652      0.515     -91.016     181.490
      # topic_8              55.2635     69.432      0.796      0.426     -80.993     191.520
      # Gender_Male           0.4825      4.453      0.108      0.914      -8.256       9.221
      # Gender_Unisex        -2.8251      5.604     -0.504      0.614     -13.822       8.172
      # Gender_Unknown        5.5380      4.389      1.262      0.207      -3.075      14.151
      # Weekday_Monday        1.8285      2.336      0.783      0.434      -2.756       6.413
      # Weekday_Saturday      3.2080      3.156      1.016      0.310      -2.986       9.402
      # Weekday_Sunday        3.6341      3.513      1.034      0.301      -3.261      10.529
      # Weekday_Thursday      4.3879      2.093      2.096      0.036       0.281       8.495
      # Weekday_Tuesday       2.8010      2.142      1.308      0.191      -1.403       7.005
      # Weekday_Wednesday     3.2736      2.179      1.502      0.133      -1.003       7.550
      # const               -76.9587     68.854     -1.118      0.264    -212.082      58.164
      # ==============================================================================
      # Omnibus:                      958.822   Durbin-Watson:                   2.022
      # Prob(Omnibus):                  0.000   Jarque-Bera (JB):            57980.284
      # Skew:                           4.429   Prob(JB):                         0.00
      # Kurtosis:                      39.646   Cond. No.                     2.37e+04
      # ==============================================================================
      

As you can see, the R2 value jumps up to 23.9 percent. An improvement of about 2.4 percent may not seem like much, but this slight improvement can significantly improve the results. Furthermore, just because the improvement was small in this context, that does not mean it will not be much larger in other contexts. In summary, text analytics is a valuable way to improve the predictive accuracy of our models.