12.9 Improving Model Fit
Finally, what did we do all of this for anyway? The purpose of text analytics is to generate new features that can be used to improve model fit. Let’s create a baseline model using this data to predict RetweetCount that does not include the topic scores. Notice in the code below that we are dropping the seven topic scores from this initial model so that we can see how the R2 improves after leaving them in:
# Generate a baseline model that doesn't include topic scores
import statsmodels.api as sm
import pandas as pd
df = df_topics.copy()
y = df.RetweetCount
X = df.drop(columns=['RetweetCount', 'Reach', 'text', 'topic_1', 'topic_2',
'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8'])
X = pd.get_dummies(X, drop_first=True).assign(const=1)
results = sm.OLS(y, X.astype('float64')).fit()
print(results.summary())
# Output:
# OLS Regression Results
# ==============================================================================
# Dep. Variable: RetweetCount R-squared: 0.215
# Model: OLS Adj. R-squared: 0.205
# Method: Least Squares F-statistic: 20.37
# Date: Tue, 26 Mar 2024 Prob (F-statistic): 7.22e-43
# Time: 06:09:50 Log-Likelihood: -4255.7
# No. Observations: 979 AIC: 8539.
# Df Residuals: 965 BIC: 8608.
# Df Model: 13
# Covariance Type: nonrobust
# =====================================================================================
# coef std err t P>|t| [0.025 0.975]
# -------------------------------------------------------------------------------------
# Hour 0.3644 0.116 3.153 0.002 0.138 0.591
# Day -0.1085 0.076 -1.421 0.156 -0.258 0.041
# Klout 0.5355 0.051 10.522 0.000 0.436 0.635
# Sentiment 1.4602 0.680 2.149 0.032 0.127 2.794
# Gender_Male 1.2055 4.491 0.268 0.788 -7.608 10.019
# Gender_Unisex -1.8570 5.630 -0.330 0.742 -12.905 9.191
# Gender_Unknown 6.1547 4.421 1.392 0.164 -2.521 14.830
# Weekday_Monday 1.2932 2.351 0.550 0.582 -3.321 5.907
# Weekday_Saturday 1.9597 3.156 0.621 0.535 -4.234 8.153
# Weekday_Sunday 1.5432 3.520 0.438 0.661 -5.364 8.451
# Weekday_Thursday 3.8403 2.106 1.824 0.068 -0.292 7.972
# Weekday_Tuesday 2.2534 2.162 1.042 0.297 -1.989 6.495
# Weekday_Wednesday 3.2981 2.201 1.498 0.134 -1.022 7.618
# const -28.9080 5.473 -5.282 0.000 -39.648 -18.168
# ==============================================================================
# Omnibus: 958.912 Durbin-Watson: 2.020
# Prob(Omnibus): 0.000 Jarque-Bera (JB): 56871.665
# Skew: 4.438 Prob(JB): 0.00
# Kurtosis: 39.268 Cond. No. 1.01e+03
# ==============================================================================
As you can see, we get an R2 of 21.5 percent. Next, let’s generate another model including the topic scores:
# Compare an improved model that does include topic scores
X = df.drop(columns=['RetweetCount', 'Reach', 'text'])
X = pd.get_dummies(X, drop_first=True).assign(const=1)
results = sm.OLS(y, X.astype('float64')).fit()
print(results.summary())
# Output:
# OLS Regression Results
# ==============================================================================
# Dep. Variable: RetweetCount R-squared: 0.239
# Model: OLS Adj. R-squared: 0.222
# Method: Least Squares F-statistic: 14.28
# Date: Tue, 26 Mar 2024 Prob (F-statistic): 8.19e-44
# Time: 06:09:57 Log-Likelihood: -4241.0
# No. Observations: 979 AIC: 8526.
# Df Residuals: 957 BIC: 8633.
# Df Model: 21
# Covariance Type: nonrobust
# =====================================================================================
# coef std err t P>|t| [0.025 0.975]
# -------------------------------------------------------------------------------------
# Hour 0.2962 0.116 2.564 0.011 0.069 0.523
# Day -0.1179 0.076 -1.553 0.121 -0.267 0.031
# Klout 0.5248 0.053 9.877 0.000 0.421 0.629
# Sentiment 1.1501 0.686 1.676 0.094 -0.197 2.497
# topic_1 52.3840 69.374 0.755 0.450 -83.759 188.527
# topic_2 45.7537 69.666 0.657 0.511 -90.963 182.470
# topic_3 54.2263 69.678 0.778 0.437 -82.512 190.965
# topic_4 56.1795 69.510 0.808 0.419 -80.230 192.589
# topic_5 47.1367 69.426 0.679 0.497 -89.109 183.382
# topic_6 41.2176 69.650 0.592 0.554 -95.467 177.902
# topic_7 45.2370 69.430 0.652 0.515 -91.016 181.490
# topic_8 55.2635 69.432 0.796 0.426 -80.993 191.520
# Gender_Male 0.4825 4.453 0.108 0.914 -8.256 9.221
# Gender_Unisex -2.8251 5.604 -0.504 0.614 -13.822 8.172
# Gender_Unknown 5.5380 4.389 1.262 0.207 -3.075 14.151
# Weekday_Monday 1.8285 2.336 0.783 0.434 -2.756 6.413
# Weekday_Saturday 3.2080 3.156 1.016 0.310 -2.986 9.402
# Weekday_Sunday 3.6341 3.513 1.034 0.301 -3.261 10.529
# Weekday_Thursday 4.3879 2.093 2.096 0.036 0.281 8.495
# Weekday_Tuesday 2.8010 2.142 1.308 0.191 -1.403 7.005
# Weekday_Wednesday 3.2736 2.179 1.502 0.133 -1.003 7.550
# const -76.9587 68.854 -1.118 0.264 -212.082 58.164
# ==============================================================================
# Omnibus: 958.822 Durbin-Watson: 2.022
# Prob(Omnibus): 0.000 Jarque-Bera (JB): 57980.284
# Skew: 4.429 Prob(JB): 0.00
# Kurtosis: 39.646 Cond. No. 2.37e+04
# ==============================================================================
As you can see, the R2 value jumps up to 23.9 percent. An improvement of about 2.4 percent may not seem like much, but this slight improvement can significantly improve the results. Furthermore, just because the improvement was small in this context, that does not mean it will not be much larger in other contexts. In summary, text analytics is a valuable way to improve the predictive accuracy of our models.