Two-Classification: Airline Satisfaction

In this case study, we will practice two-class classification modeling using a dataset that was collected from airline customers who rated their satisfaction with their experience. This type of dataset is useful for managerial decision making, but it may not be particularly useful for a machine learning context since it is based on survey data that may be difficult to collect before a customer has had a flight experience. However, that doesn't mean we can't benefit from using a classification model to understand what causes a customer to be satisfied. We will document each step of this process based on the Cross-Industry Stanrdard Process for Data Mining (CRISP-DM).

Business Understanding or Problem Definition

The purpose of this analysis is to understand what causes airline customers to be satisfied with their experience. The final delivarable will be a set of analyses that go into a report and/or presentation. The results will inform the airline decision makers on how they can maximize the effectiveness of spending money to improve future customer satisfaction.

Let's begin by importing the dataset that has been identified to help us address this issue defined above:

      # Don't forget to mount Google Drive if you haven't already:
      # from google.colab import drive
      # drive.mount('/content/drive')

      import pandas as pd

      df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/airline_satisfaction.csv')
      df = df.sample(n=10000) # Reducing the sample size during the development process since there are 129,880 rows
      print(df.shape)
      df.head()
      

Data Understanding

Next, let's explore the univariate properties and visualizations to see what data cleaning tasks need to be performed. To do this, let's write a dynamic loop that will print out all relevant univariate properties as well as a couple more loops to display some basic charts:

      import seaborn as sns
      import matplotlib.pyplot as plt

      # We will use this DataFrame to store the results of the univariate analysis
      df_results = pd.DataFrame(columns=['flag', 'type', 'missing', 'unique', 'min', 'q1', 'median',
                                          'q3', 'max', 'mode', 'mean', 'std', 'skew', 'kurt'])

      for col in df:  # Iterate through each column in the DataFrame
        # Calculate features that apply to all data types
        dtype = df[col].dtype
        missing = df[col].isna().sum()
        unique = df[col].nunique()
        mode = df[col].mode()[0]

        if pd.api.types.is_numeric_dtype(df[col]): # Depending on whether each column is numeric or not, we will calculate different features
          # Features for numeric dtypes only
          min = df[col].min()
          q1 = df[col].quantile(0.25)
          median = df[col].median()
          q3 = df[col].quantile(0.75)
          max = df[col].max()
          mean = df[col].mean()
          std = df[col].std()
          skew = df[col].skew()
          kurt = df[col].kurt()
          
          # Insert a record into df_results to display the univariate properties relevant to numeric features
          df_results.loc[col] = ['-', dtype, missing, unique, min, q1, median, q3, max, mode,
                                round(mean, 2), round(std, 2), round(skew, 2), round(kurt, 2)]
        else:
          # This line will count the number of group values that represent less than 5% of the data
          flag = df[col].value_counts()[(df[col].value_counts() / df.shape[0]) < 0.05].shape[0]
          # Insert a record into df_results to display the univariate properties relevant to categorical features
          df_results.loc[col] = [flag, dtype, missing, unique, '-', '-', '-', '-', '-',
                                mode, '-', '-', '-', '-']

      # Next, let's create a set of countplots for the categorical features and histograms for the numeric features
      # First, filter df_results into two sets of features: one for countplots and one for histograms
      countplots = df_results[df_results['type'] == 'object']  # count the number of features that are objects
      histograms = df_results[(df_results['type'] == 'float64') | (df_results['unique'] > 10)] # need histograms

      # Second, create a loop that dynamically plots the countplots and adjust the figure size based on the number of features
      f, ax = plt.subplots(1, countplots.shape[0], figsize=[countplots.shape[0] * 2, 2])
      for i, col in enumerate(countplots.index):
        g = sns.countplot(data=df, x=col, color='g', ax=ax[i]);
        g.set_yticklabels('')
        g.set_ylabel('')
        ax[i].tick_params(labelrotation=90, left=False)
        ax[i].xaxis.set_label_position('top')
        sns.despine(left=True, top=True, right=True)

      # Third, create a loop that dynamically plots the histograms and adjust the figure size based on the number of features
      f, ax = plt.subplots(1, histograms.shape[0], figsize=[histograms.shape[0] * 2, 2])
      for i, col in enumerate(histograms.index):
        g = sns.histplot(data=df, x=col, color='b', ax=ax[i], kde=True);
        g.set_yticklabels(labels=[])
        g.set_ylabel('')
        ax[i].tick_params(left=False)
        sns.despine(left=True, top=True, right=True)
      plt.show()

      # Finally, display the univariate properties in a DataFrame
      df_results
      

That was a lot of code to get through! However, I wanted you to get an example of automating the exploratory data analysis. What do we use this for? It helps us understand what data cleaning tasks needs to be performed next. During the data understanding phase, we would typically also explore the bivariate relationships between each feature with the label including both statistics and visualizations. However, since we will be modeling later (which will provide an every more accurate estimation of those relationships), I am going to skip that step to keep this example as short as possible.

Data Preparation

What do the univariate data exploration results above tell us that we can use for the data preparation phase? With the categorical features, we are looking for those that have group values that represent less than five percent of the dataset. We need to either drop those records or bin them together because we cannot produce a valide analyses of a group value that is not well-represented in the dataset. The results will not be reliable. Thankfully, our results indicate that none of the categorical features have group values with less than five percent of the dataset.

For numeric features, we are typically checking for skewness issues if we plan to model the data using a linear algorithm with linear assumptions. However, since we plan to use a decision tree algorithm, that is not relevant. So we don't have to worry about the two features (Departure Delay in Minutes and Arrival Delay in Minutes) that are highly skewed.

For all features, we need to eliminate missing data because we cannot calculate any model with missing values. Arrival Delay in Minutes is the only feature with missing data. It is missing 28 out of 10000 records. Typically, if the data were missing completely at random, I would impute those few missing values since no other columns are missing. However, since you may not have covered that topic yet, let's simply drop those records:

      # Drop rows with missing data and view the record count before and after to ensure it worked as expected

      print(df.shape)
      df.dropna(inplace=True)
      df.shape

      # Output:
      # (10000, 23)
      # (9968, 23)
      

To be clear, there are other steps involved in data cleaning/preparation like identifying and addressing outliers. However, to keep this example simple, I'm going to skip that step this time.

Modeling

Now, let's proceed with the modeling phase by:

  • Dividing the dataset into features and label

  • Dummy-coding the features only

  • Splitting the dataset into training set and test set

  • Creating and training a classifier model

      # Divide the dataset into features and label
      y = df.satisfaction
      X = df.drop(columns=['satisfaction'])

      # Dummy-code the features only
      X = pd.get_dummies(X, drop_first=True)
      
      # Split the dataset into training set and test set
      from sklearn.model_selection import train_test_split
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
      
      # Create and train the classifier model
      from sklearn.tree import ExtraTreeClassifier
      clf = ExtraTreeClassifier(random_state=12345) # Create the classifier
      clf.fit(X_train,y_train)                      # Train the classifier
      

Evaluation

Now that we have a trained classifier model, let's evaluate it by:

  • Predicting the labels for the test dataset

  • Calculating the fit metrics

  • Generating the confusion matrix

  • Visualizing the trained decision tree

      # Compare the actual satisfaction to the predicted satisfaction
      y_pred = clf.predict(X_test)

      df_results = pd.DataFrame({'Actual Satisfaction':y_test, 'Predicted Satisfaction':y_pred})
      df_results.head(10)
      

This looks pretty okay but not great. We have a few wrong predictions. However, that is just the first 10 records. Let's calculate all of the classification model metrics so that we can understand exactly how good the overall model can be.

      # Examine the model fit metrics
      from sklearn import metrics

      y_test_dummies = pd.get_dummies(y_test, drop_first=True)
      y_pred_dummies = pd.get_dummies(y_pred, drop_first=True)
      print(f"Accuracy:\t{metrics.accuracy_score(y_test, y_pred)}")
      print(f"Precision:\t{metrics.precision_score(y_test_dummies, y_pred_dummies, labels=['not satisfied', 'satisfied'])}")
      print(f"Recall:\t\t{metrics.recall_score(y_test_dummies, y_pred_dummies, labels=['not satisfied', 'satisfied'])}")
      print(f"F1:\t\t{metrics.f1_score(y_test_dummies, y_pred_dummies, labels=['not satisfied', 'satisfied'])}")

      # Output:
      # Accuracy:	0.8518890003343363
      # Precision:	0.8610591900311526
      # Recall:		0.8626716604244694
      # F1:		0.8618646710321172
      

From these results, we expect that our predictions will be accurate 87 percent of the time. Precision and Recall are very similar which tells us that our rate of false-positives versus false-negatives are very close; we are not erring more on one type than the other. However, let's complement these results by generating the confusion matrix:

      # Generate the confusion matrix
      from sklearn import metrics
      from matplotlib import pyplot as plt

      labels = y_test.sort_values().unique()
      cm = metrics.confusion_matrix(y_test, y_pred)
      cm_display = metrics.ConfusionMatrixDisplay(cm, display_labels=labels)
      cm_display.plot(values_format='d')
      plt.show()

      print(y_test.value_counts())

      # Output:
      # satisfied       1602
      # dissatisfied    1389
      # Name: satisfaction, dtype: int64
      

Based on the confusion matrix, it is easy to see that it is slightly easier to predict dissatisfied customers (false-dissatisfied: 220) than satisfied customers (false-satisfied: 223). But that difference may not be particularly meaningful. Finally, let's print out the visualized decision tree to see which features were most important in determining satisfaction.

      # Generate the decision tree visualization
      import matplotlib.pyplot as plt
      from sklearn import tree

      fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(20,7), dpi=600)

      tree.plot_tree(clf, feature_names=X.columns, class_names=labels, filled=True);
      plt.savefig('airline_satisfaction_tree_v2.png', dpi=600)
      plt.show()
      

Because this tree is so large, I set the dots per inch (DPI) to 600. You may need to download it and view in your computer's default image viewer (and zoom in) in order to see the details. If you do that, you'll see that Online Boarding <= 1.783 is the first node. That means that those who had a poor experience with the online boarding process (< 2 essentially) were the most likely to be dissatisfied. If you continue to follow the branches down the tree, you'll see which features mattered most when that first node split was True vs False. This is useful information for management decision makers.

We can take this case study further in a few ways. I'd like to focus on two additional tasks in particular: 1) try a different algorithm to see if we can improve the accuracy, and 2) generate a logistic regression classification model in order to get interpretable coefficients that will help us understand which features are most important better than the decision tree visualization does.

Repeat Modeling and Evaluation: Improve Model Accuracy with a Different Algorithm

To find the best algorithm for this dataset, I typically go through two general rounds. First, I try every applicable algorithm using the default parameters. Just as we performed in a prior section, this can involve dozens of algorithms. Second, I take notice of which family of algorithms (e.g. ensemble, linear, knn, neural network, etc.) seems to be mostly toward the top of the list. Second, I choose the best algorithms from the best "family" and then modify the parameters of those algorithms to "fine-tune" the accuracy score as high as possible. This can involve many rounds of model testing.

However, to keep this example simple, we will save that detailed process for later. For now, let's simply try a few other algorithms to see if we can get a better accuracy score.

      import sklearn.linear_model as lm
      import sklearn.ensemble as se
      import sklearn.tree as tree
      from xgboost import XGBClassifier
      from sklearn import svm
      from sklearn import gaussian_process
      from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel
      from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
      from sklearn import svm
      from sklearn.naive_bayes import CategoricalNB
      from xgboost import XGBClassifier
      from sklearn import preprocessing
      from sklearn.model_selection import train_test_split
      from sklearn.neural_network import MLPClassifier

      # Create a DataFrame to store and output the rsults
      df_clf = pd.DataFrame(columns=['Family', 'Accuracy'])

      # Insert a row into the DataFrame with the Accuracy score of the trained model from each algorithm to sort and compare
      df_clf.loc['Logistic'] = ['Linear', lm.LogisticRegression(max_iter=10000).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Ridge'] = ['Linear', lm.RidgeClassifier().fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Stochastic Gradient Descent'] = ['Linear', lm.SGDClassifier(max_iter=1000, tol=1e-3).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Passive Agressive'] = ['Linear', lm.PassiveAggressiveClassifier(max_iter=1000, random_state=1, tol=1e-3).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Perceptron'] = ['Linear', lm.Perceptron(fit_intercept=False, max_iter=10, tol=None, shuffle=False).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Decision Tree'] = ['Decision Tree', tree.DecisionTreeClassifier(random_state=1).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Extra Trees'] = ['Decision Tree', tree.ExtraTreeClassifier(random_state=1).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['KNN'] = ['KNN', KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Support Vector Machine'] = ['SVM', svm.SVC(decision_function_shape='ovo').fit(X_train, y_train).score(X_test, y_test)] # Remove the parameter for two-class model
      df_clf.loc['Bagging'] = ['Ensemble', se.BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['AdaBoost'] = ['Ensemble', se.AdaBoostClassifier(n_estimators=100, random_state=1).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Extra Trees ensemble'] = ['Ensemble', se.ExtraTreesClassifier(n_estimators=100, random_state=1).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Random Forest'] = ['Ensemble', se.RandomForestClassifier(n_estimators=10, random_state=1).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Histogram Gradient Boosting'] = ['Ensemble', se.HistGradientBoostingClassifier(max_iter=100, random_state=1).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Gradient Boosting'] = ['Ensemble', se.GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=1).fit(X_train, y_train).score(X_test, y_test)]
      df_clf.loc['Neural Network'] = ['Neural Network', MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1).fit(X_train, y_train).score(X_test, y_test)]

      # XGBoost needs the label in the format of [0, 1, 2, 3, ..., n - 1]. It can't handle categorical values
      from sklearn.preprocessing import LabelEncoder
      y_encoded = LabelEncoder().fit(y).transform(y) # This line converts the y from ['yes', 'no', 'no'] to [1, 0, 0], much like dummy coding
      # Now we have to resplit the data with these encoded y values:
      X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=1)
      df_clf.loc['XGBoost'] = ['Ensemble', XGBClassifier(random_state=1).fit(X_train, y_train).score(X_test, y_test)]

      # Now print out the DataFrame sorted by best to worst
      df_clf.sort_values(by=['Accuracy'], ascending=False)
      

Based on these results, it appears that thew Histogram Gradient Boosting algorithm provides the best accuracy for this dataset. If our goal is to general a machine learning model to predict satisfaction, then I would use that algorithm moving forward. Let's try an out-of-sample prediction based on this model. This prediction will be for a hypothetical customer with the following data:

  • Age: 29
  • Flight Distance: 4211
  • Seat comfort rating: 3
  • Departure/Arrival time convenience rating: 5
  • Food and drink rating: 5
  • Gate location rating: 5
  • Inflight wifi service rating: 3
  • Inflight entertainment rating: 3
  • Online support rating: 3
  • Ease of Online booking rating: 4
  • On-board service rating: 2
  • Leg room service rating: 3
  • Baggage handling rating: 3
  • Checkin service rating: 3
  • Cleanliness rating: 1
  • Online boarding rating: 3
  • Departure Delay in Minutes: 72
  • Arrival Delay in Minutes: 31
  • Gender_Male: 0
  • Customer Type_disloyal Customer: 1
  • Type of Travel_Personal Travel: 0
  • Class_Eco: 0
  • Class_Eco Plus: 0
      clf.predict([[29, 4211, 3, 5, 5, 5, 3, 3, 3, 4, 2, 3, 3, 3, 1, 3, 72, 31, 0, 1, 0, 0, 0]])[0]
      
      # Output:
      # /usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does not have valid feature names, but HistGradientBoostingClassifier was fitted with feature names
      # warnings.warn(
      # 'dissatisfied'
      

You can ignore that warning we got. Sklearn is basically reminding us that since we didn't pass in that out-of-sample data in a DataFrame with feature names, then had better make sure that we inputted the data in the correct order so that each value lines up with the appropriate column.

Our model predicts this customer will be 'dissatisfied'. Normally, we would export and save this model as a .sav or .pkl file at this point. But we haven't learned how to do that yet so we will skip for now.

Extend Modeling and Evaluation: Generate a Logistic Regression Model For Coefficients

In the business understanding, or problem identification, phase above we indicated that our goal was to understand which features have the greatest affect on customer satisfaction. The histogram gradient boosting algorithm does not give us a decision tree or set of coefficients to help us with that understanding. There are two techniques we can use to understand feature importance. First, we could generate what is referred to as "feature importance" scores. However, at this point in the book, you may not have learned about those yet.

A second option is that we can use a different algorithm called logistic regression to generate standardized coefficients just like we did in an earlier chapter with MLR/OLS. Logistic regression could be considered a "linear regression" algorithm for classification problems (predicting categories) as opposed to regression models (predicting numbers).The code below demonstrates how to use logistic regression to get comparable coefficients:

      from sklearn.preprocessing import MinMaxScaler
      from sklearn.linear_model import LogisticRegression
      import matplotlib.pyplot as plt
        
      # We must standardize the data first to make the coefficients comparable
      X_minmax = pd.DataFrame(MinMaxScaler().fit_transform(X), columns=X.columns)
        
      # Use the scaled X, but the same y
      clf = LogisticRegression(max_iter=10000).fit(X_minmax, y)
        
      # Store the coefficients in a dataframe
      df_coef = pd.DataFrame({'Coefficients':clf.coef_[0]}, index=clf.feature_names_in_)
      df_coef.sort_values(by=['Coefficients'], ascending=False, inplace=True)
        
      # Visualize the coefficients to make then easier to interpret
      plt.bar(list(df_coef.index), list(df_coef.Coefficients))
      plt.xticks(rotation=90)
      plt.show()
      

As you can see in this visualization, in-flight entertainment has the most positive effect on determining satisfaction. Arrival delays have the next largest effect on satisfaction, but it reduces satisfaction rather than increases it (as expected). It may help to visualize the coefficients intermixed so that those with negative effects can be compared with the positive effects. We can simply color them differently so that we know which are having negative/positive effects:

      # Store the coefficients in a dataframe
      df_coef = pd.DataFrame({'Coefficients':clf.coef_[0]}, index=clf.feature_names_in_)
      df_coef['sign'] = 'positive'
      for coef in df_coef.itertuples():
        if coef[1] < 0:
          df_coef.at[coef[0], 'sign'] = 'negative'
          df_coef.at[coef[0], 'Coefficients'] = coef[1] * -1
      df_coef.sort_values(by=['Coefficients'], inplace=True)
        
      colors = {'positive': 'mediumseagreen', 'negative': 'lightcoral'}
      df_coef['Coefficients'].plot(kind='barh', color=[colors[i] for i in df_coef['sign']])
        
      labels = df_coef['sign'].unique()
      handles = [plt.Rectangle((0,0),1,1, color=colors[l]) for l in labels]
      plt.legend(handles, labels, title="Relationship")
      plt.show()
      

This analysis makes it very easy for airline decision makers to understand where they will get the biggest increases in customer satisfaction for their efforts: improve in-flight entertainment, 2) minimize delays, 3) improve seat comfort, etc. What can with do with the feature Customer Type_disloyal Customer? That field simply indicates whether this is a first-time customer or a repeat customer. My recommendation would be to filter our data based on Customer Type and generate a separate model for each type so that our recommendations can be further customized. However, we will stop this case study here for brevity.