22.4 Multi-Classification: Network Attacks

In this case study, we will practice multi-class classification modeling using a dataset that was collected from a company's internal network. Each row represents a packet of data that was transmitted over the network. This is a common practice in information technology (IT) security to detect potential hackers. We can use this dataset for both managerial decision making (e.g. to identify types of packets to block in the firewall) and machine learning environments to detect and fight hackers in real time. Let's walk through the CRISP-DM process focusing on managerial decision making then we will use this dataset again later to create an ML pipeline version.

Business Understanding or Problem Definition

The purpose of this analysis is to understand what features of a data packet going over our company's internal network can predict whether or not it is normal traffic versus an attack. In addition, let's see if we can predict the type of the attack. In this dataset, each packet has been labelled as either "normal" or one of a dozen+ different types of attacks. Some of them may be very rare so we may have to drop those or bin them into "other."

Let's begin by importing the dataset that has been identified to help us address this issue defined above:

      # Don't forget to mount Google Drive if you haven't already:
      # from google.colab import drive
      # drive.mount('/content/drive')
        
      import pandas as pd
        
      df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/network_traffic.csv')
      print(df.shape)
      df.head()
      
      # Output
      # (125973, 43)

Notice that this is a fairly large dataset with almost 126k rows and 43 columns. We will likely have to sample this dataset while developing our process so that we don't have to wait so long for the processing to finish.

Data Understanding

As with our prior examples, let's explore only the univariate properties and visualizations so that we can see what data cleaning needs to take place. We will skip the bivariate analyses again to keep this case study relatively short. However, this time, let's create a function to calculate all of our univariate analyses so that we can reuse this function later:

      def univariate(df, sample=500):
        import seaborn as sns
        import matplotlib.pyplot as plt
        import math
    
        df_results = pd.DataFrame(columns=['bin_groups', 'type', 'missing', 'unique', 'min',
                                         'median', 'max', 'mode', 'mean', 'std', 'skew'])
    
        for col in df:
          # Features that apply to all dtypes
          dtype = df[col].dtype
          missing = df[col].isna().sum()
          unique = df[col].nunique()
          mode = df[col].mode()[0]
          if pd.api.types.is_numeric_dtype(df[col]):
            # Features for numeric dtypes only
            min = df[col].min()
            q3 = df[col].quantile(0.75)
            max = df[col].max()
            mean = df[col].mean()
            median = df[col].median()
            std = df[col].std()
            skew = df[col].skew()
            df_results.loc[col] = ['-', dtype, missing, unique, min, median, max, mode,
                                  round(mean, 2), round(std, 2), round(skew, 2)]
          else:
            # Features for object dtypes only
            flag = df[col].value_counts()[(df[col].value_counts() / df.shape[0]) < 0.05].shape[0]
            df_results.loc[col] = [flag, dtype, missing, unique, '-', '-', '-', mode, '-', '-', '-']
    
        # Make a sub-DataFrame of features that are objects or have only two values; they will need countplots
        countplots = df_results[(df_results['type']=='object') | (df_results['unique']==2)]
        # Make a sub-DataFrame of features that are floats or ints with many values which will need histograms
        histograms = df_results[(df_results['type']=='float64') | ((df_results['unique']>10) & (df_results['type']=='int64'))]
        histograms = histograms[histograms['unique']>2] # Remove those that are binary
    
        # Create a set of countplots for the categorical features
        f, ax = plt.subplots(1, countplots.shape[0], figsize=[countplots.shape[0] * 1.5, 1.5])
        for i, col in enumerate(countplots.index):
          g = sns.countplot(data=df, x=col, color='g', ax=ax[i]);
          g.set_yticklabels('')
          g.set_ylabel('')
          ax[i].tick_params(labelrotation=90, left=False)
          ax[i].xaxis.set_label_position('top')
          sns.despine(left=True, top=True, right=True)
    
        plt.subplots_adjust(hspace=2, wspace=.5)
        plt.show()
    
        # Create a set of histograms for the numeric features
        f, ax = plt.subplots(1, histograms.shape[0], figsize=[histograms.shape[0] * 1.5, 1.5])
        for i, col in enumerate(histograms.index):
          g = sns.histplot(data=df.sample(n=sample, random_state=1), x=col, color='b', ax=ax[i], kde=True);
          g.set_yticklabels(labels=[])
          g.set_ylabel('')
          ax[i].tick_params(left=False)
          sns.despine(left=True, top=True, right=True)
    
        plt.subplots_adjust(hspace=2, wspace=.5)
        plt.show()
    
        return df_results

This function is similar to the code we used previously but includes a few modifications. I removed the q1, q3, and kurtosis scores for brevity. I also modified the layout of the visualizations slightly. In addition, I included any feature with only two values in the list of countplots to catch booleans and 0/1 features and I filtered out those features from the list of histograms. Finally, most importantly, I added a function parameter called 'sample' which I implemented on line 53 to reduce the size of the dataset when plotting the histograms which can take a long time.

Now, let's call the function using the dataset we pulled in:

      df_results = univariate(df)
      df_results

What do we learn from this? First, the countplots indicate that we have several features with rare values. Land, root_shell, is_host_login, and is_guest_login each have very few numbers of "1" values. We should drop those for now. However, it's possible that those are very useful features. If we don't get a good accuracy score, then we may want to resample the dataset so that those values are higher and we can get more accurate understanding of their importance. But, for now, we will remove them.

Second, our label, 'attack', has many unique values and not all of them are well-represented (i.e. make up at least 5 percent of the dataset). We are going to have to bin many of them into an 'other' category. However, since we have so much data, maybe we can bend the 5 percent rule a bit to maybe 2 percent so that we can at least represent a handful of attack types. The features service and flag will also need to be binned.

Third, because we have so many histograms, you may need to click on the image for it to expand and see them more clearly. As you can see, we have a lot of features that are highly skewed. Others have a bimodal distribution where the frequency is high on the extremes and low in the middle (e.g. serror_rate, srv_serror_rate, and many others). For this reason, it is essential that we use a decision tree-based algorithm that does not depend on the assumption of normal distributions.

Finally, it appears that there is no missing data in this dataset. We just need to drop a few features.

Data Preparation

Let's proceed with cleaning this data. Let's begin by dropping those features with countplots that indicated they only have one value that makes up almost the entire dataset including is_host_login, is_guest_login, land, and root_shell.

      df.drop(columns=['is_host_login', 'is_guest_login', 'land', 'root_shell'], inplace=True)

Next, let's iterate through the other categorical features that have values that do not make up at least 2 percent of this dataset. Again, we are bending hte "5% rule" because we have such a large dataset. In this routing below, I'm going to print out a "before-and-after" DataFrame of each feature that gets binned so that we can review how the binning turns out.

      for feat in df.columns:
        # For each categorical feature in the datset
        if not pd.api.types.is_numeric_dtype(df[feat]):
          # Create a list of the categorical groups that comprise less than 2% of the total dataset
          other_list = df[feat].value_counts()[df[feat].value_counts() / df.shape[0] < 0.02].index
          if len(other_list) > 0: # If there were some group values < 2%
            # Print out a "before-and-after" distribution of those that had binning performed
            df_result = pd.DataFrame(df[feat].value_counts())
            
            # Set the new group to "other" for those in the other_list
            df.loc[df[feat].isin(other_list), feat] = "other"
            
            # Print out a before-and-after DataFrame for each categorical feature that gets binned
            df_result = df_result.merge(df[feat].value_counts(), left_index=True, right_index=True, how='outer')
            display(df_result.sort_values(by=[f'{feat}_y', f'{feat}_x'], ascending=False))

Again, the purpose of those last two lines of code is only to review what binning was performed in the process. First, you can see that the feature 'service' originally had 70 unique values. But those that represented less than 2 percent of the rows were binned into 'other'. Because that DataFrame is so large, the view is cut short and we can't see exactly how many group values remained. But that's okay because we have so many records in this dataset. Second, the feature 'flag' was binned from 11 values down to four including 'other.' Finally, and most importantly, the label 'attack' was binned form 23 unique values down to 7 including the new 'other' group. Those 16 attack types could be very useful to predict. But I would recommend getting even more data before we try that with a classification model. To be clear, there are other types of models (e.g. clustering) that may do a better job of handling rare label values like those we binned. But since we haven't covered that topic yet, we will stick with this binning method.

Before we move on to the modeling phase, let's revise our data understanding phase by re-running the univariate function from above while using more samples now that we have performed some binning.

      df_results = univariate(df, sample=5000)
      df_results.head(df.shape[1])

Notice how much better this looks now. Service and attack no longer have so many values that we can't read any of them on the countplots. We did not make any changes to the numeric features displayed in histograms, so I only displayed the countplots above.

Modeling

Next, let's create our multi-class model. However, rather than use a basic decision tree, let's try a popular algorithm from a package called XGBoost. You may remember this from a prior chapter/section when we tried many different classification models all at once. XGBoost was written based on the Sklearn package and includes a faster version of a gradient boosted decision tree algorithm that is designed to reduce overfitting like the random forest.

Let's begin by setting up the y and X dataset and then dummy-code the X.

      y = df.attack
      X = df.drop(columns=['attack'])
      X = pd.get_dummies(X, drop_first=True)

You may remember that xgboost requries that the label groups (normal, neptunem, other, ipsweep, portsweep, satan, and smurf) be encoded to a single column with values 0 through n to represent the different groups in the label. Note that this does not mean that there is an implied order to the group values. This is simply how xgboost is designed and part of the reason that the algorithm is so fast. We will insert the actual label group values back into our evaluation when we are done.

To accomplish this, we can use the LabelEncoder object from sklearn.

      from sklearn.preprocessing import LabelEncoder
      le = LabelEncoder()         # Create the LabelEncoder() object
      le.fit(y)                   # Fit the object with the y data
      y_encoded = le.transform(y) # Make a transformed version of the y to use in the model
        
      # Let's see exactly what LabelEncoder() is doing by comparing its results to the original values
      pd.DataFrame({'Original y':y.sort_values().unique(), 'LabelEncoded y':range(y.unique().shape[0])}).head(15)

Again, this LabelEncoder() process is not required for most sklearn classification algorithms, but xgboost requires it. Next, let's create our train and test datasets making sure to use the encoded version of y: y_encoded.

      from sklearn.model_selection import train_test_split
      X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=1) # 70% training and 30% test

Now, let's create and train the model.

      from xgboost import XGBClassifier
      xgc = XGBClassifier() # Create Decision Tree classifer object
      xgc.fit(X_train,y_train) # Train Decision Tree Classifer

That blue box that displays after training the model simply shows all of the parameters that were used with the XGBClassifier model. You may remember this box from when we trained other sklearn models that have parameter options. Usually I omit those outputs from the book, but I wanted to show this one so to draw your attention to the large number of options we have to customize this algorithm. We don't have time for a full explanation of these options now, but we will learn more about them in a later chapter. For now, we will simply use all of the defaults.

Evaluation

Now let's evaluate our model beginning by comparing a few predictions to actual values. Note that this step isn't mandatory, but I want to remind you how to generate and view predictions.

      y_pred = xgc.predict(X_test)

      df_results = pd.DataFrame({'Actual Attack':y_test, 'Predicted Attack':y_pred})
      df_results.head(10)

These predictions look pretty good so far. But since we had to encode the y label, all of our predictions have come back in the same numeric form. No problem; we can reverse the encoding like this:

      y_pred_reversed = le.inverse_transform(y_pred)
      y_test_reversed = le.inverse_transform(y_test)
        
      df_results = pd.DataFrame({'Actual Attack':y_test_reversed, 'Predicted Attack':y_pred_reversed})
      df_results.head(10)

That looks better. Let's proceed with examining the overall model fit metrics. Remember that the xbc model object does not have the label encoding reversed. So we have to use X_test and the encoded y_test for the .score() method. But we can use the reversed versions of y_test and y_pred for the classification_report() object.

      from sklearn.metrics import classification_report
      print(f'Accuracy: {xgc.score(X_test, y_test)}')
      print(classification_report(y_test_reversed, y_pred_reversed))

      # Output:
      # Accuracy: 0.999126799322608
      #                precision    recall  f1-score   support
      #  
      #       ipsweep       1.00      1.00      1.00      1114
      #       neptune       1.00      1.00      1.00     12283
      #        normal       1.00      1.00      1.00     20266
      #         other       0.99      0.99      0.99      1368
      #     portsweep       1.00      1.00      1.00       821
      #         satan       1.00      0.99      1.00      1133
      #         smurf       1.00      1.00      1.00       807
      #  
      #      accuracy                           1.00     37792
      #     macro avg       1.00      1.00      1.00     37792
      #  weighted avg       1.00      1.00      1.00     37792

This is a very impressive set of evaluation metrics. But to get a true idea of just how good the results are, we need to keep in mind the proportion of the most common label outcome. In this case, most of the traffice is labeled as 'normal'; 53.625 percent (20266 / 37792) to be exact. Therefore, for this model to be useful, we need to get an accuracy score greater than 53.625 percent. Therefore, our model with 99.9 percent accuracy is pretty awesome. Based on these results, my first inclination is to lower our binning threshold (2%) that we used during the data preparation phase and see how much our overall accuracy decreases. If we can account for a few more types of attacks without lowering our accuracy too much, then that could give us some added defense against more forms of attack.

However, we will not do that now to keep this example short. Instead, let's proceed with the confusion matrix to see exactly where our inaccurate predictions are coming from. The only slight change we need to make here from our prior examples is to the y_test_reversed object. When using the LabelEncoder() object, it turned y_test into a Numpy array rather than a Pandas Series. Therefore, we either need to 1) sort the numpy array and then extract the unique values using Numpy methods, or 2) add y_test_reversed into a Pandas Series before calling sort_values().unique(). I'll do the second option below.

      from sklearn import metrics
      from matplotlib import pyplot as plt
        
      labels = pd.Series(y_test_reversed).sort_values().unique()
        
      cm = metrics.confusion_matrix(y_test, y_pred)
      cm_display = metrics.ConfusionMatrixDisplay(cm, display_labels=labels)
      cm_display.plot(values_format='d')
      plt.xticks(rotation=45)
      plt.show()

Overall, this prediction is really good. Most of our mistakes are happening when either the prediction or actual value is 'other.' That is to be expected since there are over a dozen different attacks in that group. We do get a few predictions of 'normal' when the actual event is a 'portsweep' (2), 'satan' (3), or 'neptune' (1) attack. Those are the hacking attempts that get through and should concern us the most. Information security professionals should examine those 6 records in detail to see why they were predicted to be normal traffic. There may be other features about those attacks that are not captured in our dataset that would indicate they are attacks. Perhaps some of the features we dropped would explain those false negatives.

Finally, we need to understand which features contributed most to these predictions. In prior cases, we learned how to both print out a decision tree and a visualization of feature coefficients based on a logistic regression model. But there are other options as well. First, XGBoost (and sklearn) provides an object that allows us to plot a tree that is quite useful. By default, the XGBClassifier object creates 100 distinct decision trees (based on the n_estimators parameter default) that are "averaged" together so to speak into a final tree. We can't visualize the averaged tree, but we can view any of the 100 trees generated by using the num_trees parameter. The code below visualizes the first tree and inverts the direction from left-to-right usingthe parameter rankdir='LR'. You may need to examine the saved version in order to read the text.

      from xgboost import plot_tree
      import matplotlib.pyplot as plt
        
      plot_tree(xgc, num_trees=0, rankdir='LR')
      plt.savefig('network_attack_tree.png', dpi=600)
      plt.show()

This is a decent start. We can see that the most important feature in this first tree is whether the value for dst_host_srv_diff_host is less than .5; meaning, did the packet get sent to a different destination than where it came from? This is something that would make sense to an IT security manager. However, this is just one of 100 trees based on a sub-sample of the data. Is this truly the most important feature? As we demonstrated before, one way to get feature importance is to calculate a logistic regression to examine the coefficients. However, that is a completely different model than the XGBClassifier. It would be nice to understand feature importance for this XGBClassifier model.

XGBoost provides a method called .plot_importance() that allows us to visualize the combined effect of each feature across all sample trees averaged together. You can find the complete documentation online.

      import xgboost, seaborn as sns

      g = xgboost.plot_importance(xgc, 
                                  importance_type='gain',   # See options below
                                  grid=False,               # Remove gridlines
                                  height=.9,                # Set height of each bar
                                  show_values=False,        # Set to True to see the numbers
                                  values_format='{v:.2f}',  # Format decimals if numbers are showing
                                  title='',                 # Remove title
                                  )
        
      # "gain" = average gain of splits which use the feature
      # "weight" = number of times a feature appears in a tree
      # "cover" = average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split
      sns.despine(bottom=True)
      plt.xlabel('')
      plt.xticks([])
      plt.show()

Based on these results, we can see that across the combined average of all 100 trees, same_srv_rate has the largest effect on predicting attack type of any feature. After that, the 'http' service has the next larges effect. However, with a multi-class model, it is still difficult to understand exactly what affect these features are having. Is it on differentiating 'normal' traffic or some specific attack? To understand that, we would have to dig more deeply into the specific trees. But that's a deeper discussion for another chapter.

Previous Next