Data Partitions Revisited

We have learned several algorithms and how models are evaluated for predictive accuracy. It is a good time to update our understanding of the overall modeling process and how the training, validation, and test partitions are used during the machine learning process.

By reading this section, you will learn answers to the following questions:

  1. Should we partition the data into two or three partitions?

  2. What is the role of the train, validation, and test sets in the overall modeling and model evaluation process?

  3. How is overfitting introduced during the model training and evaluation process?

  4. Why is the test partition considered to be the gold standard for model evaluation?

  5. What are model parameters? What are hyperparameters?

The figure below shows that original data can be split into training and testing partitions. Sometimes the training data is further randomly subdivided into training and validation partitions.

Figure 13.11: Data partitions used in the learning and evaluation process

Model parameters and hyperparameters

Training data is used during initial training to determine model parameters. Model parameters the parameters of the model itself that a learning algorithm derives exclusively from the training data. For example, the estimated slope coefficients and intercept coefficient of a linear regression model derived from the training data are model parameters. Another example of model parameters are the best splits that are derived for a decision tree figured out from the training data.text

A secondary training process happens for some algorithms where the validation data is used tune hyperparameters of the algorithm. Some, examples of hyperparameters tuned by the validation data include picking the best k in KNN, picking the maximum depth of a decision tree, and how many nodes are chosen in an ANN model. In fact, for some algorithms more than one hyperparameter setting is chosen. For example, for a neural network, some hyperparameters include which transfer function is best, how many network nodes are best, and how many numbers of network levels should be used.

Some modeling environments Like Azure provide the option to do what is called a parameter sweep. That is, the environment will systematically try a variety of different hyperparameter settings to determine which combination of parameter settings produces the best result.

How use of the validation set can cause overfitting

When the validation dataset is used to tune hyperparameters, some overfitting can be introduced during this process. When the validation data is used to tune hyperparameters, the estimated prediction error may be artificially low because of overfitting introduced during hyperparameter tuning.

In addition to tuning hyperparameters the validation set may introduce some overfitting by just being used evaluate models during preliminary model evaluation. For example, the validation data may not perfectly represent that data the model is to be used for. So just using the validation data to choose models may result in the selection of a model that might not be ideal when it is used on the test dataset. This occurs frequently.

During the process of selecting the best models, we sometimes compare models that used the validation data to tune hyperparameters with models that did not use the validation dataset to tune hyperparameters. So, if we used three partitions in some cases and two partitions in other cases, we would compare models where the validation set introduced overfitting with models where that did not happen. To avoid this, it is simpler to use a test set to evaluate all models.

The test set is generally well curated. It contains carefully sampled data that reflects the distribution of various classes that the model would face, when used in the real world.

Use of the test set is the best practice for classic hold-out model evaluation

Because of the above reasons, best practice is to use test set for final model evaluation. Some practitioners use the validation set as the test set but it is not good practice. Using the test partition as the basis of final comparison across all models provides the best, most conservative basis of comparison. Use of Test dataset is the gold standard used to evaluate the model. It is only used once a model is completely trained by using the train and validation sets.

The test set is used to assess the generalization error of the final chosen model when applied to new "unseen" data. Ideally, the test set should be kept in a "vault," and be brought out only at the end of the data analysis. In no way, is the test set used to refine the model parameters or the hyperparameters.

The test set is generally what is used to evaluate competing fully trained models. For example, in many Kaggle competitions, the validation set is released initially along with the training set. But the test set is only released when the competition is about to close. The winner is chosen based on how well the model performs on the Test set.

Use of the model on production data

The final step is application of the finished model to predict outcome values for instances that have known input values but no known output values. At the time of model application, the analyst does not know exactly how well the model is performing, but the analyst has an informed estimate of how well it will do. Later as the actual outcome values are discovered, they can be compared to the predicted outcome values to see how well the model did.