8.4 Partitioning Time Series Data
Not All Data is Created Equal
Recall from the past chapters that we needed to divide our data into training and validation sets. The validation set is left aside while the model is created using the training set. The validation set is the unseen data then used to validate the accuracy of the trained model. We will now introduce different methods of partitioning the data and why you would partition the data in that particular way. The different methods for training and validating the data are Random Partitioning, Time Series Partitioning, Standard Cross Validation, K-Fold Validation, and Time Series Cross Validation.
Even though there are different methods for partitioning, we will focus on Time Series Partitioning in JMP so that we can create our future forecasts using MLR.
Random Partitioning
Random partitioning treats all observations equally and ignores the fact that some observations may have been produced at different times. This is appropriate when the time of data collection is not assumed to be material to the outcome. Since all records can be assumed to be equal, randomization is used to selects which records will be used for training versus for validation. This is a common method used when preparing the data for any type of machine learning model. The proportion of records placed in the training and validation sets vary. The size of the partitions is left to the discretion of the data analyst. However, usually a majority of the data is used for training and the lesser portion is used for validation.
Time Series Partitioning
A different form of partitioning is used when time sequence is material to the outcome of predictions. We do not treat all records as equal so we do not use randomization to select records in the training and validation sets. Rather we keep the sequence in order and use the most recent records as our validation set and the earlier observations as the training set. The figure below shows how the data should be partitioned for time series forecast. Notice in the time series validation that all of the validation data is the most recent data.
If we were to partition our data on the DeptStore data set, it would look like the right hand column in JMP in figure below:
The right hand validation column in JMP will be used to train and validate the various models that we will produce for Time Series Forecasting using MLR. The next section ,”Performing Time Series Forecasting with MLR”, will go through the detail and process of running the validation column with several methods/models to get the best fit/R2. Once you find the best fit you can use JMP to create forecasts/predicted values of your forecast periods.
Standard Cross Validation
How the overall data is partitioned into training and validation portions can lead to each partition to have different characteristics. For example, if there are a group of records that are not typical of the rest of the sample and they get selected in the training partition, there is overfitting. So when the model is run on the validation data, it does not do well. Now consider what happens If the non-typical records get put in the validation set, the model may be trained well on the training data but because the validation data is different, the error rate in the validation set will be higher than it should be.
To overcome this problem, the process of crossvalidation was developed. Crossvalidation seggregates different records into training and validation partitions in many partitioning events. Then all of the partition sets are run through the training and validation process.
K-Fold Cross Validation
In K Fold cross validation, the data is divided into k subsets of contiguous records. Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model. With this approach, every data record gets to be in a validation set exactly once, but gets to be in the training set k-1 times. This significantly reduces the bias that can be injected into the process by just doing one split of records into training and validation partitions. All of the records are in the training partition multiple times the model benefits from seeing many sets of training data. Even though all records get to be in the validation set one time, since each block is used once for validation, all of the records become part of the validation evaluation. As a general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value.
Time Series Cross Validation
In time-series cross validation, different sets of records at the end of the dataset with known outcome variables are selected into the validation partition.