Neural Network Configuration Options

There are a number of features of neural networks that can be configured.

ANN Configuration Options in JMP

Hidden Layer Structure

This refers to number of hidden layers and number of nodes in those layers. The number input for each type of function (TanH, Linear, and Gaussian), tells JMP how many nodes of that function to include. There is no optimal ‘rule of thumb’ for how many hidden layers to use and for what number of transfer functions of each type to include, but rather a process of trial and error is required to achieve the best results. Typically it is good to include both TanH and Linear functions. TanH can capture non linear relationships and linear relationships can capture the linear relationships when they exist.

Boosting

Boosting is a method for improving performance by fitting a series of weak learning models and using all of the models together to improve prediction ability of the combination. It will be explained in a subsequent chapter of this book.

Fitting Options

Transform Covariates

The transform covariates option is very powerful. If this option is selected, JMP will check the values of each input variable to see if it is normally distributed. If it is not, JMP will transform that variable so the values can be used more effectively by ANN as ANN learns. Doing this manually, would be a tedious process. By checking the ‘Transform Covariates’ option, this is done automatically.

Penalty Method

Penalty method refers to the method JMP will use to identify the best model. The Squared weighting is similar to the sum of squares used in linear regression, meaning you look at the sum of the squared differences between the actual values and the predicted values.

The squared weighting is good to use if you know that all of your input variables are statistically significant, or in other words that each input variable is important to the accuracy of your model. The absolute and weighted decay methods are good to use if you don’t know that all of your input variables are statistically significant.

Number of Tours

The number of tours refers to the number of starting points JMP will test. In running neural networks, each model begins with a random selection of assumptions. If the randomly selected starting points are poor places to start the training process, it results in a suboptimal model. Giving multiple starting points to select from gives JMP the ability to test that many starting points and find the starting points that produce the best results. This random selection of starting points means that we can use the exact same settings/parameters when we invoke the learning algorith, but end up with slightly different outcomes because the starting point is randomly selected each time a new model is created.

Other Common Model Learning Parameters

Learning rate and momentum are learning parameters that are often used to regulate the training of neural networks.

Learning Rate

Because error is used to update weights after each iteration it is important to monitor its changes. In most data mining tools we can tweak the ‘learning rate”. The learning rate regulates how fast the weights change at each iteration. When learning rate is low, impact of new evidence is diminished. When learning rate is high, impact of new evidence is magnified. Low learning rate slows learning, but reduces tendency to overfit the model (Shmueli, Bruce, and Patel, 2016).

Momentum

Some data mining tools also has a setting to change the momentum of the weights. If the values are high it keeps the weights changing in the same direction as the previous iteration. Low values means weights can quickly change such as from a negative to a positive. High momentum slows learning but helps avoid overfitting to local structures.

Avoid Overfitting

One of the weaknesses of ANN is that it can easily overfit the data, making the error rate on the validation data too large. In classification and regression trees, overfitting can be detected by examining the performance on the validation data and seeing when it starts underperforming, while the training set performance is improving. This method of finding the point of minimum validation error is a good sign of the best number of epochs for training to provide the best error rate(Shmueli, Bruce, and Patel, 2016) .