13.2 Training the Model
Training the model means estimating the value of the nodes and the weights of the arrows that produce the best predictive results.
Example of ANN on a Small Dataset
Consider the following dataset. The table consists of information on a tasting score for a certain processed cheese. The two predictors are scores for fat and salt, showing the corresponding presence of fat and salt in the specific cheese sample(where 0 is the minimum score possible and 1 the maximum). The output variable is the consumer preference on the cheese. The choices are “1” (like) or “0” (dislike).
The input nodes take as input the values of the predictors. In this example there are two predictors and therefore the input layer has two nodes, each feeding into each node of the hidden layer. The input into the input layer is its output. Let’s look at the first observation: The input into the input layer is fat = 0.2 and salt = 0.9, and the output of this layer is also x1 = 0.2 and x2 = 0.9.
The hidden layer nodes then take as input the output values from the input layer. The hidden layer in this example is composed of three nodes, each receiving input from all the input nodes. In order to compute the output of a hidden layer or output layer node, we calculate a weighted sum of the inputs and apply a certain function to it. In this example we calculate the value of logit (z).
Let’s do this to compute the output of node 6. In our example, output node 6 receive input from three hidden layer nodes. We can compute the output of this node calculating logit. Logit is calculated by taking the intercept plus the sum of the value of each node multiplied by the weight of the arrow from the node. Then logit is used to calculate the ouput of the logistic response function:
By using the probability cutoff threshold of p = 0.5, we map the output of the first observation of 0.506 to classify this record as a “1” (like).
The process we did for computing the neural network for the first observation is repeated for all the observations in the training set. For each record the model produces, the prediction is then compared with the actual response. Their difference is the error for the output node. The error is then processed back and distributed to all the hidden nodes and used to update the estimated weights. The weights are then updated after each record is run through the network.
Case Updating
In case updating with the tiny dataset, the weights would be first updated after running observation 1. These new weights are then updated after the second observation is run through the network, the third and so on until all the observations are used. A completion of all records through the network is one epoch, sweep or iteration. Case updating usually yields more accurate results but requires a longer runtime.
The reason this works is that big errors tend to make bigger changes in the weights. Smaller errors leave weights nearly unchanged. Over the course of thousands of updates, a given weight keeps changing until the error associated with that weight is insignificant, at which point training terminates.