17.2 Components of Support Vector Machines
Support Vector Machines can be broken down and understood into four components in order to understand how this algorithm works. The four components are the following:
- Separating Hyperplane
- Maximum Margin Hyperplane
- Soft Margin
- Kernel Functions
Separating Hyperplane
Looking at the two dimensional data in figure x.x, our eye has the natural tendency to identify different groups and where a line can be drawn to separate these data points. This can be true for many data sets and is the main objective that the SVM is trying to achieve. The SVM will calculate where the separating hyperplane is located and then classify which category a data point belongs to based on which side of the hyperplane the point falls.
Hyperplanes can also be made in multiple dimensions. For example, Figure xx is a 3-dimensional model using a SVM to perform an analysis to distinguish which of the Federalists Papers were written by Hamilton and which were written by Madison. Looking at the 3-dimensional image there is a clear hyperplane that can be drawn to distinguish who the Federalist Papers were written by and what characteristics made the writers distinguishable. This may not have been apparent if we analyzing our data with two dimensional data.
Maximum Margin Hyperplane
A question that we may ask ourselves, however, is “What is the best separating hyperplane?” Looking at the two dimensional data in figure x.x., we can see that there are various ways to separate the data, but which is the best if we are trying to classify our data points?
In order to find the best separating hyperplane we have to find the hyperplane that has the maximum margin width. If we are observing two-dimensional data, the margin width is based on two parallel lines created from supporting vectors. Support Vectors are usually the points lie towards the outside of a cluster. See figure x.x as an example.
Observing figure x.x you can see that there are different hyperplanes we can create based on using different support vectors. We can also see that the #1 Margin Width is larger in which we should use as our separating hyperplane. The separating hyperplane should be the middle distance of the maximum margin width.
The reason the SVM chooses the maximum margin width is to help reduce overfitting. When test data is to be included, the maximum margin width increases the probability that a test data point falls on the correct side of the hyperplane in which it will be categorized if we are trying to classify our data.
Soft Margin
Sometimes, however, not all of the data can strictly be separate by a separating hyperplane. In this case we allow for soft margins, which is allowing a few data points to fall on the “wrong” side of our separating hyperplane. The soft margin, in summary, specifies the tradeoff between hyperplane violations and the size of the maximum margin.
Observing Figure xx the soft margin for classification is calculated using the following steps:
- Choose a hyperplane that splits the data points as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples.
- Instances of misclassified observations are re-labeled as instances of a Slack Variable ξi. The distances between a misclassified instance and the margin on the other side of the hyperplane are summed as a measure of total error.
- The SVM then finds a balance between a wide margin and a small amount of total error.
Kernel Functions
So far we have covered how SVM can categorized data using linear straight lines to classify the data. However, how do SVMs handle situations in which it would be better to separate the data using non-linear or non-straight lines to better separate the data. Figures XX and XX are examples of data points that can be better classified using non-straight lines.
The way that SVMs go about creating non-straight lines is by using different kernel functions. A Kernel function is an algorithm or transformation of the data into a higher dimensions of space that cannot always be drawn out.
For example, Figure XX shows on the left hand side two categories of data that are trying to be classified in two-dimensional space. It is clear that the data can be separated by drawing a circle around the red group. Behind the scenes, however, the SVM is using a specific Kernel to convert that data into a higher dimension, 3-dimensional to be specific, and then use a hyperplane as we discussed beforehand to separate the data. Once the ideal hyperplane has been chosen, the hyperplane can be converted back down to the original two dimensional spaced to be drawn as a non-linear.
Another example, would be separating linear or one-dimensional data as shown in figure XX. Looking at the figure xx we can easily draw a single split to separate the data into two different categories. Every data point to the left of the split would be classified under the red category while everything on the right would be classified under the blue category.
However, as we introduce a more complicated single dimensional data set as shown in Figure XX, we can no longer separate the data based on a single split.
Gratefully, this problem can be solved by using a kernel function. With a kernel function we can convert the data in one dimensional space into two dimensional space as shown in figure xx. As you can see, we can now classify the data based on a single linear split once the data is converted through a kernel.
Which Kernel to Use?
Like a double edged sword, a strength that SVM has is finding the right kernel to classify the data, which other algorithm may be capable of doing. On the other hand, there are many kernels that exist and it may take time and an understanding of each kernel to find the right one. Many times discovering the right kernel is a process of trial and error to discover which works best. To simplify the process, most data mining software has a few common kernels that are known to provide the best results, but this may not always interpret the data the best.
Underfitting and Overfitting
Another reason why finding the right kernel may take time and along with trial and error is because there needs to be a balance between the number of dimensions that are created by each kernel and a low soft margin.
One may suppose that by increasing the number of dimensions to a large amount would allow us to find a hyperplane that has a soft margin with low error. However, increasing the number of dimensions largely increases the number of hyperplanes that could be used to divide the data. Doing so makes finding one separating hyperplane extremely complex for the SVM model to choose from and even If a separating hyperplane is chosen when the number of dimensions are extremely high this will more than likely cause overfitting.
On the other end, if there are two few dimensions and the error for the soft margin is too large, this could result in underfitting and misclassifying many data points.
Therefore, finding a kernel for your dataset that doesn’t create too many dimensions and a low error for the soft margin is more than likely to provide good results.