SVM
Support Vector Machine
Given dataset
Suppose that our data is linear separable. Then,
When functional margin
associated with training example for specific set of parameters functional margin
of
In support vector machines the decision boundary is chosen to be the one for which the functional margin (confidence) is maximized.
Distance to Plane
Let
Since
Goal
The concept of the margin is intuitively simple: it is the distance of the separating hyperplane to the closest examples in the dataset, assuming that our dataset is linearly separable. That is:
This optimization problem is unbounded, because one can make the functional margin large by simply scaling the parameters by a constant
This has no effect on the decision plane because:
Thus, we need to transform the optimization problem to maximize the distance between the samples and decision boundary instead of maximizing functional margin. Suppose we let all functional margins to be at least 1 (can easily achieve by multiplying parameters by a constant):
Then, for point
Then, we can formulate our objective as:
Which is equivalent to:
Soft Margin SVM
Slack Variables
Notice, in the above formulation, we have hard constraints on the margins which do not allow misclassification of points. However, in real world, data points are rarely linear separable and there will be outliers in the dataset, we may wish to allow some examples to be on the wrong side of the hyperplane or to have margin less than 1 .
To resolve this problem, we can introduce slack variables one for each data point to relax the hard constraints:
To encourage correct classification of the samples, we add
Thus, sample points are now permitted to have margin less than 1, and if an example
Dual Problem
Using Lagrange Multiplier, we can transform the constrained problem into an unconstrained concave problem:
Where the inner minimization is the dual function and the maximization w.r.t
KKT
For an unconstrained convex optimization problem, we know we are at global minimum if the gradient is zero. The KKT conditions are the equivalent conditions for the global minimum of a constrained convex optimization problem.
Stationarity, If the strong duality holds,
is optimal, then minimizes (same for which are formulated as constraints in the dual problem):Complementary Slackness:
Primal Feasibility:
Dual Feasibility:
Solving Dual Problem
We now solve for the dual function by fixing
Substitute back to the original equation, we obtain the dual function:
Then, we have the dual problem:
This is a quadratic programming problem that we can solve using quadratic programming.
Interpretation
We could conclude:
if
Since , we have the points are with are on the marginif
: the points are inside the margin on the correct side : the points are on the decision boundary : the points are inside the wrong side of the margin and misclassified
if
, the points are not support vectors, have no affect on the weight.
After finding the optimal values for
We obtain optimal
Let
Kernel Tricks
Implementation
1 | from cvxopt import matrix, solvers |
Ref
https://www.ccs.neu.edu/home/vip/teach/MLcourse/6_SVM_kernels/lecture_notes/svm/svm.pdf
http://www.cs.cmu.edu/~guestrin/Class/10701-S06/Slides/svms-s06.pdf
MML book
Lagrangian Duality for Dummies, David Knowles
PRML Chapter 7