SVM & Decision Tree
10 September, 2020
0
0
0
Contributors
Dataset: SGEMM GPU kernel
Dataset Description: This data consists of 241600
observations of student possible parameter combinations. Last four columns
represent the marks of student and since prediction is for the marks of
student, average of these four are treated as the response variable.
Support Vector Machines
SVM is a supervised learning algorithm. It
can be used for classification as well as for regression. Now for the classification
it is not necessary to have our dataset having data points linearly separable.
There might be some different functions which separate the data well. Thus, we
need different function to classify our data points and SVM provide that
feature using different kernels. Different functions used for the
classification of data is called feature function. Here the data point used for
the classification is called “Landing Point”. The name Support Vector Machine
is given because these Landing points are known as Support Vector.
SVM also provide a parameter C which decide
the way of separating data points. If value of C is small then we prefer to
choose the maximum classifier, means the way to separate data by large margin,
so that its accuracy is good for the data points which is not included in the
training data. It happens because it provides a big flexibility to the data
points to have different value than training data and still classify them
correctly. But if misclassification of training data is not acceptable then
that separator is used which classify the training data in best way. This is
represented by the large value of C.
Also the graph to find out the coefficients
and RSS follows a convex shape so there is a strong reason to find the
coefficient for which the RSS value is minimum
Hence SVM provides the more flexible way to
classify the data because of the testing different way for classifying the data
better using different kernels. Also it provides regularization parameter to
provide the lowest misclassification rate based on the type of data used and
chances of getting minima is local minima can be removed using SVM.
Linear Kernel Function:
If the data can be classified based on the
linear function of the independent variables, then this kernel is used. Here
the similarity function is the linear function.
Here SVM is used with
linear kernel for following values of regularization parameter:- 0.001,0.005,0.01,0.02,0.05.
No of folds used for the cross validation is 7.
Below is the graph, showing regularization parameter as a function of accuracy along with the Classification report and confusion matrix:
Gaussian Kernel Function:
Gaussian Kernel is implemented when the
feature function used which can classify the data points is as follow:- exp ^
(-|x-l| ^ 2 / (2* sigma ^ 2))
Gaussian Kernel is implemented
on training data set with values of C (Regularization parameter) such as 0.01,0.02,0.05,0.1,0.2.
Again, cross-validation with the value of k-fold as 7 is applied in the
training data set in order to obtain train score and test score.
Here we are using the
difference between the training point and landing point to decide the feature
function, hence scalarization of data point is required to remove the
overfitting of variable having large unit.
Below is the graph, showing regularization
parameter as a function of accuracy along with the Classification report and
confusion matrix:
Sigmoid Kernel Function:
Sigmoid Kernel is implemented
on training data set with values of C (Regularization parameter) such as 0.01,0.02,0.05,0.1,0.5
. Again, cross-validation with the value of k-fold as 7 is applied in the
training data set in order to obtain train score and test score.
Below is the graph, showing regularization
parameter as a function of accuracy along with the Classification report and
confusion matrix:
Decision Tree
Decision Tree is a non-parametric
supervised learning method. Non-Parametric means we have not assumed the
functional way by which data points have been classified. Hence it provides a
slower way because it finds the function which classify data points and not
restricted to find parameters required for the function we have assumed. But
since it does not work on a predefined function assumed hence it provides a
broad range for the function classifying the data point and it provide more
accurate way.
In the creation of decision tree the whole
data set is divided into the various nodes such that the elements of that node
are very much similar to each other. It is also known as reducing impurity or
information gain. Here Gini method is used for the division of data points not
entropy method because both of these methods classify with almost same accuracy
level, but the entropy includes calculation of logarithm which is
computationally intensive. Thus, to make the calculation computationally
optimized Gini method has been used. The response variable is Binary and Gini
is preferred as the criteria in that case where the response variable is
binary.
Now to decide the node of the tree to be
used for classification, impurity reduction is calculated at each node and the
variable which provides the best reduction in impurity level is chosen for the
classification of data point. When the node is left with all of the member
belonging to same class of the output variable we stop the growth of tree and
that node is named as “Leaf” node.
Now the more depth of the tree the training
data is better classified but same is not good for testing data. After certain
level of the depth of tree the model get overfitting of the training data.
Hence after that level MSE test start to increase. Thus optimal depth of the
tree has to be chosen which is not very small to improve the classification of
the data point and not very big to remove the overfitting of the training data.
Below is the plot between Accuracy & Depth,
we can see that after the value of depth=5, the train and test accuracy
increase but the value of test accuracy is in every instant after depth=5.
Thus, we will proceed further to find the test accuracy at the value of depth =5.
Keeping the value of depth=5, tree is pruned on different values of parameter such as 0.0000001,0.000001,0.00001,0.0001,0.001,0.01,0.05
From the above graph, it can be seen that a model attains high accuracy at
tree depth =5 & pruning(ccp_alpha) at 0.005 .
Below is the Classification report along with confusion matrix is as follows:
Linear Kernel Function
Gaussian Kernel
Sigmoid Kernel
Decision Tree diff. Depth