SVM & Decision Tree

10 September, 2020

Contributors

Vikram

@vikram7736

Dataset: SGEMM GPU kernel

Dataset Description: This data consists of 241600 observations of student possible parameter combinations. Last four columns represent the marks of student and since prediction is for the marks of student, average of these four are treated as the response variable.

Support Vector Machines

SVM is a supervised learning algorithm. It can be used for classification as well as for regression. Now for the classification it is not necessary to have our dataset having data points linearly separable. There might be some different functions which separate the data well. Thus, we need different function to classify our data points and SVM provide that feature using different kernels. Different functions used for the classification of data is called feature function. Here the data point used for the classification is called “Landing Point”. The name Support Vector Machine is given because these Landing points are known as Support Vector.

SVM also provide a parameter C which decide the way of separating data points. If value of C is small then we prefer to choose the maximum classifier, means the way to separate data by large margin, so that its accuracy is good for the data points which is not included in the training data. It happens because it provides a big flexibility to the data points to have different value than training data and still classify them correctly. But if misclassification of training data is not acceptable then that separator is used which classify the training data in best way. This is represented by the large value of C.

Also the graph to find out the coefficients and RSS follows a convex shape so there is a strong reason to find the coefficient for which the RSS value is minimum

Hence SVM provides the more flexible way to classify the data because of the testing different way for classifying the data better using different kernels. Also it provides regularization parameter to provide the lowest misclassification rate based on the type of data used and chances of getting minima is local minima can be removed using SVM.

Linear Kernel Function:

If the data can be classified based on the linear function of the independent variables, then this kernel is used. Here the similarity function is the linear function.

Here SVM is used with linear kernel for following values of regularization parameter:- 0.001,0.005,0.01,0.02,0.05. No of folds used for the cross validation is 7.

Below is the graph, showing regularization parameter as a function of accuracy along with the Classification report and confusion matrix:

Gaussian Kernel Function:

Gaussian Kernel is implemented when the feature function used which can classify the data points is as follow:- exp ^ (-|x-l| ^ 2 / (2* sigma ^ 2))

Gaussian Kernel is implemented on training data set with values of C (Regularization parameter) such as 0.01,0.02,0.05,0.1,0.2. Again, cross-validation with the value of k-fold as 7 is applied in the training data set in order to obtain train score and test score.

Here we are using the difference between the training point and landing point to decide the feature function, hence scalarization of data point is required to remove the overfitting of variable having large unit.

Below is the graph, showing regularization parameter as a function of accuracy along with the Classification report and confusion matrix:

Sigmoid Kernel Function:

Sigmoid Kernel is implemented on training data set with values of C (Regularization parameter) such as 0.01,0.02,0.05,0.1,0.5 . Again, cross-validation with the value of k-fold as 7 is applied in the training data set in order to obtain train score and test score.

Below is the graph, showing regularization parameter as a function of accuracy along with the Classification report and confusion matrix:

Decision Tree

Decision Tree is a non-parametric supervised learning method. Non-Parametric means we have not assumed the functional way by which data points have been classified. Hence it provides a slower way because it finds the function which classify data points and not restricted to find parameters required for the function we have assumed. But since it does not work on a predefined function assumed hence it provides a broad range for the function classifying the data point and it provide more accurate way.

In the creation of decision tree the whole data set is divided into the various nodes such that the elements of that node are very much similar to each other. It is also known as reducing impurity or information gain. Here Gini method is used for the division of data points not entropy method because both of these methods classify with almost same accuracy level, but the entropy includes calculation of logarithm which is computationally intensive. Thus, to make the calculation computationally optimized Gini method has been used. The response variable is Binary and Gini is preferred as the criteria in that case where the response variable is binary.

Now to decide the node of the tree to be used for classification, impurity reduction is calculated at each node and the variable which provides the best reduction in impurity level is chosen for the classification of data point. When the node is left with all of the member belonging to same class of the output variable we stop the growth of tree and that node is named as “Leaf” node.

Now the more depth of the tree the training data is better classified but same is not good for testing data. After certain level of the depth of tree the model get overfitting of the training data. Hence after that level MSE test start to increase. Thus optimal depth of the tree has to be chosen which is not very small to improve the classification of the data point and not very big to remove the overfitting of the training data.

Below is the plot between Accuracy & Depth, we can see that after the value of depth=5, the train and test accuracy increase but the value of test accuracy is in every instant after depth=5. Thus, we will proceed further to find the test accuracy at the value of depth =5.

Keeping the value of depth=5, tree is pruned on different values of parameter such as 0.0000001,0.000001,0.00001,0.0001,0.001,0.01,0.05

From the above graph, it can be seen that a model attains high accuracy at tree depth =5 & pruning(ccp_alpha) at 0.005 .

Below is the Classification report along with confusion matrix is as follows:

Linear Kernel Function

Gaussian Kernel

Sigmoid Kernel

Decision Tree diff. Depth

SVM & Decision Tree

More Articles