Outlier check with SVM novelty detection in R

   Support vector machines (SVM) are widely used in classification, regression, and novelty detection analysis. In this post, I will show how to use one-class novelty detection method to find out outliers in a given data. We use a kernel-based ksvm function of kernlab package and svm function of an e1071 package.

Using the kernel-based SVM method (ksvm)

   The kernlab package provides kernel-based functions in R.  You may need to install it if it is not available on your machine.

> library(kernlab)

First, we generate random test data for this tutorial.

> test <- runif(100)*10

We add some random values as the outliers into test data.

> test[sample(1:100,5)] <- sample(10:20,5)
> head(test, 10)
 [1]  5.31377537  2.10634674  0.04266631  5.82521947  8.29579331  6.42224791
 [7]  9.66175330  1.16918444  6.35891347 19.00000000

Here, we added five randomly generated values in a range of [10~20], into the test data with a random index. They can be the outliers in a data because they are higher than most of the values.
A below plot shows the data we have created.

> plot(test, type="l", col="blue")


Test data is ready, and we can build a model.

> svm_model=kernlab::ksvm(test,nu=0.09, type="one-svc", kernel="vanilladot")

A ksvm function arguments are:
  • data - data to use, here test is input data.
  • nu - sets the upper bound on the training error and the lower bound on the fraction of data points to become Support Vectors.
  • type - ksvm usage type. Here we use for novelty check that is one-svc.
  • Kernel - kernel function used in training and predicting. A vanilladot is linear kernel function.

You may get more information about the usage and argument types by checking help page of ksvm function.

Predicting model and getting the result.

> get_index=predict(svm_model)
> head(get_index)
      [,1]
[1,] FALSE
[2,] FALSE
[3,] FALSE
[4,] FALSE
[5,] FALSE
[6,] FALSE

Now we get the values of our results.

> out_index=which(get_index[,1]==TRUE)
> out_index
[1] 10 38 46 97
> test[out_index]
[1] 19 15 16 20

Four outliers are detected in a vector.
Finally, we identify them in a plot.

> plot(test, col="blue", type="l")
> points(x=out_index, y=test[out_index], pch=18, col="red")


Using the svm function of an e1071 package

   The e1071 package also provides svm function to analyze regression and classification problems. Make sure you have an e1071 library on your computer.

> library(e1071)

We use the test data and build the model with svm function.

> svm <- e1071::svm(test, nu=0.09, type="one-classification", kernel="linear")

The svm model description

> svm
Call:
svm.default(x = test, type = "one-classification", kernel = "linear", 
    nu = 0.09)


Parameters:
   SVM-Type:  one-classification 
 SVM-Kernel:  linear 
      gamma:  1 
         nu:  0.09 

Number of Support Vectors:  10

 Predicting data and getting TRUE values.

> out_svm <- predict(svm)
> index_svm <- which(out_svm==TRUE)

Checking output data.

> test[index_svm]
[1] 19 15 16 20

Plotting the result in a chart.

> plot(test, col="blue", type="l")
> points(x=index_svm, y=test[index_svm],pch=18, col="orange")


 As you've noticed, we've got the same result with svm and ksvm functions.

Conclusion

   We detected outliers in a simple, simulated data with ksvm and svm functions.
   There are some cases that ksvm and svm novelty check functions may not work well. To get an accurate result, we have to tune the parameters of ksvm and svm functions correctly, especially a nu argument. For this test, I also tried many times to optimize a nu value. If you don't get the correct result in your analysis, you should check and adjust a nu parameter.
   Based on your target function and data, check the other arguments to improve your results.

   In this post, we've learned ksvm and svm function to detect outlier points of a vector data in brief. If you have any comments, please leave it below. Thank you!


Outlier detection with Local Outlier Factor with R

Outlier check with kmeans distance calculation with R

Outlier detection with boxplot.stats function in R

No comments:

Post a Comment