Classification with the Adabag Boosting in R

   AdaBoost (Adaptive Boosting) is a boosting algorithm in machine learning. Improving week learners and creating an aggregated model to improve model accuracy is a key concept of boosting algorithms. A weak learner is defined as the one with poor performance or slightly better than a random guess classifier. Adaboost improves those classifiers by increasing their weights and gets their votes to create the final combined model.
 
   In this post, we'll learn how to use the adabag package's boosting function to classify data in R. The tutorial covers:
  1. Preparing data
  2. Classification with boosting
  3. Classification with boosting.cv
  4. Source code listing

We'll start by loading the required libraries.

library(adabag)
library(caret)

Preparing data

    In this tutorial, we'll use the Iris dataset as a target classification data. We'll split it into the train and test parts. Here we'll use 10 percent of a dataset as a test data.

indexes=createDataPartition(iris$Species, p=.90, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]


Classification with boosting


   We'll define the model with boosting function and train it with train data. The 'boosting' function applies the AdaBoost.M1 and SAMME algorithms using classification trees. A 'boos' is a bootstrap uses the weights for each observation in an iteration if it is TRUE. Otherwise, each observation is used with its weight. A 'mfinal' is the number of iterations or trees to use.

model = boosting(Species~., data=train, boos=TRUE, mfinal=50)

We can check the model properties

print(names(model))
[1] "formula"  "trees"   "weights"  "votes"   "prob"   "class"     
[7] "importance" "terms"      "call"

print(model$trees[1])
[[1]]
n= 135 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 135 88 versicolor (0.3185185 0.3481481 0.3333333)  
  2) Petal.Length< 2.7 43  0 setosa (1.0000000 0.0000000 0.0000000) *
  3) Petal.Length>=2.7 92 45 versicolor (0.0000000 0.5108696 0.4891304)  
    6) Petal.Width< 1.75 50  3 versicolor (0.0000000 0.9400000 0.0600000) *
    7) Petal.Width>=1.75 42  0 virginica (0.0000000 0.0000000 1.0000000) *

The model is ready and we can predict test data. Predicted data accuracy is also included in output data.

pred = predict(model, test)

print(pred$confusion)
               Observed Class
Predicted Class setosa versicolor virginica
     setosa          5          0         0
     versicolor      0          5         0
     virginica       0          0         5
 
print(pred$error)
[1] 0

We can also print the probability of each class in test data.

result = data.frame(test$Species, pred$prob, pred$class)
print(result)
   test.Species         X1         X2         X3 pred.class
1        setosa 0.92897958 0.07102042 0.00000000     setosa
2        setosa 0.90999935 0.07693250 0.01306815     setosa
3        setosa 0.88902756 0.09790429 0.01306815     setosa
4        setosa 0.92897958 0.07102042 0.00000000     setosa
5        setosa 0.88902756 0.09790429 0.01306815     setosa
6    versicolor 0.01288461 0.91943143 0.06768396 versicolor
7    versicolor 0.01288461 0.84235917 0.14475622 versicolor
8    versicolor 0.03205498 0.95093238 0.01701263 versicolor
9    versicolor 0.03205498 0.95093238 0.01701263 versicolor
10   versicolor 0.03205498 0.95093238 0.01701263 versicolor
11    virginica 0.00000000 0.04468596 0.95531404  virginica
12    virginica 0.00000000 0.01577596 0.98422404  virginica
13    virginica 0.00000000 0.05561801 0.94438199  virginica
14    virginica 0.00000000 0.05561801 0.94438199  virginica
15    virginica 0.00000000 0.33446425 0.66553575  virginica


Classification with boosting.cv

   The boosting.cv function provides cross-validation method.  The training data is divided into multiple subsets to apply boosting and prediction is performed for the entire dataset. To train the model we use entire dataset and get prediction result. Here, v is cross-validation subsets numbers.

cvmodel = boosting.cv(Species~., data=iris, boos=TRUE, mfinal=10, v=5)

We'll check the accuracy.

print(cvmodel[-1])
$confusion
               Observed Class
Predicted Class setosa versicolor virginica
     setosa         50          0         0
     versicolor      0         45         3
     virginica       0          5        47

$error
[1] 0.05333333

You can compare the original and predicted classes.

data.frame(iris$Species, cvmodel$class)


   In this post, we've briefly learned how to classify data with the adabag boosting model in R. The full source code is listed below.


Source code listing

library(adabag) library(caret) indexes=createDataPartition(iris$Species, p=.90, list = F) train = iris[indexes, ] test = iris[-indexes, ] model = boosting(Species~., data=train, boos=TRUE, mfinal=50) print(names(model)) print(model$trees[1]) pred = predict(model, test) print(pred$confusion) print(pred$error) result = data.frame(test$Species, pred$prob, pred$class) print(result) # cross-validataion method cvmodel = boosting.cv(Species~., data=iris, boos=TRUE, mfinal=10, v=5) print(cvmodel[-1]) print(data.frame(iris$Species, cvmodel$class))
Classification with Gradient Boosting Model in R

Classification with XGBoost Model in R

2 comments:

  1. Great example and code!!!! Thanks much

    ReplyDelete
  2. What happens in the cases were null values exists? Can the model work without filling them with average values or any other similar ideia?

    ReplyDelete