Classification with a bagging (treebag) method in R

   Bagging (Bootstrap Aggregating) algorithm is used to improve model accuracy in regression and classification problems. Building multiple models from separated subsets of train data, and constructing a final aggregated and more accurate model is the basic concept of bagging algorithm. Bagging is one of the ensemble learning methods.

   In this post, we'll learn a simple usage of 'treebag' bagging method for classification problem in R. You may read a help page of each function and other resources if you are interested to know more about them.
   We need caret library and iris dataset in this tutorial. We'll start including them into our source code.

library(caret)
data(iris)

Preparing data 

Next, we'll split iris dataset into a train and test parts.

set.seed(12) 
indexes <- createDataPartition(iris$Species, p = .9, list = F)
train <- iris[indexes, ]
test <- iris[-indexes, ]

A 'bag' function method
 
The 'bag' function requires a bagControl parameter, and we define it as below.

bagCtrl <- bagControl(fit = ctreeBag$fit,
                      predict = ctreeBag$pred,
                      aggregate = ctreeBag$aggregate)

Fitting a model with 'bag' function.

fit <- bag(Species~., data = train, bagControl = bagCtrl)
print(fit)

Call:
bag.formula(formula = Species ~ ., data = train, bagControl = bagCtrl)


B: 10 
Training data: 4 variables and 135 samples
All variables were used in each model

Finally, we'll predict test data and print the result.

pred <- predict(fit, test)
df <- data.frame(predicted = pred, actual = test$Species)
print(df)
    predicted     actual
1      setosa     setosa
2      setosa     setosa
3      setosa     setosa
4      setosa     setosa
5      setosa     setosa
6  versicolor versicolor
7  versicolor versicolor
8  versicolor versicolor
9  versicolor versicolor
10 versicolor versicolor
11  virginica  virginica
12  virginica  virginica
13  virginica  virginica
14  virginica  virginica
15 versicolor  virginica

Caret 'train' method

Caret 'train' function requires training control parameter, and we define it. Here, we use cross-validation method, and fold number is 5. 

trCtrl <- trainControl(method = "cv", number = 5)

Building a model with train function

cr.fit <- train(Species~., data = train, method = "treebag",
               trControl = trCtrl, metric = "Accuracy")
 
print(cr.fit)
Bagged CART 

135 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 108, 108, 108, 108, 108 
Resampling results:

  Accuracy   Kappa    
  0.9407407  0.9111111

Predicting test data and printing the result.

cr.pred <- predict(cr.fit, test)
 
cr.df <- data.frame(predicted = cr.pred, actual = test$Species)
print(cr.df)
    predicted     actual
1      setosa     setosa
2      setosa     setosa
3      setosa     setosa
4      setosa     setosa
5      setosa     setosa
6  versicolor versicolor
7  versicolor versicolor
8  versicolor versicolor
9  versicolor versicolor
10 versicolor versicolor
11  virginica  virginica
12  virginica  virginica
13  virginica  virginica
14  virginica  virginica
15 versicolor  virginica


   In this post, we have learned how to use a bag and treebag functions for classification problem in R. I hope you have found this post useful.

No comments:

Post a Comment