Gradient Boosting Classification with GBM in R

    Boosting is one of the ensemble learning techniques in machine learning and it is widely used in regression and classification problems. The main concept of this method is to improve (boost) the week learners sequentially and increase the model accuracy with a combined model. There are several boosting algorithms such as Gradient boosting, AdaBoost (Adaptive Boost), XGBoost and others.

    In this post, we'll learn how to classify data with a gbm (Generalized Boosted Model) package's gbm (Gradient Boosting Model) method. This package applies J. Friedman's gradient boosting machines and Adaboot algorithms. The tutorial covers:

  1. Preparing the data
  2. Classification with gbm
  3. Classification with caret train method
  4. Source code listing
  We'll start by loading the required packages.

library(gbm)
library(caret) 


Preparing the data

    We'll use the Iris dataset as a target classification data and prepare it by splitting into the train and test parts. Here, we'll use 10 percent of the dataset as test data.

indexes = createDataPartition(iris$Species, p = .90, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]


Classification with gbm

    We'll define the gbm model and include train data to fit the model. Here, we'll set multinomial distribution, 10 cross-validation fold, and 200 trees.

mod_gbm = gbm(Species ~.,
              data = train,
              distribution = "multinomial",
              cv.folds = 10,
              shrinkage = .01,
              n.minobsinnode = 10,
              n.trees = 200)
 
print(mod_gbm)
gbm(formula = Species ~ ., distribution = "multinomial", data = train, 
    n.trees = 200, n.minobsinnode = 10, shrinkage = 0.01, cv.folds = 10)
A gradient boosted model with multinomial loss function.
200 iterations were performed.
The best cross-validation iteration was 200.
There were 4 predictors of which 3 had non-zero influence.

The model is ready, and we'll predict test data.

pred = predict.gbm(object = mod_gb,
                   newdata = test,
                   n.trees = 200,
                   type = "response")

    The predicted result is not easy-readable data so we'll get class names with the highest prediction value.

labels = colnames(pred)[apply(pred, 1, which.max)]
result = data.frame(test$Species, labels)
 
print(result)
   test.Species     labels
1        setosa     setosa
2        setosa     setosa
3        setosa     setosa
4        setosa     setosa
5        setosa     setosa
6    versicolor versicolor
7    versicolor versicolor
8    versicolor versicolor
9    versicolor  virginica
10   versicolor versicolor
11    virginica versicolor
12    virginica  virginica
13    virginica  virginica
14    virginica  virginica
15    virginica  virginica

Finally, we'll check the confusion matrix.

cm = confusionMatrix(test$Species, as.factor(labels))
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          4         1
  virginica       0          1         4

Overall Statistics
                                          
               Accuracy : 0.8667          
                 95% CI : (0.5954, 0.9834)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 3.143e-05       
                                          
                  Kappa : 0.8             
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8000           0.8000
Specificity                 1.0000            0.9000           0.9000
Pos Pred Value              1.0000            0.8000           0.8000
Neg Pred Value              1.0000            0.9000           0.9000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2667           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.8500           0.8500


Classification with caret train method

    In the second method, we use the caret package's train() function for model fitting. The train() function requires train control parameter and we can define it as below.

tc = trainControl(method = "repeatedcv", number = 10)

Next, we'll define the model and train it with train data.

model = train(Species ~., data=train, method="gbm", trControl=tc)

We can predict test data with the fitted model.

pred = predict(model, test)
result = data.frame(test$Species, pred)
print(result)
   test.Species       pred
1        setosa     setosa
2        setosa     setosa
3        setosa     setosa
4        setosa     setosa
5        setosa     setosa
6    versicolor versicolor
7    versicolor versicolor
8    versicolor versicolor
9    versicolor versicolor
10   versicolor versicolor
11    virginica  virginica
12    virginica versicolor
13    virginica  virginica
14    virginica  virginica
15    virginica  virginica

Finally, we'll check the confusion matrix.

cm = confusionMatrix(test$Species, as.factor(pred))
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          5         0
  virginica       0          1         4

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.6805, 0.9983)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 2.523e-05       
                                          
                  Kappa : 0.9             
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8333           1.0000
Specificity                 1.0000            1.0000           0.9091
Pos Pred Value              1.0000            1.0000           0.8000
Neg Pred Value              1.0000            0.9000           1.0000
Prevalence                  0.3333            0.4000           0.2667
Detection Rate              0.3333            0.3333           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9167           0.9545


   In this tutorial, we've learned how to classify data with gbm method in R. The full source code is listed below.


Source code listing

library(gbm) library(caret) indexes = createDataPartition(iris$Species, p = .90, list = F) train = iris[indexes, ] test = iris[-indexes, ] mod_gbm = gbm(Species ~., data = train, distribution = "multinomial", cv.folds = 10, shrinkage = .01, n.minobsinnode = 10, n.trees = 200) print(mod_gbm) pred = predict.gbm(object = mod_gbm, newdata = test, n.trees = 200, type = "response") labels = colnames(pred)[apply(pred, 1, which.max)] result = data.frame(test$Species, labels) print(result) cm = confusionMatrix(test$Species, as.factor(labels)) print(cm) # caret train method tc = trainControl(method = "repeatedcv", number = 10) model = train(Species ~., data=train, method="gbm", trControl=tc) print(model) pred = predict(model, test) result = data.frame(test$Species, pred) print(result) cm = confusionMatrix(test$Species, as.factor(pred)) print(cm)
Classification with Adaboost Model in R

Classification with XGBoost Model in R

2 comments:

  1. Hello there! This is my first comment here, so I just wanted to give a quick shout out and say I genuinely enjoy reading your articles. Can you recommend any other blogs/websites/forums that deal with the same subjects? Thanks.
    Surya Informatics

    ReplyDelete
  2. wonderfull... the caret train method worked just fine for me and helped a lot.
    Thank you..!!

    ReplyDelete