3 Statistical Learning

Statistical Learning is also known as Machine learning(ML) in general. Here we try to develop ML methods to model/predict the occurrence of diabetic outcome.

3.1 Data Preprocessing

The data consists of different features that are needed to be mapped to a common reference frame. This is done by data preprocessing step.

library(caret) # ML package for various methods

# Create the training and test datasets
set.seed(100)
hci<-diab

# Step 1: Get row numbers for the training data
trainRowNumbers <- createDataPartition(hci$Outcome, p=0.8, list=FALSE) # Data partition for dividing the dataset into training and testing data set. This is useful for cross validation

# Step 2: Create the training  dataset
trainData <- hci[trainRowNumbers,]

# Step 3: Create the test dataset
testData <- hci[-trainRowNumbers,]

# Store X and Y for later use.
x = trainData[, 1:8]
y=trainData$Outcome

xt= testData[, 1:8]
yt=testData$Outcome
# # See the structure of the new dataset

3.2 Normalization of features

The features are normalized to a range of [0,1] using preproces command and using range method

preProcess_range_modeltr <- preProcess(trainData, method='range')
preProcess_range_modelts <- preProcess(testData, method='range')

trainData <- predict(preProcess_range_modeltr, newdata = trainData)
testData <- predict(preProcess_range_modelts, newdata = testData)

# Append the Y variable
trainData$Outcome <- y
testData$Outcome<-yt
levels(trainData$Outcome) <- c("Class0", "Class1") # Convert binary outcome into character for caret package
levels(testData$Outcome) <- c("Class0", "Class1")

#apply(trainData[, 1:8], 2, FUN=function(x){c('min'=min(x), 'max'=max(x))})
#str(trainData)

3.3 Options for training process

#fit control
fitControl <- trainControl(
  method = 'cv',                   # k-fold cross validation
  number = 5,                      # number of folds
  savePredictions = 'final',       # saves predictions for optimal tuning parameter
  classProbs = T,                  # should class probabilities be returned
  summaryFunction=twoClassSummary  # results summary function
)

3.4 Classical ML Models

The ML models we have chosen are: LDA, KNN, SVM, RandomForest, Adaboost. The Caret package provides a uniform program interface for all the machine models defined in the library.

# Step 1: Tune hyper parameters by setting tuneLength
set.seed(100)
model1 = train(Outcome ~ ., data=trainData, method='lda', tuneLength = 5, metric='ROC', trControl = fitControl)

model2 = train(Outcome ~ ., data=trainData, method='knn', tuneLength=2, trControl = fitControl)#KNN Model
model3 = train(Outcome ~ ., data=trainData, method='svmRadial', tuneLength=2, trControl = fitControl)#SVM
model4 = train(Outcome ~ ., data=trainData, method='rpart', tuneLength=2, trControl = fitControl)#RandomForest
model5 = train(Outcome ~ ., data=trainData, method='adaboost', tuneLength=2, trControl = fitControl) # Adaboost

# Compare model performances using resample()
models_compare <- resamples(list(LDA=model1,KNN=model2,SVM=model3,RF=model4, ADA=model5))

# Summary of the models performances
summary(models_compare)

## 
## Call:
## summary.resamples(object = models_compare)
## 
## Models: LDA, KNN, SVM, RF, ADA 
## Number of resamples: 5 
## 
## ROC 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA 0.8084302 0.8095930 0.8186047 0.8225000 0.8290698 0.8468023    0
## KNN 0.7332849 0.7409884 0.7819767 0.7731395 0.7904070 0.8190407    0
## SVM 0.7781977 0.8191860 0.8287791 0.8241860 0.8348837 0.8598837    0
## RF  0.6415698 0.6619186 0.7068314 0.7105233 0.7531977 0.7890988    0
## ADA 0.7741279 0.7752907 0.7796512 0.7942442 0.8098837 0.8322674    0
## 
## Sens 
##       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
## LDA 0.8375  0.8625 0.8750 0.8725  0.8875 0.9000    0
## KNN 0.7750  0.7875 0.8375 0.8325  0.8625 0.9000    0
## SVM 0.8000  0.8375 0.8500 0.8550  0.8875 0.9000    0
## RF  0.7250  0.7625 0.7875 0.8100  0.8500 0.9250    0
## ADA 0.7250  0.7875 0.8125 0.8075  0.8500 0.8625    0
## 
## Spec 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LDA 0.4651163 0.5348837 0.5813953 0.5534884 0.5813953 0.6046512    0
## KNN 0.3488372 0.4883721 0.4883721 0.4837209 0.5348837 0.5581395    0
## SVM 0.4651163 0.5116279 0.5116279 0.5488372 0.5813953 0.6744186    0
## RF  0.4186047 0.5581395 0.5813953 0.6000000 0.6511628 0.7906977    0
## ADA 0.5581395 0.5813953 0.6279070 0.6139535 0.6279070 0.6744186    0

# Draw box plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)

3.5 Testing the performance for the test data set.

# Step 2: Predict on testData and Compute the confusion matrix
# Using LDA Model
predicted <- predict(model1, testData[,1:8])
confusionMatrix(reference = testData$Outcome, data = predicted, mode='everything')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Class0 Class1
##     Class0     48      5
##     Class1     52     48
##                                           
##                Accuracy : 0.6275          
##                  95% CI : (0.5457, 0.7042)
##     No Information Rate : 0.6536          
##     P-Value [Acc > NIR] : 0.7788          
##                                           
##                   Kappa : 0.3192          
##  Mcnemar's Test P-Value : 1.109e-09       
##                                           
##             Sensitivity : 0.4800          
##             Specificity : 0.9057          
##          Pos Pred Value : 0.9057          
##          Neg Pred Value : 0.4800          
##               Precision : 0.9057          
##                  Recall : 0.4800          
##                      F1 : 0.6275          
##              Prevalence : 0.6536          
##          Detection Rate : 0.3137          
##    Detection Prevalence : 0.3464          
##       Balanced Accuracy : 0.6928          
##                                           
##        'Positive' Class : Class0          
##