1 Introduction

The center of any complex engineering application or analyis is the data/dataset. It is the root node that entire organizations and corporations run their business. There is a famous saying nowadays that Data is the new oil. In this book, we are going to focus about analyzing biomedical data.

1.1 Dataset

The biomedical data we are going to use is obtained from Kaggle website. The dataset contains information about diabetes of cohort of sample subjects. This dataset arises from a research study of the National Institute of Diabetes and Digestive and Kidney Diseases. The purpose of the dataset is to predict whether or not a patient has diabetes. It is based on certain test measurements included in the dataset. Here, the patients are all females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictors/features and one target/response variable named as Outcome. We will first import the dataset into the workspace. Before that we have setup the Rstudio for carrying out the analysis. First, set the working directory using setwd(). Second, import the required library into the R Markdown file. It is shown here.

setwd("C:/Users/RajuPC/Documents/MyR/RajuBook")
library(tidyverse) #Required for analysis, visualization
library(plotly) # Required for interactive plotting of graphs
library(ggsci) #Themes package for the plots
diab<-read_csv("diabetes.csv") # It reads the CSV file and assigns to diab object

1.2 Data Exploration

The diab is a R data frame and we can see the data in many as follows:

head(diab) # Elements of first few rows

## # A tibble: 6 x 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##         <int>   <int>         <int>         <int>   <int> <dbl>
## 1           6     148            72            35       0  33.6
## 2           1      85            66            29       0  26.6
## 3           8     183            64             0       0  23.3
## 4           1      89            66            23      94  28.1
## 5           0     137            40            35     168  43.1
## 6           5     116            74             0       0  25.6
## # ... with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <int>,
## #   Outcome <int>

tail(diab) # Elements of Last few rows

## # A tibble: 6 x 9
##   Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
##         <int>   <int>         <int>         <int>   <int> <dbl>
## 1           9      89            62             0       0  22.5
## 2          10     101            76            48     180  32.9
## 3           2     122            70            27       0  36.8
## 4           5     121            72            23     112  26.2
## 5           1     126            60             0       0  30.1
## 6           1      93            70            31       0  30.4
## # ... with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <int>,
## #   Outcome <int>

colnames(diab) #Names of Columns which are the names of predictors and outcome variables

## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BMI"                     
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"

str(diab) # Structure of the dataset

## Classes 'tbl_df', 'tbl' and 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 9
##   .. ..$ Pregnancies             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Glucose                 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BloodPressure           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ SkinThickness           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Insulin                 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BMI                     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ DiabetesPedigreeFunction: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ Age                     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ Outcome                 : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

1.3 Descriptive statistics

Here using a summary() function one can easily obtain the descriptive statistics of the imported dataframe.

summary(diab)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Factoring the variables Factors are special type of datatype in R. It is used for categorical variables. But here the outcome variable is specified as integer. It is better to represent categorical variables as factors in R. This can be done as follows:

diab$Outcome<-factor(diab$Outcome)
class(diab$Outcome)

## [1] "factor"