1 Introduction
The center of any complex engineering application or analyis is the data/dataset. It is the root node that entire organizations and corporations run their business. There is a famous saying nowadays that Data is the new oil
. In this book, we are going to focus about analyzing biomedical data.
1.1 Dataset
The biomedical data we are going to use is obtained from Kaggle website. The dataset contains information about diabetes of cohort of sample subjects. This dataset arises from a research study of the National Institute of Diabetes and Digestive and Kidney Diseases
. The purpose of the dataset is to predict whether or not a patient has diabetes. It is based on certain test measurements included in the dataset. Here, the patients are all females at least 21 years old of Pima Indian heritage.
The datasets consists of several medical predictors/features
and one target/response
variable named as Outcome
. We will first import the dataset into the workspace. Before that we have setup the Rstudio for carrying out the analysis. First, set the working directory using setwd()
. Second, import the required library into the R Markdown file. It is shown here.
setwd("C:/Users/RajuPC/Documents/MyR/RajuBook")
library(tidyverse) #Required for analysis, visualization
library(plotly) # Required for interactive plotting of graphs
library(ggsci) #Themes package for the plots
diab<-read_csv("diabetes.csv") # It reads the CSV file and assigns to diab object
1.2 Data Exploration
The diab
is a R data frame and we can see the data in many as follows:
head(diab) # Elements of first few rows
## # A tibble: 6 x 9
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## <int> <int> <int> <int> <int> <dbl>
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## # ... with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <int>,
## # Outcome <int>
tail(diab) # Elements of Last few rows
## # A tibble: 6 x 9
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## <int> <int> <int> <int> <int> <dbl>
## 1 9 89 62 0 0 22.5
## 2 10 101 76 48 180 32.9
## 3 2 122 70 27 0 36.8
## 4 5 121 72 23 112 26.2
## 5 1 126 60 0 0 30.1
## 6 1 93 70 31 0 30.4
## # ... with 3 more variables: DiabetesPedigreeFunction <dbl>, Age <int>,
## # Outcome <int>
colnames(diab) #Names of Columns which are the names of predictors and outcome variables
## [1] "Pregnancies" "Glucose"
## [3] "BloodPressure" "SkinThickness"
## [5] "Insulin" "BMI"
## [7] "DiabetesPedigreeFunction" "Age"
## [9] "Outcome"
str(diab) # Structure of the dataset
## Classes 'tbl_df', 'tbl' and 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 9
## .. ..$ Pregnancies : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Glucose : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BloodPressure : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ SkinThickness : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Insulin : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BMI : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ DiabetesPedigreeFunction: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Age : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Outcome : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
1.3 Descriptive statistics
Here using a summary()
function one can easily obtain the descriptive statistics of the imported dataframe.
summary(diab)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Factoring the variables Factors are special type of datatype in R. It is used for categorical variables. But here the outcome variable is specified as integer. It is better to represent categorical variables as factors in R. This can be done as follows:
diab$Outcome<-factor(diab$Outcome)
class(diab$Outcome)
## [1] "factor"