Using Random Forest to Predict Heart Disease

By Saulo Gil

February 6, 2024

Introduction

Heart disease, also known as cardiovascular disease (CVDs), encompasses a range of conditions that affect the heart and blood vessels. It is the leading cause of death globally. It consists to a group of disorders that affect the heart and blood vessels such as coronary artery disease, stroke, peripheral artery disease, hypertension and, heart failure. It is noteworthy that CVDs are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide.

Heart disease are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Indeed, CVDs can significantly impact mortality rates, especially if left untreated or poorly managed. Early diagnosis and management of risk factors can help prevent heart disease and improve outcomes for those living with the condition.

Random Forest is a popular machine learning algorithm used for both classification and regression tasks. It consists of multiple random decision trees. Briefly, two types of randomnesses are built into the trees. First, each tree is built on a random sample from the original data. Second, at each tree node, a subset of features is randomly selected to generate the best split. Thus, it is reasonable to assume that this machine learning algorithm may be useful for the early detection and treatment of heart disease and, finally, for avoiding complications and improving outcomes.

Herein, we utilized the Random Forest to predict cardiovascular disease.

Packages

# Pacotes
library(tidyverse)
library(tidymodels)
library(GGally)
library(DataExplorer)
library(doParallel)

Reading dataset

This data set dates from 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. The “target” field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease.

The features are following:

  • age;

  • sex;

  • chest pain type (4 values);

  • resting blood pressure;

  • serum cholestoral in mg/dl;

  • fasting blood sugar > 120 mg/dl;

  • resting electrocardiographic results (values 0,1,2);

  • maximum heart rate achieved;

  • exercise induced angina;

  • oldpeak = ST depression induced by exercise relative to rest;

  • the slope of the peak exercise ST segment;

  • number of major vessels (0-3) colored by flourosopy;

  • thal: 0 = normal; 1 = fixed defect; 2 = reversable defect.

Let’s load the dataset!

# Lendo a base
df <- read.csv('heart.csv') |> 
  janitor::clean_names() |> #lowercase columns
  mutate(
    heart_disease = as.factor(heart_disease)
  )

Pre-processing

Pre-processing of data is a crucial step for preparing and cleaning the dataset before feeding it into a machine learning model. It plays a significant role in improving the performance, accuracy, and reliability of the model. Therefore, let’s deep into the dataset!!!

Looking the outcome (heart disease)

df |>
  ggplot(mapping = aes(x = heart_disease, fill = sex)) +
  geom_bar() + 
  theme_bw()

df |> 
  count(heart_disease)
##   heart_disease   n
## 1             0 410
## 2             1 508

The number of classes of the outcome seems balanced!

Data engineering

df_adjust <- 
  df |>
  mutate(
    # sex for dummies
    sex = case_when(sex == "M" ~ "0",
                    sex == "F" ~ "1"),
    # creating dummies for chest_pain_type
    chest_pain_type_ATA = case_when(chest_pain_type == "ATA" ~ 1,
                                    chest_pain_type != "ATA" ~ 0),
    chest_pain_type_NAP = case_when(chest_pain_type == "NAP" ~ 1,
                                    chest_pain_type != "NAP" ~ 0),
    chest_pain_type_ASY = case_when(chest_pain_type == "ASY" ~ 1,
                                    chest_pain_type != "ASY" ~ 0),
    chest_pain_type_TA = case_when(chest_pain_type == "TA" ~ 1,
                                    chest_pain_type != "TA" ~ 0),
    # creating dummies for resting_ecg
    resting_ecg_normal = case_when(resting_ecg == "Normal" ~ 1,
                                   resting_ecg != "Normal" ~ 0),
    resting_ecg_st = case_when(resting_ecg == "ST" ~ 1,
                               resting_ecg != "ST" ~ 0),
    resting_ecg_lvh = case_when(resting_ecg == "LVH" ~ 1,
                                 resting_ecg != "LVH" ~ 0),
    # creating dummies for exercise_angina
    exercise_angina = case_when(exercise_angina == "Y" ~ 1,
                                exercise_angina == "N" ~ 0),
    # creating dummies for st_slope
    st_slope_up = case_when(st_slope == "Up" ~ 1,
                            st_slope != "Up" ~ 0),
    st_slope_flat = case_when(st_slope == "Flat" ~ 1,
                            st_slope != "Flat" ~ 0),
    st_slope_down = case_when(st_slope == "Down" ~ 1,
                            st_slope != "Down" ~ 0)
    ) |>
  # selecting features
  select(age,
         sex,
         resting_bp,
         cholesterol,
         fasting_bs,
         max_hr,
         oldpeak,
         chest_pain_type_ATA,
         chest_pain_type_NAP,
         chest_pain_type_ASY,
         chest_pain_type_TA,
         resting_ecg_normal,
         resting_ecg_st,
         resting_ecg_lvh,
         exercise_angina,
         st_slope_up,
         st_slope_flat,
         st_slope_down,
         heart_disease
         ) |>
  #adjusting type of data
  mutate(sex = as.character(sex),
         fasting_bs = as.character(fasting_bs),
         chest_pain_type_ATA = as.character(chest_pain_type_ATA),
         chest_pain_type_NAP = as.character(chest_pain_type_NAP),
         chest_pain_type_ASY = as.character(chest_pain_type_ASY),
         chest_pain_type_TA = as.character(chest_pain_type_TA),
         resting_ecg_normal = as.character(resting_ecg_normal),
         resting_ecg_st = as.character(resting_ecg_st),
         resting_ecg_lvh = as.character(resting_ecg_lvh),
         exercise_angina = as.character(exercise_angina),
         st_slope_up = as.character(st_slope_up),
         st_slope_flat = as.character(st_slope_flat),
         st_slope_down = as.character(st_slope_down)
         )

glimpse(df_adjust)
## Rows: 918
## Columns: 19
## $ age                 <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39…
## $ sex                 <chr> "0", "1", "0", "1", "0", "0", "1", "0", "0", "1", …
## $ resting_bp          <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, …
## $ cholesterol         <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, …
## $ fasting_bs          <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ max_hr              <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 1…
## $ oldpeak             <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, …
## $ chest_pain_type_ATA <chr> "1", "0", "1", "0", "0", "0", "1", "1", "0", "1", …
## $ chest_pain_type_NAP <chr> "0", "1", "0", "0", "1", "1", "0", "0", "0", "0", …
## $ chest_pain_type_ASY <chr> "0", "0", "0", "1", "0", "0", "0", "0", "1", "0", …
## $ chest_pain_type_TA  <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ resting_ecg_normal  <chr> "1", "1", "0", "1", "1", "1", "1", "1", "1", "1", …
## $ resting_ecg_st      <chr> "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", …
## $ resting_ecg_lvh     <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ exercise_angina     <chr> "0", "0", "0", "1", "0", "0", "0", "0", "1", "0", …
## $ st_slope_up         <chr> "1", "0", "1", "0", "1", "1", "1", "1", "0", "1", …
## $ st_slope_flat       <chr> "0", "1", "0", "1", "0", "0", "0", "0", "1", "0", …
## $ st_slope_down       <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", …
## $ heart_disease       <fct> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1,…

Done!!! Lest’s explore the dataset.

Spliting dataset - Train/Test

Since our dataset is size-limited (n-918), I opted to separate 70% of the sample size for training.

set.seed(32)

heart_initial_split <- initial_split(df,
                                     prop = 0.7,
                                     strata = "heart_disease")
heart_initial_split
## <Training/Testing/Total>
## <642/276/918>
# train and test datasets
heart_train <- training(heart_initial_split)

heart_test <- testing(heart_initial_split)

Exploratory analysis

Overview

skimr::skim(heart_train)
Table 1: Data summary
Name heart_train
Number of rows 642
Number of columns 12
_______________________
Column type frequency:
character 5
factor 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sex 0 1 1 1 0 2 0
chest_pain_type 0 1 2 3 0 4 0
resting_ecg 0 1 2 6 0 3 0
exercise_angina 0 1 1 1 0 2 0
st_slope 0 1 2 4 0 3 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
heart_disease 0 1 FALSE 2 1: 355, 0: 287

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 53.47 9.42 29.0 47.00 54.0 60.0 77.0 ▂▅▇▇▂
resting_bp 0 1 132.62 18.99 0.0 120.00 130.0 140.0 200.0 ▁▁▃▇▁
cholesterol 0 1 196.50 107.87 0.0 175.25 221.0 263.0 603.0 ▃▇▆▁▁
fasting_bs 0 1 0.23 0.42 0.0 0.00 0.0 0.0 1.0 ▇▁▁▁▂
max_hr 0 1 137.17 25.99 70.0 119.00 138.0 157.0 202.0 ▂▆▇▇▂
oldpeak 0 1 0.87 1.07 -2.6 0.00 0.5 1.5 5.6 ▁▇▆▁▁

Missing values

plot_missing(heart_train,ggtheme = theme_bw())

Checking the correlation and distribution of numerical features

heart_train |>  
  select(where(is.numeric)) |>  
  cor(use = "pairwise.complete.obs") |> 
  corrplot::corrplot(method = "number")

heart_train |>  
  select(where(is.numeric), heart_disease) |> 
  ggpairs(aes(colour = as_factor(heart_disease))) +
    theme_bw() +
    theme(
      legend.position = "top",
      legend.title = element_blank()
      )

heart_train |>  
  select(c(where(is.numeric), heart_disease)) |> 
  pivot_longer(-heart_disease, 
               names_to = "Feature",
               values_to = "Value") |> 
  ggplot(aes(y = as.factor(heart_disease), 
             x = Value, 
             fill = as.factor(heart_disease))) +
  geom_boxplot() +
  facet_wrap(~Feature,
             scales = "free_x") +
  ggtitle("Heart_disease vs. Numeric Features") +
    theme_bw() +
    theme(
      legend.position = "top",
      legend.title = element_blank()
      )

heart_train |>  
  select(c(where(is.numeric), heart_disease)) |> 
  pivot_longer(-heart_disease, 
               names_to = "Feature",
               values_to = "Value") |> 
  ggplot(aes(x = Value, 
             colour = as.factor(heart_disease))) +
  stat_ecdf() +
  facet_wrap(~Feature, scales = "free_x") +
  labs(title = "Heart_disease vs. Numeric Features",
       subtitle = "Cumulative Distribution") +
    theme_bw() +
    theme(
      legend.position = "top",
      legend.title = element_blank()
      )

grafico_de_barras_das_vars_continuas <- 
  function(dados) {
    dados |>  
    select(c(where(is.numeric), heart_disease)) |> 
    pivot_longer(-heart_disease,
                 names_to = "Feature",
                 values_to = "Value") |> 
    dplyr::group_by(Feature) |> 
    dplyr::mutate(
      Value = factor(dplyr::ntile(Value, 10),
                     levels = 1:10)
    ) |> 
    dplyr::count(heart_disease, 
                 Feature,
                 Value) |> 
    ggplot(aes(y = (Value),
               x = n,
               fill = as.factor(heart_disease))) +
    geom_col(position = "fill") +
    geom_label(aes(label = n),
               position = position_fill(vjust = 0.5)) +
    facet_wrap(~Feature,
               scales = "free_y",
               ncol = 3) +
    ggtitle("Heart_disease vs. Numeric Features") +
    theme_bw() +
    theme(
      legend.position = "top",
      legend.title = element_blank()
      )
}

grafico_de_barras_das_vars_continuas(heart_train)

Checking the distribution of categorical features

contagens <- 
  heart_train |>  
  select(c(where(is.character),
           heart_disease)) |> 
  pivot_longer(-heart_disease,
               names_to = "Feature",
               values_to = "Value") |> 
  count(heart_disease,
        Feature,
        Value)

# tabela
contagens |> 
  pivot_wider(names_from = heart_disease,
              values_from = n) |> 
  DT::datatable()
contagens |> 
  ggplot(aes(y = Value,
             x = n,
             fill = as.factor(heart_disease))) +
  geom_col(position = "fill") +
  geom_label(aes(label = n),
             position = position_fill(vjust = 0.5)) +
  facet_wrap(~Feature,
             scales = "free_y",
             ncol = 3) +
  ggtitle("Heart_disease vs. Categorical Features") +
    theme_bw() +
    theme(
      legend.position = "top",
      legend.title = element_blank()
      )

Preprocessing and Feature Engineering Steps using Recipes package

For numeric features, the values were normalized, multicollineariry higher than 0.8 were removed and missing values were imputed with median.

For nominal features, missing values were imputed with more frequent value, and all features were transformed in a dummy feature.

Features containing only zero were removed.

The processed dataset can be seen below.

heart_recipe <-
  recipe(formula = heart_disease ~ ., heart_train) |>
  step_normalize(all_numeric_predictors()) |> # normalize all continuous features
  step_corr(all_numeric_predictors(), threshold = .8) |>  # multicollinearity <- r > 0.8
  step_impute_median(all_numeric_predictors()) |>  # Processing missing continuous data - imputing median 
  step_impute_mode(all_nominal_predictors()) |> # Processing missing categorical data - imputing more frequent class
  step_zv(all_predictors()) |>  # removing variables that contain only zero (zero variance filter)
  step_dummy(all_nominal_predictors()) # dummy - nominal predictors

# Checking the result of the recipe (and whether the steps worked!)
heart_recipe |> 
    prep() |> 
    bake(new_data = heart_train) |>
    glimpse()
## Rows: 642
## Columns: 16
## $ age                 <dbl> -1.4292242, -1.5353499, -0.8985962, 0.0565342, -0.…
## $ resting_bp          <dbl> 0.3887070, -0.6642461, -0.1377696, -1.1907226, -0.…
## $ cholesterol         <dbl> 0.85757304, 1.32110261, 0.37550228, 0.10665512, 0.…
## $ fasting_bs          <dbl> -0.5517274, -0.5517274, -0.5517274, -0.5517274, -0…
## $ max_hr              <dbl> 1.339869171, 1.262925121, 1.262925121, 0.185708420…
## $ oldpeak             <dbl> -0.8149987, -0.8149987, -0.8149987, -0.8149987, -0…
## $ heart_disease       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sex_M               <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0,…
## $ chest_pain_type_ATA <dbl> 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,…
## $ chest_pain_type_NAP <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,…
## $ chest_pain_type_TA  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ resting_ecg_Normal  <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ resting_ecg_ST      <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ exercise_angina_Y   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ st_slope_Flat       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ st_slope_Up         <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Definition of cross-validation

The cross-validation was set-up in 5 folds.

heart_resamples <- 
  vfold_cv(heart_train,
           v = 5, 
           strata = heart_disease) # 5 folds

heart_resamples$splits
## [[1]]
## <Analysis/Assess/Total>
## <513/129/642>
## 
## [[2]]
## <Analysis/Assess/Total>
## <513/129/642>
## 
## [[3]]
## <Analysis/Assess/Total>
## <514/128/642>
## 
## [[4]]
## <Analysis/Assess/Total>
## <514/128/642>
## 
## [[5]]
## <Analysis/Assess/Total>
## <514/128/642>

Modeling - Random Forest

The hyperameters tuned were:

  • mtry: It refers to an integer for the number of predictors that will be randomly sampled at each split when creating the tree models;

  • trees: It refers to an integer for the number of trees contained in the ensemble;

  • min_n: It refers to an integer for the minimum number of data points in a node that are required for the node to be split further.

Model

# Modelo
heart_rf_model <- rand_forest(
  mtry = tune(),
  trees = tune(),
  min_n = tune()
) |>
  set_mode("classification") |>
  set_engine("ranger")

Workflow

# Workflow
heart_rf_wf <- workflow() |>
  add_model(heart_rf_model) |>
  add_recipe(heart_recipe)

heart_rf_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## • step_normalize()
## • step_corr()
## • step_impute_median()
## • step_impute_mode()
## • step_zv()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Computational engine: ranger

Tuning

A Random grid was used for tuning the hyperparameters.

# Tune
heart_grid_rf <- grid_random(
  min_n(range = c(20, 80)),
  mtry(range = c(4, 18)),
  trees(range = c(200, 400))
)

heart_grid_rf |> 
  DT::datatable()
doParallel::registerDoParallel(4)

heart_rf_tune_grid <-
  tune_grid(
  heart_rf_wf,
  resamples = heart_resamples,
  grid = heart_grid_rf,
  metrics = metric_set(roc_auc, accuracy, sensitivity, specificity),
  verbose = TRUE
  )
autoplot(heart_rf_tune_grid) +
  theme_bw()

# codes to look metrics --- remove comment to run
# collect_metrics(heart_rf_tune_grid) |> 
#   filter(.metric == "roc_auc") |> 
#   filter(mean == max(mean))
# 
# heart_rf_tune_grid |> 
#   select_best(metric = "roc_auc")
# 
# show_best(heart_rf_tune_grid, n = 6)
heart_rf_tune_grid |> 
  select_best(metric = "roc_auc") |> 
  select(-".config") |> 
  DT::datatable(caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left;',
    "Best hyperparameters."))

Performance

heart_rf_best_params <- select_best(heart_rf_tune_grid,
                                    metric =  "roc_auc")

heart_rf_best_wf <- heart_rf_wf |> 
  finalize_workflow(heart_rf_best_params)

heart_rf_last_fit <- last_fit(object = heart_rf_best_wf,
                              split =  heart_initial_split)

heart_test_preds <- 
  bind_rows(
    collect_predictions(heart_rf_last_fit) |>
    mutate(modelo = "RandomForest")
)

heart_test_preds |> 
  select(
    id, .pred_class, heart_disease
  ) |> 
  mutate(
    Prediction = if_else(.pred_class == heart_disease, "RIGHT", "WRONG")
  ) |> 
  relocate(Prediction, .before = .pred_class) |> 
  DT::datatable()
## ROC
heart_test_preds |>
  group_by(modelo) |>
  roc_curve(heart_disease, .pred_0) |>
  autoplot()+
  theme(
    legend.position = "top",
    legend.title = element_blank()
  )

Conclusions

In conclusion, Random Forest proved effective in predicting heart disease. Therefore, this artificial intelligence approach can be implemented in clinical settings to assist healthcare professionals in the early detection of heart disease.

It is noteworthy that the analyses performed herein are not intended to be used in practice. However, these analysis reinforce the utilization of ML algorithm in the health care.

Hope you enjoyed it!😜!