Heart Disease Data set - An Unsupervised Learning Approach

By Saulo Gil

June 22, 2022

This project consists in an Unsupervised Learning analysis of the Heart Disease data set available in the UCI Machine Learning Repository.

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. So, we opted to use just 14 tributes of the Cleveland database that are commonly used ML researchers.

The attributes are following:

age = age in years.

sex = sex (1 = male; 0 = female).

cp = chest pain type

Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic,

trestbps = resting blood pressure (in mm Hg on admission to the hospital),

chol = serum cholestoral in mg/dl,

fbs = fasting blood sugar > 120 mg/dl (1 = true; 0 = false),

restecg = resting electrocardiographic results,

thalach = maximum heart rate achieved,

exang = exercise induced angina (1 = yes; 0 = no),

oldpeak = ST depression induced by exercise relative to rest,

slope = the slope of the peak exercise ST segment,

ca = number of major vessels (0-3) colored by flourosopy,

thal = 3 = normal; 6 = fixed defect; 7 = reversable defect,

num = diagnosis of heart disease (angiographic disease status)(1 = true; 0 = false)

LET’S START ❗️

Packages

library(tidyverse)
library(DT)
library(factoextra)
library(gridExtra)

Pre-processing

Loading and adjusting database

The dataset was downloaded here and, as dataset comes without column names, I included column names using r colnames function from {r base}. Then, I removed all categorical variables considering since that i intend to reduce the dimensionality of the data using Principal Component Analysis (PCA) which do not appear to be an adequate approach for categorical variables. More details may be checked here

data <- 
  read_csv("processed.cleveland.data", 
    col_names = FALSE)

colnames(data) <- c(
  "age",
  "sex",
  "cp",
  "trestbps",
  "chol",
  "fbs",
  "restecg",
  "thalach",
  "exang",
  "oldpeak",
  "slope",
  "ca",
  "thal",
  "num"
)

data <- 
  data |>
  mutate(ca = as.double(ca)) |> 
  select(-sex,
         -cp,
         -fbs,
         -exang,
         -thal,
         -num)

The filtered dataset can be checked above.

datatable(data = data, rownames = FALSE)

Verifying missing values

DataExplorer::plot_missing(data) +
  theme_bw() + 
  theme(legend.title = element_blank(),
        legend.position = "none")

Looking missing value plot, we observed an low frequency of missing value in the variable ca, and I opted to remove then.

Removing missing data

data <-
  drop_na(data)

Lets see missing data again

DataExplorer::plot_missing(data) +
  theme_bw() + 
  theme(legend.title = element_blank(),
        legend.position = "none")

Yeah, now, we does not have missing values ❗️ Now, let’s to see data summary.

Data summary

skimr::skim(data)

Table 1: Data summary
Name	data
Number of rows	299
Number of columns	8
_______________________
Column type frequency:
numeric	8
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	54.53	9.02	29	48	56.0	61.0	77.0	▁▅▇▇▁
trestbps	1	131.67	17.71	94	120	130.0	140.0	200.0	▃▇▅▁▁
chol	1	247.10	51.91	126	211	242.0	275.5	564.0	▃▇▂▁▁
restecg	1	1.00	0.99	0	0	1.0	2.0	2.0	▇▁▁▁▇
thalach	1	149.51	22.95	71	133	153.0	165.5	202.0	▁▂▅▇▂
oldpeak	1	1.05	1.16	0	0	0.8	1.6	6.2	▇▂▁▁▁
slope	1	1.60	0.62	1	1	2.0	2.0	3.0	▇▁▇▁▁
ca	1	0.67	0.94	0	0	0.0	1.0	3.0	▇▃▁▂▁

We may observed a wide range in the units of the variables. So, I opted to scale all variables in order to put then in standardized units, in the case, using z-score. Now, we may observe standardized units for all variables.

Now, let’s to reduce dimentionality ❗️

Scaling data - z-score approach

data_scaled <- scale(data)

skimr::skim(data_scaled)

Table 2: Data summary
Name	data_scaled
Number of rows	299
Number of columns	8
_______________________
Column type frequency:
numeric	8
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	sd	p0	p25	p50	p75	p100	hist
age	1	1	-2.83	-0.72	0.16	0.72	2.49	▁▅▇▇▁
trestbps	1	1	-2.13	-0.66	-0.09	0.47	3.86	▃▇▅▁▁
chol	1	1	-2.33	-0.70	-0.10	0.55	6.10	▃▇▂▁▁
restecg	1	1	-1.00	-1.00	0.00	1.01	1.01	▇▁▁▁▇
thalach	1	1	-3.42	-0.72	0.15	0.70	2.29	▁▂▅▇▂
oldpeak	1	1	-0.90	-0.90	-0.22	0.47	4.42	▇▂▁▁▁
slope	1	1	-0.97	-0.97	0.64	0.64	2.26	▇▁▇▁▁
ca	1	1	-0.72	-0.72	-0.72	0.35	2.48	▇▃▁▂▁

Reducing the dimensionality

Following to apply PCA and plot, we may observe that first component shows 30% of the variance while second component just 16%. Also, we are not able to identify any visual cluster. Therefore, let’s apply k-means and hierarchical clustering approaches in order to identify clusters.

Let’s start by hierarchical clustering.

In brief,

PCA <- princomp(data_scaled)

PCA |> 
  fviz(element = "ind") +
  theme_bw()

Clustering

Checking optimal number of clustering using Hierarchical Clustering

Hierarchical clustering is an unsupervised machine learning algorithm for identifying groups in the dataset. It does not require us to pre-specify the number of clusters to be generated. Furthermore, hierarchical clustering has an added advantage over other commonly used methods in that it results in an attractive tree-based representation of the observations, called a dendrogram. More details may be checked here

Firstly, let’s to determine optimal number of cluster elbow method. As we can observe above, the “elbow” is not clear. Let-s to see another method, i.e., silhouette method, to indicate optimal number of cluster.

fviz_nbclust(data_scaled, hcut, method = "wss")

fviz_nbclust(data_scaled, hcut, method = "silhouette")

Despite of elbow and silhouette methods are not equivalent, i will used 4 cluster. Let’s to see dendrogram and

#hierarchichal clustering using k = 4

hc <- hcut(data_scaled, k = 4,
           hc_method = "ward.D2",
           hc_metric = "euclidean")

# Plot the obtained dendrogram
plot(hc, cex = 0.6, hang = -10)

fviz_cluster(hc, data = data_scaled) +
  theme_bw()

After trying distinct approaches using Hierarchical Cluster method the figure below illustrated the 4 clusters as silhouette method suggested. Thus, it is notable that this approach were not able to identify appropriately the cluster.

Let’s to try K-means method ❗️

In brief, K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster. For more details click here

Checking optimal number of clustering using K-Means

fviz_nbclust(data_scaled, kmeans, method = "wss")

fviz_nbclust(data_scaled, kmeans, method = "silhouette")

Differently of the hierarchical clustering, both elbow and silhouette method indicated 2 cluster in the dataset.

Let’s obsverve it ❗️

k2 <- kmeans(data_scaled,
             centers = 2,
             nstart = 25)

fviz_cluster(k2,data = data_scaled) +
  theme_bw()

Wowww, now we are able to see clusters ❗️. Nice, but let’s to try some more cluster for us to see if something is better ❗️

k3 <- kmeans(data_scaled, centers = 3, nstart = 25)
k4 <- kmeans(data_scaled, centers = 4, nstart = 25)
k5 <- kmeans(data_scaled, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k2, geom = "point", data = data_scaled) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = data_scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = data_scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = data_scaled) + ggtitle("k = 5")

grid.arrange(p1, p2, p3, p4,
             nrow = 2) +
  theme_bw()

## NULL

Observing plots above, we can observe that elbow and silhouette methods indicated adequately the number of cluster, since that there are some overlaps when using more cluster.

Final

In the conclusion, the K-means method with 2 cluster appear to offer an adequate clusterization of the heart disease dataset. Thus, this algorithm would may be utilized to determine specific medical care for each cluster of patients. It is noteworthy that the analyses performed herein are not intended to be used in practice. However, these analysis reinforce the utilization of unsupervised machine learning in the health care scenario.

Hope you enjoyed it!😜!

Posted on:: June 22, 2022

Length:: 7 minute read, 1380 words

See Also: