CardioTrackTest

Sleeping

App Files Files Community

CardioTrackTest / data /README.md

Martinacap02

Init deploy branch for HF Space

f7d11f7 about 1 month ago

preview code

raw

history blame contribute delete

6.83 kB

Dataset Card

Dataset Description
Dataset Structure
- Data Instances
- Data Fields
Dataset Creation
Considerations for Using the Data
- Social Impact of Dataset
- Discussion of Biases
Additional Information
- Dataset Curators
- Citation Information

Dataset Description

Homepage: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

Dataset Summary

This dataset contains anonymized clinical data used to predict the risk of heart failure.
It includes 918 patient records, 11 clinical features, and one target variable.
The original dataset was downloaded from Kaggle and was created by merging five well-known cardiology datasets.

The version used in this project underwent additional preprocessing steps, including standardization, normalization, categorical encoding, and removal of the Sex feature. The resulting dataset is used for experimentation and model development.

Supported Tasks

This dataset can be used for a variety of machine learning tasks, including:

Binary Classification

Predicting whether a patient has heart disease.
Risk Scoring / Clinical Risk Stratification

Estimating cardiac risk based on clinical variables.
Explainable AI (XAI)

Useful for feature-importance analysis and interpretability.

Languages

English (en)

Dataset Structure

Data Instances

Each instance represents one patient. Example:

Age	Sex	ChestPainType	RestingBP	Cholesterol	FastingBS	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease
54	M	ASY	140	239	0	Normal	160	N	1.2	Flat	1

Data Fields

Field	Type	Description
Age	int	Patient age in years
Sex	binary	Patient sex (M = male, F = female)
ChestPainType	category	Chest pain type (TA, ATA, NAP, ASY)
RestingBP	int	Resting blood pressure (mm Hg)
Cholesterol	int	Serum cholesterol (mg/dL)
FastingBS	binary	Fasting blood sugar (1 if >120 mg/dL, 0 otherwise)
RestingECG	category	Resting ECG results (Normal, ST, LVH)
MaxHR	int	Maximum heart rate achieved
ExerciseAngina	binary	Exercise-induced angina (Y/N)
Oldpeak	float	ST depression relative to rest
ST_Slope	category	Slope of the ST segment (Up, Flat, Down)
HeartDisease	binary	Target variable (1 = disease, 0 = no disease)

Dataset Creation

Source Data

The preprocessed dataset used in this project originates from the Kaggle dataset “Heart Failure Prediction Dataset”.

The raw dataset was created by merging five widely-used cardiology datasets:

Cleveland (303 samples)
Hungarian (294 samples)
Switzerland (123 samples)
Long Beach VA (200 samples)
Stalog (270 samples)

The Kaggle author selected the 11 common features and merged the datasets into a unified collection of 1,190 records, then removed 272 duplicates, resulting in 918 unique samples.

All initial merging and normalization steps were performed by the dataset author on Kaggle.

Annotations

No manual annotations were added.
The target variable HeartDisease is already included in the original dataset.

Personal and Sensitive Information

Although the dataset contains clinical information (sensitive under GDPR), it is fully anonymized:

No personal identifiers (name, address, contact details, IDs).
All sources were already anonymized before publication.
No biometric or genetic data are included.

Thus, while clinically sensitive, the dataset does not pose identifiable privacy risks.

Considerations for Using the Data

Social Impact of Dataset

The dataset can support research and development of models for cardiac risk prediction and early detection.

However:

Models trained on this dataset must not be used as standalone diagnostic tools.
They should not be the sole basis for clinical decisions.
Misuse in healthcare contexts may lead to incorrect risk assessment.

Discussion of Biases

This dataset may contain several sources of bias that can affect model performance and fairness:

The data comes from multiple hospitals and countries, each with different patient profiles and clinical protocols. Some groups may be underrepresented.
Source datasets used different diagnostic practices and measurement standards, which may introduce noise or inconsistency in labels and clinical values.
Only 11 features are included, omitting other relevant clinical variables. This can cause proxy bias or oversimplification of cardiac risk.
Some datasets are older and may not reflect current medical practices or population characteristics.

Additional Information

Dataset Curators

The original dataset was created and published by fedesoriano on Kaggle.

The preprocessed dataset was curated by the CardioTrack team:

Work carried out as part of the Software Engineering for AI-Enabled Systems program at the University of Bari.

Citation Information

If you use this datasets, please cite:

Original Dataset
Soriano, F. (2021). Heart Failure Prediction Dataset. Kaggle.
https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction