Heart Disease Classification

A scikit-learn Logistic Regression model that predicts the presence of heart disease from 13 routine clinical features. Trained on the UCI Cleveland Heart Disease dataset, the model achieves 88.52% accuracy on the held-out test set and 87.05% F1 under 5-fold cross-validation.

Intended Use

This model is built for educational and research purposes, demonstrating an end-to-end classical ML workflow on tabular medical data: exploratory analysis, model comparison, hyperparameter tuning, and evaluation. It is not a medical device and must not be used for clinical decision-making.

Dataset

The model is trained on the UCI Cleveland Heart Disease dataset, which contains 303 patient records with 13 input features and a binary target (1 = heart disease present, 0 = absent).

Feature	Description
age	Age of the patient in years
sex	Sex (1 = male, 0 = female)
cp	Chest pain type (0: typical angina, 1: atypical angina, 2: non-anginal, 3: asymptomatic)
trestbps	Resting blood pressure on hospital admission (mm Hg)
chol	Serum cholesterol (mg/dl)
fbs	Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
restecg	Resting electrocardiographic results (0: normal, 1: ST-T abnormality, 2: LV hypertrophy)
thalach	Maximum heart rate achieved
exang	Exercise-induced angina (1 = yes, 0 = no)
oldpeak	ST depression induced by exercise relative to rest
slope	Slope of the peak exercise ST segment
ca	Number of major vessels (0-3) colored by fluoroscopy
thal	Thalassemia (1: normal, 2: fixed defect, 3: reversible defect)

Class balance: 165 positive cases, 138 negative cases (54% / 46%).

Methodology

1. Data Exploration

The dataset has no missing values across any column. Crosstab analysis surfaced clear patterns: chest pain type and sex showed strong association with the target.

2. Train/Test Split

The data was split 80/20 with a fixed random seed for reproducibility, yielding 242 training and 61 test samples.

3. Model Comparison

Three classifiers were trained with default parameters and evaluated on the test set:

Model	Test Accuracy
Logistic Regression	88.52%
Random Forest Classifier	83.61%
K-Nearest Neighbors	68.85%

4. Hyperparameter Tuning

All three models were tuned with RandomizedSearchCV (5-fold CV, 20 iterations).

KNN improved to a peak of 75.41% test accuracy after sweeping n_neighbors from 1 to 20.
Random Forest reached a best CV score of 82.64% with n_estimators=710, max_depth=10, min_samples_split=8, min_samples_leaf=3, max_features=1.
Logistic Regression retained the top score at 88.52% with C=0.2336, solver='liblinear'.

Logistic Regression was selected as the final model.

Results

Test Set Performance (n=61)

Class	Precision	Recall	F1	Support
0 (No disease)	0.89	0.86	0.88	29
1 (Disease)	0.88	0.91	0.89	32
Accuracy			0.89	61
Macro avg	0.89	0.88	0.88	61

5-Fold Cross-Validation

Metric	Score
Accuracy	84.80%
Precision	82.16%
Recall	92.73%
F1	87.05%

The high recall (92.73%) is particularly important in a screening context, since false negatives (missed disease cases) are more costly than false positives.

Feature Importance

Logistic Regression coefficients indicate which features push predictions toward the positive class (heart disease present):

Feature	Coefficient	Direction
cp (chest pain type)	+0.675	Strong positive
slope	+0.471	Positive
restecg	+0.335	Positive
fbs	+0.048	Weak positive
thalach	+0.025	Weak positive
age	+0.004	Negligible
sex	-0.904	Strong negative
thal	-0.700	Strong negative
ca	-0.652	Strong negative
exang	-0.631	Strong negative
oldpeak	-0.576	Strong negative
trestbps	-0.012	Negligible
chol	-0.002	Negligible

Chest pain type is the most influential positive predictor, while sex, thalassemia status, and number of major vessels are the strongest negative predictors.

Usage

Installation

pip install scikit-learn==1.4.0 joblib huggingface_hub

Loading the Model

import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="abduleyo/heart-disease-classification-model",
    filename="model.joblib",
)
model = joblib.load(path)

Making a Prediction

sample = pd.DataFrame([{
    "age": 54, "sex": 1, "cp": 2, "trestbps": 150, "chol": 232,
    "fbs": 0, "restecg": 0, "thalach": 165, "exang": 0,
    "oldpeak": 1.6, "slope": 2, "ca": 0, "thal": 3,
}])

prediction = model.predict(sample)[0]
probability = model.predict_proba(sample)[0][1]

print(f"Prediction: {'Disease' if prediction == 1 else 'No Disease'}")
print(f"Probability of disease: {probability:.2%}")

Limitations

Small sample size. The model was trained on only 303 records, which limits generalizability.
Cohort bias. The Cleveland dataset reflects a specific patient population from a single institution. Performance may degrade on data from other regions, ethnicities, or clinical settings.
Feature engineering. No scaling, normalization, or polynomial features were applied. Adding StandardScaler would likely improve numerical stability for chol and trestbps.
Not a clinical tool. This model is a demonstration. Heart disease diagnosis requires comprehensive evaluation by qualified medical professionals.
Categorical encoding. Categorical features (cp, restecg, slope, thal) are treated as ordinal integers rather than one-hot encoded, which Logistic Regression may interpret incorrectly.

Reproducibility

Framework: scikit-learn
Random seed: 42 (numpy)
Cross-validation: 5-fold
Search strategy: RandomizedSearchCV with 20 iterations

License

MIT License. Free to use, modify, and distribute with attribution.

Citation

If you use this model in your work, please cite:

@misc{heart-disease-classification-2026,
  author = {Abdule},
  title  = {Heart Disease Classification with Logistic Regression},
  year   = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/abduleyo/heart-disease-classification-model}
}

Acknowledgements

Dataset: UCI Machine Learning Repository, Heart Disease Data Set (Cleveland). Original investigators: Andras Janosi, William Steinbrunn, Matthias Pfisterer, Robert Detrano.

Downloads last month: -