Heart Disease Classification
A scikit-learn Logistic Regression model that predicts the presence of heart disease from 13 routine clinical features. Trained on the UCI Cleveland Heart Disease dataset, the model achieves 88.52% accuracy on the held-out test set and 87.05% F1 under 5-fold cross-validation.
Intended Use
This model is built for educational and research purposes, demonstrating an end-to-end classical ML workflow on tabular medical data: exploratory analysis, model comparison, hyperparameter tuning, and evaluation. It is not a medical device and must not be used for clinical decision-making.
Dataset
The model is trained on the UCI Cleveland Heart Disease dataset, which contains 303 patient records with 13 input features and a binary target (1 = heart disease present, 0 = absent).
| Feature | Description |
|---|---|
| age | Age of the patient in years |
| sex | Sex (1 = male, 0 = female) |
| cp | Chest pain type (0: typical angina, 1: atypical angina, 2: non-anginal, 3: asymptomatic) |
| trestbps | Resting blood pressure on hospital admission (mm Hg) |
| chol | Serum cholesterol (mg/dl) |
| fbs | Fasting blood sugar > 120 mg/dl (1 = true, 0 = false) |
| restecg | Resting electrocardiographic results (0: normal, 1: ST-T abnormality, 2: LV hypertrophy) |
| thalach | Maximum heart rate achieved |
| exang | Exercise-induced angina (1 = yes, 0 = no) |
| oldpeak | ST depression induced by exercise relative to rest |
| slope | Slope of the peak exercise ST segment |
| ca | Number of major vessels (0-3) colored by fluoroscopy |
| thal | Thalassemia (1: normal, 2: fixed defect, 3: reversible defect) |
Class balance: 165 positive cases, 138 negative cases (54% / 46%).
Methodology
1. Data Exploration
The dataset has no missing values across any column. Crosstab analysis surfaced clear patterns: chest pain type and sex showed strong association with the target.
2. Train/Test Split
The data was split 80/20 with a fixed random seed for reproducibility, yielding 242 training and 61 test samples.
3. Model Comparison
Three classifiers were trained with default parameters and evaluated on the test set:
| Model | Test Accuracy |
|---|---|
| Logistic Regression | 88.52% |
| Random Forest Classifier | 83.61% |
| K-Nearest Neighbors | 68.85% |
4. Hyperparameter Tuning
All three models were tuned with RandomizedSearchCV (5-fold CV, 20 iterations).
- KNN improved to a peak of 75.41% test accuracy after sweeping
n_neighborsfrom 1 to 20. - Random Forest reached a best CV score of 82.64% with
n_estimators=710,max_depth=10,min_samples_split=8,min_samples_leaf=3,max_features=1. - Logistic Regression retained the top score at 88.52% with
C=0.2336,solver='liblinear'.
Logistic Regression was selected as the final model.
Results
Test Set Performance (n=61)
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| 0 (No disease) | 0.89 | 0.86 | 0.88 | 29 |
| 1 (Disease) | 0.88 | 0.91 | 0.89 | 32 |
| Accuracy | 0.89 | 61 | ||
| Macro avg | 0.89 | 0.88 | 0.88 | 61 |
5-Fold Cross-Validation
| Metric | Score |
|---|---|
| Accuracy | 84.80% |
| Precision | 82.16% |
| Recall | 92.73% |
| F1 | 87.05% |
The high recall (92.73%) is particularly important in a screening context, since false negatives (missed disease cases) are more costly than false positives.
Feature Importance
Logistic Regression coefficients indicate which features push predictions toward the positive class (heart disease present):
| Feature | Coefficient | Direction |
|---|---|---|
| cp (chest pain type) | +0.675 | Strong positive |
| slope | +0.471 | Positive |
| restecg | +0.335 | Positive |
| fbs | +0.048 | Weak positive |
| thalach | +0.025 | Weak positive |
| age | +0.004 | Negligible |
| sex | -0.904 | Strong negative |
| thal | -0.700 | Strong negative |
| ca | -0.652 | Strong negative |
| exang | -0.631 | Strong negative |
| oldpeak | -0.576 | Strong negative |
| trestbps | -0.012 | Negligible |
| chol | -0.002 | Negligible |
Chest pain type is the most influential positive predictor, while sex, thalassemia status, and number of major vessels are the strongest negative predictors.
Usage
Installation
pip install scikit-learn==1.4.0 joblib huggingface_hub
Loading the Model
import joblib
import pandas as pd
from huggingface_hub import hf_hub_download
path = hf_hub_download(
repo_id="abduleyo/heart-disease-classification-model",
filename="model.joblib",
)
model = joblib.load(path)
Making a Prediction
sample = pd.DataFrame([{
"age": 54, "sex": 1, "cp": 2, "trestbps": 150, "chol": 232,
"fbs": 0, "restecg": 0, "thalach": 165, "exang": 0,
"oldpeak": 1.6, "slope": 2, "ca": 0, "thal": 3,
}])
prediction = model.predict(sample)[0]
probability = model.predict_proba(sample)[0][1]
print(f"Prediction: {'Disease' if prediction == 1 else 'No Disease'}")
print(f"Probability of disease: {probability:.2%}")
Limitations
- Small sample size. The model was trained on only 303 records, which limits generalizability.
- Cohort bias. The Cleveland dataset reflects a specific patient population from a single institution. Performance may degrade on data from other regions, ethnicities, or clinical settings.
- Feature engineering. No scaling, normalization, or polynomial features were applied. Adding
StandardScalerwould likely improve numerical stability forcholandtrestbps. - Not a clinical tool. This model is a demonstration. Heart disease diagnosis requires comprehensive evaluation by qualified medical professionals.
- Categorical encoding. Categorical features (
cp,restecg,slope,thal) are treated as ordinal integers rather than one-hot encoded, which Logistic Regression may interpret incorrectly.
Reproducibility
- Framework: scikit-learn
- Random seed: 42 (numpy)
- Cross-validation: 5-fold
- Search strategy: RandomizedSearchCV with 20 iterations
License
MIT License. Free to use, modify, and distribute with attribution.
Citation
If you use this model in your work, please cite:
@misc{heart-disease-classification-2026,
author = {Abdule},
title = {Heart Disease Classification with Logistic Regression},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/abduleyo/heart-disease-classification-model}
}
Acknowledgements
Dataset: UCI Machine Learning Repository, Heart Disease Data Set (Cleveland). Original investigators: Andras Janosi, William Steinbrunn, Matthias Pfisterer, Robert Detrano.
- Downloads last month
- -