Heart Disease Classification

A scikit-learn Logistic Regression model that predicts the presence of heart disease from 13 routine clinical features. Trained on the UCI Cleveland Heart Disease dataset, the model achieves 88.52% accuracy on the held-out test set and 87.05% F1 under 5-fold cross-validation.

Intended Use

This model is built for educational and research purposes, demonstrating an end-to-end classical ML workflow on tabular medical data: exploratory analysis, model comparison, hyperparameter tuning, and evaluation. It is not a medical device and must not be used for clinical decision-making.

Dataset

The model is trained on the UCI Cleveland Heart Disease dataset, which contains 303 patient records with 13 input features and a binary target (1 = heart disease present, 0 = absent).

Feature Description
age Age of the patient in years
sex Sex (1 = male, 0 = female)
cp Chest pain type (0: typical angina, 1: atypical angina, 2: non-anginal, 3: asymptomatic)
trestbps Resting blood pressure on hospital admission (mm Hg)
chol Serum cholesterol (mg/dl)
fbs Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
restecg Resting electrocardiographic results (0: normal, 1: ST-T abnormality, 2: LV hypertrophy)
thalach Maximum heart rate achieved
exang Exercise-induced angina (1 = yes, 0 = no)
oldpeak ST depression induced by exercise relative to rest
slope Slope of the peak exercise ST segment
ca Number of major vessels (0-3) colored by fluoroscopy
thal Thalassemia (1: normal, 2: fixed defect, 3: reversible defect)

Class balance: 165 positive cases, 138 negative cases (54% / 46%).

Methodology

1. Data Exploration

The dataset has no missing values across any column. Crosstab analysis surfaced clear patterns: chest pain type and sex showed strong association with the target.

2. Train/Test Split

The data was split 80/20 with a fixed random seed for reproducibility, yielding 242 training and 61 test samples.

3. Model Comparison

Three classifiers were trained with default parameters and evaluated on the test set:

Model Test Accuracy
Logistic Regression 88.52%
Random Forest Classifier 83.61%
K-Nearest Neighbors 68.85%

4. Hyperparameter Tuning

All three models were tuned with RandomizedSearchCV (5-fold CV, 20 iterations).

  • KNN improved to a peak of 75.41% test accuracy after sweeping n_neighbors from 1 to 20.
  • Random Forest reached a best CV score of 82.64% with n_estimators=710, max_depth=10, min_samples_split=8, min_samples_leaf=3, max_features=1.
  • Logistic Regression retained the top score at 88.52% with C=0.2336, solver='liblinear'.

Logistic Regression was selected as the final model.

Results

Test Set Performance (n=61)

Class Precision Recall F1 Support
0 (No disease) 0.89 0.86 0.88 29
1 (Disease) 0.88 0.91 0.89 32
Accuracy 0.89 61
Macro avg 0.89 0.88 0.88 61

5-Fold Cross-Validation

Metric Score
Accuracy 84.80%
Precision 82.16%
Recall 92.73%
F1 87.05%

The high recall (92.73%) is particularly important in a screening context, since false negatives (missed disease cases) are more costly than false positives.

Feature Importance

Logistic Regression coefficients indicate which features push predictions toward the positive class (heart disease present):

Feature Coefficient Direction
cp (chest pain type) +0.675 Strong positive
slope +0.471 Positive
restecg +0.335 Positive
fbs +0.048 Weak positive
thalach +0.025 Weak positive
age +0.004 Negligible
sex -0.904 Strong negative
thal -0.700 Strong negative
ca -0.652 Strong negative
exang -0.631 Strong negative
oldpeak -0.576 Strong negative
trestbps -0.012 Negligible
chol -0.002 Negligible

Chest pain type is the most influential positive predictor, while sex, thalassemia status, and number of major vessels are the strongest negative predictors.

Usage

Installation

pip install scikit-learn==1.4.0 joblib huggingface_hub

Loading the Model

import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="abduleyo/heart-disease-classification-model",
    filename="model.joblib",
)
model = joblib.load(path)

Making a Prediction

sample = pd.DataFrame([{
    "age": 54, "sex": 1, "cp": 2, "trestbps": 150, "chol": 232,
    "fbs": 0, "restecg": 0, "thalach": 165, "exang": 0,
    "oldpeak": 1.6, "slope": 2, "ca": 0, "thal": 3,
}])

prediction = model.predict(sample)[0]
probability = model.predict_proba(sample)[0][1]

print(f"Prediction: {'Disease' if prediction == 1 else 'No Disease'}")
print(f"Probability of disease: {probability:.2%}")

Limitations

  • Small sample size. The model was trained on only 303 records, which limits generalizability.
  • Cohort bias. The Cleveland dataset reflects a specific patient population from a single institution. Performance may degrade on data from other regions, ethnicities, or clinical settings.
  • Feature engineering. No scaling, normalization, or polynomial features were applied. Adding StandardScaler would likely improve numerical stability for chol and trestbps.
  • Not a clinical tool. This model is a demonstration. Heart disease diagnosis requires comprehensive evaluation by qualified medical professionals.
  • Categorical encoding. Categorical features (cp, restecg, slope, thal) are treated as ordinal integers rather than one-hot encoded, which Logistic Regression may interpret incorrectly.

Reproducibility

  • Framework: scikit-learn
  • Random seed: 42 (numpy)
  • Cross-validation: 5-fold
  • Search strategy: RandomizedSearchCV with 20 iterations

License

MIT License. Free to use, modify, and distribute with attribution.

Citation

If you use this model in your work, please cite:

@misc{heart-disease-classification-2026,
  author = {Abdule},
  title  = {Heart Disease Classification with Logistic Regression},
  year   = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/abduleyo/heart-disease-classification-model}
}

Acknowledgements

Dataset: UCI Machine Learning Repository, Heart Disease Data Set (Cleveland). Original investigators: Andras Janosi, William Steinbrunn, Matthias Pfisterer, Robert Detrano.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support