Spaces:
Sleeping
Sleeping
File size: 10,339 Bytes
b296845 a7ce724 b296845 a7ce724 b296845 a7ce724 b296845 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 |
---
title: CardioTrack API
emoji: ❤️
colorFrom: purple
colorTo: gray
sdk: docker
app_port: 7860
---
# Predicting Outcomes in Heart Failure
## Table of Contents
1. [Project Overview](#project-overview)
2. [Project Organization](#project-organization)
3. [DVC Pipeline Defined](#dvc-pipeline-defined)
4. [Milestones Summary](#milestones-summary)
- [Milestone 1 - Inception](#milestone-1---inception)
- [Milestone 2 - Reproducibility](#milestone-2---reproducibility)
- [Milestone 3 - Quality Assurance](#milestone-3---quality-assurance)
- [Milestone 4 - API Integration](#milestone-4---API-Integration)
## Project Overview
<a target="_blank" href="https://cookiecutter-data-science.drivendata.org/">
<img src="https://img.shields.io/badge/CCDS-Project%20template-328F97?logo=cookiecutter" />
</a>
This project develops a predictive pipeline for patient outcome prediction in heart failure, using a publicly available dataset of clinical records. The goal is to design and evaluate machine learning models within a reproducible workflow that can be integrated into larger systems for clinical decision support. The workflow addresses data heterogeneity, defines consistent preprocessing and feature engineering strategies, and explores alternative modeling approaches with systematic evaluation using clinically relevant metrics. It also emphasizes model transparency and auditability, ensuring that the resulting pipeline can be deployed as a reliable, adaptable software component in healthcare applications. The project aims not only to improve baseline predictive performance but also to demonstrate how data-driven models can be effectively integrated into end-to-end AI-enabled healthcare systems.
## Project Organization
```
├── LICENSE <- Open-source license if one is chosen
├── Makefile <- Makefile with convenience commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default mkdocs project; see www.mkdocs.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── pyproject.toml <- Project configuration file with package metadata for
│ predicting_outcomes_in_heart_failure and configuration for tools like black
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.cfg <- Configuration file for flake8
│
└── predicting_outcomes_in_heart_failure <- Source code for use in this project.
│
├── __init__.py <- Makes predicting_outcomes_in_heart_failure a Python module
│
├── config.py <- Store useful variables and configuration
│
├── data
│ ├── __init__.py
│ ├── dataset.py <- Scripts to download or generate data
| ├── preprocess.py <- Data preprocessing code
│ └── split_data.py <- Split dataset into train and test code
│
├── features.py <- Code to create features for modeling
│
├── modeling
│ ├── __init__.py
│ ├── predict.py <- Code to run model inference with trained models
│ └── train.py <- Code to train models
│
└── plots.py <- Code to create visualizations
```
## DVC Pipeline defined
```
+---------------+
| download_data |
+---------------+
*
*
*
+---------------+
| preprocessing |
+---------------+
*
*
*
+------------+
| split_data |
+------------+
*** ***
* *
** ***
+----------+ *
| training | ***
+----------+ *
*** ***
* *
** **
+------------+
| evaluation |
+------------+
```
## Milestones Summary
### Milestone 1 - Inception
During this milestone, the **CCDS Project Template** was used as the foundation for organizing the project.
The main conceptual and structural components of the system were defined, following the template guidelines to ensure consistency and traceability.
Additionally, a **Machine Learning Canvas** has been added in the [`docs/`](./docs) folder.
It outlines the model objectives, the data to be used, and the key methodological aspects planned for the next phases of the project.
### Milestone 2 - Reproducibility
Milestone-2 introduces **reproducibility**, from **data management** to **model training and evaluation**. This includes a fully automated pipeline, experiment tracking, and model registry integration, ensuring every step can be consistently reproduced and monitored.
#### Exploratory Data Analysis (EDA)
As part of the early steps, we added and refined an **Exploratory Data Analysis** to better understand the dataset, its distribution, and relationships between variables. This helped define the preprocessing and modeling strategies used later.
#### DVC Initialization and Pipeline Setup
We initialized **DVC** and configured a full pipeline to automate the main steps of the ML workflow:
- Automatic data **download**
- **Preprocessing**
- **Data splitting**
- **Training** and **evaluation**
The pipeline is fully reproducible and version-controlled through DVC.
#### Model Training and Experiment Tracking
We implemented the **training scripts** and integrated **MLflow** for experiment tracking.
Three models are trained and evaluated within this workflow:
- Decision Tree
- Random Forest
- Logistic Regression
Each experiment is logged to MLflow.
#### Model Registry and Thresholds
Models that reach or exceed the predefined **performance thresholds** (as defined in the ML Canvas) are automatically **saved to the model registry**.
### Milestone 3 – Quality Assurance
In this milestone, we introduced **Quality Assurance** layer to the system.
#### Static Linters
Two static linters were added to improve code style and consistency:
- **Ruff** for Python files in the `predicting_outcomes_in_heart_failure` and `tests` folders.
It checks formatting, syntax, and common anti-patterns, and is integrated into the GitHub workflow via an *action*.
- **Pynblint** for Jupyter notebooks, also integrated into the GitHub workflow through a dedicated *action*.
#### Data Quality
We implemented **data quality checks** on both raw and processed data using **Great Expectations**.
These validations help to:
- detect anomalies or invalid values at the data source
- prevent the propagation of data issues into downstream processes
#### Code Quality
We added automated **unit and integration tests** using **pytest**, covering the main modules and functionalities of the system.
#### ML Pipeline Enhancements
we applied the following enhancements to the ML pipeline:
- Refactored preprocessing with gender-based dataset variants.
- Added validation (e.g., error on single-row datasets).
- Saved StandardScaler as preprocessing artifact.
- Updated split logic and DVC pipeline.
- Training now creates variant-specific MLflow experiments.
- Added RandomOverSampler to address class imbalance.
- Updated evaluation and inference to align with the new structure.
#### Explainability
We applied an explainability module:
- Added SHAP explainability module.
- Added tests for explainability functionality.
#### Risk Classification
We added a **Risk Classification** analysis for the system in accordance with **IMDRF** and **AI Act** regulations.
The documentation is available in the [`docs/`](./docs) folder.
Ecco la versione finale **in Markdown puro**, già formattata correttamente:
### Milestone 4 - API Integration
During Milestone 4, we implemented a fully functional API and Dataset Card and Model card for the champion model and the following used dataset.
APIs are structured into four main routers:
#### **General Router**
- **GET /**
Returns a welcome message and confirms that the API is running.
#### **Prediction Router**
- **POST /predictions**
Generates a binary prediction (0/1) for a single patient sample.
- **POST /predict-batch**
Accepts a list of patient samples and returns a prediction for each element in the batch.
- **POST /explanations**
Produces SHAP-based explanations for a single input and returns the URL of the generated SHAP waterfall plot.
#### **Model Info Router**
- **GET /model/hyperparameters**
Returns the hyperparameters and cross-validation results of the model defined in `MODEL_PATH`.
- **GET /model/metrics**
Returns the test-set metrics stored during the model evaluation stage.
#### **Cards Router**
- **GET /card/{card_type}**
Returns the content of a “card” file (dataset card or model card).
### **Cards**
During this milestone, we also created:
- a **dataset card** describing the dataset used by the champion model
- a **model card** documenting the champion model itself
|