README / README.md
bradtuva's picture
Update README.md
4cb34d3 verified
---
title: README
emoji: πŸ“‰
colorFrom: indigo
colorTo: purple
sdk: static
pinned: false
---
# Tuva Project: Open-Source Healthcare Modeling
Welcome to the Tuva ML Models Hub β€” an open-source ecosystem for healthcare risk prediction, cost benchmarking, and expected value modeling.
---
## Mission
The Tuva Project is dedicated to democratizing healthcare knowledge.
We believe that access to robust models should not be locked behind paywalls or proprietary systems.
These models are typically:
- Expensive to build and maintain
- Trained on complex healthcare data
- Essential for policy, research, and actuarial strategy
By open-sourcing these tools, we empower health systems, researchers, and startups to build with transparency and scale with trust.
---
## What You'll Find Here
This hub is a growing library of machine learning models designed to support:
- Cost prediction
- Encounter forecasting
- Risk stratification
- Benchmarking for Medicare, Medicaid, and commercial populations
Each model includes:
- Trained model artifacts (e.g., `.pkl`, `.joblib`)
- Scripts for running predictions
- Complete documentation and evaluation metrics
---
## Quick Start: End-to-End Workflow
This section provides high-level instructions for running a model with the Tuva Project. The workflow involves preparing benchmark data using dbt, running a Python prediction script, and optionally ingesting the results back into dbt for analysis.
### 1. Configure Your dbt Project
You need to enable the correct variables in your `dbt_project.yml` file to control the workflow.
#### A. Enable Benchmark Marts
These two variables control which parts of the Tuva Project are active. They are `false` by default.
```yaml
# in dbt_project.yml
vars:
benchmarks_train: true
benchmarks_already_created: true
```
- `benchmarks_train`: Set to `true` to build the datasets that the ML models will use for making predictions.
- `benchmarks_already_created`: Set to `true` to ingest model predictions back into the project as a new dbt source.
#### B. (Optional) Set Prediction Source Locations
If you plan to bring predictions back into dbt for analysis, you must define where dbt can find the prediction data.
```yaml
# in dbt_project.yml
vars:
predictions_person_year: "{{ source('benchmark_output', 'person_year') }}"
predictions_inpatient: "{{ source('benchmark_output', 'inpatient') }}"
predictions_inpatient_prospective: "{{ source('benchmark_output', 'inpatient_predictions_prospective') }}"
predictions_person_year_prospective: "{{ source('benchmark_output', 'pmpm_predictions_prospective') }}"
```
#### C. Configure `sources.yml`
Ensure your `sources.yml` file includes a definition for the source you referenced above (e.g., `benchmark_output`) that points to the database and schema where your model's prediction outputs are stored.
---
### 2. The 3-Step Run Process
This workflow can be managed by any orchestration tool (e.g., Airflow, Prefect, Fabric Notebooks) or run manually from the command line.
#### Step 1: Generate the Training & Benchmarking Data
Run the Tuva Project with `benchmarks_train` enabled. This creates the input data required by the ML model.
```bash
dbt build --vars '{benchmarks_train: true}'
```
To run only the benchmark mart:
```bash
dbt build --select tag:benchmarks_train --vars '{benchmarks_train: true}'
```
#### Step 2: Run the Prediction Python Code
Execute the Python script to generate predictions. This script will read the data created in Step 1 and write the prediction outputs to a persistent location (e.g., a table in your data warehouse).
*We have provided example Snowflake Notebook code within each model's repository that was used in Tuva's environment.*
#### Step 3: (Optional) Analyze Predictions in dbt
To bring the predictions back into the Tuva Project for analysis, run dbt again with `benchmarks_already_created` enabled. This populates the analytics marts.
```bash
dbt build --vars '{benchmarks_already_created: true, benchmarks_train: false}'
```
To run only the analysis models:
```bash
dbt build --select tag:benchmarks_analysis --vars '{benchmarks_already_created: true, benchmarks_train: false}'
```
---
## Current Focus: Medicare (CMS)
Our initial models use de-identified CMS data to calculate:
- Expected values for paid amounts and encounter counts at the member-year level
- Readmission rate
- Discharge location
- Length of stay
Models like the **Encounter Cost Prediction Model** are trained on the 2022/23 Medicare Standard Analytic Files (SAF), using standardized preprocessing and evaluation pipelines.
---
## What's Next
We are expanding to include:
- Commercial claims models (e.g., ESI, employer-based populations)
- Medicaid utilization and cost models
---
## Contribute
This hub is open to community contributions.
If you're working on a healthcare machine learning model and want to share it:
1. Fork one of our repositories
2. Upload your trained model and code
3. Document your inputs, outputs, and evaluation
4. Open a pull request or reach out to our team
We believe risk modeling should be open infrastructure.
Help us build a future where healthcare knowledge is free and shared.