---
title: README
emoji: 📉
colorFrom: indigo
colorTo: purple
sdk: static
pinned: false
---

# Tuva Project: Open-Source Healthcare Modeling

Welcome to the Tuva ML Models Hub — an open-source ecosystem for healthcare risk prediction, cost benchmarking, and expected value modeling.

---

## Mission

The Tuva Project is dedicated to democratizing healthcare knowledge.  
We believe that access to robust models should not be locked behind paywalls or proprietary systems.

These models are typically:

- Expensive to build and maintain  
- Trained on complex healthcare data  
- Essential for policy, research, and actuarial strategy  

By open-sourcing these tools, we empower health systems, researchers, and startups to build with transparency and scale with trust.

---

## What You'll Find Here

This hub is a growing library of machine learning models designed to support:

- Cost prediction  
- Encounter forecasting  
- Risk stratification  
- Benchmarking for Medicare, Medicaid, and commercial populations  

Each model includes:

- Trained model artifacts (e.g., `.pkl`, `.joblib`)  
- Scripts for running predictions  
- Complete documentation and evaluation metrics  

---

## Quick Start: End-to-End Workflow

This section provides high-level instructions for running a model with the Tuva Project. The workflow involves preparing benchmark data using dbt, running a Python prediction script, and optionally ingesting the results back into dbt for analysis.

### 1. Configure Your dbt Project

You need to enable the correct variables in your `dbt_project.yml` file to control the workflow.

#### A. Enable Benchmark Marts

These two variables control which parts of the Tuva Project are active. They are `false` by default.

```yaml
# in dbt_project.yml
vars:
  benchmarks_train: true
  benchmarks_already_created: true
```

- `benchmarks_train`: Set to `true` to build the datasets that the ML models will use for making predictions.  
- `benchmarks_already_created`: Set to `true` to ingest model predictions back into the project as a new dbt source.

#### B. (Optional) Set Prediction Source Locations

If you plan to bring predictions back into dbt for analysis, you must define where dbt can find the prediction data.

```yaml
# in dbt_project.yml
vars:
  predictions_person_year: "{{ source('benchmark_output', 'person_year') }}"
  predictions_inpatient: "{{ source('benchmark_output', 'inpatient') }}"
  predictions_inpatient_prospective: "{{ source('benchmark_output', 'inpatient_predictions_prospective') }}"
  predictions_person_year_prospective: "{{ source('benchmark_output', 'pmpm_predictions_prospective') }}"
```

#### C. Configure `sources.yml`

Ensure your `sources.yml` file includes a definition for the source you referenced above (e.g., `benchmark_output`) that points to the database and schema where your model's prediction outputs are stored.

---

### 2. The 3-Step Run Process

This workflow can be managed by any orchestration tool (e.g., Airflow, Prefect, Fabric Notebooks) or run manually from the command line.

#### Step 1: Generate the Training & Benchmarking Data

Run the Tuva Project with `benchmarks_train` enabled. This creates the input data required by the ML model.

```bash
dbt build --vars '{benchmarks_train: true}'
```

To run only the benchmark mart:

```bash
dbt build --select tag:benchmarks_train --vars '{benchmarks_train: true}'
```

#### Step 2: Run the Prediction Python Code

Execute the Python script to generate predictions. This script will read the data created in Step 1 and write the prediction outputs to a persistent location (e.g., a table in your data warehouse).

*We have provided example Snowflake Notebook code within each model's repository that was used in Tuva's environment.*

#### Step 3: (Optional) Analyze Predictions in dbt

To bring the predictions back into the Tuva Project for analysis, run dbt again with `benchmarks_already_created` enabled. This populates the analytics marts.

```bash
dbt build --vars '{benchmarks_already_created: true, benchmarks_train: false}'
```

To run only the analysis models:

```bash
dbt build --select tag:benchmarks_analysis --vars '{benchmarks_already_created: true, benchmarks_train: false}'
```

---

## Current Focus: Medicare (CMS)

Our initial models use de-identified CMS data to calculate:

- Expected values for paid amounts and encounter counts at the member-year level  
- Readmission rate  
- Discharge location  
- Length of stay  

Models like the **Encounter Cost Prediction Model** are trained on the 2022/23 Medicare Standard Analytic Files (SAF), using standardized preprocessing and evaluation pipelines.


---

## What's Next

We are expanding to include:

- Commercial claims models (e.g., ESI, employer-based populations)  
- Medicaid utilization and cost models  

---

## Contribute

This hub is open to community contributions.

If you're working on a healthcare machine learning model and want to share it:

1. Fork one of our repositories  
2. Upload your trained model and code  
3. Document your inputs, outputs, and evaluation  
4. Open a pull request or reach out to our team  

We believe risk modeling should be open infrastructure.  
Help us build a future where healthcare knowledge is free and shared.