--- title: README emoji: 📉 colorFrom: indigo colorTo: purple sdk: static pinned: false --- # Tuva Project: Open-Source Healthcare Modeling Welcome to the Tuva ML Models Hub — an open-source ecosystem for healthcare risk prediction, cost benchmarking, and expected value modeling. --- ## Mission The Tuva Project is dedicated to democratizing healthcare knowledge. We believe that access to robust models should not be locked behind paywalls or proprietary systems. These models are typically: - Expensive to build and maintain - Trained on complex healthcare data - Essential for policy, research, and actuarial strategy By open-sourcing these tools, we empower health systems, researchers, and startups to build with transparency and scale with trust. --- ## What You'll Find Here This hub is a growing library of machine learning models designed to support: - Cost prediction - Encounter forecasting - Risk stratification - Benchmarking for Medicare, Medicaid, and commercial populations Each model includes: - Trained model artifacts (e.g., `.pkl`, `.joblib`) - Scripts for running predictions - Complete documentation and evaluation metrics --- ## Quick Start: End-to-End Workflow This section provides high-level instructions for running a model with the Tuva Project. The workflow involves preparing benchmark data using dbt, running a Python prediction script, and optionally ingesting the results back into dbt for analysis. ### 1. Configure Your dbt Project You need to enable the correct variables in your `dbt_project.yml` file to control the workflow. #### A. Enable Benchmark Marts These two variables control which parts of the Tuva Project are active. They are `false` by default. ```yaml # in dbt_project.yml vars: benchmarks_train: true benchmarks_already_created: true ``` - `benchmarks_train`: Set to `true` to build the datasets that the ML models will use for making predictions. - `benchmarks_already_created`: Set to `true` to ingest model predictions back into the project as a new dbt source. #### B. (Optional) Set Prediction Source Locations If you plan to bring predictions back into dbt for analysis, you must define where dbt can find the prediction data. ```yaml # in dbt_project.yml vars: predictions_person_year: "{{ source('benchmark_output', 'person_year') }}" predictions_inpatient: "{{ source('benchmark_output', 'inpatient') }}" predictions_inpatient_prospective: "{{ source('benchmark_output', 'inpatient_predictions_prospective') }}" predictions_person_year_prospective: "{{ source('benchmark_output', 'pmpm_predictions_prospective') }}" ``` #### C. Configure `sources.yml` Ensure your `sources.yml` file includes a definition for the source you referenced above (e.g., `benchmark_output`) that points to the database and schema where your model's prediction outputs are stored. --- ### 2. The 3-Step Run Process This workflow can be managed by any orchestration tool (e.g., Airflow, Prefect, Fabric Notebooks) or run manually from the command line. #### Step 1: Generate the Training & Benchmarking Data Run the Tuva Project with `benchmarks_train` enabled. This creates the input data required by the ML model. ```bash dbt build --vars '{benchmarks_train: true}' ``` To run only the benchmark mart: ```bash dbt build --select tag:benchmarks_train --vars '{benchmarks_train: true}' ``` #### Step 2: Run the Prediction Python Code Execute the Python script to generate predictions. This script will read the data created in Step 1 and write the prediction outputs to a persistent location (e.g., a table in your data warehouse). *We have provided example Snowflake Notebook code within each model's repository that was used in Tuva's environment.* #### Step 3: (Optional) Analyze Predictions in dbt To bring the predictions back into the Tuva Project for analysis, run dbt again with `benchmarks_already_created` enabled. This populates the analytics marts. ```bash dbt build --vars '{benchmarks_already_created: true, benchmarks_train: false}' ``` To run only the analysis models: ```bash dbt build --select tag:benchmarks_analysis --vars '{benchmarks_already_created: true, benchmarks_train: false}' ``` --- ## Current Focus: Medicare (CMS) Our initial models use de-identified CMS data to calculate: - Expected values for paid amounts and encounter counts at the member-year level - Readmission rate - Discharge location - Length of stay Models like the **Encounter Cost Prediction Model** are trained on the 2022/23 Medicare Standard Analytic Files (SAF), using standardized preprocessing and evaluation pipelines. --- ## What's Next We are expanding to include: - Commercial claims models (e.g., ESI, employer-based populations) - Medicaid utilization and cost models --- ## Contribute This hub is open to community contributions. If you're working on a healthcare machine learning model and want to share it: 1. Fork one of our repositories 2. Upload your trained model and code 3. Document your inputs, outputs, and evaluation 4. Open a pull request or reach out to our team We believe risk modeling should be open infrastructure. Help us build a future where healthcare knowledge is free and shared.