bradtuva commited on
Commit
62ec3bf
·
verified ·
1 Parent(s): 015ae94

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -23
README.md CHANGED
@@ -6,6 +6,7 @@ colorTo: purple
6
  sdk: static
7
  pinned: false
8
  ---
 
9
  # Tuva Project: Open-Source Healthcare Modeling
10
 
11
  Welcome to the Tuva ML Models Hub — an open-source ecosystem for healthcare risk prediction, cost benchmarking, and expected value modeling.
@@ -18,9 +19,10 @@ The Tuva Project is dedicated to democratizing healthcare knowledge.
18
  We believe that access to robust models should not be locked behind paywalls or proprietary systems.
19
 
20
  These models are typically:
21
- - Expensive to build and maintain
22
- - Trained on complex healthcare data
23
- - Essential for policy, research, and actuarial strategy
 
24
 
25
  By open-sourcing these tools, we empower health systems, researchers, and startups to build with transparency and scale with trust.
26
 
@@ -30,16 +32,95 @@ By open-sourcing these tools, we empower health systems, researchers, and startu
30
 
31
  This hub is a growing library of machine learning models designed to support:
32
 
33
- - Cost prediction
34
- - Encounter forecasting
35
- - Risk stratification
36
- - Benchmarking for Medicare, Medicaid, and commercial populations
37
 
38
  Each model includes:
39
 
40
- - Trained model artifacts (e.g., `.pkl`, `.joblib`)
41
- - Scripts for running predictions
42
- - Complete documentation and evaluation metrics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ---
45
 
@@ -47,13 +128,13 @@ Each model includes:
47
 
48
  Our initial models use de-identified CMS data to calculate:
49
 
50
- - Expected values for paid amounts and encounter counts at the member-year level
51
- - Readmission rate
52
- - Discharge location
53
- - Length of stay
54
-
55
 
56
  Models like the **Encounter Cost Prediction Model** are trained on the 2020 Medicare Standard Analytic Files (SAF), using standardized preprocessing and evaluation pipelines.
 
57
  Models trained on 2022 and 2023 data are coming soon.
58
 
59
  ---
@@ -62,8 +143,8 @@ Models trained on 2022 and 2023 data are coming soon.
62
 
63
  We are expanding to include:
64
 
65
- - Commercial claims models (e.g., ESI, employer-based populations)
66
- - Medicaid utilization and cost models
67
 
68
  ---
69
 
@@ -73,12 +154,10 @@ This hub is open to community contributions.
73
 
74
  If you're working on a healthcare machine learning model and want to share it:
75
 
76
- 1. Fork one of our repositories
77
- 2. Upload your trained model and code
78
- 3. Document your inputs, outputs, and evaluation
79
- 4. Open a pull request or reach out to our team
80
-
81
-
82
 
83
  We believe risk modeling should be open infrastructure.
84
  Help us build a future where healthcare knowledge is free and shared.
 
6
  sdk: static
7
  pinned: false
8
  ---
9
+
10
  # Tuva Project: Open-Source Healthcare Modeling
11
 
12
  Welcome to the Tuva ML Models Hub — an open-source ecosystem for healthcare risk prediction, cost benchmarking, and expected value modeling.
 
19
  We believe that access to robust models should not be locked behind paywalls or proprietary systems.
20
 
21
  These models are typically:
22
+
23
+ - Expensive to build and maintain
24
+ - Trained on complex healthcare data
25
+ - Essential for policy, research, and actuarial strategy
26
 
27
  By open-sourcing these tools, we empower health systems, researchers, and startups to build with transparency and scale with trust.
28
 
 
32
 
33
  This hub is a growing library of machine learning models designed to support:
34
 
35
+ - Cost prediction
36
+ - Encounter forecasting
37
+ - Risk stratification
38
+ - Benchmarking for Medicare, Medicaid, and commercial populations
39
 
40
  Each model includes:
41
 
42
+ - Trained model artifacts (e.g., `.pkl`, `.joblib`)
43
+ - Scripts for running predictions
44
+ - Complete documentation and evaluation metrics
45
+
46
+ ---
47
+
48
+ ## Quick Start: End-to-End Workflow
49
+
50
+ This section provides high-level instructions for running a model with the Tuva Project. The workflow involves preparing benchmark data using dbt, running a Python prediction script, and optionally ingesting the results back into dbt for analysis.
51
+
52
+ ### 1. Configure Your dbt Project
53
+
54
+ You need to enable the correct variables in your `dbt_project.yml` file to control the workflow.
55
+
56
+ #### A. Enable Benchmark Marts
57
+
58
+ These two variables control which parts of the Tuva Project are active. They are `false` by default.
59
+
60
+ ```yaml
61
+ # in dbt_project.yml
62
+ vars:
63
+ benchmarks_train: true
64
+ benchmarks_already_created: true
65
+ ```
66
+
67
+ - `benchmarks_train`: Set to `true` to build the datasets that the ML models will use for making predictions.
68
+ - `benchmarks_already_created`: Set to `true` to ingest model predictions back into the project as a new dbt source.
69
+
70
+ #### B. (Optional) Set Prediction Source Locations
71
+
72
+ If you plan to bring predictions back into dbt for analysis, you must define where dbt can find the prediction data.
73
+
74
+ ```yaml
75
+ # in dbt_project.yml
76
+ vars:
77
+ predictions_person_year: "{{ source('benchmark_output', 'person_year') }}"
78
+ predictions_inpatient: "{{ source('benchmark_output', 'inpatient') }}"
79
+ ```
80
+
81
+ #### C. Configure `sources.yml`
82
+
83
+ Ensure your `sources.yml` file includes a definition for the source you referenced above (e.g., `benchmark_output`) that points to the database and schema where your model's prediction outputs are stored.
84
+
85
+ ---
86
+
87
+ ### 2. The 3-Step Run Process
88
+
89
+ This workflow can be managed by any orchestration tool (e.g., Airflow, Prefect, Fabric Notebooks) or run manually from the command line.
90
+
91
+ #### Step 1: Generate the Training & Benchmarking Data
92
+
93
+ Run the Tuva Project with `benchmarks_train` enabled. This creates the input data required by the ML model.
94
+
95
+ ```bash
96
+ dbt build --vars '{benchmarks_train: true}'
97
+ ```
98
+
99
+ To run only the benchmark mart:
100
+
101
+ ```bash
102
+ dbt build --select tag:benchmarks_train --vars '{benchmarks_train: true}'
103
+ ```
104
+
105
+ #### Step 2: Run the Prediction Python Code
106
+
107
+ Execute the Python script to generate predictions. This script will read the data created in Step 1 and write the prediction outputs to a persistent location (e.g., a table in your data warehouse).
108
+
109
+ *We have provided example Snowflake Notebook code within each model's repository that was used in Tuva's environment.*
110
+
111
+ #### Step 3: (Optional) Analyze Predictions in dbt
112
+
113
+ To bring the predictions back into the Tuva Project for analysis, run dbt again with `benchmarks_already_created` enabled. This populates the analytics marts.
114
+
115
+ ```bash
116
+ dbt build --vars '{benchmarks_already_created: true, benchmarks_train: false}'
117
+ ```
118
+
119
+ To run only the analysis models:
120
+
121
+ ```bash
122
+ dbt build --select tag:benchmarks_analysis --vars '{benchmarks_already_created: true, benchmarks_train: false}'
123
+ ```
124
 
125
  ---
126
 
 
128
 
129
  Our initial models use de-identified CMS data to calculate:
130
 
131
+ - Expected values for paid amounts and encounter counts at the member-year level
132
+ - Readmission rate
133
+ - Discharge location
134
+ - Length of stay
 
135
 
136
  Models like the **Encounter Cost Prediction Model** are trained on the 2020 Medicare Standard Analytic Files (SAF), using standardized preprocessing and evaluation pipelines.
137
+
138
  Models trained on 2022 and 2023 data are coming soon.
139
 
140
  ---
 
143
 
144
  We are expanding to include:
145
 
146
+ - Commercial claims models (e.g., ESI, employer-based populations)
147
+ - Medicaid utilization and cost models
148
 
149
  ---
150
 
 
154
 
155
  If you're working on a healthcare machine learning model and want to share it:
156
 
157
+ 1. Fork one of our repositories
158
+ 2. Upload your trained model and code
159
+ 3. Document your inputs, outputs, and evaluation
160
+ 4. Open a pull request or reach out to our team
 
 
161
 
162
  We believe risk modeling should be open infrastructure.
163
  Help us build a future where healthcare knowledge is free and shared.