A newer version of the Gradio SDK is available:
6.2.0
Using the CPTAC Ovarian Offline Dataset
This guide explains how to generate the CPTAC ovarian cohort with OVdata.py and connect it to the Real-time Interactive Clinical Navigator UI.
1. Generate the dataset artifacts
python OVdata.py --output-dir RL0910/data/cptac_ovarian
The script will download the proteomics, transcriptomics, and clinical tables (only on the first run), build a multi-modal feature matrix, and export the following files inside the chosen output directory:
| File | Purpose |
|---|---|
ovarian_offline_dataset.csv |
Flat table with patient_id, timestep, action, reward, and normalized state_* feature columns. |
ovarian_offline_dataset.npz |
NumPy archive with arrays for states, actions, rewards, patient IDs, and supporting metadata. |
ovarian_offline_schema.yaml |
Schema mapping that allows the UI to align the CSV with the RLDT state/action format. |
ovarian_offline_metadata.json |
Summary statistics for reproducibility (feature catalog, normalization info, action labels, etc.). |
Tip: Use
python OVdata.py --helpto adjust feature caps or disable min-max scaling if you need a different preprocessing recipe.
2. Load the cohort inside the UI
- Launch
python RL0910/enhanced_chat_ui.py. - In the 📊 Data Management tab:
- Select Real Data in the “Data Source” selector.
- Upload
ovarian_offline_dataset.csvwith the “Load Real Data” file picker. - (Recommended) Upload
ovarian_offline_schema.yamlin the schema slot so the UI can apply the same column mappings used during export.
- Press Load Real Data. A success banner will confirm the number of patients and records loaded, and the patient dropdown becomes active.
- Explore patients, generate reports, or switch to other tabs. Online training remains paused until you explicitly press Start Online Training in the 📊 Online Learning Monitor tab.
3. Optional: reuse the artifacts elsewhere
- The CSV is compatible with most pandas- or PyTorch-based offline RL pipelines.
- The
.npzbundle loads directly withnumpy.load, exposing arrays namedstates,actions,rewards,patient_ids,timesteps, andterminals. - The YAML schema can be referenced by the adapters in
RL0910/schema.pyto map new cohorts that follow the same column layout.
Troubleshooting
- If you see a message about missing CPTAC assets, run
python -m cptac downloadonce with an internet connection. - Errors mentioning
_ARRAY_APItypically indicate an olderpyarrowbuild. The UI now disables the Arrow backend automatically, but upgradingpyarrow(or installingpyarrow>=14) will also resolve it. - Loading failures in the UI will keep the patient dropdown disabled; re-run
OVdata.pyor double-check the schema path if that happens.