Spaces:

KingmaoQ
/

RLDT

Build error

App Files Files Community

RLDT / docs /OVdata_usage.md

KingmaoQ

Clean history without large data files

59bd924 about 1 month ago

preview code

raw

history blame contribute delete

2.84 kB

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Using the CPTAC Ovarian Offline Dataset

This guide explains how to generate the CPTAC ovarian cohort with OVdata.py and connect it to the Real-time Interactive Clinical Navigator UI.

1. Generate the dataset artifacts

python OVdata.py --output-dir RL0910/data/cptac_ovarian

The script will download the proteomics, transcriptomics, and clinical tables (only on the first run), build a multi-modal feature matrix, and export the following files inside the chosen output directory:

File	Purpose
`ovarian_offline_dataset.csv`	Flat table with `patient_id`, `timestep`, `action`, `reward`, and normalized `state_*` feature columns.
`ovarian_offline_dataset.npz`	NumPy archive with arrays for states, actions, rewards, patient IDs, and supporting metadata.
`ovarian_offline_schema.yaml`	Schema mapping that allows the UI to align the CSV with the RLDT state/action format.
`ovarian_offline_metadata.json`	Summary statistics for reproducibility (feature catalog, normalization info, action labels, etc.).

Tip: Use python OVdata.py --help to adjust feature caps or disable min-max scaling if you need a different preprocessing recipe.

2. Load the cohort inside the UI

Launch python RL0910/enhanced_chat_ui.py.
In the 📊 Data Management tab:
- Select Real Data in the “Data Source” selector.
- Upload ovarian_offline_dataset.csv with the “Load Real Data” file picker.
- (Recommended) Upload ovarian_offline_schema.yaml in the schema slot so the UI can apply the same column mappings used during export.
Press Load Real Data. A success banner will confirm the number of patients and records loaded, and the patient dropdown becomes active.
Explore patients, generate reports, or switch to other tabs. Online training remains paused until you explicitly press Start Online Training in the 📊 Online Learning Monitor tab.

3. Optional: reuse the artifacts elsewhere

The CSV is compatible with most pandas- or PyTorch-based offline RL pipelines.
The .npz bundle loads directly with numpy.load, exposing arrays named states, actions, rewards, patient_ids, timesteps, and terminals.
The YAML schema can be referenced by the adapters in RL0910/schema.py to map new cohorts that follow the same column layout.

Troubleshooting

If you see a message about missing CPTAC assets, run python -m cptac download once with an internet connection.
Errors mentioning _ARRAY_API typically indicate an older pyarrow build. The UI now disables the Arrow backend automatically, but upgrading pyarrow (or installing pyarrow>=14) will also resolve it.
Loading failures in the UI will keep the patient dropdown disabled; re-run OVdata.py or double-check the schema path if that happens.