| # Using the CPTAC Ovarian Offline Dataset | |
| This guide explains how to generate the CPTAC ovarian cohort with `OVdata.py` and connect it to the Real-time Interactive Clinical Navigator UI. | |
| ## 1. Generate the dataset artifacts | |
| ```bash | |
| python OVdata.py --output-dir RL0910/data/cptac_ovarian | |
| ``` | |
| The script will download the proteomics, transcriptomics, and clinical tables (only on the first run), build a multi-modal feature matrix, and export the following files inside the chosen output directory: | |
| | File | Purpose | | |
| | --- | --- | | |
| | `ovarian_offline_dataset.csv` | Flat table with `patient_id`, `timestep`, `action`, `reward`, and normalized `state_*` feature columns. | | |
| | `ovarian_offline_dataset.npz` | NumPy archive with arrays for states, actions, rewards, patient IDs, and supporting metadata. | | |
| | `ovarian_offline_schema.yaml` | Schema mapping that allows the UI to align the CSV with the RLDT state/action format. | | |
| | `ovarian_offline_metadata.json` | Summary statistics for reproducibility (feature catalog, normalization info, action labels, etc.). | | |
| > **Tip:** Use `python OVdata.py --help` to adjust feature caps or disable min-max scaling if you need a different preprocessing recipe. | |
| ## 2. Load the cohort inside the UI | |
| 1. Launch `python RL0910/enhanced_chat_ui.py`. | |
| 2. In the **📊 Data Management** tab: | |
| - Select **Real Data** in the “Data Source” selector. | |
| - Upload `ovarian_offline_dataset.csv` with the “Load Real Data” file picker. | |
| - (Recommended) Upload `ovarian_offline_schema.yaml` in the schema slot so the UI can apply the same column mappings used during export. | |
| 3. Press **Load Real Data**. A success banner will confirm the number of patients and records loaded, and the patient dropdown becomes active. | |
| 4. Explore patients, generate reports, or switch to other tabs. Online training remains paused until you explicitly press **Start Online Training** in the **📊 Online Learning Monitor** tab. | |
| ## 3. Optional: reuse the artifacts elsewhere | |
| * The CSV is compatible with most pandas- or PyTorch-based offline RL pipelines. | |
| * The `.npz` bundle loads directly with `numpy.load`, exposing arrays named `states`, `actions`, `rewards`, `patient_ids`, `timesteps`, and `terminals`. | |
| * The YAML schema can be referenced by the adapters in `RL0910/schema.py` to map new cohorts that follow the same column layout. | |
| ## Troubleshooting | |
| * If you see a message about missing CPTAC assets, run `python -m cptac download` once with an internet connection. | |
| * Errors mentioning `_ARRAY_API` typically indicate an older `pyarrow` build. The UI now disables the Arrow backend automatically, but upgrading `pyarrow` (or installing `pyarrow>=14`) will also resolve it. | |
| * Loading failures in the UI will keep the patient dropdown disabled; re-run `OVdata.py` or double-check the schema path if that happens. | |