# Using the CPTAC Ovarian Offline Dataset

This guide explains how to generate the CPTAC ovarian cohort with `OVdata.py` and connect it to the Real-time Interactive Clinical Navigator UI.

## 1. Generate the dataset artifacts

```bash
python OVdata.py --output-dir RL0910/data/cptac_ovarian
```

The script will download the proteomics, transcriptomics, and clinical tables (only on the first run), build a multi-modal feature matrix, and export the following files inside the chosen output directory:

| File | Purpose |
| --- | --- |
| `ovarian_offline_dataset.csv` | Flat table with `patient_id`, `timestep`, `action`, `reward`, and normalized `state_*` feature columns. |
| `ovarian_offline_dataset.npz` | NumPy archive with arrays for states, actions, rewards, patient IDs, and supporting metadata. |
| `ovarian_offline_schema.yaml` | Schema mapping that allows the UI to align the CSV with the RLDT state/action format. |
| `ovarian_offline_metadata.json` | Summary statistics for reproducibility (feature catalog, normalization info, action labels, etc.). |

> **Tip:** Use `python OVdata.py --help` to adjust feature caps or disable min-max scaling if you need a different preprocessing recipe.

## 2. Load the cohort inside the UI

1. Launch `python RL0910/enhanced_chat_ui.py`.
2. In the **📊 Data Management** tab:
   - Select **Real Data** in the “Data Source” selector.
   - Upload `ovarian_offline_dataset.csv` with the “Load Real Data” file picker.
   - (Recommended) Upload `ovarian_offline_schema.yaml` in the schema slot so the UI can apply the same column mappings used during export.
3. Press **Load Real Data**. A success banner will confirm the number of patients and records loaded, and the patient dropdown becomes active.
4. Explore patients, generate reports, or switch to other tabs. Online training remains paused until you explicitly press **Start Online Training** in the **📊 Online Learning Monitor** tab.

## 3. Optional: reuse the artifacts elsewhere

* The CSV is compatible with most pandas- or PyTorch-based offline RL pipelines.
* The `.npz` bundle loads directly with `numpy.load`, exposing arrays named `states`, `actions`, `rewards`, `patient_ids`, `timesteps`, and `terminals`.
* The YAML schema can be referenced by the adapters in `RL0910/schema.py` to map new cohorts that follow the same column layout.

## Troubleshooting

* If you see a message about missing CPTAC assets, run `python -m cptac download` once with an internet connection.
* Errors mentioning `_ARRAY_API` typically indicate an older `pyarrow` build. The UI now disables the Arrow backend automatically, but upgrading `pyarrow` (or installing `pyarrow>=14`) will also resolve it.
* Loading failures in the UI will keep the patient dropdown disabled; re-run `OVdata.py` or double-check the schema path if that happens.