# OpenMed Cross-Dataset Transfer Discovery ## Goal Find the optimal training curriculum for biomedical NER by systematically exploring cross-dataset transfer learning. The metric is **val_f1** on the NCBI disease NER test set (higher is better). ## Setup Agree on a run tag: propose a tag based on today's date (e.g. `apr8`). The branch `autoresearch/` must not already exist — this is a fresh run. 1. Create the branch: `git checkout -b autoresearch/` from current master. 2. Read the in-scope files: `README.md`, `prepare.py` (DO NOT MODIFY), `train.py` (the file you modify). 3. Verify data exists: Check that `~/.cache/openmed-autoresearch/` contains dataset folders and `meta.json`. If not, tell the human to run `pip install -r requirements.txt && python prepare.py`. 4. Initialize `results.tsv` with header: `experiment\tdescription\tval_f1\tpeak_vram_mb\tkept` 5. Run a baseline first (no curriculum, just NCBI disease fine-tuning for the full 5 minutes). 6. Confirm and go. ## The Experiment Loop Each experiment: 1. **Hypothesize**: Think about what cross-dataset transfer might help. Consider: - Which entity types share semantic overlap (chemicals↔diseases, genes↔proteins)? - Whether pre-training on broader entity types builds better representations - Curriculum ordering effects (broad→narrow vs narrow→broad) - Multi-dataset mixing ratios - Time allocation between stages - Hyperparameter interactions with curriculum choices 2. **Modify `train.py`**: Change the `CURRICULUM` list and/or hyperparameters. The CURRICULUM format is: ```python CURRICULUM = [ (["dataset_names"], proportion_of_time, {"dataset": ratio} or None), ... ] ``` Available datasets: `bc5cdr_chem`, `ncbi_disease`, `bc2gm`, `jnlpba`, `linnaeus` You may also modify: `LEARNING_RATE`, `WEIGHT_DECAY`, `WARMUP_RATIO`, `BATCH_SIZE`, `GRADIENT_ACCUMULATION_STEPS`, `LR_SCHEDULER_TYPE`, `FREEZE_LAYERS`, `DROPOUT_OVERRIDE`, `FP16`, and any architecture changes to the model/classifier head. 3. **Run**: `python train.py > run.log 2>&1` 4. **Read results**: `grep "^val_f1:\|^peak_vram_mb:" run.log` If empty → crashed. Run `tail -n 50 run.log`, attempt fix. Give up after 3 attempts. 5. **Record** in `results.tsv` (do NOT commit this file). 6. **Keep or revert**: - If `val_f1` improved → `git add train.py && git commit -m "experiment N: (f1=X.XXXX)"` — ADVANCE - If equal or worse → `git checkout -- train.py` — REVERT ## Research Directions (explore in roughly this order) ### Phase 1: Transfer Affinity Discovery (experiments 1-30) Map which source datasets help NCBI disease NER. Try each one individually as a pretraining stage: - bc5cdr_chem → ncbi_disease - bc2gm → ncbi_disease - jnlpba → ncbi_disease - linnaeus → ncbi_disease Vary the time split (30/70, 50/50, 70/30) for each pair. ### Phase 2: Multi-Source Curricula (experiments 30-60) Based on Phase 1 winners, try: - Two-source pretraining (best pair mixed → ncbi_disease) - Three-stage curricula (source1 → source2 → ncbi_disease) - Simultaneous multi-dataset mixing in a single stage - Vary mixing ratios for the best combinations ### Phase 3: Hyperparameter Interactions (experiments 60-80) Take the best curriculum from Phase 2 and optimize: - Learning rate per stage (you can modify the code to use different LR per stage) - Layer freezing during pretraining, unfreezing for fine-tuning - Warmup and scheduler differences per stage - Batch size effects - Gradient accumulation ### Phase 4: Architecture Tweaks (experiments 80-100) With the best curriculum + hyperparameters: - Add a CRF layer on top of the classifier - Try a wider/deeper classification head (MLP instead of linear) - Experiment with attention dropout - Try intermediate pooling strategies ## Important Notes - **Do not modify `prepare.py`** or the evaluation logic in `train.py` (the `compute_f1` function). - The total time budget is always 5 minutes. The agent competes on what it can achieve in that fixed window. - **Track your hypotheses**: In the git commit messages AND in results.tsv, note WHY you tried something, not just what you changed. This makes the results scientifically useful. - If a direction shows no promise after 3-4 experiments, move on. - The baseline (just NCBI disease, no curriculum) is the number to beat. Everything is relative to that. - Be creative! The best discoveries will be non-obvious interactions (e.g., species NER helping disease NER through shared biomedical context).