| # OpenMed Cross-Dataset Transfer Discovery |
|
|
| ## Goal |
|
|
| Find the optimal training curriculum for biomedical NER by systematically exploring cross-dataset transfer learning. The metric is **val_f1** on the NCBI disease NER test set (higher is better). |
| |
| ## Setup |
| |
| Agree on a run tag: propose a tag based on today's date (e.g. `apr8`). The branch `autoresearch/<tag>` must not already exist — this is a fresh run. |
| |
| 1. Create the branch: `git checkout -b autoresearch/<tag>` from current master. |
| 2. Read the in-scope files: `README.md`, `prepare.py` (DO NOT MODIFY), `train.py` (the file you modify). |
| 3. Verify data exists: Check that `~/.cache/openmed-autoresearch/` contains dataset folders and `meta.json`. If not, tell the human to run `pip install -r requirements.txt && python prepare.py`. |
| 4. Initialize `results.tsv` with header: `experiment\tdescription\tval_f1\tpeak_vram_mb\tkept` |
| 5. Run a baseline first (no curriculum, just NCBI disease fine-tuning for the full 5 minutes). |
| 6. Confirm and go. |
| |
| ## The Experiment Loop |
| |
| Each experiment: |
| |
| 1. **Hypothesize**: Think about what cross-dataset transfer might help. Consider: |
| - Which entity types share semantic overlap (chemicals↔diseases, genes↔proteins)? |
| - Whether pre-training on broader entity types builds better representations |
| - Curriculum ordering effects (broad→narrow vs narrow→broad) |
| - Multi-dataset mixing ratios |
| - Time allocation between stages |
| - Hyperparameter interactions with curriculum choices |
| |
| 2. **Modify `train.py`**: Change the `CURRICULUM` list and/or hyperparameters. The CURRICULUM format is: |
| ```python |
| CURRICULUM = [ |
| (["dataset_names"], proportion_of_time, {"dataset": ratio} or None), |
| ... |
| ] |
| ``` |
| Available datasets: `bc5cdr_chem`, `ncbi_disease`, `bc2gm`, `jnlpba`, `linnaeus` |
| |
| You may also modify: `LEARNING_RATE`, `WEIGHT_DECAY`, `WARMUP_RATIO`, `BATCH_SIZE`, `GRADIENT_ACCUMULATION_STEPS`, `LR_SCHEDULER_TYPE`, `FREEZE_LAYERS`, `DROPOUT_OVERRIDE`, `FP16`, and any architecture changes to the model/classifier head. |
| |
| 3. **Run**: `python train.py > run.log 2>&1` |
| |
| 4. **Read results**: `grep "^val_f1:\|^peak_vram_mb:" run.log` |
| If empty → crashed. Run `tail -n 50 run.log`, attempt fix. Give up after 3 attempts. |
| |
| 5. **Record** in `results.tsv` (do NOT commit this file). |
|
|
| 6. **Keep or revert**: |
| - If `val_f1` improved → `git add train.py && git commit -m "experiment N: <description> (f1=X.XXXX)"` — ADVANCE |
| - If equal or worse → `git checkout -- train.py` — REVERT |
|
|
| ## Research Directions (explore in roughly this order) |
|
|
| ### Phase 1: Transfer Affinity Discovery (experiments 1-30) |
| Map which source datasets help NCBI disease NER. Try each one individually as a pretraining stage: |
| - bc5cdr_chem → ncbi_disease |
| - bc2gm → ncbi_disease |
| - jnlpba → ncbi_disease |
| - linnaeus → ncbi_disease |
| Vary the time split (30/70, 50/50, 70/30) for each pair. |
| |
| ### Phase 2: Multi-Source Curricula (experiments 30-60) |
| Based on Phase 1 winners, try: |
| - Two-source pretraining (best pair mixed → ncbi_disease) |
| - Three-stage curricula (source1 → source2 → ncbi_disease) |
| - Simultaneous multi-dataset mixing in a single stage |
| - Vary mixing ratios for the best combinations |
| |
| ### Phase 3: Hyperparameter Interactions (experiments 60-80) |
| Take the best curriculum from Phase 2 and optimize: |
| - Learning rate per stage (you can modify the code to use different LR per stage) |
| - Layer freezing during pretraining, unfreezing for fine-tuning |
| - Warmup and scheduler differences per stage |
| - Batch size effects |
| - Gradient accumulation |
| |
| ### Phase 4: Architecture Tweaks (experiments 80-100) |
| With the best curriculum + hyperparameters: |
| - Add a CRF layer on top of the classifier |
| - Try a wider/deeper classification head (MLP instead of linear) |
| - Experiment with attention dropout |
| - Try intermediate pooling strategies |
| |
| ## Important Notes |
| |
| - **Do not modify `prepare.py`** or the evaluation logic in `train.py` (the `compute_f1` function). |
| - The total time budget is always 5 minutes. The agent competes on what it can achieve in that fixed window. |
| - **Track your hypotheses**: In the git commit messages AND in results.tsv, note WHY you tried something, not just what you changed. This makes the results scientifically useful. |
| - If a direction shows no promise after 3-4 experiments, move on. |
| - The baseline (just NCBI disease, no curriculum) is the number to beat. Everything is relative to that. |
| - Be creative! The best discoveries will be non-obvious interactions (e.g., species NER helping disease NER through shared biomedical context). |
|
|