OpenMed Cross-Dataset Transfer Discovery
Goal
Find the optimal training curriculum for biomedical NER by systematically exploring cross-dataset transfer learning. The metric is val_f1 on the NCBI disease NER test set (higher is better).
Setup
Agree on a run tag: propose a tag based on today's date (e.g. apr8). The branch autoresearch/<tag> must not already exist — this is a fresh run.
- Create the branch:
git checkout -b autoresearch/<tag>from current master. - Read the in-scope files:
README.md,prepare.py(DO NOT MODIFY),train.py(the file you modify). - Verify data exists: Check that
~/.cache/openmed-autoresearch/contains dataset folders andmeta.json. If not, tell the human to runpip install -r requirements.txt && python prepare.py. - Initialize
results.tsvwith header:experiment\tdescription\tval_f1\tpeak_vram_mb\tkept - Run a baseline first (no curriculum, just NCBI disease fine-tuning for the full 5 minutes).
- Confirm and go.
The Experiment Loop
Each experiment:
Hypothesize: Think about what cross-dataset transfer might help. Consider:
- Which entity types share semantic overlap (chemicals↔diseases, genes↔proteins)?
- Whether pre-training on broader entity types builds better representations
- Curriculum ordering effects (broad→narrow vs narrow→broad)
- Multi-dataset mixing ratios
- Time allocation between stages
- Hyperparameter interactions with curriculum choices
Modify
train.py: Change theCURRICULUMlist and/or hyperparameters. The CURRICULUM format is:CURRICULUM = [ (["dataset_names"], proportion_of_time, {"dataset": ratio} or None), ... ]Available datasets:
bc5cdr_chem,ncbi_disease,bc2gm,jnlpba,linnaeusYou may also modify:
LEARNING_RATE,WEIGHT_DECAY,WARMUP_RATIO,BATCH_SIZE,GRADIENT_ACCUMULATION_STEPS,LR_SCHEDULER_TYPE,FREEZE_LAYERS,DROPOUT_OVERRIDE,FP16, and any architecture changes to the model/classifier head.Run:
python train.py > run.log 2>&1Read results:
grep "^val_f1:\|^peak_vram_mb:" run.logIf empty → crashed. Runtail -n 50 run.log, attempt fix. Give up after 3 attempts.Record in
results.tsv(do NOT commit this file).Keep or revert:
- If
val_f1improved →git add train.py && git commit -m "experiment N: <description> (f1=X.XXXX)"— ADVANCE - If equal or worse →
git checkout -- train.py— REVERT
- If
Research Directions (explore in roughly this order)
Phase 1: Transfer Affinity Discovery (experiments 1-30)
Map which source datasets help NCBI disease NER. Try each one individually as a pretraining stage:
- bc5cdr_chem → ncbi_disease
- bc2gm → ncbi_disease
- jnlpba → ncbi_disease
- linnaeus → ncbi_disease Vary the time split (30/70, 50/50, 70/30) for each pair.
Phase 2: Multi-Source Curricula (experiments 30-60)
Based on Phase 1 winners, try:
- Two-source pretraining (best pair mixed → ncbi_disease)
- Three-stage curricula (source1 → source2 → ncbi_disease)
- Simultaneous multi-dataset mixing in a single stage
- Vary mixing ratios for the best combinations
Phase 3: Hyperparameter Interactions (experiments 60-80)
Take the best curriculum from Phase 2 and optimize:
- Learning rate per stage (you can modify the code to use different LR per stage)
- Layer freezing during pretraining, unfreezing for fine-tuning
- Warmup and scheduler differences per stage
- Batch size effects
- Gradient accumulation
Phase 4: Architecture Tweaks (experiments 80-100)
With the best curriculum + hyperparameters:
- Add a CRF layer on top of the classifier
- Try a wider/deeper classification head (MLP instead of linear)
- Experiment with attention dropout
- Try intermediate pooling strategies
Important Notes
- Do not modify
prepare.pyor the evaluation logic intrain.py(thecompute_f1function). - The total time budget is always 5 minutes. The agent competes on what it can achieve in that fixed window.
- Track your hypotheses: In the git commit messages AND in results.tsv, note WHY you tried something, not just what you changed. This makes the results scientifically useful.
- If a direction shows no promise after 3-4 experiments, move on.
- The baseline (just NCBI disease, no curriculum) is the number to beat. Everything is relative to that.
- Be creative! The best discoveries will be non-obvious interactions (e.g., species NER helping disease NER through shared biomedical context).