openmed-autoresearch / program.md
AutoResearch Agent
initial commit
ecbf3a0

OpenMed Cross-Dataset Transfer Discovery

Goal

Find the optimal training curriculum for biomedical NER by systematically exploring cross-dataset transfer learning. The metric is val_f1 on the NCBI disease NER test set (higher is better).

Setup

Agree on a run tag: propose a tag based on today's date (e.g. apr8). The branch autoresearch/<tag> must not already exist — this is a fresh run.

  1. Create the branch: git checkout -b autoresearch/<tag> from current master.
  2. Read the in-scope files: README.md, prepare.py (DO NOT MODIFY), train.py (the file you modify).
  3. Verify data exists: Check that ~/.cache/openmed-autoresearch/ contains dataset folders and meta.json. If not, tell the human to run pip install -r requirements.txt && python prepare.py.
  4. Initialize results.tsv with header: experiment\tdescription\tval_f1\tpeak_vram_mb\tkept
  5. Run a baseline first (no curriculum, just NCBI disease fine-tuning for the full 5 minutes).
  6. Confirm and go.

The Experiment Loop

Each experiment:

  1. Hypothesize: Think about what cross-dataset transfer might help. Consider:

    • Which entity types share semantic overlap (chemicals↔diseases, genes↔proteins)?
    • Whether pre-training on broader entity types builds better representations
    • Curriculum ordering effects (broad→narrow vs narrow→broad)
    • Multi-dataset mixing ratios
    • Time allocation between stages
    • Hyperparameter interactions with curriculum choices
  2. Modify train.py: Change the CURRICULUM list and/or hyperparameters. The CURRICULUM format is:

    CURRICULUM = [
        (["dataset_names"], proportion_of_time, {"dataset": ratio} or None),
        ...
    ]
    

    Available datasets: bc5cdr_chem, ncbi_disease, bc2gm, jnlpba, linnaeus

    You may also modify: LEARNING_RATE, WEIGHT_DECAY, WARMUP_RATIO, BATCH_SIZE, GRADIENT_ACCUMULATION_STEPS, LR_SCHEDULER_TYPE, FREEZE_LAYERS, DROPOUT_OVERRIDE, FP16, and any architecture changes to the model/classifier head.

  3. Run: python train.py > run.log 2>&1

  4. Read results: grep "^val_f1:\|^peak_vram_mb:" run.log If empty → crashed. Run tail -n 50 run.log, attempt fix. Give up after 3 attempts.

  5. Record in results.tsv (do NOT commit this file).

  6. Keep or revert:

    • If val_f1 improved → git add train.py && git commit -m "experiment N: <description> (f1=X.XXXX)" — ADVANCE
    • If equal or worse → git checkout -- train.py — REVERT

Research Directions (explore in roughly this order)

Phase 1: Transfer Affinity Discovery (experiments 1-30)

Map which source datasets help NCBI disease NER. Try each one individually as a pretraining stage:

  • bc5cdr_chem → ncbi_disease
  • bc2gm → ncbi_disease
  • jnlpba → ncbi_disease
  • linnaeus → ncbi_disease Vary the time split (30/70, 50/50, 70/30) for each pair.

Phase 2: Multi-Source Curricula (experiments 30-60)

Based on Phase 1 winners, try:

  • Two-source pretraining (best pair mixed → ncbi_disease)
  • Three-stage curricula (source1 → source2 → ncbi_disease)
  • Simultaneous multi-dataset mixing in a single stage
  • Vary mixing ratios for the best combinations

Phase 3: Hyperparameter Interactions (experiments 60-80)

Take the best curriculum from Phase 2 and optimize:

  • Learning rate per stage (you can modify the code to use different LR per stage)
  • Layer freezing during pretraining, unfreezing for fine-tuning
  • Warmup and scheduler differences per stage
  • Batch size effects
  • Gradient accumulation

Phase 4: Architecture Tweaks (experiments 80-100)

With the best curriculum + hyperparameters:

  • Add a CRF layer on top of the classifier
  • Try a wider/deeper classification head (MLP instead of linear)
  • Experiment with attention dropout
  • Try intermediate pooling strategies

Important Notes

  • Do not modify prepare.py or the evaluation logic in train.py (the compute_f1 function).
  • The total time budget is always 5 minutes. The agent competes on what it can achieve in that fixed window.
  • Track your hypotheses: In the git commit messages AND in results.tsv, note WHY you tried something, not just what you changed. This makes the results scientifically useful.
  • If a direction shows no promise after 3-4 experiments, move on.
  • The baseline (just NCBI disease, no curriculum) is the number to beat. Everything is relative to that.
  • Be creative! The best discoveries will be non-obvious interactions (e.g., species NER helping disease NER through shared biomedical context).