OpenMed × Autoresearch: Cross-Dataset Transfer Discovery

Autonomous discovery of optimal training curricula for biomedical NER, using Karpathy's autoresearch loop on OpenMed datasets.

Results (~120 experiments on RTX 4090)

Disease NER (NCBI Disease target — 93 experiments)

Configuration	val_f1 (mean ± std)	Improvement
Baseline (ncbi_disease only)	0.8033	—
+ bc5cdr_chem pretrain (50/50)	0.8470	+4.4%
+ 3-stage curriculum (chem→jnlpba→disease)	0.8535 ± 0.007	+5.0%

Chemical NER (BC5CDR-Chem target — 26 experiments)

Configuration	val_f1 (mean ± std)	Improvement
Baseline (bc5cdr_chem only)	0.8090 ± 0.005	—
+ 3-stage curriculum (jnlpba→disease→chem)	0.8195 ± 0.002	+1.3%

Transfer Affinity Matrix (ΔF1 from 50/50 pretrain→finetune)

Source ↓ · Target →	ncbi_disease	bc5cdr_chem
No pretrain (baseline)	0.8033	0.8090
bc5cdr_chem	+0.044	—
ncbi_disease	—	-0.001
jnlpba	+0.013	-0.004
bc2gm	-0.007	-0.001
linnaeus	-0.033	+0.001

Reading the matrix: each cell shows the F1 change when pretraining on the source (row) before fine-tuning on the target (column). Bold = significant positive transfer.

Key Discovery: Asymmetric Transfer

Direction	Improvement	Pretrain Budget
Chemicals → Disease NER	+5.0%	40% pretrain
Disease → Chemical NER	+1.3%	15% pretrain

Transfer is 3.8x stronger from chemicals to diseases than the reverse. This asymmetry likely arises because (1) BC5CDR contains both chemical AND disease annotations, and (2) chemical entities are more lexically distinctive while disease entities benefit more from contextual pretraining.

Other key findings:

Sequential 3-stage curriculum beats single-source and mixing approaches (both targets)
Cosine scheduler is more reliable than constant_with_warmup (higher mean, lower variance)
Batch size 64 is critical in time-budgeted training (3x GPU utilization)
JNLPBA (proteins) helps both targets — proteins interact with both chemicals and diseases
Negative transfer from species (Linnaeus) and gene-only (BC2GM) datasets
Default BERT hyperparameters are near-optimal — 50+ tuning experiments found no improvement

See FINDINGS.md, FINDINGS_CHEM.md, and results.tsv for full analysis.

What this does

An AI agent (Claude Code) runs ~100 experiments overnight on your GPU, systematically exploring which biomedical NER datasets help each other through transfer learning. The target: maximize F1 on NCBI disease NER by finding the best cross-dataset pretraining curriculum.

Available datasets

Short name	Source	Entity types
`bc5cdr_chem`	BC5CDR	Chemicals, drugs
`ncbi_disease`	NCBI Disease	Disease mentions
`bc2gm`	BC2GM	Gene/protein mentions
`jnlpba`	BioNLP 2004	DNA, RNA, cell lines, cell types, proteins
`linnaeus`	Linnaeus	Species

Setup

# 1. Clone and enter
git clone <this-repo>
cd openmed-autoresearch

# 2. Install dependencies
pip install -r requirements.txt

# 3. Prepare data (downloads & tokenizes all datasets)
python prepare.py

# 4. Verify
ls ~/.cache/openmed-autoresearch/
# Should see: bc5cdr_chem/ ncbi_disease/ bc2gm/ jnlpba/ linnaeus/ meta.json

# 5. Test baseline
python train.py
# Should print val_f1 and peak_vram_mb after ~5 minutes

# 6. Run autoresearch with Claude Code
claude --dangerously-skip-permissions
# Then tell it: "Read program.md and start the autoresearch loop"

Changing the target dataset

To evaluate on a different entity type (e.g., chemicals instead of diseases), edit TARGET_EVAL_DATASET in train.py. The curriculum exploration then discovers what helps that entity type.

Analyzing results

After a run, use analyze.py to generate the transfer affinity heatmap:

python analyze.py results.tsv

Output

results.tsv: Full experiment log with F1 scores
Git history on the autoresearch/<tag> branch: only the improvements
Transfer affinity insights from the experiment data

Credits

OpenMed by Maziyar Panahi — datasets and models
autoresearch by Andrej Karpathy — the autonomous experiment loop pattern
Base model: ModernBERT-base

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for osatinsky/openmed-autoresearch

Base model

answerdotai/ModernBERT-base

Finetuned

(1361)

this model

osatinsky
/

openmed-autoresearch