AutoResearch Agent

initial commit

ecbf3a0 7 days ago

4.58 kB

	# OpenMed Cross-Dataset Transfer Discovery

	## Goal

	Find the optimal training curriculum for biomedical NER by systematically exploring cross-dataset transfer learning. The metric is val_f1 on the NCBI disease NER test set (higher is better).

	## Setup

	Agree on a run tag: propose a tag based on today's date (e.g. `apr8`). The branch `autoresearch/<tag>` must not already exist — this is a fresh run.

	1. Create the branch: `git checkout -b autoresearch/<tag>` from current master.
	2. Read the in-scope files: `README.md`, `prepare.py` (DO NOT MODIFY), `train.py` (the file you modify).
	3. Verify data exists: Check that `~/.cache/openmed-autoresearch/` contains dataset folders and `meta.json`. If not, tell the human to run `pip install -r requirements.txt && python prepare.py`.
	4. Initialize `results.tsv` with header: `experiment\tdescription\tval_f1\tpeak_vram_mb\tkept`
	5. Run a baseline first (no curriculum, just NCBI disease fine-tuning for the full 5 minutes).
	6. Confirm and go.

	## The Experiment Loop

	Each experiment:

	1. Hypothesize: Think about what cross-dataset transfer might help. Consider:
	- Which entity types share semantic overlap (chemicals↔diseases, genes↔proteins)?
	- Whether pre-training on broader entity types builds better representations
	- Curriculum ordering effects (broad→narrow vs narrow→broad)
	- Multi-dataset mixing ratios
	- Time allocation between stages
	- Hyperparameter interactions with curriculum choices

	2. Modify `train.py`: Change the `CURRICULUM` list and/or hyperparameters. The CURRICULUM format is:
	```python
	CURRICULUM = [
	(["dataset_names"], proportion_of_time, {"dataset": ratio} or None),
	...
	]
	```
	Available datasets: `bc5cdr_chem`, `ncbi_disease`, `bc2gm`, `jnlpba`, `linnaeus`

	You may also modify: `LEARNING_RATE`, `WEIGHT_DECAY`, `WARMUP_RATIO`, `BATCH_SIZE`, `GRADIENT_ACCUMULATION_STEPS`, `LR_SCHEDULER_TYPE`, `FREEZE_LAYERS`, `DROPOUT_OVERRIDE`, `FP16`, and any architecture changes to the model/classifier head.

	3. Run: `python train.py > run.log 2>&1`

	4. Read results: `grep "^val_f1:\\|^peak_vram_mb:" run.log`
	If empty → crashed. Run `tail -n 50 run.log`, attempt fix. Give up after 3 attempts.

	5. Record in `results.tsv` (do NOT commit this file).

	6. Keep or revert:
	- If `val_f1` improved → `git add train.py && git commit -m "experiment N: <description> (f1=X.XXXX)"` — ADVANCE
	- If equal or worse → `git checkout -- train.py` — REVERT

	## Research Directions (explore in roughly this order)

	### Phase 1: Transfer Affinity Discovery (experiments 1-30)
	Map which source datasets help NCBI disease NER. Try each one individually as a pretraining stage:
	- bc5cdr_chem → ncbi_disease
	- bc2gm → ncbi_disease
	- jnlpba → ncbi_disease
	- linnaeus → ncbi_disease
	Vary the time split (30/70, 50/50, 70/30) for each pair.

	### Phase 2: Multi-Source Curricula (experiments 30-60)
	Based on Phase 1 winners, try:
	- Two-source pretraining (best pair mixed → ncbi_disease)
	- Three-stage curricula (source1 → source2 → ncbi_disease)
	- Simultaneous multi-dataset mixing in a single stage
	- Vary mixing ratios for the best combinations

	### Phase 3: Hyperparameter Interactions (experiments 60-80)
	Take the best curriculum from Phase 2 and optimize:
	- Learning rate per stage (you can modify the code to use different LR per stage)
	- Layer freezing during pretraining, unfreezing for fine-tuning
	- Warmup and scheduler differences per stage
	- Batch size effects
	- Gradient accumulation

	### Phase 4: Architecture Tweaks (experiments 80-100)
	With the best curriculum + hyperparameters:
	- Add a CRF layer on top of the classifier
	- Try a wider/deeper classification head (MLP instead of linear)
	- Experiment with attention dropout
	- Try intermediate pooling strategies

	## Important Notes

	- Do not modify `prepare.py` or the evaluation logic in `train.py` (the `compute_f1` function).
	- The total time budget is always 5 minutes. The agent competes on what it can achieve in that fixed window.
	- Track your hypotheses: In the git commit messages AND in results.tsv, note WHY you tried something, not just what you changed. This makes the results scientifically useful.
	- If a direction shows no promise after 3-4 experiments, move on.
	- The baseline (just NCBI disease, no curriculum) is the number to beat. Everything is relative to that.
	- Be creative! The best discoveries will be non-obvious interactions (e.g., species NER helping disease NER through shared biomedical context).