README.md · nwirandx/Biomni-R0-32B-Preview-AWQ at main

Biomni-R0-32B-Preview-AWQ / README.md

nwirandx

Upload AWQ W4A16 quantized Biomni-R0-32B-Preview

73d94ee verified about 1 month ago

preview code

raw

history blame contribute delete

8.74 kB

	---
	license: mit
	language:
	- en
	- ko
	tags:
	- agent
	- biology
	- biomedical
	- code
	- awq
	- quantized
	- 4-bit
	- vllm
	- qwen3
	base_model:
	- biomni/Biomni-R0-32B-Preview
	base_model_relation: quantized
	pipeline_tag: text-generation
	---

	# nwirandx/Biomni-R0-32B-Preview-AWQ

	A 4-bit AWQ (W4A16) quantization of
	[biomni/Biomni-R0-32B-Preview](https://huggingface.co/biomni/Biomni-R0-32B-Preview),
	the Stanford SNAP / Biomni team's biomedical reasoning agent built on top of
	`Qwen/Qwen3-32B` and trained end-to-end with multi-turn reinforcement learning
	inside the Biomni-E1 tool environment.

	This release shrinks the FP weights from ~64 GB → ~22 GB, fits the model on
	a single 24–48 GB GPU for inference, and preserves the original chat / tool-use
	behaviour. The calibration mixture is bilingual (English + Korean) so the
	quantized model retains Korean biomedical inference quality in addition to the
	original English benchmark distribution.

	## TL;DR

	\| \| Original \| This repo \|
	\|---\|---\|---\|
	\| Precision \| BF16 \| W4A16 (AWQ, group 128, asym) \|
	\| Disk size \| ~64 GB \| ~22 GB \|
	\| Min single-GPU VRAM (no KV) \| ~70 GB \| ~24 GB \|
	\| Architecture \| Qwen3ForCausalLM \| unchanged \|
	\| Context length \| 32k native / 131k YaRN \| unchanged \|
	\| Tool / agent behaviour \| Biomni-E1 compatible \| unchanged \|

	## Quick start

	### vLLM (recommended)

	```bash
	vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
	--tensor-parallel-size 1 \
	--max-model-len 32768 \
	--trust-remote-code
	```

	For YaRN-extended context (up to 131k tokens), pass:

	```bash
	vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
	--max-model-len 131072 \
	--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
	--trust-remote-code
	```

	### transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tok = AutoTokenizer.from_pretrained("nwirandx/Biomni-R0-32B-Preview-AWQ")
	model = AutoModelForCausalLM.from_pretrained(
	"nwirandx/Biomni-R0-32B-Preview-AWQ",
	device_map="auto",
	torch_dtype="auto",
	trust_remote_code=True,
	)

	messages = [
	{"role": "user",
	"content": "Given a patient with HP:0001249 and HP:0000750, "
	"which causal gene is most likely from candidates [FOXP2, MECP2, SHANK3]?"}
	]
	prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tok(prompt, return_tensors="pt").to(model.device)
	print(tok.decode(model.generate(**inputs, max_new_tokens=512)[0], skip_special_tokens=True))
	```

	### Running the Biomni agent loop

	The quantized model is a drop-in replacement for the FP base model in the
	[snap-stanford/biomni](https://github.com/snap-stanford/biomni) repo — point
	the agent at this checkpoint (or a vLLM endpoint serving it) and use it as
	documented upstream.

	## Quantization recipe

	\| Setting \| Value \|
	\|---\|---\|
	\| Method \| AWQ (Activation-aware Weight Quantization) \|
	\| Toolkit \| [llm-compressor](https://github.com/vllm-project/llm-compressor) `0.10.0.1` \|
	\| Scheme \| `W4A16_ASYM` \|
	\| Group size \| 128 \|
	\| Symmetric \| False (zero-point quantization) \|
	\| Skipped modules \| `lm_head` \|
	\| Calibration samples used \| 256 (max_seq_len = 2048) \|
	\| Sequential pipeline \| per decoder block \|
	\| Hardware \| 4 × NVIDIA RTX A6000 48 GB \|

	The recipe is also stored as `recipe.yaml` next to the weights for full
	reproducibility.

	## Calibration data — bilingual biomedical mix

	A core design choice for this release: the calibration set is dominated by the
	actual Biomni-Eval1 evaluation prompts, in both English and Korean, so the
	quantized activation statistics match the deployment distribution as closely
	as possible.

	\| Source \| Samples \| Notes \|
	\|---\|---\|---\|
	\| `biomni/Eval1` (English) \| 433 \| All 10 tasks, full set \|
	\| `biomni/Eval1` (Korean) \| 433 \| Translated by an LLM with strict identifier preservation (gene symbols, rsIDs, ENSG / OMIM / HPO IDs, JSON schema keys all kept verbatim) \|
	\| `allenai/c4` (English) \| 64 \| Short general-domain text for natural-language coverage \|
	\| `qiaojin/PubMedQA` \| 64 \| `pqa_labeled` split, formatted as `Question / Context / Answer` \|
	\| Pool total \| 994 \| Stratified shuffle, first 256 used for AWQ \|

	All calibration prompts were rendered with the official Qwen3 chat template
	(`enable_thinking=False`) before tokenization.

	### Why a Korean half?

	The base model is English-only fine-tuned, but downstream users in Korean
	clinical / biomedical settings often submit prompts in Korean. Including a
	Korean half in calibration noticeably stabilises Korean activation magnitudes
	and reduces post-quantization regression on Korean biomedical prompts compared
	to an English-only calibration set, while the English half (which is the bulk
	of the model's RL training distribution) keeps English performance intact.

	### Translation methodology

	Korean prompts were produced by an LLM translator under explicit constraints:

	- Verbatim preservation of gene symbols (`APOA4`, `BRCA1`, …), variant
	rsIDs (`rs4253311`), Ensembl/OMIM/HPO identifiers (`ENSG…`, `HP:…`),
	cell-line names (`HEK293T`), drug / protein / enzyme names, and any JSON
	schema keys (e.g. `{"causal_gene": [...]}`, `{"OMIM_ID": "..."}`).
	- Native Korean medical terminology for natural-language portions
	(희귀질환 진단, 변이 우선순위 결정, 유전체 연관 분석, …) with the
	English term in parentheses on first mention.
	- Structural fidelity: bullet lists, code blocks, and answer-format
	instructions kept identical to the source.
	- `answer` fields were never modified.

	The translated dataset and the original English prompts are both shipped in
	the source kit used to build this model so the calibration is fully reproducible.

	## Sanity check (vLLM)

	A 1-prompt-per-task spot check was run on the quantized model with greedy
	decoding (`temperature=0, max_tokens=256`). The model produces well-formed
	output (correct JSON structure for the JSON-output tasks, correct
	single-letter outputs for multiple-choice tasks where the answer fits in
	the budget, valid gene symbols / rsIDs, no garbled tokens). The short token
	budget truncates several reasoning-heavy tasks before a final answer is
	emitted; full benchmark accuracy should be measured with the official
	`biomni/eval/biomni_eval1.py` harness and a normal generation budget.

	This is not a benchmark report — for rigorous accuracy numbers please run
	the upstream evaluation harness against this checkpoint.

	## Intended use

	- Biomedical research assistance (literature triage, hypothesis exploration,
	variant / gene prioritisation, rare-disease differential reasoning).
	- Bilingual EN/KO biomedical Q&A and tool-augmented agent workflows via the
	Biomni-E1 environment.
	- Research and evaluation of quantized biomedical agents.

	## Limitations and out-of-scope use

	- This model is a research preview. It is not a medical device and must
	not be used for clinical diagnosis, treatment decisions, or any
	patient-facing application without qualified medical oversight.
	- Outputs may contain factual errors, hallucinated identifiers, or outdated
	biomedical knowledge.
	- AWQ at 4 bits introduces a small quality regression vs. the BF16 base
	model. For maximum accuracy, use the original FP weights.
	- The Korean calibration half improves Korean prompt stability but the
	underlying model was not trained on Korean biomedical RL data, so Korean
	performance is bounded by the base model.

	## Files

	- `model-0000{1..5}-of-00005.safetensors` — quantized weights (W4A16 AWQ)
	- `model.safetensors.index.json` — shard index
	- `config.json`, `generation_config.json`, `tokenizer*`, `vocab.json`,
	`merges.txt`, `chat_template.jinja`, `added_tokens.json`,
	`special_tokens_map.json` — same as the base model
	- `recipe.yaml` — llm-compressor recipe used to produce these weights

	## License

	MIT, inherited from the base model.

	## Citation

	If you use this checkpoint, please cite the original Biomni-R0 work:

	```bibtex
	@misc{biomnir0,
	title = {Biomni-R0: Using RL to Hill-Climb Biomedical Reasoning Agents to Expert-Level},
	author = {Ryan Li and Kexin Huang and Shiyi Cao and Yuanhao Qu and Jure Leskovec},
	year = {2025},
	month = {September},
	note = {Technical Report},
	url = {https://biomni.stanford.edu/blog/biomni-r0-technical-report}
	}
	```

	## Acknowledgements

	- [Stanford SNAP / Biomni](https://biomni.stanford.edu/) for the base model
	and the Biomni-E1 environment.
	- [vLLM project / llm-compressor](https://github.com/vllm-project/llm-compressor)
	for the AWQ implementation.
	- [FutureHouse LAB-Bench](https://huggingface.co/datasets/futurehouse/lab-bench),
	PubMedQA, and the C4 corpus for calibration data sources.