nwirandx/Biomni-R0-32B-Preview-AWQ
A 4-bit AWQ (W4A16) quantization of
biomni/Biomni-R0-32B-Preview,
the Stanford SNAP / Biomni team's biomedical reasoning agent built on top of
Qwen/Qwen3-32B and trained end-to-end with multi-turn reinforcement learning
inside the Biomni-E1 tool environment.
This release shrinks the FP weights from ~64 GB → ~22 GB, fits the model on a single 24–48 GB GPU for inference, and preserves the original chat / tool-use behaviour. The calibration mixture is bilingual (English + Korean) so the quantized model retains Korean biomedical inference quality in addition to the original English benchmark distribution.
TL;DR
| Original | This repo | |
|---|---|---|
| Precision | BF16 | W4A16 (AWQ, group 128, asym) |
| Disk size | ~64 GB | ~22 GB |
| Min single-GPU VRAM (no KV) | ~70 GB | ~24 GB |
| Architecture | Qwen3ForCausalLM | unchanged |
| Context length | 32k native / 131k YaRN | unchanged |
| Tool / agent behaviour | Biomni-E1 compatible | unchanged |
Quick start
vLLM (recommended)
vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--trust-remote-code
For YaRN-extended context (up to 131k tokens), pass:
vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
--max-model-len 131072 \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--trust-remote-code
transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("nwirandx/Biomni-R0-32B-Preview-AWQ")
model = AutoModelForCausalLM.from_pretrained(
"nwirandx/Biomni-R0-32B-Preview-AWQ",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
messages = [
{"role": "user",
"content": "Given a patient with HP:0001249 and HP:0000750, "
"which causal gene is most likely from candidates [FOXP2, MECP2, SHANK3]?"}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**inputs, max_new_tokens=512)[0], skip_special_tokens=True))
Running the Biomni agent loop
The quantized model is a drop-in replacement for the FP base model in the snap-stanford/biomni repo — point the agent at this checkpoint (or a vLLM endpoint serving it) and use it as documented upstream.
Quantization recipe
| Setting | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Toolkit | llm-compressor 0.10.0.1 |
| Scheme | W4A16_ASYM |
| Group size | 128 |
| Symmetric | False (zero-point quantization) |
| Skipped modules | lm_head |
| Calibration samples used | 256 (max_seq_len = 2048) |
| Sequential pipeline | per decoder block |
| Hardware | 4 × NVIDIA RTX A6000 48 GB |
The recipe is also stored as recipe.yaml next to the weights for full
reproducibility.
Calibration data — bilingual biomedical mix
A core design choice for this release: the calibration set is dominated by the actual Biomni-Eval1 evaluation prompts, in both English and Korean, so the quantized activation statistics match the deployment distribution as closely as possible.
| Source | Samples | Notes |
|---|---|---|
biomni/Eval1 (English) |
433 | All 10 tasks, full set |
biomni/Eval1 (Korean) |
433 | Translated by an LLM with strict identifier preservation (gene symbols, rsIDs, ENSG / OMIM / HPO IDs, JSON schema keys all kept verbatim) |
allenai/c4 (English) |
64 | Short general-domain text for natural-language coverage |
qiaojin/PubMedQA |
64 | pqa_labeled split, formatted as Question / Context / Answer |
| Pool total | 994 | Stratified shuffle, first 256 used for AWQ |
All calibration prompts were rendered with the official Qwen3 chat template
(enable_thinking=False) before tokenization.
Why a Korean half?
The base model is English-only fine-tuned, but downstream users in Korean clinical / biomedical settings often submit prompts in Korean. Including a Korean half in calibration noticeably stabilises Korean activation magnitudes and reduces post-quantization regression on Korean biomedical prompts compared to an English-only calibration set, while the English half (which is the bulk of the model's RL training distribution) keeps English performance intact.
Translation methodology
Korean prompts were produced by an LLM translator under explicit constraints:
- Verbatim preservation of gene symbols (
APOA4,BRCA1, …), variant rsIDs (rs4253311), Ensembl/OMIM/HPO identifiers (ENSG…,HP:…), cell-line names (HEK293T), drug / protein / enzyme names, and any JSON schema keys (e.g.{"causal_gene": [...]},{"OMIM_ID": "..."}). - Native Korean medical terminology for natural-language portions (희귀질환 진단, 변이 우선순위 결정, 유전체 연관 분석, …) with the English term in parentheses on first mention.
- Structural fidelity: bullet lists, code blocks, and answer-format instructions kept identical to the source.
answerfields were never modified.
The translated dataset and the original English prompts are both shipped in the source kit used to build this model so the calibration is fully reproducible.
Sanity check (vLLM)
A 1-prompt-per-task spot check was run on the quantized model with greedy
decoding (temperature=0, max_tokens=256). The model produces well-formed
output (correct JSON structure for the JSON-output tasks, correct
single-letter outputs for multiple-choice tasks where the answer fits in
the budget, valid gene symbols / rsIDs, no garbled tokens). The short token
budget truncates several reasoning-heavy tasks before a final answer is
emitted; full benchmark accuracy should be measured with the official
biomni/eval/biomni_eval1.py harness and a normal generation budget.
This is not a benchmark report — for rigorous accuracy numbers please run the upstream evaluation harness against this checkpoint.
Intended use
- Biomedical research assistance (literature triage, hypothesis exploration, variant / gene prioritisation, rare-disease differential reasoning).
- Bilingual EN/KO biomedical Q&A and tool-augmented agent workflows via the Biomni-E1 environment.
- Research and evaluation of quantized biomedical agents.
Limitations and out-of-scope use
- This model is a research preview. It is not a medical device and must not be used for clinical diagnosis, treatment decisions, or any patient-facing application without qualified medical oversight.
- Outputs may contain factual errors, hallucinated identifiers, or outdated biomedical knowledge.
- AWQ at 4 bits introduces a small quality regression vs. the BF16 base model. For maximum accuracy, use the original FP weights.
- The Korean calibration half improves Korean prompt stability but the underlying model was not trained on Korean biomedical RL data, so Korean performance is bounded by the base model.
Files
model-0000{1..5}-of-00005.safetensors— quantized weights (W4A16 AWQ)model.safetensors.index.json— shard indexconfig.json,generation_config.json,tokenizer*,vocab.json,merges.txt,chat_template.jinja,added_tokens.json,special_tokens_map.json— same as the base modelrecipe.yaml— llm-compressor recipe used to produce these weights
License
MIT, inherited from the base model.
Citation
If you use this checkpoint, please cite the original Biomni-R0 work:
@misc{biomnir0,
title = {Biomni-R0: Using RL to Hill-Climb Biomedical Reasoning Agents to Expert-Level},
author = {Ryan Li and Kexin Huang and Shiyi Cao and Yuanhao Qu and Jure Leskovec},
year = {2025},
month = {September},
note = {Technical Report},
url = {https://biomni.stanford.edu/blog/biomni-r0-technical-report}
}
Acknowledgements
- Stanford SNAP / Biomni for the base model and the Biomni-E1 environment.
- vLLM project / llm-compressor for the AWQ implementation.
- FutureHouse LAB-Bench, PubMedQA, and the C4 corpus for calibration data sources.
- Downloads last month
- 15