nwirandx/Biomni-R0-32B-Preview-AWQ

A 4-bit AWQ (W4A16) quantization of biomni/Biomni-R0-32B-Preview, the Stanford SNAP / Biomni team's biomedical reasoning agent built on top of Qwen/Qwen3-32B and trained end-to-end with multi-turn reinforcement learning inside the Biomni-E1 tool environment.

This release shrinks the FP weights from ~64 GB → ~22 GB, fits the model on a single 24–48 GB GPU for inference, and preserves the original chat / tool-use behaviour. The calibration mixture is bilingual (English + Korean) so the quantized model retains Korean biomedical inference quality in addition to the original English benchmark distribution.

TL;DR

	Original	This repo
Precision	BF16	W4A16 (AWQ, group 128, asym)
Disk size	~64 GB	~22 GB
Min single-GPU VRAM (no KV)	~70 GB	~24 GB
Architecture	Qwen3ForCausalLM	unchanged
Context length	32k native / 131k YaRN	unchanged
Tool / agent behaviour	Biomni-E1 compatible	unchanged

Quick start

vLLM (recommended)

vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --trust-remote-code

For YaRN-extended context (up to 131k tokens), pass:

vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
    --max-model-len 131072 \
    --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
    --trust-remote-code

transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("nwirandx/Biomni-R0-32B-Preview-AWQ")
model = AutoModelForCausalLM.from_pretrained(
    "nwirandx/Biomni-R0-32B-Preview-AWQ",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user",
     "content": "Given a patient with HP:0001249 and HP:0000750, "
                "which causal gene is most likely from candidates [FOXP2, MECP2, SHANK3]?"}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**inputs, max_new_tokens=512)[0], skip_special_tokens=True))

Running the Biomni agent loop

The quantized model is a drop-in replacement for the FP base model in the snap-stanford/biomni repo — point the agent at this checkpoint (or a vLLM endpoint serving it) and use it as documented upstream.

Quantization recipe

Setting	Value
Method	AWQ (Activation-aware Weight Quantization)
Toolkit	llm-compressor `0.10.0.1`
Scheme	`W4A16_ASYM`
Group size	128
Symmetric	False (zero-point quantization)
Skipped modules	`lm_head`
Calibration samples used	256 (max_seq_len = 2048)
Sequential pipeline	per decoder block
Hardware	4 × NVIDIA RTX A6000 48 GB

The recipe is also stored as recipe.yaml next to the weights for full reproducibility.

Calibration data — bilingual biomedical mix

A core design choice for this release: the calibration set is dominated by the actual Biomni-Eval1 evaluation prompts, in both English and Korean, so the quantized activation statistics match the deployment distribution as closely as possible.

Source	Samples	Notes
`biomni/Eval1` (English)	433	All 10 tasks, full set
`biomni/Eval1` (Korean)	433	Translated by an LLM with strict identifier preservation (gene symbols, rsIDs, ENSG / OMIM / HPO IDs, JSON schema keys all kept verbatim)
`allenai/c4` (English)	64	Short general-domain text for natural-language coverage
`qiaojin/PubMedQA`	64	`pqa_labeled` split, formatted as `Question / Context / Answer`
Pool total	994	Stratified shuffle, first 256 used for AWQ

All calibration prompts were rendered with the official Qwen3 chat template (enable_thinking=False) before tokenization.

Why a Korean half?

The base model is English-only fine-tuned, but downstream users in Korean clinical / biomedical settings often submit prompts in Korean. Including a Korean half in calibration noticeably stabilises Korean activation magnitudes and reduces post-quantization regression on Korean biomedical prompts compared to an English-only calibration set, while the English half (which is the bulk of the model's RL training distribution) keeps English performance intact.

Translation methodology

Korean prompts were produced by an LLM translator under explicit constraints:

Verbatim preservation of gene symbols (APOA4, BRCA1, …), variant rsIDs (rs4253311), Ensembl/OMIM/HPO identifiers (ENSG…, HP:…), cell-line names (HEK293T), drug / protein / enzyme names, and any JSON schema keys (e.g. {"causal_gene": [...]}, {"OMIM_ID": "..."}).
Native Korean medical terminology for natural-language portions (희귀질환 진단, 변이 우선순위 결정, 유전체 연관 분석, …) with the English term in parentheses on first mention.
Structural fidelity: bullet lists, code blocks, and answer-format instructions kept identical to the source.
answer fields were never modified.

The translated dataset and the original English prompts are both shipped in the source kit used to build this model so the calibration is fully reproducible.

Sanity check (vLLM)

A 1-prompt-per-task spot check was run on the quantized model with greedy decoding (temperature=0, max_tokens=256). The model produces well-formed output (correct JSON structure for the JSON-output tasks, correct single-letter outputs for multiple-choice tasks where the answer fits in the budget, valid gene symbols / rsIDs, no garbled tokens). The short token budget truncates several reasoning-heavy tasks before a final answer is emitted; full benchmark accuracy should be measured with the official biomni/eval/biomni_eval1.py harness and a normal generation budget.

This is not a benchmark report — for rigorous accuracy numbers please run the upstream evaluation harness against this checkpoint.

Intended use

Biomedical research assistance (literature triage, hypothesis exploration, variant / gene prioritisation, rare-disease differential reasoning).
Bilingual EN/KO biomedical Q&A and tool-augmented agent workflows via the Biomni-E1 environment.
Research and evaluation of quantized biomedical agents.

Limitations and out-of-scope use

This model is a research preview. It is not a medical device and must not be used for clinical diagnosis, treatment decisions, or any patient-facing application without qualified medical oversight.
Outputs may contain factual errors, hallucinated identifiers, or outdated biomedical knowledge.
AWQ at 4 bits introduces a small quality regression vs. the BF16 base model. For maximum accuracy, use the original FP weights.
The Korean calibration half improves Korean prompt stability but the underlying model was not trained on Korean biomedical RL data, so Korean performance is bounded by the base model.

Files

model-0000{1..5}-of-00005.safetensors — quantized weights (W4A16 AWQ)
model.safetensors.index.json — shard index
config.json, generation_config.json, tokenizer*, vocab.json, merges.txt, chat_template.jinja, added_tokens.json, special_tokens_map.json — same as the base model
recipe.yaml — llm-compressor recipe used to produce these weights

License

MIT, inherited from the base model.

Citation

If you use this checkpoint, please cite the original Biomni-R0 work:

@misc{biomnir0,
  title  = {Biomni-R0: Using RL to Hill-Climb Biomedical Reasoning Agents to Expert-Level},
  author = {Ryan Li and Kexin Huang and Shiyi Cao and Yuanhao Qu and Jure Leskovec},
  year   = {2025},
  month  = {September},
  note   = {Technical Report},
  url    = {https://biomni.stanford.edu/blog/biomni-r0-technical-report}
}

Acknowledgements

Stanford SNAP / Biomni for the base model and the Biomni-E1 environment.
vLLM project / llm-compressor for the AWQ implementation.
FutureHouse LAB-Bench, PubMedQA, and the C4 corpus for calibration data sources.

Downloads last month: 15

Safetensors

Model size

6B params

Tensor type

F32

I64

I32

Model tree for nwirandx/Biomni-R0-32B-Preview-AWQ

Base model

Qwen/Qwen3-32B

Finetuned

biomni/Biomni-R0-32B-Preview

Quantized

(5)

this model