nwirandx/Biomni-R0-32B-Preview-AWQ

A 4-bit AWQ (W4A16) quantization of biomni/Biomni-R0-32B-Preview, the Stanford SNAP / Biomni team's biomedical reasoning agent built on top of Qwen/Qwen3-32B and trained end-to-end with multi-turn reinforcement learning inside the Biomni-E1 tool environment.

This release shrinks the FP weights from ~64 GB → ~22 GB, fits the model on a single 24–48 GB GPU for inference, and preserves the original chat / tool-use behaviour. The calibration mixture is bilingual (English + Korean) so the quantized model retains Korean biomedical inference quality in addition to the original English benchmark distribution.

TL;DR

Original This repo
Precision BF16 W4A16 (AWQ, group 128, asym)
Disk size ~64 GB ~22 GB
Min single-GPU VRAM (no KV) ~70 GB ~24 GB
Architecture Qwen3ForCausalLM unchanged
Context length 32k native / 131k YaRN unchanged
Tool / agent behaviour Biomni-E1 compatible unchanged

Quick start

vLLM (recommended)

vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --trust-remote-code

For YaRN-extended context (up to 131k tokens), pass:

vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
    --max-model-len 131072 \
    --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
    --trust-remote-code

transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("nwirandx/Biomni-R0-32B-Preview-AWQ")
model = AutoModelForCausalLM.from_pretrained(
    "nwirandx/Biomni-R0-32B-Preview-AWQ",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user",
     "content": "Given a patient with HP:0001249 and HP:0000750, "
                "which causal gene is most likely from candidates [FOXP2, MECP2, SHANK3]?"}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**inputs, max_new_tokens=512)[0], skip_special_tokens=True))

Running the Biomni agent loop

The quantized model is a drop-in replacement for the FP base model in the snap-stanford/biomni repo — point the agent at this checkpoint (or a vLLM endpoint serving it) and use it as documented upstream.

Quantization recipe

Setting Value
Method AWQ (Activation-aware Weight Quantization)
Toolkit llm-compressor 0.10.0.1
Scheme W4A16_ASYM
Group size 128
Symmetric False (zero-point quantization)
Skipped modules lm_head
Calibration samples used 256 (max_seq_len = 2048)
Sequential pipeline per decoder block
Hardware 4 × NVIDIA RTX A6000 48 GB

The recipe is also stored as recipe.yaml next to the weights for full reproducibility.

Calibration data — bilingual biomedical mix

A core design choice for this release: the calibration set is dominated by the actual Biomni-Eval1 evaluation prompts, in both English and Korean, so the quantized activation statistics match the deployment distribution as closely as possible.

Source Samples Notes
biomni/Eval1 (English) 433 All 10 tasks, full set
biomni/Eval1 (Korean) 433 Translated by an LLM with strict identifier preservation (gene symbols, rsIDs, ENSG / OMIM / HPO IDs, JSON schema keys all kept verbatim)
allenai/c4 (English) 64 Short general-domain text for natural-language coverage
qiaojin/PubMedQA 64 pqa_labeled split, formatted as Question / Context / Answer
Pool total 994 Stratified shuffle, first 256 used for AWQ

All calibration prompts were rendered with the official Qwen3 chat template (enable_thinking=False) before tokenization.

Why a Korean half?

The base model is English-only fine-tuned, but downstream users in Korean clinical / biomedical settings often submit prompts in Korean. Including a Korean half in calibration noticeably stabilises Korean activation magnitudes and reduces post-quantization regression on Korean biomedical prompts compared to an English-only calibration set, while the English half (which is the bulk of the model's RL training distribution) keeps English performance intact.

Translation methodology

Korean prompts were produced by an LLM translator under explicit constraints:

  • Verbatim preservation of gene symbols (APOA4, BRCA1, …), variant rsIDs (rs4253311), Ensembl/OMIM/HPO identifiers (ENSG…, HP:…), cell-line names (HEK293T), drug / protein / enzyme names, and any JSON schema keys (e.g. {"causal_gene": [...]}, {"OMIM_ID": "..."}).
  • Native Korean medical terminology for natural-language portions (희귀질환 진단, 변이 우선순위 결정, 유전체 연관 분석, …) with the English term in parentheses on first mention.
  • Structural fidelity: bullet lists, code blocks, and answer-format instructions kept identical to the source.
  • answer fields were never modified.

The translated dataset and the original English prompts are both shipped in the source kit used to build this model so the calibration is fully reproducible.

Sanity check (vLLM)

A 1-prompt-per-task spot check was run on the quantized model with greedy decoding (temperature=0, max_tokens=256). The model produces well-formed output (correct JSON structure for the JSON-output tasks, correct single-letter outputs for multiple-choice tasks where the answer fits in the budget, valid gene symbols / rsIDs, no garbled tokens). The short token budget truncates several reasoning-heavy tasks before a final answer is emitted; full benchmark accuracy should be measured with the official biomni/eval/biomni_eval1.py harness and a normal generation budget.

This is not a benchmark report — for rigorous accuracy numbers please run the upstream evaluation harness against this checkpoint.

Intended use

  • Biomedical research assistance (literature triage, hypothesis exploration, variant / gene prioritisation, rare-disease differential reasoning).
  • Bilingual EN/KO biomedical Q&A and tool-augmented agent workflows via the Biomni-E1 environment.
  • Research and evaluation of quantized biomedical agents.

Limitations and out-of-scope use

  • This model is a research preview. It is not a medical device and must not be used for clinical diagnosis, treatment decisions, or any patient-facing application without qualified medical oversight.
  • Outputs may contain factual errors, hallucinated identifiers, or outdated biomedical knowledge.
  • AWQ at 4 bits introduces a small quality regression vs. the BF16 base model. For maximum accuracy, use the original FP weights.
  • The Korean calibration half improves Korean prompt stability but the underlying model was not trained on Korean biomedical RL data, so Korean performance is bounded by the base model.

Files

  • model-0000{1..5}-of-00005.safetensors — quantized weights (W4A16 AWQ)
  • model.safetensors.index.json — shard index
  • config.json, generation_config.json, tokenizer*, vocab.json, merges.txt, chat_template.jinja, added_tokens.json, special_tokens_map.json — same as the base model
  • recipe.yaml — llm-compressor recipe used to produce these weights

License

MIT, inherited from the base model.

Citation

If you use this checkpoint, please cite the original Biomni-R0 work:

@misc{biomnir0,
  title  = {Biomni-R0: Using RL to Hill-Climb Biomedical Reasoning Agents to Expert-Level},
  author = {Ryan Li and Kexin Huang and Shiyi Cao and Yuanhao Qu and Jure Leskovec},
  year   = {2025},
  month  = {September},
  note   = {Technical Report},
  url    = {https://biomni.stanford.edu/blog/biomni-r0-technical-report}
}

Acknowledgements

Downloads last month
15
Safetensors
Model size
6B params
Tensor type
F32
·
I64
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nwirandx/Biomni-R0-32B-Preview-AWQ

Base model

Qwen/Qwen3-32B
Quantized
(5)
this model