nwirandx's picture
Upload AWQ W4A16 quantized Biomni-R0-32B-Preview
73d94ee verified
---
license: mit
language:
- en
- ko
tags:
- agent
- biology
- biomedical
- code
- awq
- quantized
- 4-bit
- vllm
- qwen3
base_model:
- biomni/Biomni-R0-32B-Preview
base_model_relation: quantized
pipeline_tag: text-generation
---
# nwirandx/Biomni-R0-32B-Preview-AWQ
A 4-bit AWQ (W4A16) quantization of
[**biomni/Biomni-R0-32B-Preview**](https://huggingface.co/biomni/Biomni-R0-32B-Preview),
the Stanford SNAP / Biomni team's biomedical reasoning agent built on top of
`Qwen/Qwen3-32B` and trained end-to-end with multi-turn reinforcement learning
inside the Biomni-E1 tool environment.
This release shrinks the FP weights from **~64 GB → ~22 GB**, fits the model on
a single 24–48 GB GPU for inference, and preserves the original chat / tool-use
behaviour. The calibration mixture is **bilingual (English + Korean)** so the
quantized model retains Korean biomedical inference quality in addition to the
original English benchmark distribution.
## TL;DR
| | Original | This repo |
|---|---|---|
| Precision | BF16 | W4A16 (AWQ, group 128, asym) |
| Disk size | ~64 GB | ~22 GB |
| Min single-GPU VRAM (no KV) | ~70 GB | ~24 GB |
| Architecture | Qwen3ForCausalLM | unchanged |
| Context length | 32k native / 131k YaRN | unchanged |
| Tool / agent behaviour | Biomni-E1 compatible | unchanged |
## Quick start
### vLLM (recommended)
```bash
vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--trust-remote-code
```
For YaRN-extended context (up to 131k tokens), pass:
```bash
vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
--max-model-len 131072 \
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
--trust-remote-code
```
### transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("nwirandx/Biomni-R0-32B-Preview-AWQ")
model = AutoModelForCausalLM.from_pretrained(
"nwirandx/Biomni-R0-32B-Preview-AWQ",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
messages = [
{"role": "user",
"content": "Given a patient with HP:0001249 and HP:0000750, "
"which causal gene is most likely from candidates [FOXP2, MECP2, SHANK3]?"}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**inputs, max_new_tokens=512)[0], skip_special_tokens=True))
```
### Running the Biomni agent loop
The quantized model is a drop-in replacement for the FP base model in the
[snap-stanford/biomni](https://github.com/snap-stanford/biomni) repo — point
the agent at this checkpoint (or a vLLM endpoint serving it) and use it as
documented upstream.
## Quantization recipe
| Setting | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Toolkit | [llm-compressor](https://github.com/vllm-project/llm-compressor) `0.10.0.1` |
| Scheme | `W4A16_ASYM` |
| Group size | 128 |
| Symmetric | False (zero-point quantization) |
| Skipped modules | `lm_head` |
| Calibration samples used | 256 (max_seq_len = 2048) |
| Sequential pipeline | per decoder block |
| Hardware | 4 × NVIDIA RTX A6000 48 GB |
The recipe is also stored as `recipe.yaml` next to the weights for full
reproducibility.
## Calibration data — bilingual biomedical mix
A core design choice for this release: the calibration set is dominated by the
**actual Biomni-Eval1 evaluation prompts**, in both English and Korean, so the
quantized activation statistics match the deployment distribution as closely
as possible.
| Source | Samples | Notes |
|---|---|---|
| `biomni/Eval1` (English) | 433 | All 10 tasks, full set |
| `biomni/Eval1` (Korean) | 433 | Translated by an LLM with strict identifier preservation (gene symbols, rsIDs, ENSG / OMIM / HPO IDs, JSON schema keys all kept verbatim) |
| `allenai/c4` (English) | 64 | Short general-domain text for natural-language coverage |
| `qiaojin/PubMedQA` | 64 | `pqa_labeled` split, formatted as `Question / Context / Answer` |
| **Pool total** | **994** | Stratified shuffle, first 256 used for AWQ |
All calibration prompts were rendered with the official Qwen3 chat template
(`enable_thinking=False`) before tokenization.
### Why a Korean half?
The base model is English-only fine-tuned, but downstream users in Korean
clinical / biomedical settings often submit prompts in Korean. Including a
Korean half in calibration noticeably stabilises Korean activation magnitudes
and reduces post-quantization regression on Korean biomedical prompts compared
to an English-only calibration set, while the English half (which is the bulk
of the model's RL training distribution) keeps English performance intact.
### Translation methodology
Korean prompts were produced by an LLM translator under explicit constraints:
- **Verbatim preservation** of gene symbols (`APOA4`, `BRCA1`, …), variant
rsIDs (`rs4253311`), Ensembl/OMIM/HPO identifiers (`ENSG…`, `HP:…`),
cell-line names (`HEK293T`), drug / protein / enzyme names, and any JSON
schema keys (e.g. `{"causal_gene": [...]}`, `{"OMIM_ID": "..."}`).
- **Native Korean medical terminology** for natural-language portions
(희귀질환 진단, 변이 우선순위 결정, 유전체 연관 분석, …) with the
English term in parentheses on first mention.
- **Structural fidelity**: bullet lists, code blocks, and answer-format
instructions kept identical to the source.
- `answer` fields were never modified.
The translated dataset and the original English prompts are both shipped in
the source kit used to build this model so the calibration is fully reproducible.
## Sanity check (vLLM)
A 1-prompt-per-task spot check was run on the quantized model with greedy
decoding (`temperature=0, max_tokens=256`). The model produces well-formed
output (correct JSON structure for the JSON-output tasks, correct
single-letter outputs for multiple-choice tasks where the answer fits in
the budget, valid gene symbols / rsIDs, no garbled tokens). The short token
budget truncates several reasoning-heavy tasks before a final answer is
emitted; full benchmark accuracy should be measured with the official
`biomni/eval/biomni_eval1.py` harness and a normal generation budget.
This is **not a benchmark report** — for rigorous accuracy numbers please run
the upstream evaluation harness against this checkpoint.
## Intended use
- Biomedical research assistance (literature triage, hypothesis exploration,
variant / gene prioritisation, rare-disease differential reasoning).
- Bilingual EN/KO biomedical Q&A and tool-augmented agent workflows via the
Biomni-E1 environment.
- Research and evaluation of quantized biomedical agents.
## Limitations and out-of-scope use
- This model is a **research preview**. It is not a medical device and must
not be used for clinical diagnosis, treatment decisions, or any
patient-facing application without qualified medical oversight.
- Outputs may contain factual errors, hallucinated identifiers, or outdated
biomedical knowledge.
- AWQ at 4 bits introduces a small quality regression vs. the BF16 base
model. For maximum accuracy, use the original FP weights.
- The Korean calibration half improves Korean prompt stability but the
underlying model was not trained on Korean biomedical RL data, so Korean
performance is bounded by the base model.
## Files
- `model-0000{1..5}-of-00005.safetensors` — quantized weights (W4A16 AWQ)
- `model.safetensors.index.json` — shard index
- `config.json`, `generation_config.json`, `tokenizer*`, `vocab.json`,
`merges.txt`, `chat_template.jinja`, `added_tokens.json`,
`special_tokens_map.json` — same as the base model
- `recipe.yaml` — llm-compressor recipe used to produce these weights
## License
MIT, inherited from the base model.
## Citation
If you use this checkpoint, please cite the original Biomni-R0 work:
```bibtex
@misc{biomnir0,
title = {Biomni-R0: Using RL to Hill-Climb Biomedical Reasoning Agents to Expert-Level},
author = {Ryan Li and Kexin Huang and Shiyi Cao and Yuanhao Qu and Jure Leskovec},
year = {2025},
month = {September},
note = {Technical Report},
url = {https://biomni.stanford.edu/blog/biomni-r0-technical-report}
}
```
## Acknowledgements
- [Stanford SNAP / Biomni](https://biomni.stanford.edu/) for the base model
and the Biomni-E1 environment.
- [vLLM project / llm-compressor](https://github.com/vllm-project/llm-compressor)
for the AWQ implementation.
- [FutureHouse LAB-Bench](https://huggingface.co/datasets/futurehouse/lab-bench),
PubMedQA, and the C4 corpus for calibration data sources.