---
license: mit
language:
- en
- ko
tags:
- agent
- biology
- biomedical
- code
- awq
- quantized
- 4-bit
- vllm
- qwen3
base_model:
- biomni/Biomni-R0-32B-Preview
base_model_relation: quantized
pipeline_tag: text-generation
---

# nwirandx/Biomni-R0-32B-Preview-AWQ

A 4-bit AWQ (W4A16) quantization of
[**biomni/Biomni-R0-32B-Preview**](https://huggingface.co/biomni/Biomni-R0-32B-Preview),
the Stanford SNAP / Biomni team's biomedical reasoning agent built on top of
`Qwen/Qwen3-32B` and trained end-to-end with multi-turn reinforcement learning
inside the Biomni-E1 tool environment.

This release shrinks the FP weights from **~64 GB → ~22 GB**, fits the model on
a single 24–48 GB GPU for inference, and preserves the original chat / tool-use
behaviour. The calibration mixture is **bilingual (English + Korean)** so the
quantized model retains Korean biomedical inference quality in addition to the
original English benchmark distribution.

## TL;DR

| | Original | This repo |
|---|---|---|
| Precision | BF16 | W4A16 (AWQ, group 128, asym) |
| Disk size | ~64 GB | ~22 GB |
| Min single-GPU VRAM (no KV) | ~70 GB | ~24 GB |
| Architecture | Qwen3ForCausalLM | unchanged |
| Context length | 32k native / 131k YaRN | unchanged |
| Tool / agent behaviour | Biomni-E1 compatible | unchanged |

## Quick start

### vLLM (recommended)

```bash
vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --trust-remote-code
```

For YaRN-extended context (up to 131k tokens), pass:

```bash
vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \
    --max-model-len 131072 \
    --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \
    --trust-remote-code
```

### transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("nwirandx/Biomni-R0-32B-Preview-AWQ")
model = AutoModelForCausalLM.from_pretrained(
    "nwirandx/Biomni-R0-32B-Preview-AWQ",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user",
     "content": "Given a patient with HP:0001249 and HP:0000750, "
                "which causal gene is most likely from candidates [FOXP2, MECP2, SHANK3]?"}
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
print(tok.decode(model.generate(**inputs, max_new_tokens=512)[0], skip_special_tokens=True))
```

### Running the Biomni agent loop

The quantized model is a drop-in replacement for the FP base model in the
[snap-stanford/biomni](https://github.com/snap-stanford/biomni) repo — point
the agent at this checkpoint (or a vLLM endpoint serving it) and use it as
documented upstream.

## Quantization recipe

| Setting | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Toolkit | [llm-compressor](https://github.com/vllm-project/llm-compressor) `0.10.0.1` |
| Scheme | `W4A16_ASYM` |
| Group size | 128 |
| Symmetric | False (zero-point quantization) |
| Skipped modules | `lm_head` |
| Calibration samples used | 256 (max_seq_len = 2048) |
| Sequential pipeline | per decoder block |
| Hardware | 4 × NVIDIA RTX A6000 48 GB |

The recipe is also stored as `recipe.yaml` next to the weights for full
reproducibility.

## Calibration data — bilingual biomedical mix

A core design choice for this release: the calibration set is dominated by the
**actual Biomni-Eval1 evaluation prompts**, in both English and Korean, so the
quantized activation statistics match the deployment distribution as closely
as possible.

| Source | Samples | Notes |
|---|---|---|
| `biomni/Eval1` (English) | 433 | All 10 tasks, full set |
| `biomni/Eval1` (Korean) | 433 | Translated by an LLM with strict identifier preservation (gene symbols, rsIDs, ENSG / OMIM / HPO IDs, JSON schema keys all kept verbatim) |
| `allenai/c4` (English) | 64 | Short general-domain text for natural-language coverage |
| `qiaojin/PubMedQA` | 64 | `pqa_labeled` split, formatted as `Question / Context / Answer` |
| **Pool total** | **994** | Stratified shuffle, first 256 used for AWQ |

All calibration prompts were rendered with the official Qwen3 chat template
(`enable_thinking=False`) before tokenization.

### Why a Korean half?

The base model is English-only fine-tuned, but downstream users in Korean
clinical / biomedical settings often submit prompts in Korean. Including a
Korean half in calibration noticeably stabilises Korean activation magnitudes
and reduces post-quantization regression on Korean biomedical prompts compared
to an English-only calibration set, while the English half (which is the bulk
of the model's RL training distribution) keeps English performance intact.

### Translation methodology

Korean prompts were produced by an LLM translator under explicit constraints:

- **Verbatim preservation** of gene symbols (`APOA4`, `BRCA1`, …), variant
  rsIDs (`rs4253311`), Ensembl/OMIM/HPO identifiers (`ENSG…`, `HP:…`),
  cell-line names (`HEK293T`), drug / protein / enzyme names, and any JSON
  schema keys (e.g. `{"causal_gene": [...]}`, `{"OMIM_ID": "..."}`).
- **Native Korean medical terminology** for natural-language portions
  (희귀질환 진단, 변이 우선순위 결정, 유전체 연관 분석, …) with the
  English term in parentheses on first mention.
- **Structural fidelity**: bullet lists, code blocks, and answer-format
  instructions kept identical to the source.
- `answer` fields were never modified.

The translated dataset and the original English prompts are both shipped in
the source kit used to build this model so the calibration is fully reproducible.

## Sanity check (vLLM)

A 1-prompt-per-task spot check was run on the quantized model with greedy
decoding (`temperature=0, max_tokens=256`). The model produces well-formed
output (correct JSON structure for the JSON-output tasks, correct
single-letter outputs for multiple-choice tasks where the answer fits in
the budget, valid gene symbols / rsIDs, no garbled tokens). The short token
budget truncates several reasoning-heavy tasks before a final answer is
emitted; full benchmark accuracy should be measured with the official
`biomni/eval/biomni_eval1.py` harness and a normal generation budget.

This is **not a benchmark report** — for rigorous accuracy numbers please run
the upstream evaluation harness against this checkpoint.

## Intended use

- Biomedical research assistance (literature triage, hypothesis exploration,
  variant / gene prioritisation, rare-disease differential reasoning).
- Bilingual EN/KO biomedical Q&A and tool-augmented agent workflows via the
  Biomni-E1 environment.
- Research and evaluation of quantized biomedical agents.

## Limitations and out-of-scope use

- This model is a **research preview**. It is not a medical device and must
  not be used for clinical diagnosis, treatment decisions, or any
  patient-facing application without qualified medical oversight.
- Outputs may contain factual errors, hallucinated identifiers, or outdated
  biomedical knowledge.
- AWQ at 4 bits introduces a small quality regression vs. the BF16 base
  model. For maximum accuracy, use the original FP weights.
- The Korean calibration half improves Korean prompt stability but the
  underlying model was not trained on Korean biomedical RL data, so Korean
  performance is bounded by the base model.

## Files

- `model-0000{1..5}-of-00005.safetensors` — quantized weights (W4A16 AWQ)
- `model.safetensors.index.json` — shard index
- `config.json`, `generation_config.json`, `tokenizer*`, `vocab.json`,
  `merges.txt`, `chat_template.jinja`, `added_tokens.json`,
  `special_tokens_map.json` — same as the base model
- `recipe.yaml` — llm-compressor recipe used to produce these weights

## License

MIT, inherited from the base model.

## Citation

If you use this checkpoint, please cite the original Biomni-R0 work:

```bibtex
@misc{biomnir0,
  title  = {Biomni-R0: Using RL to Hill-Climb Biomedical Reasoning Agents to Expert-Level},
  author = {Ryan Li and Kexin Huang and Shiyi Cao and Yuanhao Qu and Jure Leskovec},
  year   = {2025},
  month  = {September},
  note   = {Technical Report},
  url    = {https://biomni.stanford.edu/blog/biomni-r0-technical-report}
}
```

## Acknowledgements

- [Stanford SNAP / Biomni](https://biomni.stanford.edu/) for the base model
  and the Biomni-E1 environment.
- [vLLM project / llm-compressor](https://github.com/vllm-project/llm-compressor)
  for the AWQ implementation.
- [FutureHouse LAB-Bench](https://huggingface.co/datasets/futurehouse/lab-bench),
  PubMedQA, and the C4 corpus for calibration data sources.