| --- |
| license: mit |
| language: |
| - en |
| - ko |
| tags: |
| - agent |
| - biology |
| - biomedical |
| - code |
| - awq |
| - quantized |
| - 4-bit |
| - vllm |
| - qwen3 |
| base_model: |
| - biomni/Biomni-R0-32B-Preview |
| base_model_relation: quantized |
| pipeline_tag: text-generation |
| --- |
| |
| # nwirandx/Biomni-R0-32B-Preview-AWQ |
|
|
| A 4-bit AWQ (W4A16) quantization of |
| [**biomni/Biomni-R0-32B-Preview**](https://huggingface.co/biomni/Biomni-R0-32B-Preview), |
| the Stanford SNAP / Biomni team's biomedical reasoning agent built on top of |
| `Qwen/Qwen3-32B` and trained end-to-end with multi-turn reinforcement learning |
| inside the Biomni-E1 tool environment. |
|
|
| This release shrinks the FP weights from **~64 GB → ~22 GB**, fits the model on |
| a single 24–48 GB GPU for inference, and preserves the original chat / tool-use |
| behaviour. The calibration mixture is **bilingual (English + Korean)** so the |
| quantized model retains Korean biomedical inference quality in addition to the |
| original English benchmark distribution. |
|
|
| ## TL;DR |
|
|
| | | Original | This repo | |
| |---|---|---| |
| | Precision | BF16 | W4A16 (AWQ, group 128, asym) | |
| | Disk size | ~64 GB | ~22 GB | |
| | Min single-GPU VRAM (no KV) | ~70 GB | ~24 GB | |
| | Architecture | Qwen3ForCausalLM | unchanged | |
| | Context length | 32k native / 131k YaRN | unchanged | |
| | Tool / agent behaviour | Biomni-E1 compatible | unchanged | |
|
|
| ## Quick start |
|
|
| ### vLLM (recommended) |
|
|
| ```bash |
| vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \ |
| --tensor-parallel-size 1 \ |
| --max-model-len 32768 \ |
| --trust-remote-code |
| ``` |
|
|
| For YaRN-extended context (up to 131k tokens), pass: |
|
|
| ```bash |
| vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \ |
| --max-model-len 131072 \ |
| --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \ |
| --trust-remote-code |
| ``` |
|
|
| ### transformers |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained("nwirandx/Biomni-R0-32B-Preview-AWQ") |
| model = AutoModelForCausalLM.from_pretrained( |
| "nwirandx/Biomni-R0-32B-Preview-AWQ", |
| device_map="auto", |
| torch_dtype="auto", |
| trust_remote_code=True, |
| ) |
| |
| messages = [ |
| {"role": "user", |
| "content": "Given a patient with HP:0001249 and HP:0000750, " |
| "which causal gene is most likely from candidates [FOXP2, MECP2, SHANK3]?"} |
| ] |
| prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = tok(prompt, return_tensors="pt").to(model.device) |
| print(tok.decode(model.generate(**inputs, max_new_tokens=512)[0], skip_special_tokens=True)) |
| ``` |
|
|
| ### Running the Biomni agent loop |
|
|
| The quantized model is a drop-in replacement for the FP base model in the |
| [snap-stanford/biomni](https://github.com/snap-stanford/biomni) repo — point |
| the agent at this checkpoint (or a vLLM endpoint serving it) and use it as |
| documented upstream. |
|
|
| ## Quantization recipe |
|
|
| | Setting | Value | |
| |---|---| |
| | Method | AWQ (Activation-aware Weight Quantization) | |
| | Toolkit | [llm-compressor](https://github.com/vllm-project/llm-compressor) `0.10.0.1` | |
| | Scheme | `W4A16_ASYM` | |
| | Group size | 128 | |
| | Symmetric | False (zero-point quantization) | |
| | Skipped modules | `lm_head` | |
| | Calibration samples used | 256 (max_seq_len = 2048) | |
| | Sequential pipeline | per decoder block | |
| | Hardware | 4 × NVIDIA RTX A6000 48 GB | |
|
|
| The recipe is also stored as `recipe.yaml` next to the weights for full |
| reproducibility. |
|
|
| ## Calibration data — bilingual biomedical mix |
|
|
| A core design choice for this release: the calibration set is dominated by the |
| **actual Biomni-Eval1 evaluation prompts**, in both English and Korean, so the |
| quantized activation statistics match the deployment distribution as closely |
| as possible. |
|
|
| | Source | Samples | Notes | |
| |---|---|---| |
| | `biomni/Eval1` (English) | 433 | All 10 tasks, full set | |
| | `biomni/Eval1` (Korean) | 433 | Translated by an LLM with strict identifier preservation (gene symbols, rsIDs, ENSG / OMIM / HPO IDs, JSON schema keys all kept verbatim) | |
| | `allenai/c4` (English) | 64 | Short general-domain text for natural-language coverage | |
| | `qiaojin/PubMedQA` | 64 | `pqa_labeled` split, formatted as `Question / Context / Answer` | |
| | **Pool total** | **994** | Stratified shuffle, first 256 used for AWQ | |
|
|
| All calibration prompts were rendered with the official Qwen3 chat template |
| (`enable_thinking=False`) before tokenization. |
|
|
| ### Why a Korean half? |
|
|
| The base model is English-only fine-tuned, but downstream users in Korean |
| clinical / biomedical settings often submit prompts in Korean. Including a |
| Korean half in calibration noticeably stabilises Korean activation magnitudes |
| and reduces post-quantization regression on Korean biomedical prompts compared |
| to an English-only calibration set, while the English half (which is the bulk |
| of the model's RL training distribution) keeps English performance intact. |
|
|
| ### Translation methodology |
|
|
| Korean prompts were produced by an LLM translator under explicit constraints: |
|
|
| - **Verbatim preservation** of gene symbols (`APOA4`, `BRCA1`, …), variant |
| rsIDs (`rs4253311`), Ensembl/OMIM/HPO identifiers (`ENSG…`, `HP:…`), |
| cell-line names (`HEK293T`), drug / protein / enzyme names, and any JSON |
| schema keys (e.g. `{"causal_gene": [...]}`, `{"OMIM_ID": "..."}`). |
| - **Native Korean medical terminology** for natural-language portions |
| (희귀질환 진단, 변이 우선순위 결정, 유전체 연관 분석, …) with the |
| English term in parentheses on first mention. |
| - **Structural fidelity**: bullet lists, code blocks, and answer-format |
| instructions kept identical to the source. |
| - `answer` fields were never modified. |
|
|
| The translated dataset and the original English prompts are both shipped in |
| the source kit used to build this model so the calibration is fully reproducible. |
|
|
| ## Sanity check (vLLM) |
|
|
| A 1-prompt-per-task spot check was run on the quantized model with greedy |
| decoding (`temperature=0, max_tokens=256`). The model produces well-formed |
| output (correct JSON structure for the JSON-output tasks, correct |
| single-letter outputs for multiple-choice tasks where the answer fits in |
| the budget, valid gene symbols / rsIDs, no garbled tokens). The short token |
| budget truncates several reasoning-heavy tasks before a final answer is |
| emitted; full benchmark accuracy should be measured with the official |
| `biomni/eval/biomni_eval1.py` harness and a normal generation budget. |
|
|
| This is **not a benchmark report** — for rigorous accuracy numbers please run |
| the upstream evaluation harness against this checkpoint. |
|
|
| ## Intended use |
|
|
| - Biomedical research assistance (literature triage, hypothesis exploration, |
| variant / gene prioritisation, rare-disease differential reasoning). |
| - Bilingual EN/KO biomedical Q&A and tool-augmented agent workflows via the |
| Biomni-E1 environment. |
| - Research and evaluation of quantized biomedical agents. |
|
|
| ## Limitations and out-of-scope use |
|
|
| - This model is a **research preview**. It is not a medical device and must |
| not be used for clinical diagnosis, treatment decisions, or any |
| patient-facing application without qualified medical oversight. |
| - Outputs may contain factual errors, hallucinated identifiers, or outdated |
| biomedical knowledge. |
| - AWQ at 4 bits introduces a small quality regression vs. the BF16 base |
| model. For maximum accuracy, use the original FP weights. |
| - The Korean calibration half improves Korean prompt stability but the |
| underlying model was not trained on Korean biomedical RL data, so Korean |
| performance is bounded by the base model. |
|
|
| ## Files |
|
|
| - `model-0000{1..5}-of-00005.safetensors` — quantized weights (W4A16 AWQ) |
| - `model.safetensors.index.json` — shard index |
| - `config.json`, `generation_config.json`, `tokenizer*`, `vocab.json`, |
| `merges.txt`, `chat_template.jinja`, `added_tokens.json`, |
| `special_tokens_map.json` — same as the base model |
| - `recipe.yaml` — llm-compressor recipe used to produce these weights |
|
|
| ## License |
|
|
| MIT, inherited from the base model. |
|
|
| ## Citation |
|
|
| If you use this checkpoint, please cite the original Biomni-R0 work: |
|
|
| ```bibtex |
| @misc{biomnir0, |
| title = {Biomni-R0: Using RL to Hill-Climb Biomedical Reasoning Agents to Expert-Level}, |
| author = {Ryan Li and Kexin Huang and Shiyi Cao and Yuanhao Qu and Jure Leskovec}, |
| year = {2025}, |
| month = {September}, |
| note = {Technical Report}, |
| url = {https://biomni.stanford.edu/blog/biomni-r0-technical-report} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| - [Stanford SNAP / Biomni](https://biomni.stanford.edu/) for the base model |
| and the Biomni-E1 environment. |
| - [vLLM project / llm-compressor](https://github.com/vllm-project/llm-compressor) |
| for the AWQ implementation. |
| - [FutureHouse LAB-Bench](https://huggingface.co/datasets/futurehouse/lab-bench), |
| PubMedQA, and the C4 corpus for calibration data sources. |
|
|