--- license: mit language: - en - ko tags: - agent - biology - biomedical - code - awq - quantized - 4-bit - vllm - qwen3 base_model: - biomni/Biomni-R0-32B-Preview base_model_relation: quantized pipeline_tag: text-generation --- # nwirandx/Biomni-R0-32B-Preview-AWQ A 4-bit AWQ (W4A16) quantization of [**biomni/Biomni-R0-32B-Preview**](https://huggingface.co/biomni/Biomni-R0-32B-Preview), the Stanford SNAP / Biomni team's biomedical reasoning agent built on top of `Qwen/Qwen3-32B` and trained end-to-end with multi-turn reinforcement learning inside the Biomni-E1 tool environment. This release shrinks the FP weights from **~64 GB → ~22 GB**, fits the model on a single 24–48 GB GPU for inference, and preserves the original chat / tool-use behaviour. The calibration mixture is **bilingual (English + Korean)** so the quantized model retains Korean biomedical inference quality in addition to the original English benchmark distribution. ## TL;DR | | Original | This repo | |---|---|---| | Precision | BF16 | W4A16 (AWQ, group 128, asym) | | Disk size | ~64 GB | ~22 GB | | Min single-GPU VRAM (no KV) | ~70 GB | ~24 GB | | Architecture | Qwen3ForCausalLM | unchanged | | Context length | 32k native / 131k YaRN | unchanged | | Tool / agent behaviour | Biomni-E1 compatible | unchanged | ## Quick start ### vLLM (recommended) ```bash vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \ --tensor-parallel-size 1 \ --max-model-len 32768 \ --trust-remote-code ``` For YaRN-extended context (up to 131k tokens), pass: ```bash vllm serve nwirandx/Biomni-R0-32B-Preview-AWQ \ --max-model-len 131072 \ --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' \ --trust-remote-code ``` ### transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer tok = AutoTokenizer.from_pretrained("nwirandx/Biomni-R0-32B-Preview-AWQ") model = AutoModelForCausalLM.from_pretrained( "nwirandx/Biomni-R0-32B-Preview-AWQ", device_map="auto", torch_dtype="auto", trust_remote_code=True, ) messages = [ {"role": "user", "content": "Given a patient with HP:0001249 and HP:0000750, " "which causal gene is most likely from candidates [FOXP2, MECP2, SHANK3]?"} ] prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tok(prompt, return_tensors="pt").to(model.device) print(tok.decode(model.generate(**inputs, max_new_tokens=512)[0], skip_special_tokens=True)) ``` ### Running the Biomni agent loop The quantized model is a drop-in replacement for the FP base model in the [snap-stanford/biomni](https://github.com/snap-stanford/biomni) repo — point the agent at this checkpoint (or a vLLM endpoint serving it) and use it as documented upstream. ## Quantization recipe | Setting | Value | |---|---| | Method | AWQ (Activation-aware Weight Quantization) | | Toolkit | [llm-compressor](https://github.com/vllm-project/llm-compressor) `0.10.0.1` | | Scheme | `W4A16_ASYM` | | Group size | 128 | | Symmetric | False (zero-point quantization) | | Skipped modules | `lm_head` | | Calibration samples used | 256 (max_seq_len = 2048) | | Sequential pipeline | per decoder block | | Hardware | 4 × NVIDIA RTX A6000 48 GB | The recipe is also stored as `recipe.yaml` next to the weights for full reproducibility. ## Calibration data — bilingual biomedical mix A core design choice for this release: the calibration set is dominated by the **actual Biomni-Eval1 evaluation prompts**, in both English and Korean, so the quantized activation statistics match the deployment distribution as closely as possible. | Source | Samples | Notes | |---|---|---| | `biomni/Eval1` (English) | 433 | All 10 tasks, full set | | `biomni/Eval1` (Korean) | 433 | Translated by an LLM with strict identifier preservation (gene symbols, rsIDs, ENSG / OMIM / HPO IDs, JSON schema keys all kept verbatim) | | `allenai/c4` (English) | 64 | Short general-domain text for natural-language coverage | | `qiaojin/PubMedQA` | 64 | `pqa_labeled` split, formatted as `Question / Context / Answer` | | **Pool total** | **994** | Stratified shuffle, first 256 used for AWQ | All calibration prompts were rendered with the official Qwen3 chat template (`enable_thinking=False`) before tokenization. ### Why a Korean half? The base model is English-only fine-tuned, but downstream users in Korean clinical / biomedical settings often submit prompts in Korean. Including a Korean half in calibration noticeably stabilises Korean activation magnitudes and reduces post-quantization regression on Korean biomedical prompts compared to an English-only calibration set, while the English half (which is the bulk of the model's RL training distribution) keeps English performance intact. ### Translation methodology Korean prompts were produced by an LLM translator under explicit constraints: - **Verbatim preservation** of gene symbols (`APOA4`, `BRCA1`, …), variant rsIDs (`rs4253311`), Ensembl/OMIM/HPO identifiers (`ENSG…`, `HP:…`), cell-line names (`HEK293T`), drug / protein / enzyme names, and any JSON schema keys (e.g. `{"causal_gene": [...]}`, `{"OMIM_ID": "..."}`). - **Native Korean medical terminology** for natural-language portions (희귀질환 진단, 변이 우선순위 결정, 유전체 연관 분석, …) with the English term in parentheses on first mention. - **Structural fidelity**: bullet lists, code blocks, and answer-format instructions kept identical to the source. - `answer` fields were never modified. The translated dataset and the original English prompts are both shipped in the source kit used to build this model so the calibration is fully reproducible. ## Sanity check (vLLM) A 1-prompt-per-task spot check was run on the quantized model with greedy decoding (`temperature=0, max_tokens=256`). The model produces well-formed output (correct JSON structure for the JSON-output tasks, correct single-letter outputs for multiple-choice tasks where the answer fits in the budget, valid gene symbols / rsIDs, no garbled tokens). The short token budget truncates several reasoning-heavy tasks before a final answer is emitted; full benchmark accuracy should be measured with the official `biomni/eval/biomni_eval1.py` harness and a normal generation budget. This is **not a benchmark report** — for rigorous accuracy numbers please run the upstream evaluation harness against this checkpoint. ## Intended use - Biomedical research assistance (literature triage, hypothesis exploration, variant / gene prioritisation, rare-disease differential reasoning). - Bilingual EN/KO biomedical Q&A and tool-augmented agent workflows via the Biomni-E1 environment. - Research and evaluation of quantized biomedical agents. ## Limitations and out-of-scope use - This model is a **research preview**. It is not a medical device and must not be used for clinical diagnosis, treatment decisions, or any patient-facing application without qualified medical oversight. - Outputs may contain factual errors, hallucinated identifiers, or outdated biomedical knowledge. - AWQ at 4 bits introduces a small quality regression vs. the BF16 base model. For maximum accuracy, use the original FP weights. - The Korean calibration half improves Korean prompt stability but the underlying model was not trained on Korean biomedical RL data, so Korean performance is bounded by the base model. ## Files - `model-0000{1..5}-of-00005.safetensors` — quantized weights (W4A16 AWQ) - `model.safetensors.index.json` — shard index - `config.json`, `generation_config.json`, `tokenizer*`, `vocab.json`, `merges.txt`, `chat_template.jinja`, `added_tokens.json`, `special_tokens_map.json` — same as the base model - `recipe.yaml` — llm-compressor recipe used to produce these weights ## License MIT, inherited from the base model. ## Citation If you use this checkpoint, please cite the original Biomni-R0 work: ```bibtex @misc{biomnir0, title = {Biomni-R0: Using RL to Hill-Climb Biomedical Reasoning Agents to Expert-Level}, author = {Ryan Li and Kexin Huang and Shiyi Cao and Yuanhao Qu and Jure Leskovec}, year = {2025}, month = {September}, note = {Technical Report}, url = {https://biomni.stanford.edu/blog/biomni-r0-technical-report} } ``` ## Acknowledgements - [Stanford SNAP / Biomni](https://biomni.stanford.edu/) for the base model and the Biomni-E1 environment. - [vLLM project / llm-compressor](https://github.com/vllm-project/llm-compressor) for the AWQ implementation. - [FutureHouse LAB-Bench](https://huggingface.co/datasets/futurehouse/lab-bench), PubMedQA, and the C4 corpus for calibration data sources.