Biomni-R0-32B-PTQ-INT8

This is a Post-Training Quantization (PTQ) INT8 version of Biomni-R0-32B-Preview.

Quantization Details

Parameter Value
Scheme INT8 (8-bit weights and activations)
Method Post-Training Quantization (PTQ)
Backend Optimum Quanto
Calibration Samples 120
Model Size ~8-10 GB (vs ~60 GB original BF16)
Memory Reduction ~83-87% compared to BF16

Usage

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "hassanshka/Biomni-R0-32B-PTQ-INT8",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("hassanshka/Biomni-R0-32B-PTQ-INT8")

# Inference
messages = [{"role": "user", "content": "Your medical question here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Using vLLM (Recommended)

The quantization may be applied at serving time by vLLM for optimal performance:

from vllm import LLM, SamplingParams

llm = LLM(
    model="hassanshka/Biomni-R0-32B-PTQ-INT8",
    quantization="awq",  # or appropriate quantization method
    trust_remote_code=True
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your medical question here"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Quantization Process

This model was quantized using Optimum Quanto with post-training quantization:

from optimum.quanto import quantize, qint8
from transformers import AutoModelForCausalLM

# Load the base model
model = AutoModelForCausalLM.from_pretrained(
    "biomni/Biomni-R0-32B-Preview",
    torch_dtype=torch.float16
)

# Quantize to INT8
quantize(model, weights=qint8, activations=qint8)

# Save the quantized model
model.save_pretrained("./Biomni-R0-32B-PTQ-INT8")

Performance

  • Memory Reduction: ~83-87% compared to BF16 (from ~60 GB to ~8-10 GB)
  • Inference Speed: Faster inference on consumer GPUs
  • Accuracy: Minimal degradation with proper calibration
  • Hardware: Compatible with most modern GPUs (no special hardware required)

Model Information

  • Architecture: Qwen3ForCausalLM
  • Hidden Size: 5120
  • Number of Layers: 64
  • Attention Heads: 64
  • Key-Value Heads: 8
  • Vocabulary Size: 151936
  • Max Position Embeddings: 40960

Quantization Info

This model was quantized from the original Biomni-R0-32B-Preview model using:

  • Method: PTQ-INT8 (Post-Training Quantization)
  • Calibration Samples: 120
  • Backend: Optimum Quanto
  • Quantization Scheme: INT8 weights and activations

License

Apache 2.0 (same as base model)

Citation

If you use this model, please cite the original Biomni model:

@misc{{biomni-r0-32b-preview,
  title={{Biomni-R0-32B-Preview}},
  author={{Biomni Team}},
  year={{2024}},
  url={{https://huggingface.co/biomni/Biomni-R0-32B-Preview}}
}}
Downloads last month
3
Safetensors
Model size
33B params
Tensor type
F16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hassanshka/Biomni-R0-32B-PTQ-INT8

Base model

Qwen/Qwen3-32B
Finetuned
(2)
this model