Biomni-R0-32B-AWQ-INT4

This is an AWQ INT4 quantized version of Biomni-R0-32B-Preview.

Quantization Details

Parameter Value
Scheme W4A16 (4-bit weights, 16-bit activations)
Method AWQ (Activation-aware Weight Quantization)
Group Size 128
Calibration Custom biomedical dataset
Framework LLM Compressor

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "hassanshka/Biomni-R0-32B-AWQ-INT4",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("hassanshka/Biomni-R0-32B-AWQ-INT4")

# Inference
messages = [{"role": "user", "content": "Your medical question here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Quantization Script

from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor import oneshot

recipe = AWQModifier(
    scheme="W4A16",
    targets="Linear",
    ignore=["lm_head"],
)

oneshot(
    model=model,
    dataset=calibration_data,
    recipe=recipe,
    max_seq_length=2048,
    num_calibration_samples=len(calibration_data),
)

Performance

  • Memory Reduction: ~75% compared to BF16
  • Inference Speed: Optimized for consumer GPUs (RTX 3090/4090)
  • Accuracy: Minimal degradation due to custom calibration on biomedical data

License

Apache 2.0 (same as base model)

Citation

If you use this model, please cite the original Biomni model.

Downloads last month
-
Safetensors
Model size
6B params
Tensor type
F32
I64
I32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for hassanshka/Biomni-R0-32B-AWQ-INT4

Base model

Qwen/Qwen3-32B
Quantized
(5)
this model