Biomni-R0-32B-AWQ-INT4
This is an AWQ INT4 quantized version of Biomni-R0-32B-Preview.
Quantization Details
| Parameter | Value |
|---|---|
| Scheme | W4A16 (4-bit weights, 16-bit activations) |
| Method | AWQ (Activation-aware Weight Quantization) |
| Group Size | 128 |
| Calibration | Custom biomedical dataset |
| Framework | LLM Compressor |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"hassanshka/Biomni-R0-32B-AWQ-INT4",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("hassanshka/Biomni-R0-32B-AWQ-INT4")
# Inference
messages = [{"role": "user", "content": "Your medical question here"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Quantization Script
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor import oneshot
recipe = AWQModifier(
scheme="W4A16",
targets="Linear",
ignore=["lm_head"],
)
oneshot(
model=model,
dataset=calibration_data,
recipe=recipe,
max_seq_length=2048,
num_calibration_samples=len(calibration_data),
)
Performance
- Memory Reduction: ~75% compared to BF16
- Inference Speed: Optimized for consumer GPUs (RTX 3090/4090)
- Accuracy: Minimal degradation due to custom calibration on biomedical data
License
Apache 2.0 (same as base model)
Citation
If you use this model, please cite the original Biomni model.
- Downloads last month
- -