Behemoth-R1-123B-v2 FP8 Dynamic
FP8 Dynamic quantization of TheDrummer/Behemoth-R1-123B-v2 using llmcompressor.
Model Details
- Base Model: TheDrummer/Behemoth-R1-123B-v2 (Mistral Large 2411 finetune)
- Quantization: FP8 Dynamic (W8A8) via llmcompressor
- Scheme: FP8_DYNAMIC, lm_head excluded
- Size: ~123 GB (vs 246 GB FP16)
- Format: SafeTensors with compressed-tensors metadata
Usage with vLLM
python3 -m vllm.entrypoints.openai.api_server \
--model Irvollo/Behemoth-R1-123B-v2-FP8-Dynamic \
--quantization compressed-tensors \
--dtype bfloat16 \
--max-model-len 32768 \
--gpu-memory-utilization 0.95 \
--enable-prefix-caching \
--trust-remote-code
Reasoning / Thinking
Supports native reasoning via <think> tag prefill:
{
"messages": [
{"role": "user", "content": "Your question"},
{"role": "assistant", "content": "<think>\n"}
],
"continue_final_message": true,
"add_generation_prompt": false
}
Hardware Requirements
- Single GPU: H200 NVL (141 GB) — tight with ~18 GB KV cache
- Recommended: 2x A100 80GB or H100 for comfortable KV headroom
Quantization Details
- Quantized on 2x NVIDIA B200 (358 GB VRAM)
- Calibration: 616 linear layers in <1 second
- Total pipeline: ~11 minutes
- Tool: llmcompressor
Credits
- Original model by TheDrummer
- FP8 quantization by Irvollo
- Downloads last month
- 68
Model tree for Irvollo/Behemoth-R1-123B-v2-FP8-Dynamic
Base model
mistralai/Mistral-Large-Instruct-2411 Finetuned
TheDrummer/Behemoth-R1-123B-v2