--- tags: - fp8 - quantized - mistral - roleplay - creative-writing - reasoning base_model: TheDrummer/Behemoth-R1-123B-v2 library_name: transformers pipeline_tag: text-generation license: apache-2.0 --- # Behemoth-R1-123B-v2 FP8 Dynamic FP8 Dynamic quantization of [TheDrummer/Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2) using llmcompressor. ## Model Details - **Base Model**: TheDrummer/Behemoth-R1-123B-v2 (Mistral Large 2411 finetune) - **Quantization**: FP8 Dynamic (W8A8) via llmcompressor - **Scheme**: FP8_DYNAMIC, lm_head excluded - **Size**: ~123 GB (vs 246 GB FP16) - **Format**: SafeTensors with compressed-tensors metadata ## Usage with vLLM ```bash python3 -m vllm.entrypoints.openai.api_server \ --model Irvollo/Behemoth-R1-123B-v2-FP8-Dynamic \ --quantization compressed-tensors \ --dtype bfloat16 \ --max-model-len 32768 \ --gpu-memory-utilization 0.95 \ --enable-prefix-caching \ --trust-remote-code ``` ## Reasoning / Thinking Supports native reasoning via `` tag prefill: ```json { "messages": [ {"role": "user", "content": "Your question"}, {"role": "assistant", "content": "\n"} ], "continue_final_message": true, "add_generation_prompt": false } ``` ## Hardware Requirements - **Single GPU**: H200 NVL (141 GB) — tight with ~18 GB KV cache - **Recommended**: 2x A100 80GB or H100 for comfortable KV headroom ## Quantization Details - Quantized on 2x NVIDIA B200 (358 GB VRAM) - Calibration: 616 linear layers in <1 second - Total pipeline: ~11 minutes - Tool: [llmcompressor](https://github.com/vllm-project/llm-compressor) ## Credits - Original model by [TheDrummer](https://huggingface.co/TheDrummer) - FP8 quantization by [Irvollo](https://huggingface.co/Irvollo)