π΄ Behemoth-T1-123B-FP8 π΄
The party where literary craft meets unhinged creative writing β production sweet spot.
βοΈ The pitch
This is the FP8 W8A8 dynamic quantized version of tacodevs/Behemoth-T1-123B β a 123B Mistral Large roleplay model that thinks like a literary author before it writes like a storyteller.
FP8 is the production sweet spot: half the VRAM of BF16, ~99% of the quality (essentially lossless), runs cleanly on Hopper GPUs (H100, H200) with native FP8 acceleration. This is the variant most users should pick.
For the full pitch, training details, and the philosophy behind T1, see the BF16 model card.
β‘ This variant
| Value | |
|---|---|
| Base | tacodevs/Behemoth-T1-123B (BF16) |
| Quantization | FP8 W8A8 dynamic (8-bit weights, 8-bit activations) |
| Calibration | Data-free (per-tensor scale computed analytically) |
| Quantizer | llm-compressor QuantizationModifier |
| Size on disk | ~115 GB (2Γ smaller than BF16) |
| VRAM (8k ctx) | ~125 GB β fits on 2Γ 80 GB or 1Γ 144 GB GPU |
| Quality vs BF16 | ~99% (essentially lossless) |
| Speed | Faster than BF16 on H100/H200 (native FP8 tensor cores) |
π€ How to use
T1 expects a prefilled <think> block to enter literary thinking mode. Use the same 7 prefill phrases as the BF16 model:
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="-")
PREFILLS = {
"analytical": "Ok i need to think about how to respond β what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
"creative": "Ok i need to think as a creative writer β what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
"unhinged": "Ok i need to think as an unhinged author β raw, explicit, intense, fully in character with no holding back, so",
}
response = client.chat.completions.create(
model="tacodevs/Behemoth-T1-123B-FP8",
messages=[
{"role": "system", "content": CHARACTER_CARD},
*conversation_history,
{"role": "user", "content": user_message},
{"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
],
extra_body={
"continue_final_message": True,
"add_generation_prompt": False,
},
temperature=0.6,
max_tokens=2048,
stop=["[INST]", "</s>"],
)
π Serving with vLLM
vllm serve tacodevs/Behemoth-T1-123B-FP8 \
--tokenizer-mode auto \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--kv-cache-dtype fp8
Important: use --tokenizer-mode auto, not mistral β mistral_common mode silently mis-templates merged-LoRA checkpoints.
Recommended hardware:
- 2Γ H100 80GB β comfortable, ~62GB per GPU
- 2Γ H200 144GB β luxury, plenty of headroom for 32k+ context
- 1Γ H100 NVL 188GB β single-card option
FP8 W8A8 uses native FP8 tensor cores on Hopper GPUs for both weights and activations, so this variant is also faster than BF16 in practice, not just smaller.
β Quality notes
FP8 W8A8 dynamic quantization is essentially lossless for 100B+ models:
- β Stream-of-consciousness thinking shape β preserved
- β Detail surfacing from character cards β preserved
- β Word-for-word output similarity to BF16 β typically 95%+ token overlap on greedy sampling
- β Beats base R1 in side-by-side β preserved
- β Production reliability β same recipe Mistral officially uses for Mistral Large 3 FP8
If you need the absolute reference quality and have 4Γ 80 GB GPUs, use the BF16 reference. For most production use cases, FP8 is the right pick.
π οΈ Training details (from base T1)
T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto
tacodevs/Behemoth-X-R1-123B
(itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).
| LoRA rank | 32 (alpha 64, dropout 0.05, all 7 projection modules) |
| Trainable params | 559M / 123B (0.45%) |
| Dataset | 1000 Claude Opus 4.5 thinking traces on real RP conversations |
| Loss masking | Think-only (only the post-prefill thinking continuation gets loss) |
| Sequence length | 4096 |
| Epochs | 2 |
| Final eval loss | 0.9898 |
The LoRA only learns the shape of literary thinking. The base model's RP prose engine receives zero gradient updates β the underlying creative writing voice is structurally preserved.
π Citation
@misc{behemoth-t1-2026,
title = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
author = {tacodevs},
year = {2026},
url = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}
The party doesn't end. We just go to bed.
- Downloads last month
- 22
Model tree for tacodevs/Behemoth-T1-123B-FP8
Base model
mistralai/Mistral-Large-Instruct-2411