Behemoth-T1

🌴 Behemoth-T1-123B-FP8 🌴

The party where literary craft meets unhinged creative writing β€” production sweet spot.

BF16 FP8 GPTQ

β˜€οΈ The pitch

This is the FP8 W8A8 dynamic quantized version of tacodevs/Behemoth-T1-123B β€” a 123B Mistral Large roleplay model that thinks like a literary author before it writes like a storyteller.

FP8 is the production sweet spot: half the VRAM of BF16, ~99% of the quality (essentially lossless), runs cleanly on Hopper GPUs (H100, H200) with native FP8 acceleration. This is the variant most users should pick.

For the full pitch, training details, and the philosophy behind T1, see the BF16 model card.

⚑ This variant

Value
Base tacodevs/Behemoth-T1-123B (BF16)
Quantization FP8 W8A8 dynamic (8-bit weights, 8-bit activations)
Calibration Data-free (per-tensor scale computed analytically)
Quantizer llm-compressor QuantizationModifier
Size on disk ~115 GB (2Γ— smaller than BF16)
VRAM (8k ctx) ~125 GB β†’ fits on 2Γ— 80 GB or 1Γ— 144 GB GPU
Quality vs BF16 ~99% (essentially lossless)
Speed Faster than BF16 on H100/H200 (native FP8 tensor cores)

🎀 How to use

T1 expects a prefilled <think> block to enter literary thinking mode. Use the same 7 prefill phrases as the BF16 model:

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="-")

PREFILLS = {
    "analytical": "Ok i need to think about how to respond β€” what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
    "creative":   "Ok i need to think as a creative writer β€” what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
    "unhinged":   "Ok i need to think as an unhinged author β€” raw, explicit, intense, fully in character with no holding back, so",
}

response = client.chat.completions.create(
    model="tacodevs/Behemoth-T1-123B-FP8",
    messages=[
        {"role": "system", "content": CHARACTER_CARD},
        *conversation_history,
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
    ],
    extra_body={
        "continue_final_message": True,
        "add_generation_prompt": False,
    },
    temperature=0.6,
    max_tokens=2048,
    stop=["[INST]", "</s>"],
)

πŸš€ Serving with vLLM

vllm serve tacodevs/Behemoth-T1-123B-FP8 \
    --tokenizer-mode auto \
    --tensor-parallel-size 2 \
    --max-model-len 8192 \
    --kv-cache-dtype fp8

Important: use --tokenizer-mode auto, not mistral β€” mistral_common mode silently mis-templates merged-LoRA checkpoints.

Recommended hardware:

  • 2Γ— H100 80GB β€” comfortable, ~62GB per GPU
  • 2Γ— H200 144GB β€” luxury, plenty of headroom for 32k+ context
  • 1Γ— H100 NVL 188GB β€” single-card option

FP8 W8A8 uses native FP8 tensor cores on Hopper GPUs for both weights and activations, so this variant is also faster than BF16 in practice, not just smaller.

βœ… Quality notes

FP8 W8A8 dynamic quantization is essentially lossless for 100B+ models:

  • βœ… Stream-of-consciousness thinking shape β€” preserved
  • βœ… Detail surfacing from character cards β€” preserved
  • βœ… Word-for-word output similarity to BF16 β€” typically 95%+ token overlap on greedy sampling
  • βœ… Beats base R1 in side-by-side β€” preserved
  • βœ… Production reliability β€” same recipe Mistral officially uses for Mistral Large 3 FP8

If you need the absolute reference quality and have 4Γ— 80 GB GPUs, use the BF16 reference. For most production use cases, FP8 is the right pick.

πŸ› οΈ Training details (from base T1)

T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto tacodevs/Behemoth-X-R1-123B (itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).

LoRA rank 32 (alpha 64, dropout 0.05, all 7 projection modules)
Trainable params 559M / 123B (0.45%)
Dataset 1000 Claude Opus 4.5 thinking traces on real RP conversations
Loss masking Think-only (only the post-prefill thinking continuation gets loss)
Sequence length 4096
Epochs 2
Final eval loss 0.9898

The LoRA only learns the shape of literary thinking. The base model's RP prose engine receives zero gradient updates β€” the underlying creative writing voice is structurally preserved.

πŸ“œ Citation

@misc{behemoth-t1-2026,
  title  = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
  author = {tacodevs},
  year   = {2026},
  url    = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}

The party doesn't end. We just go to bed.

Downloads last month
22
Safetensors
Model size
123B params
Tensor type
BF16
Β·
F8_E4M3
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tacodevs/Behemoth-T1-123B-FP8

Quantized
(15)
this model