Behemoth-T1

🌴 Behemoth-T1-123B-GPTQ 🌴

The party where literary craft meets unhinged creative writing β€” now in 4-bit.

BF16 FP8 GPTQ

β˜€οΈ The pitch

This is the W4A16 GPTQ-quantized version of tacodevs/Behemoth-T1-123B β€” a 123B Mistral Large roleplay model that thinks like a literary author before it writes like a storyteller.

GPTQ is for single-GPU users. The full model fits on a single 80 GB or 96 GB GPU. Quality is ~95-97% of the BF16 reference β€” virtually indistinguishable from full precision in normal use.

For the full pitch, training details, and the philosophy behind T1, see the BF16 model card.

⚑ This variant

Value
Base tacodevs/Behemoth-T1-123B (BF16)
Quantization GPTQ W4A16 (4-bit weights, 16-bit activations)
Group size 128
Calibration 256 in-distribution samples from tacodevs/rp-opus-4.6-x1000
Quantizer llm-compressor GPTQModifier
Size on disk ~62 GB (4Γ— smaller than BF16)
VRAM (8k ctx) ~62 GB β†’ fits on 1Γ— 80 GB or 1Γ— 96 GB GPU
Quality vs BF16 ~95-97% (literary thinking pattern preserved)

🎀 How to use

T1 expects a prefilled <think> block to enter literary thinking mode. Use the same 7 prefill phrases as the BF16 model:

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="-")

PREFILLS = {
    "analytical": "Ok i need to think about how to respond β€” what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
    "creative":   "Ok i need to think as a creative writer β€” what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
    "unhinged":   "Ok i need to think as an unhinged author β€” raw, explicit, intense, fully in character with no holding back, so",
}

response = client.chat.completions.create(
    model="tacodevs/Behemoth-T1-123B-GPTQ",
    messages=[
        {"role": "system", "content": CHARACTER_CARD},
        *conversation_history,
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
    ],
    extra_body={
        "continue_final_message": True,
        "add_generation_prompt": False,
    },
    temperature=0.6,
    max_tokens=2048,
    stop=["[INST]", "</s>"],
)

πŸš€ Serving with vLLM

vllm serve tacodevs/Behemoth-T1-123B-GPTQ \
    --tokenizer-mode auto \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9

Important: use --tokenizer-mode auto, not mistral β€” mistral_common mode silently mis-templates merged-LoRA checkpoints.

Single 80 GB H100, single 80 GB A100, or single 96 GB H100 NVL all fit comfortably.

🟑 Quality notes

W4A16 quantization has measurable but small impact on the literary thinking pattern T1 was trained for:

  • βœ… Stream-of-consciousness thinking shape β€” preserved (encoded across many attention layers)
  • βœ… Detail surfacing from character cards β€” preserved
  • βœ… Beats base R1 in side-by-side β€” preserved (the gap is huge)
  • 🟑 Specific word choices β€” may differ token-by-token from BF16
  • 🟑 The "cleverest" inventive details β€” sometimes replaced with equivalent-quality alternatives

For most users, GPTQ T1 is indistinguishable from BF16 T1 in normal use. Only A/B testing with the same seed would expose the differences.

If you want maximum quality and have the VRAM, use the BF16 reference or FP8 W8A8 (~99% of BF16, fits on 2Γ—80 GB).

πŸ› οΈ Training details (from base T1)

T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto tacodevs/Behemoth-X-R1-123B (itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).

LoRA rank 32 (alpha 64, dropout 0.05, all 7 projection modules)
Trainable params 559M / 123B (0.45%)
Dataset 1000 Claude Opus 4.5 thinking traces on real RP conversations
Loss masking Think-only (only the post-prefill thinking continuation gets loss)
Sequence length 4096
Epochs 2
Final eval loss 0.9898

The LoRA only learns the shape of literary thinking. The base model's RP prose engine receives zero gradient updates β€” the underlying creative writing voice is structurally preserved.

πŸ“œ Citation

@misc{behemoth-t1-2026,
  title  = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
  author = {tacodevs},
  year   = {2026},
  url    = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}

The party doesn't end. We just go to bed.

Downloads last month
35
Safetensors
Model size
17B params
Tensor type
I64
Β·
I32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tacodevs/Behemoth-T1-123B-GPTQ

Quantized
(15)
this model