--- license: other license_name: mistral-ai-research-license license_link: https://mistral.ai/licenses/MNPL-0.1.md language: - en base_model: - tacodevs/Behemoth-X-R1-123B - TheDrummer/Behemoth-X-123B-v2 - TheDrummer/Behemoth-R1-123B-v2 tags: - mistral - mistral-large - 123b - roleplay - creative-writing - thinking - reasoning - lora - distillation - claude-opus pipeline_tag: text-generation library_name: transformers ---
## ☀️ The pitch
Behemoth-T1 is a **123B Mistral Large** roleplay model with one trick the
others don't have: it **thinks like an author before it writes like a
storyteller**.
Most RP models either reason in dry bullet-point lists (cold) or skip
reasoning entirely and improvise (sloppy). T1 reasons in **literary
stream-of-consciousness** — the way a working novelist talks to themselves
while drafting — and then hands the scene off to a fully-preserved creative
prose engine.
The result: **scenes that hit harder on the hard cases.** Long character
cards, emotional complexity, multi-character beats, the moments where lesser
models flatten out — those are exactly where T1 pulls ahead.
## 🎨 Three thinking modes, one model
T1 ships with **three personality modes** for the thinking phase. You pick
which one fits the scene. Each one is a different angle on the same craft,
like three friends hyping each other up at a beach party.
🧠 AnalyticalThe planner. |
🌸 CreativeThe storyteller. |
🔥 UnhingedThe troublemaker. |
## 🎤 How it works
T1 uses a **prefill** technique to enter thinking mode. You provide the
model with the start of a `
## ⚡ Quantizations
Three flavors. Pick your VRAM budget.
| Variant | VRAM (8k ctx) | Quality | Repo |
|---|---|---|---|
| BF16 | ~246 GB (4×80 GB or 2×144 GB) | Reference | Behemoth-T1-123B |
| FP8 W8A8 | ~125 GB (2×80 GB) | ~99% of BF16 | Behemoth-T1-123B-FP8 |
| GPTQ W4A16 | ~62 GB (1×80 GB) | ~96% of BF16 | Behemoth-T1-123B-GPTQ |
## 🛠️ Training details
T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto
[`tacodevs/Behemoth-X-R1-123B`](https://huggingface.co/tacodevs/Behemoth-X-R1-123B)
(itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).
| Base | tacodevs/Behemoth-X-R1-123B (Mistral Large 123B arch) |
| Method | LoRA fine-tune, think-only loss masking |
| LoRA rank | 32 (alpha 64, dropout 0.05, all 7 projection modules) |
| Trainable params | 559M / 123B (0.45%) |
| Dataset | 1000 Claude Opus 4.5 thinking traces on real RP conversations |
| Sequence length | 4096 |
| Epochs | 2 |
| Effective batch | 32 (1 × 4 grad_accum × 8 GPUs) |
| Optimizer | DeepSpeed AdamW + WarmupDecayLR |
| Learning rate | 3e-5 with 3% warmup |
| Hardware | 8× NVIDIA H200 SXM 144GB |
| Training time | 32 minutes |
| Final train loss | 0.8165 |
| Final eval loss | 0.9898 (gap: 0.17 — healthy generalization) |
| Token accuracy | 69.4% on held-out validation |
## 🌊 Lineage
T1 stands on the shoulders of three earlier models:
- **[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)**
— uncensored creative writing fine-tune of Mistral Large 2407.
*Provides the prose voice.*
- **[TheDrummer/Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)**
— reasoning fine-tune that adds `
## 🎭 What changes vs base
After training, T1 differs from base Behemoth-X-R1 in exactly one way:
**when given a `
## 📜 Citation
If T1 helps you ship something, a link back is appreciated.
```bibtex
@misc{behemoth-t1-2026,
title = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
author = {tacodevs},
year = {2026},
url = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}
```
The party doesn't end. We just go to bed.