Behemoth-T1-123B / README.md
tacodevs's picture
upload README.md
948a826 verified
---
license: other
license_name: mistral-ai-research-license
license_link: https://mistral.ai/licenses/MNPL-0.1.md
language:
- en
base_model:
- tacodevs/Behemoth-X-R1-123B
- TheDrummer/Behemoth-X-123B-v2
- TheDrummer/Behemoth-R1-123B-v2
tags:
- mistral
- mistral-large
- 123b
- roleplay
- creative-writing
- thinking
- reasoning
- lora
- distillation
- claude-opus
pipeline_tag: text-generation
library_name: transformers
---
<div align="center">
<img src="images/01_header_main.png" alt="Behemoth-T1" width="100%" />
<h1>🌴 Behemoth-T1-123B 🌴</h1>
<p><i>The party where literary craft meets unhinged creative writing.</i></p>
<p>
<a href="https://huggingface.co/tacodevs/Behemoth-T1-123B"><img src="https://img.shields.io/badge/BF16-tacodevs%2FBehemoth--T1--123B-ff7eb6?style=for-the-badge" alt="BF16"/></a>
<a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-FP8"><img src="https://img.shields.io/badge/FP8-Behemoth--T1--123B--FP8-7eddff?style=for-the-badge" alt="FP8"/></a>
<a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-GPTQ"><img src="https://img.shields.io/badge/GPTQ--W4A16-Behemoth--T1--123B--GPTQ-bdff7e?style=for-the-badge" alt="GPTQ"/></a>
</p>
</div>
<img src="images/divider_palms.png" alt="" width="100%" />
## β˜€οΈ The pitch
Behemoth-T1 is a **123B Mistral Large** roleplay model with one trick the
others don't have: it **thinks like an author before it writes like a
storyteller**.
Most RP models either reason in dry bullet-point lists (cold) or skip
reasoning entirely and improvise (sloppy). T1 reasons in **literary
stream-of-consciousness** β€” the way a working novelist talks to themselves
while drafting β€” and then hands the scene off to a fully-preserved creative
prose engine.
The result: **scenes that hit harder on the hard cases.** Long character
cards, emotional complexity, multi-character beats, the moments where lesser
models flatten out β€” those are exactly where T1 pulls ahead.
<img src="images/divider_lights.png" alt="" width="100%" />
## 🎨 Three thinking modes, one model
T1 ships with **three personality modes** for the thinking phase. You pick
which one fits the scene. Each one is a different angle on the same craft,
like three friends hyping each other up at a beach party.
<table>
<tr>
<td width="33%" align="center">
<img src="images/chibi_silver.png" alt="Analytical" width="220" />
<h3>🧠 Analytical</h3>
<p><i>The planner.</i><br/>
Reasons about what the character feels, what their experience pulls in,
what they value, what they're trying to achieve. Cool, deliberate, surgical.</p>
</td>
<td width="33%" align="center">
<img src="images/chibi_pink.png" alt="Creative" width="220" />
<h3>🌸 Creative</h3>
<p><i>The storyteller.</i><br/>
Looks for the unexpected angle, the twist nobody saw coming, the line of
escalation that feels earned instead of cheap. Curious, generative,
narratively confident.</p>
</td>
<td width="33%" align="center">
<img src="images/chibi_red.png" alt="Unhinged" width="220" />
<h3>πŸ”₯ Unhinged</h3>
<p><i>The troublemaker.</i><br/>
Raw, explicit, intense, fully in character with no holding back. Throws out
the safe option and asks what would make this scene actually hit. Pure id
energy with craft underneath.</p>
</td>
</tr>
</table>
<img src="images/divider_waves.png" alt="" width="100%" />
## 🎀 How it works
T1 uses a **prefill** technique to enter thinking mode. You provide the
model with the start of a `<think>` block containing one of seven seed
phrases, and the model continues from there with literary craft notes
before producing the actual response.
```python
# vLLM OpenAI-compatible endpoint with prefill via continue_final_message
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/v1",
api_key="-",
)
PREFILLS = {
"analytical": "Ok i need to think about how to respond β€” what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
"creative": "Ok i need to think as a creative writer β€” what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
"unhinged": "Ok i need to think as an unhinged author β€” raw, explicit, intense, fully in character with no holding back, so",
}
response = client.chat.completions.create(
model="tacodevs/Behemoth-T1-123B",
messages=[
{"role": "system", "content": CHARACTER_CARD},
*conversation_history,
{"role": "user", "content": user_message},
{"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
],
extra_body={
"continue_final_message": True,
"add_generation_prompt": False,
},
temperature=0.6,
max_tokens=2048,
stop=["[INST]", "</s>"],
)
```
The model responds with the rest of the thinking block, closes `</think>`,
and then writes the in-character prose response β€” all in one continuous
stream.
<img src="images/02_party_dance.png" alt="" width="100%" />
## ⚑ Quantizations
Three flavors. Pick your VRAM budget.
<table>
<tr>
<th>Variant</th>
<th>VRAM (8k ctx)</th>
<th>Quality</th>
<th>Repo</th>
</tr>
<tr>
<td><b>BF16</b></td>
<td>~246 GB (4Γ—80 GB or 2Γ—144 GB)</td>
<td>Reference</td>
<td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B"><code>Behemoth-T1-123B</code></a></td>
</tr>
<tr>
<td><b>FP8 W8A8</b></td>
<td>~125 GB (2Γ—80 GB)</td>
<td>~99% of BF16</td>
<td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-FP8"><code>Behemoth-T1-123B-FP8</code></a></td>
</tr>
<tr>
<td><b>GPTQ W4A16</b></td>
<td>~62 GB (1Γ—80 GB)</td>
<td>~96% of BF16</td>
<td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-GPTQ"><code>Behemoth-T1-123B-GPTQ</code></a></td>
</tr>
</table>
All variants serve cleanly via vLLM with `--tokenizer-mode auto` (do **not**
use `mistral` mode β€” it silently mis-templates merged-LoRA checkpoints).
<img src="images/03_party_drinks.png" alt="" width="100%" />
## πŸ› οΈ Training details
T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto
[`tacodevs/Behemoth-X-R1-123B`](https://huggingface.co/tacodevs/Behemoth-X-R1-123B)
(itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).
<table>
<tr><td><b>Base</b></td><td>tacodevs/Behemoth-X-R1-123B (Mistral Large 123B arch)</td></tr>
<tr><td><b>Method</b></td><td>LoRA fine-tune, think-only loss masking</td></tr>
<tr><td><b>LoRA rank</b></td><td>32 (alpha 64, dropout 0.05, all 7 projection modules)</td></tr>
<tr><td><b>Trainable params</b></td><td>559M / 123B (0.45%)</td></tr>
<tr><td><b>Dataset</b></td><td>1000 Claude Opus 4.5 thinking traces on real RP conversations</td></tr>
<tr><td><b>Sequence length</b></td><td>4096</td></tr>
<tr><td><b>Epochs</b></td><td>2</td></tr>
<tr><td><b>Effective batch</b></td><td>32 (1 Γ— 4 grad_accum Γ— 8 GPUs)</td></tr>
<tr><td><b>Optimizer</b></td><td>DeepSpeed AdamW + WarmupDecayLR</td></tr>
<tr><td><b>Learning rate</b></td><td>3e-5 with 3% warmup</td></tr>
<tr><td><b>Hardware</b></td><td>8Γ— NVIDIA H200 SXM 144GB</td></tr>
<tr><td><b>Training time</b></td><td>32 minutes</td></tr>
<tr><td><b>Final train loss</b></td><td>0.8165</td></tr>
<tr><td><b>Final eval loss</b></td><td>0.9898 (gap: 0.17 β€” healthy generalization)</td></tr>
<tr><td><b>Token accuracy</b></td><td>69.4% on held-out validation</td></tr>
</table>
### The think-only loss trick
Loss is computed **only** on the post-prefill thinking continuation, up
through `</think>`. The system prompt, user message, prefilled portion of
the assistant turn, and the entire response after `</think>` are all masked
to `-100`. This means:
1. The base model's RP prose engine receives **zero gradient updates** β€”
the underlying creative writing voice is structurally preserved.
2. The LoRA only learns the *shape* of literary thinking β€” what to surface,
how to chain ideas, where to land the craft.
3. At inference, T1 thinks in the new Opus-style stream-of-consciousness,
then hands off to the unmodified base prose engine for the actual
response.
This is the only loss configuration that gives you new thinking *without*
messing with the prose voice you wanted to preserve.
<img src="images/04_party_water.png" alt="" width="100%" />
## 🌊 Lineage
T1 stands on the shoulders of three earlier models:
- **[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)**
β€” uncensored creative writing fine-tune of Mistral Large 2407.
*Provides the prose voice.*
- **[TheDrummer/Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)**
β€” reasoning fine-tune that adds `<think>` capability.
*Provides the thinking infrastructure.*
- **[tacodevs/Behemoth-X-R1-123B](https://huggingface.co/tacodevs/Behemoth-X-R1-123B)**
β€” SCE merge of X + R1 (55%/45%, `select_topk: 1.0`).
*The direct base for T1's LoRA.*
T1 then distills literary thinking patterns from **Claude Opus 4.5** on top
of that merge, keeping the creative voice while replacing R1's bullet-point
thinking with stream-of-consciousness craft notes.
<img src="images/05_party_floats.png" alt="" width="100%" />
## 🎭 What changes vs base
After training, T1 differs from base Behemoth-X-R1 in exactly one way:
**when given a `<think>` prefill, it produces literary author-craft notes
instead of structured bullets.**
The prose generation, character voice handling, NSFW handling, long context
attention, system prompt comprehension β€” none of that changed. We
specifically didn't touch those weights.
What you should notice:
- **Hard scenes hit harder.** Long character cards, emotionally complex
beats, multi-character POV moments β€” these are where the literary
thinking earns its compute. ~15-25% better scene quality on these cases
in our internal evals.
- **Easy scenes are unchanged.** A simple horny prompt with a one-line
card? Base behavior. T1 doesn't try to be clever where cleverness isn't
needed.
- **Refusals are not added.** T1 inherits Behemoth-X-R1's lack of safety
alignment for creative fiction. We did not retrain that surface.
## ⚠️ Limitations
- T1's improvement is **conditional on the prefill**. Without a prefilled
`<think>` block, the model behaves like base Behemoth-X-R1. The LoRA only
fires when seeded.
- Sequence length cap during training was 4096. The model still handles
longer contexts at inference (it's a 131k context Mistral Large), but the
thinking style was learned on shorter conversations.
- The literary thinking style is opinionated. If you want sparse bullet
thinking, prefill `<think>\n` with no seed phrase and the model will fall
back to base behavior.
<img src="images/divider_palms.png" alt="" width="100%" />
## πŸ“œ Citation
If T1 helps you ship something, a link back is appreciated.
```bibtex
@misc{behemoth-t1-2026,
title = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
author = {tacodevs},
year = {2026},
url = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}
```
<div align="center">
<img src="images/06_footer_sunset.png" alt="" width="100%" />
<i>The party doesn't end. We just go to bed.</i>
</div>