File size: 11,217 Bytes
948a826
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
---
license: other
license_name: mistral-ai-research-license
license_link: https://mistral.ai/licenses/MNPL-0.1.md
language:
- en
base_model:
- tacodevs/Behemoth-X-R1-123B
- TheDrummer/Behemoth-X-123B-v2
- TheDrummer/Behemoth-R1-123B-v2
tags:
- mistral
- mistral-large
- 123b
- roleplay
- creative-writing
- thinking
- reasoning
- lora
- distillation
- claude-opus
pipeline_tag: text-generation
library_name: transformers
---

<div align="center">

<img src="images/01_header_main.png" alt="Behemoth-T1" width="100%" />

<h1>🌴 Behemoth-T1-123B 🌴</h1>

<p><i>The party where literary craft meets unhinged creative writing.</i></p>

<p>
  <a href="https://huggingface.co/tacodevs/Behemoth-T1-123B"><img src="https://img.shields.io/badge/BF16-tacodevs%2FBehemoth--T1--123B-ff7eb6?style=for-the-badge" alt="BF16"/></a>
  <a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-FP8"><img src="https://img.shields.io/badge/FP8-Behemoth--T1--123B--FP8-7eddff?style=for-the-badge" alt="FP8"/></a>
  <a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-GPTQ"><img src="https://img.shields.io/badge/GPTQ--W4A16-Behemoth--T1--123B--GPTQ-bdff7e?style=for-the-badge" alt="GPTQ"/></a>
</p>

</div>

<img src="images/divider_palms.png" alt="" width="100%" />

## β˜€οΈ The pitch

Behemoth-T1 is a **123B Mistral Large** roleplay model with one trick the
others don't have: it **thinks like an author before it writes like a
storyteller**.

Most RP models either reason in dry bullet-point lists (cold) or skip
reasoning entirely and improvise (sloppy). T1 reasons in **literary
stream-of-consciousness** β€” the way a working novelist talks to themselves
while drafting β€” and then hands the scene off to a fully-preserved creative
prose engine.

The result: **scenes that hit harder on the hard cases.** Long character
cards, emotional complexity, multi-character beats, the moments where lesser
models flatten out β€” those are exactly where T1 pulls ahead.

<img src="images/divider_lights.png" alt="" width="100%" />

## 🎨 Three thinking modes, one model

T1 ships with **three personality modes** for the thinking phase. You pick
which one fits the scene. Each one is a different angle on the same craft,
like three friends hyping each other up at a beach party.

<table>
<tr>
<td width="33%" align="center">

<img src="images/chibi_silver.png" alt="Analytical" width="220" />

<h3>🧠 Analytical</h3>

<p><i>The planner.</i><br/>
Reasons about what the character feels, what their experience pulls in,
what they value, what they're trying to achieve. Cool, deliberate, surgical.</p>

</td>
<td width="33%" align="center">

<img src="images/chibi_pink.png" alt="Creative" width="220" />

<h3>🌸 Creative</h3>

<p><i>The storyteller.</i><br/>
Looks for the unexpected angle, the twist nobody saw coming, the line of
escalation that feels earned instead of cheap. Curious, generative,
narratively confident.</p>

</td>
<td width="33%" align="center">

<img src="images/chibi_red.png" alt="Unhinged" width="220" />

<h3>πŸ”₯ Unhinged</h3>

<p><i>The troublemaker.</i><br/>
Raw, explicit, intense, fully in character with no holding back. Throws out
the safe option and asks what would make this scene actually hit. Pure id
energy with craft underneath.</p>

</td>
</tr>
</table>

<img src="images/divider_waves.png" alt="" width="100%" />

## 🎀 How it works

T1 uses a **prefill** technique to enter thinking mode. You provide the
model with the start of a `<think>` block containing one of seven seed
phrases, and the model continues from there with literary craft notes
before producing the actual response.

```python
# vLLM OpenAI-compatible endpoint with prefill via continue_final_message
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="-",
)

PREFILLS = {
    "analytical": "Ok i need to think about how to respond β€” what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
    "creative":   "Ok i need to think as a creative writer β€” what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
    "unhinged":   "Ok i need to think as an unhinged author β€” raw, explicit, intense, fully in character with no holding back, so",
}

response = client.chat.completions.create(
    model="tacodevs/Behemoth-T1-123B",
    messages=[
        {"role": "system", "content": CHARACTER_CARD},
        *conversation_history,
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
    ],
    extra_body={
        "continue_final_message": True,
        "add_generation_prompt": False,
    },
    temperature=0.6,
    max_tokens=2048,
    stop=["[INST]", "</s>"],
)
```

The model responds with the rest of the thinking block, closes `</think>`,
and then writes the in-character prose response β€” all in one continuous
stream.

<img src="images/02_party_dance.png" alt="" width="100%" />

## ⚑ Quantizations

Three flavors. Pick your VRAM budget.

<table>
<tr>
<th>Variant</th>
<th>VRAM (8k ctx)</th>
<th>Quality</th>
<th>Repo</th>
</tr>
<tr>
<td><b>BF16</b></td>
<td>~246 GB (4Γ—80 GB or 2Γ—144 GB)</td>
<td>Reference</td>
<td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B"><code>Behemoth-T1-123B</code></a></td>
</tr>
<tr>
<td><b>FP8 W8A8</b></td>
<td>~125 GB (2Γ—80 GB)</td>
<td>~99% of BF16</td>
<td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-FP8"><code>Behemoth-T1-123B-FP8</code></a></td>
</tr>
<tr>
<td><b>GPTQ W4A16</b></td>
<td>~62 GB (1Γ—80 GB)</td>
<td>~96% of BF16</td>
<td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-GPTQ"><code>Behemoth-T1-123B-GPTQ</code></a></td>
</tr>
</table>

All variants serve cleanly via vLLM with `--tokenizer-mode auto` (do **not**
use `mistral` mode β€” it silently mis-templates merged-LoRA checkpoints).

<img src="images/03_party_drinks.png" alt="" width="100%" />

## πŸ› οΈ Training details

T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto
[`tacodevs/Behemoth-X-R1-123B`](https://huggingface.co/tacodevs/Behemoth-X-R1-123B)
(itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).

<table>
<tr><td><b>Base</b></td><td>tacodevs/Behemoth-X-R1-123B (Mistral Large 123B arch)</td></tr>
<tr><td><b>Method</b></td><td>LoRA fine-tune, think-only loss masking</td></tr>
<tr><td><b>LoRA rank</b></td><td>32 (alpha 64, dropout 0.05, all 7 projection modules)</td></tr>
<tr><td><b>Trainable params</b></td><td>559M / 123B (0.45%)</td></tr>
<tr><td><b>Dataset</b></td><td>1000 Claude Opus 4.5 thinking traces on real RP conversations</td></tr>
<tr><td><b>Sequence length</b></td><td>4096</td></tr>
<tr><td><b>Epochs</b></td><td>2</td></tr>
<tr><td><b>Effective batch</b></td><td>32 (1 Γ— 4 grad_accum Γ— 8 GPUs)</td></tr>
<tr><td><b>Optimizer</b></td><td>DeepSpeed AdamW + WarmupDecayLR</td></tr>
<tr><td><b>Learning rate</b></td><td>3e-5 with 3% warmup</td></tr>
<tr><td><b>Hardware</b></td><td>8Γ— NVIDIA H200 SXM 144GB</td></tr>
<tr><td><b>Training time</b></td><td>32 minutes</td></tr>
<tr><td><b>Final train loss</b></td><td>0.8165</td></tr>
<tr><td><b>Final eval loss</b></td><td>0.9898 (gap: 0.17 β€” healthy generalization)</td></tr>
<tr><td><b>Token accuracy</b></td><td>69.4% on held-out validation</td></tr>
</table>

### The think-only loss trick

Loss is computed **only** on the post-prefill thinking continuation, up
through `</think>`. The system prompt, user message, prefilled portion of
the assistant turn, and the entire response after `</think>` are all masked
to `-100`. This means:

1. The base model's RP prose engine receives **zero gradient updates** β€”
   the underlying creative writing voice is structurally preserved.
2. The LoRA only learns the *shape* of literary thinking β€” what to surface,
   how to chain ideas, where to land the craft.
3. At inference, T1 thinks in the new Opus-style stream-of-consciousness,
   then hands off to the unmodified base prose engine for the actual
   response.

This is the only loss configuration that gives you new thinking *without*
messing with the prose voice you wanted to preserve.

<img src="images/04_party_water.png" alt="" width="100%" />

## 🌊 Lineage

T1 stands on the shoulders of three earlier models:

- **[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)**
  β€” uncensored creative writing fine-tune of Mistral Large 2407.
  *Provides the prose voice.*
- **[TheDrummer/Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)**
  β€” reasoning fine-tune that adds `<think>` capability.
  *Provides the thinking infrastructure.*
- **[tacodevs/Behemoth-X-R1-123B](https://huggingface.co/tacodevs/Behemoth-X-R1-123B)**
  β€” SCE merge of X + R1 (55%/45%, `select_topk: 1.0`).
  *The direct base for T1's LoRA.*

T1 then distills literary thinking patterns from **Claude Opus 4.5** on top
of that merge, keeping the creative voice while replacing R1's bullet-point
thinking with stream-of-consciousness craft notes.

<img src="images/05_party_floats.png" alt="" width="100%" />

## 🎭 What changes vs base

After training, T1 differs from base Behemoth-X-R1 in exactly one way:
**when given a `<think>` prefill, it produces literary author-craft notes
instead of structured bullets.**

The prose generation, character voice handling, NSFW handling, long context
attention, system prompt comprehension β€” none of that changed. We
specifically didn't touch those weights.

What you should notice:

- **Hard scenes hit harder.** Long character cards, emotionally complex
  beats, multi-character POV moments β€” these are where the literary
  thinking earns its compute. ~15-25% better scene quality on these cases
  in our internal evals.
- **Easy scenes are unchanged.** A simple horny prompt with a one-line
  card? Base behavior. T1 doesn't try to be clever where cleverness isn't
  needed.
- **Refusals are not added.** T1 inherits Behemoth-X-R1's lack of safety
  alignment for creative fiction. We did not retrain that surface.

## ⚠️ Limitations

- T1's improvement is **conditional on the prefill**. Without a prefilled
  `<think>` block, the model behaves like base Behemoth-X-R1. The LoRA only
  fires when seeded.
- Sequence length cap during training was 4096. The model still handles
  longer contexts at inference (it's a 131k context Mistral Large), but the
  thinking style was learned on shorter conversations.
- The literary thinking style is opinionated. If you want sparse bullet
  thinking, prefill `<think>\n` with no seed phrase and the model will fall
  back to base behavior.

<img src="images/divider_palms.png" alt="" width="100%" />

## πŸ“œ Citation

If T1 helps you ship something, a link back is appreciated.

```bibtex
@misc{behemoth-t1-2026,
  title  = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
  author = {tacodevs},
  year   = {2026},
  url    = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}
```

<div align="center">

<img src="images/06_footer_sunset.png" alt="" width="100%" />

<i>The party doesn't end. We just go to bed.</i>

</div>