README.md · tacodevs/Behemoth-T1-123B at main

File size: 11,217 Bytes

948a826

---
license: other
license_name: mistral-ai-research-license
license_link: https://mistral.ai/licenses/MNPL-0.1.md
language:
- en
base_model:
- tacodevs/Behemoth-X-R1-123B
- TheDrummer/Behemoth-X-123B-v2
- TheDrummer/Behemoth-R1-123B-v2
tags:
- mistral
- mistral-large
- 123b
- roleplay
- creative-writing
- thinking
- reasoning
- lora
- distillation
- claude-opus
pipeline_tag: text-generation
library_name: transformers
---

<div align="center">

<img src="images/01_header_main.png" alt="Behemoth-T1" width="100%" />

<h1>🌴 Behemoth-T1-123B 🌴</h1>

<p><i>The party where literary craft meets unhinged creative writing.</i></p>

<p>
  <a href="https://huggingface.co/tacodevs/Behemoth-T1-123B"><img src="https://img.shields.io/badge/BF16-tacodevs%2FBehemoth--T1--123B-ff7eb6?style=for-the-badge" alt="BF16"/></a>
  <a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-FP8"><img src="https://img.shields.io/badge/FP8-Behemoth--T1--123B--FP8-7eddff?style=for-the-badge" alt="FP8"/></a>
  <a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-GPTQ"><img src="https://img.shields.io/badge/GPTQ--W4A16-Behemoth--T1--123B--GPTQ-bdff7e?style=for-the-badge" alt="GPTQ"/></a>
</p>

</div>

<img src="images/divider_palms.png" alt="" width="100%" />

## ☀️ The pitch

Behemoth-T1 is a **123B Mistral Large** roleplay model with one trick the
others don't have: it **thinks like an author before it writes like a
storyteller**.

Most RP models either reason in dry bullet-point lists (cold) or skip
reasoning entirely and improvise (sloppy). T1 reasons in **literary
stream-of-consciousness** — the way a working novelist talks to themselves
while drafting — and then hands the scene off to a fully-preserved creative
prose engine.

The result: **scenes that hit harder on the hard cases.** Long character
cards, emotional complexity, multi-character beats, the moments where lesser
models flatten out — those are exactly where T1 pulls ahead.

<img src="images/divider_lights.png" alt="" width="100%" />

## 🎨 Three thinking modes, one model

T1 ships with **three personality modes** for the thinking phase. You pick
which one fits the scene. Each one is a different angle on the same craft,
like three friends hyping each other up at a beach party.

<table>
<tr>
<td width="33%" align="center">

<img src="images/chibi_silver.png" alt="Analytical" width="220" />

<h3>🧠 Analytical</h3>

<p><i>The planner.</i><br/>
Reasons about what the character feels, what their experience pulls in,
what they value, what they're trying to achieve. Cool, deliberate, surgical.</p>

</td>
<td width="33%" align="center">

<img src="images/chibi_pink.png" alt="Creative" width="220" />

<h3>🌸 Creative</h3>

<p><i>The storyteller.</i><br/>
Looks for the unexpected angle, the twist nobody saw coming, the line of
escalation that feels earned instead of cheap. Curious, generative,
narratively confident.</p>

</td>
<td width="33%" align="center">

<img src="images/chibi_red.png" alt="Unhinged" width="220" />

<h3>🔥 Unhinged</h3>

<p><i>The troublemaker.</i><br/>
Raw, explicit, intense, fully in character with no holding back. Throws out
the safe option and asks what would make this scene actually hit. Pure id
energy with craft underneath.</p>

</td>
</tr>
</table>

<img src="images/divider_waves.png" alt="" width="100%" />

## 🎤 How it works

T1 uses a **prefill** technique to enter thinking mode. You provide the
model with the start of a `<think>` block containing one of seven seed
phrases, and the model continues from there with literary craft notes
before producing the actual response.

```python
# vLLM OpenAI-compatible endpoint with prefill via continue_final_message
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="-",
)

PREFILLS = {
    "analytical": "Ok i need to think about how to respond — what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
    "creative":   "Ok i need to think as a creative writer — what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
    "unhinged":   "Ok i need to think as an unhinged author — raw, explicit, intense, fully in character with no holding back, so",
}

response = client.chat.completions.create(
    model="tacodevs/Behemoth-T1-123B",
    messages=[
        {"role": "system", "content": CHARACTER_CARD},
        *conversation_history,
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
    ],
    extra_body={
        "continue_final_message": True,
        "add_generation_prompt": False,
    },
    temperature=0.6,
    max_tokens=2048,
    stop=["[INST]", "</s>"],
)
```

The model responds with the rest of the thinking block, closes `</think>`,
and then writes the in-character prose response — all in one continuous
stream.

<img src="images/02_party_dance.png" alt="" width="100%" />

## ⚡ Quantizations

Three flavors. Pick your VRAM budget.

<table>
<tr>
<th>Variant</th>
<th>VRAM (8k ctx)</th>
<th>Quality</th>
<th>Repo</th>
</tr>
<tr>
<td><b>BF16</b></td>
<td>~246 GB (4×80 GB or 2×144 GB)</td>
<td>Reference</td>
<td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B"><code>Behemoth-T1-123B</code></a></td>
</tr>
<tr>
<td><b>FP8 W8A8</b></td>
<td>~125 GB (2×80 GB)</td>
<td>~99% of BF16</td>
<td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-FP8"><code>Behemoth-T1-123B-FP8</code></a></td>
</tr>
<tr>
<td><b>GPTQ W4A16</b></td>
<td>~62 GB (1×80 GB)</td>
<td>~96% of BF16</td>
<td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-GPTQ"><code>Behemoth-T1-123B-GPTQ</code></a></td>
</tr>
</table>

All variants serve cleanly via vLLM with `--tokenizer-mode auto` (do **not**
use `mistral` mode — it silently mis-templates merged-LoRA checkpoints).

<img src="images/03_party_drinks.png" alt="" width="100%" />

## 🛠️ Training details

T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto
[`tacodevs/Behemoth-X-R1-123B`](https://huggingface.co/tacodevs/Behemoth-X-R1-123B)
(itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).

<table>
<tr><td><b>Base</b></td><td>tacodevs/Behemoth-X-R1-123B (Mistral Large 123B arch)</td></tr>
<tr><td><b>Method</b></td><td>LoRA fine-tune, think-only loss masking</td></tr>
<tr><td><b>LoRA rank</b></td><td>32 (alpha 64, dropout 0.05, all 7 projection modules)</td></tr>
<tr><td><b>Trainable params</b></td><td>559M / 123B (0.45%)</td></tr>
<tr><td><b>Dataset</b></td><td>1000 Claude Opus 4.5 thinking traces on real RP conversations</td></tr>
<tr><td><b>Sequence length</b></td><td>4096</td></tr>
<tr><td><b>Epochs</b></td><td>2</td></tr>
<tr><td><b>Effective batch</b></td><td>32 (1 × 4 grad_accum × 8 GPUs)</td></tr>
<tr><td><b>Optimizer</b></td><td>DeepSpeed AdamW + WarmupDecayLR</td></tr>
<tr><td><b>Learning rate</b></td><td>3e-5 with 3% warmup</td></tr>
<tr><td><b>Hardware</b></td><td>8× NVIDIA H200 SXM 144GB</td></tr>
<tr><td><b>Training time</b></td><td>32 minutes</td></tr>
<tr><td><b>Final train loss</b></td><td>0.8165</td></tr>
<tr><td><b>Final eval loss</b></td><td>0.9898 (gap: 0.17 — healthy generalization)</td></tr>
<tr><td><b>Token accuracy</b></td><td>69.4% on held-out validation</td></tr>
</table>

### The think-only loss trick

Loss is computed **only** on the post-prefill thinking continuation, up
through `</think>`. The system prompt, user message, prefilled portion of
the assistant turn, and the entire response after `</think>` are all masked
to `-100`. This means:

1. The base model's RP prose engine receives **zero gradient updates** —
   the underlying creative writing voice is structurally preserved.
2. The LoRA only learns the *shape* of literary thinking — what to surface,
   how to chain ideas, where to land the craft.
3. At inference, T1 thinks in the new Opus-style stream-of-consciousness,
   then hands off to the unmodified base prose engine for the actual
   response.

This is the only loss configuration that gives you new thinking *without*
messing with the prose voice you wanted to preserve.

<img src="images/04_party_water.png" alt="" width="100%" />

## 🌊 Lineage

T1 stands on the shoulders of three earlier models:

- **[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)**
  — uncensored creative writing fine-tune of Mistral Large 2407.
  *Provides the prose voice.*
- **[TheDrummer/Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)**
  — reasoning fine-tune that adds `<think>` capability.
  *Provides the thinking infrastructure.*
- **[tacodevs/Behemoth-X-R1-123B](https://huggingface.co/tacodevs/Behemoth-X-R1-123B)**
  — SCE merge of X + R1 (55%/45%, `select_topk: 1.0`).
  *The direct base for T1's LoRA.*

T1 then distills literary thinking patterns from **Claude Opus 4.5** on top
of that merge, keeping the creative voice while replacing R1's bullet-point
thinking with stream-of-consciousness craft notes.

<img src="images/05_party_floats.png" alt="" width="100%" />

## 🎭 What changes vs base

After training, T1 differs from base Behemoth-X-R1 in exactly one way:
**when given a `<think>` prefill, it produces literary author-craft notes
instead of structured bullets.**

The prose generation, character voice handling, NSFW handling, long context
attention, system prompt comprehension — none of that changed. We
specifically didn't touch those weights.

What you should notice:

- **Hard scenes hit harder.** Long character cards, emotionally complex
  beats, multi-character POV moments — these are where the literary
  thinking earns its compute. ~15-25% better scene quality on these cases
  in our internal evals.
- **Easy scenes are unchanged.** A simple horny prompt with a one-line
  card? Base behavior. T1 doesn't try to be clever where cleverness isn't
  needed.
- **Refusals are not added.** T1 inherits Behemoth-X-R1's lack of safety
  alignment for creative fiction. We did not retrain that surface.

## ⚠️ Limitations

- T1's improvement is **conditional on the prefill**. Without a prefilled
  `<think>` block, the model behaves like base Behemoth-X-R1. The LoRA only
  fires when seeded.
- Sequence length cap during training was 4096. The model still handles
  longer contexts at inference (it's a 131k context Mistral Large), but the
  thinking style was learned on shorter conversations.
- The literary thinking style is opinionated. If you want sparse bullet
  thinking, prefill `<think>\n` with no seed phrase and the model will fall
  back to base behavior.

<img src="images/divider_palms.png" alt="" width="100%" />

## 📜 Citation

If T1 helps you ship something, a link back is appreciated.

```bibtex
@misc{behemoth-t1-2026,
  title  = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
  author = {tacodevs},
  year   = {2026},
  url    = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}
```

<div align="center">

<img src="images/06_footer_sunset.png" alt="" width="100%" />

<i>The party doesn't end. We just go to bed.</i>

</div>