| --- |
| license: other |
| license_name: mistral-ai-research-license |
| license_link: https://mistral.ai/licenses/MNPL-0.1.md |
| language: |
| - en |
| base_model: |
| - tacodevs/Behemoth-X-R1-123B |
| - TheDrummer/Behemoth-X-123B-v2 |
| - TheDrummer/Behemoth-R1-123B-v2 |
| tags: |
| - mistral |
| - mistral-large |
| - 123b |
| - roleplay |
| - creative-writing |
| - thinking |
| - reasoning |
| - lora |
| - distillation |
| - claude-opus |
| pipeline_tag: text-generation |
| library_name: transformers |
| --- |
| |
| <div align="center"> |
|
|
| <img src="images/01_header_main.png" alt="Behemoth-T1" width="100%" /> |
|
|
| <h1>π΄ Behemoth-T1-123B π΄</h1> |
|
|
| <p><i>The party where literary craft meets unhinged creative writing.</i></p> |
|
|
| <p> |
| <a href="https://huggingface.co/tacodevs/Behemoth-T1-123B"><img src="https://img.shields.io/badge/BF16-tacodevs%2FBehemoth--T1--123B-ff7eb6?style=for-the-badge" alt="BF16"/></a> |
| <a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-FP8"><img src="https://img.shields.io/badge/FP8-Behemoth--T1--123B--FP8-7eddff?style=for-the-badge" alt="FP8"/></a> |
| <a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-GPTQ"><img src="https://img.shields.io/badge/GPTQ--W4A16-Behemoth--T1--123B--GPTQ-bdff7e?style=for-the-badge" alt="GPTQ"/></a> |
| </p> |
|
|
| </div> |
|
|
| <img src="images/divider_palms.png" alt="" width="100%" /> |
|
|
| ## βοΈ The pitch |
|
|
| Behemoth-T1 is a **123B Mistral Large** roleplay model with one trick the |
| others don't have: it **thinks like an author before it writes like a |
| storyteller**. |
|
|
| Most RP models either reason in dry bullet-point lists (cold) or skip |
| reasoning entirely and improvise (sloppy). T1 reasons in **literary |
| stream-of-consciousness** β the way a working novelist talks to themselves |
| while drafting β and then hands the scene off to a fully-preserved creative |
| prose engine. |
|
|
| The result: **scenes that hit harder on the hard cases.** Long character |
| cards, emotional complexity, multi-character beats, the moments where lesser |
| models flatten out β those are exactly where T1 pulls ahead. |
|
|
| <img src="images/divider_lights.png" alt="" width="100%" /> |
|
|
| ## π¨ Three thinking modes, one model |
|
|
| T1 ships with **three personality modes** for the thinking phase. You pick |
| which one fits the scene. Each one is a different angle on the same craft, |
| like three friends hyping each other up at a beach party. |
|
|
| <table> |
| <tr> |
| <td width="33%" align="center"> |
|
|
| <img src="images/chibi_silver.png" alt="Analytical" width="220" /> |
|
|
| <h3>π§ Analytical</h3> |
|
|
| <p><i>The planner.</i><br/> |
| Reasons about what the character feels, what their experience pulls in, |
| what they value, what they're trying to achieve. Cool, deliberate, surgical.</p> |
|
|
| </td> |
| <td width="33%" align="center"> |
|
|
| <img src="images/chibi_pink.png" alt="Creative" width="220" /> |
|
|
| <h3>πΈ Creative</h3> |
|
|
| <p><i>The storyteller.</i><br/> |
| Looks for the unexpected angle, the twist nobody saw coming, the line of |
| escalation that feels earned instead of cheap. Curious, generative, |
| narratively confident.</p> |
|
|
| </td> |
| <td width="33%" align="center"> |
|
|
| <img src="images/chibi_red.png" alt="Unhinged" width="220" /> |
|
|
| <h3>π₯ Unhinged</h3> |
|
|
| <p><i>The troublemaker.</i><br/> |
| Raw, explicit, intense, fully in character with no holding back. Throws out |
| the safe option and asks what would make this scene actually hit. Pure id |
| energy with craft underneath.</p> |
|
|
| </td> |
| </tr> |
| </table> |
|
|
| <img src="images/divider_waves.png" alt="" width="100%" /> |
|
|
| ## π€ How it works |
|
|
| T1 uses a **prefill** technique to enter thinking mode. You provide the |
| model with the start of a `<think>` block containing one of seven seed |
| phrases, and the model continues from there with literary craft notes |
| before producing the actual response. |
|
|
| ```python |
| # vLLM OpenAI-compatible endpoint with prefill via continue_final_message |
| import openai |
| |
| client = openai.OpenAI( |
| base_url="http://localhost:8000/v1", |
| api_key="-", |
| ) |
| |
| PREFILLS = { |
| "analytical": "Ok i need to think about how to respond β what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so", |
| "creative": "Ok i need to think as a creative writer β what twist would surprise here? Let me find an engaging new direction nobody saw coming, so", |
| "unhinged": "Ok i need to think as an unhinged author β raw, explicit, intense, fully in character with no holding back, so", |
| } |
| |
| response = client.chat.completions.create( |
| model="tacodevs/Behemoth-T1-123B", |
| messages=[ |
| {"role": "system", "content": CHARACTER_CARD}, |
| *conversation_history, |
| {"role": "user", "content": user_message}, |
| {"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"}, |
| ], |
| extra_body={ |
| "continue_final_message": True, |
| "add_generation_prompt": False, |
| }, |
| temperature=0.6, |
| max_tokens=2048, |
| stop=["[INST]", "</s>"], |
| ) |
| ``` |
|
|
| The model responds with the rest of the thinking block, closes `</think>`, |
| and then writes the in-character prose response β all in one continuous |
| stream. |
|
|
| <img src="images/02_party_dance.png" alt="" width="100%" /> |
|
|
| ## β‘ Quantizations |
|
|
| Three flavors. Pick your VRAM budget. |
|
|
| <table> |
| <tr> |
| <th>Variant</th> |
| <th>VRAM (8k ctx)</th> |
| <th>Quality</th> |
| <th>Repo</th> |
| </tr> |
| <tr> |
| <td><b>BF16</b></td> |
| <td>~246 GB (4Γ80 GB or 2Γ144 GB)</td> |
| <td>Reference</td> |
| <td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B"><code>Behemoth-T1-123B</code></a></td> |
| </tr> |
| <tr> |
| <td><b>FP8 W8A8</b></td> |
| <td>~125 GB (2Γ80 GB)</td> |
| <td>~99% of BF16</td> |
| <td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-FP8"><code>Behemoth-T1-123B-FP8</code></a></td> |
| </tr> |
| <tr> |
| <td><b>GPTQ W4A16</b></td> |
| <td>~62 GB (1Γ80 GB)</td> |
| <td>~96% of BF16</td> |
| <td><a href="https://huggingface.co/tacodevs/Behemoth-T1-123B-GPTQ"><code>Behemoth-T1-123B-GPTQ</code></a></td> |
| </tr> |
| </table> |
|
|
| All variants serve cleanly via vLLM with `--tokenizer-mode auto` (do **not** |
| use `mistral` mode β it silently mis-templates merged-LoRA checkpoints). |
|
|
| <img src="images/03_party_drinks.png" alt="" width="100%" /> |
|
|
| ## π οΈ Training details |
|
|
| T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto |
| [`tacodevs/Behemoth-X-R1-123B`](https://huggingface.co/tacodevs/Behemoth-X-R1-123B) |
| (itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning). |
|
|
| <table> |
| <tr><td><b>Base</b></td><td>tacodevs/Behemoth-X-R1-123B (Mistral Large 123B arch)</td></tr> |
| <tr><td><b>Method</b></td><td>LoRA fine-tune, think-only loss masking</td></tr> |
| <tr><td><b>LoRA rank</b></td><td>32 (alpha 64, dropout 0.05, all 7 projection modules)</td></tr> |
| <tr><td><b>Trainable params</b></td><td>559M / 123B (0.45%)</td></tr> |
| <tr><td><b>Dataset</b></td><td>1000 Claude Opus 4.5 thinking traces on real RP conversations</td></tr> |
| <tr><td><b>Sequence length</b></td><td>4096</td></tr> |
| <tr><td><b>Epochs</b></td><td>2</td></tr> |
| <tr><td><b>Effective batch</b></td><td>32 (1 Γ 4 grad_accum Γ 8 GPUs)</td></tr> |
| <tr><td><b>Optimizer</b></td><td>DeepSpeed AdamW + WarmupDecayLR</td></tr> |
| <tr><td><b>Learning rate</b></td><td>3e-5 with 3% warmup</td></tr> |
| <tr><td><b>Hardware</b></td><td>8Γ NVIDIA H200 SXM 144GB</td></tr> |
| <tr><td><b>Training time</b></td><td>32 minutes</td></tr> |
| <tr><td><b>Final train loss</b></td><td>0.8165</td></tr> |
| <tr><td><b>Final eval loss</b></td><td>0.9898 (gap: 0.17 β healthy generalization)</td></tr> |
| <tr><td><b>Token accuracy</b></td><td>69.4% on held-out validation</td></tr> |
| </table> |
| |
| ### The think-only loss trick |
| |
| Loss is computed **only** on the post-prefill thinking continuation, up |
| through `</think>`. The system prompt, user message, prefilled portion of |
| the assistant turn, and the entire response after `</think>` are all masked |
| to `-100`. This means: |
| |
| 1. The base model's RP prose engine receives **zero gradient updates** β |
| the underlying creative writing voice is structurally preserved. |
| 2. The LoRA only learns the *shape* of literary thinking β what to surface, |
| how to chain ideas, where to land the craft. |
| 3. At inference, T1 thinks in the new Opus-style stream-of-consciousness, |
| then hands off to the unmodified base prose engine for the actual |
| response. |
| |
| This is the only loss configuration that gives you new thinking *without* |
| messing with the prose voice you wanted to preserve. |
| |
| <img src="images/04_party_water.png" alt="" width="100%" /> |
| |
| ## π Lineage |
| |
| T1 stands on the shoulders of three earlier models: |
| |
| - **[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)** |
| β uncensored creative writing fine-tune of Mistral Large 2407. |
| *Provides the prose voice.* |
| - **[TheDrummer/Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)** |
| β reasoning fine-tune that adds `<think>` capability. |
| *Provides the thinking infrastructure.* |
| - **[tacodevs/Behemoth-X-R1-123B](https://huggingface.co/tacodevs/Behemoth-X-R1-123B)** |
| β SCE merge of X + R1 (55%/45%, `select_topk: 1.0`). |
| *The direct base for T1's LoRA.* |
|
|
| T1 then distills literary thinking patterns from **Claude Opus 4.5** on top |
| of that merge, keeping the creative voice while replacing R1's bullet-point |
| thinking with stream-of-consciousness craft notes. |
|
|
| <img src="images/05_party_floats.png" alt="" width="100%" /> |
|
|
| ## π What changes vs base |
|
|
| After training, T1 differs from base Behemoth-X-R1 in exactly one way: |
| **when given a `<think>` prefill, it produces literary author-craft notes |
| instead of structured bullets.** |
|
|
| The prose generation, character voice handling, NSFW handling, long context |
| attention, system prompt comprehension β none of that changed. We |
| specifically didn't touch those weights. |
|
|
| What you should notice: |
|
|
| - **Hard scenes hit harder.** Long character cards, emotionally complex |
| beats, multi-character POV moments β these are where the literary |
| thinking earns its compute. ~15-25% better scene quality on these cases |
| in our internal evals. |
| - **Easy scenes are unchanged.** A simple horny prompt with a one-line |
| card? Base behavior. T1 doesn't try to be clever where cleverness isn't |
| needed. |
| - **Refusals are not added.** T1 inherits Behemoth-X-R1's lack of safety |
| alignment for creative fiction. We did not retrain that surface. |
|
|
| ## β οΈ Limitations |
|
|
| - T1's improvement is **conditional on the prefill**. Without a prefilled |
| `<think>` block, the model behaves like base Behemoth-X-R1. The LoRA only |
| fires when seeded. |
| - Sequence length cap during training was 4096. The model still handles |
| longer contexts at inference (it's a 131k context Mistral Large), but the |
| thinking style was learned on shorter conversations. |
| - The literary thinking style is opinionated. If you want sparse bullet |
| thinking, prefill `<think>\n` with no seed phrase and the model will fall |
| back to base behavior. |
|
|
| <img src="images/divider_palms.png" alt="" width="100%" /> |
|
|
| ## π Citation |
|
|
| If T1 helps you ship something, a link back is appreciated. |
|
|
| ```bibtex |
| @misc{behemoth-t1-2026, |
| title = {Behemoth-T1-123B: Literary Thinking Distillation for RP}, |
| author = {tacodevs}, |
| year = {2026}, |
| url = {https://huggingface.co/tacodevs/Behemoth-T1-123B}, |
| } |
| ``` |
|
|
| <div align="center"> |
|
|
| <img src="images/06_footer_sunset.png" alt="" width="100%" /> |
|
|
| <i>The party doesn't end. We just go to bed.</i> |
|
|
| </div> |
|
|