Behemoth-T1-123B / README.md
tacodevs's picture
upload README.md
948a826 verified
metadata
license: other
license_name: mistral-ai-research-license
license_link: https://mistral.ai/licenses/MNPL-0.1.md
language:
  - en
base_model:
  - tacodevs/Behemoth-X-R1-123B
  - TheDrummer/Behemoth-X-123B-v2
  - TheDrummer/Behemoth-R1-123B-v2
tags:
  - mistral
  - mistral-large
  - 123b
  - roleplay
  - creative-writing
  - thinking
  - reasoning
  - lora
  - distillation
  - claude-opus
pipeline_tag: text-generation
library_name: transformers
Behemoth-T1

🌴 Behemoth-T1-123B 🌴

The party where literary craft meets unhinged creative writing.

BF16 FP8 GPTQ

β˜€οΈ The pitch

Behemoth-T1 is a 123B Mistral Large roleplay model with one trick the others don't have: it thinks like an author before it writes like a storyteller.

Most RP models either reason in dry bullet-point lists (cold) or skip reasoning entirely and improvise (sloppy). T1 reasons in literary stream-of-consciousness β€” the way a working novelist talks to themselves while drafting β€” and then hands the scene off to a fully-preserved creative prose engine.

The result: scenes that hit harder on the hard cases. Long character cards, emotional complexity, multi-character beats, the moments where lesser models flatten out β€” those are exactly where T1 pulls ahead.

🎨 Three thinking modes, one model

T1 ships with three personality modes for the thinking phase. You pick which one fits the scene. Each one is a different angle on the same craft, like three friends hyping each other up at a beach party.

Analytical

🧠 Analytical

The planner.
Reasons about what the character feels, what their experience pulls in, what they value, what they're trying to achieve. Cool, deliberate, surgical.

Creative

🌸 Creative

The storyteller.
Looks for the unexpected angle, the twist nobody saw coming, the line of escalation that feels earned instead of cheap. Curious, generative, narratively confident.

Unhinged

πŸ”₯ Unhinged

The troublemaker.
Raw, explicit, intense, fully in character with no holding back. Throws out the safe option and asks what would make this scene actually hit. Pure id energy with craft underneath.

🎀 How it works

T1 uses a prefill technique to enter thinking mode. You provide the model with the start of a <think> block containing one of seven seed phrases, and the model continues from there with literary craft notes before producing the actual response.

# vLLM OpenAI-compatible endpoint with prefill via continue_final_message
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="-",
)

PREFILLS = {
    "analytical": "Ok i need to think about how to respond β€” what does the character feel right now, what from their experience is relevant, what do they value, and what are they trying to achieve, so",
    "creative":   "Ok i need to think as a creative writer β€” what twist would surprise here? Let me find an engaging new direction nobody saw coming, so",
    "unhinged":   "Ok i need to think as an unhinged author β€” raw, explicit, intense, fully in character with no holding back, so",
}

response = client.chat.completions.create(
    model="tacodevs/Behemoth-T1-123B",
    messages=[
        {"role": "system", "content": CHARACTER_CARD},
        *conversation_history,
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": f"<think>\n{PREFILLS['creative']}\n"},
    ],
    extra_body={
        "continue_final_message": True,
        "add_generation_prompt": False,
    },
    temperature=0.6,
    max_tokens=2048,
    stop=["[INST]", "</s>"],
)

The model responds with the rest of the thinking block, closes </think>, and then writes the in-character prose response β€” all in one continuous stream.

⚑ Quantizations

Three flavors. Pick your VRAM budget.

Variant VRAM (8k ctx) Quality Repo
BF16 ~246 GB (4Γ—80 GB or 2Γ—144 GB) Reference Behemoth-T1-123B
FP8 W8A8 ~125 GB (2Γ—80 GB) ~99% of BF16 Behemoth-T1-123B-FP8
GPTQ W4A16 ~62 GB (1Γ—80 GB) ~96% of BF16 Behemoth-T1-123B-GPTQ

All variants serve cleanly via vLLM with --tokenizer-mode auto (do not use mistral mode β€” it silently mis-templates merged-LoRA checkpoints).

πŸ› οΈ Training details

T1 is a LoRA distillation of Claude Opus 4.5 literary thinking onto tacodevs/Behemoth-X-R1-123B (itself an SCE merge of Behemoth-X creative writing + Behemoth-R1 reasoning).

Basetacodevs/Behemoth-X-R1-123B (Mistral Large 123B arch)
MethodLoRA fine-tune, think-only loss masking
LoRA rank32 (alpha 64, dropout 0.05, all 7 projection modules)
Trainable params559M / 123B (0.45%)
Dataset1000 Claude Opus 4.5 thinking traces on real RP conversations
Sequence length4096
Epochs2
Effective batch32 (1 Γ— 4 grad_accum Γ— 8 GPUs)
OptimizerDeepSpeed AdamW + WarmupDecayLR
Learning rate3e-5 with 3% warmup
Hardware8Γ— NVIDIA H200 SXM 144GB
Training time32 minutes
Final train loss0.8165
Final eval loss0.9898 (gap: 0.17 β€” healthy generalization)
Token accuracy69.4% on held-out validation

The think-only loss trick

Loss is computed only on the post-prefill thinking continuation, up through </think>. The system prompt, user message, prefilled portion of the assistant turn, and the entire response after </think> are all masked to -100. This means:

  1. The base model's RP prose engine receives zero gradient updates β€” the underlying creative writing voice is structurally preserved.
  2. The LoRA only learns the shape of literary thinking β€” what to surface, how to chain ideas, where to land the craft.
  3. At inference, T1 thinks in the new Opus-style stream-of-consciousness, then hands off to the unmodified base prose engine for the actual response.

This is the only loss configuration that gives you new thinking without messing with the prose voice you wanted to preserve.

🌊 Lineage

T1 stands on the shoulders of three earlier models:

T1 then distills literary thinking patterns from Claude Opus 4.5 on top of that merge, keeping the creative voice while replacing R1's bullet-point thinking with stream-of-consciousness craft notes.

🎭 What changes vs base

After training, T1 differs from base Behemoth-X-R1 in exactly one way: when given a <think> prefill, it produces literary author-craft notes instead of structured bullets.

The prose generation, character voice handling, NSFW handling, long context attention, system prompt comprehension β€” none of that changed. We specifically didn't touch those weights.

What you should notice:

  • Hard scenes hit harder. Long character cards, emotionally complex beats, multi-character POV moments β€” these are where the literary thinking earns its compute. ~15-25% better scene quality on these cases in our internal evals.
  • Easy scenes are unchanged. A simple horny prompt with a one-line card? Base behavior. T1 doesn't try to be clever where cleverness isn't needed.
  • Refusals are not added. T1 inherits Behemoth-X-R1's lack of safety alignment for creative fiction. We did not retrain that surface.

⚠️ Limitations

  • T1's improvement is conditional on the prefill. Without a prefilled <think> block, the model behaves like base Behemoth-X-R1. The LoRA only fires when seeded.
  • Sequence length cap during training was 4096. The model still handles longer contexts at inference (it's a 131k context Mistral Large), but the thinking style was learned on shorter conversations.
  • The literary thinking style is opinionated. If you want sparse bullet thinking, prefill <think>\n with no seed phrase and the model will fall back to base behavior.

πŸ“œ Citation

If T1 helps you ship something, a link back is appreciated.

@misc{behemoth-t1-2026,
  title  = {Behemoth-T1-123B: Literary Thinking Distillation for RP},
  author = {tacodevs},
  year   = {2026},
  url    = {https://huggingface.co/tacodevs/Behemoth-T1-123B},
}

The party doesn't end. We just go to bed.