license: other
license_name: mistral-research-license
license_link: https://mistral.ai/licenses/MRL-0.1.md
base_model:
- TheDrummer/Behemoth-X-123B-v2
- TheDrummer/Behemoth-R1-123B-v2
base_model_relation: merge
tags:
- mergekit
- merge
- sce
- mistral
- mistral-large
- thinking
- reasoning
- roleplay
- creative-writing
language:
- en
pipeline_tag: text-generation
π§ Behemoth-X-R1-123B
A thinking beast that writes like a poet.
An SCE merge of Behemoth-X and Behemoth-R1 β prose voice meets reasoning mind in one 123B parameter model.
β‘ What makes this different
Most "thinking" models sacrifice prose for reasoning. Most creative models can't think their way out of a scene. Behemoth-X-R1 doesn't compromise β it carries the distinctive voice and character depth of Behemoth-X into a model that can open a <think> tag and actually use it.
No LoRA. No retraining. Just principled weight arithmetic using the same SCE merge recipe that FuseAI used to preserve reasoning in FuseO1.
The parents:
- Behemoth-X-123B-v2 β the top-rated creative writer on the UGI Leaderboard. Character voice, prose density, the reason people run 123B at home.
- Behemoth-R1-123B-v2 β Behemoth's reasoning sibling. Knows when to open
<think>, knows when to close it.
The child: both, at once.
𧬠How it was made
Method: SCE β Select, Calculate, Erase
Unlike TIES or DARE, SCE doesn't prune deltas by density. It uses variance-aware matrix-level selection with sign consensus β meaning capability-bearing weight updates survive the merge even when they're small and diffuse. That matters here because reasoning is a behavioral trait encoded across many tiny parameter shifts, not a knowledge trait concentrated in a few big ones.
The recipe:
models:
- model: TheDrummer/Behemoth-X-123B-v2
parameters:
weight: 0.55
- model: TheDrummer/Behemoth-R1-123B-v2
parameters:
weight: 0.45
merge_method: sce
base_model: mistralai/Mistral-Large-Instruct-2411
parameters:
select_topk: 1.0
dtype: bfloat16
Why these numbers?
- 55/45 β Slight lean toward X for prose quality while giving R1 enough mass to keep its thinking circuit intact. Both parents share the same base, same tokenizer (verified identical SHA256), and the same training lineage β ideal merge conditions.
select_topk: 1.0β Keep all the deltas. Let variance + sign consensus do the work. This is the FuseO1 setting, validated empirically on reasoning merges.
π Prompt Format
Standard Mistral v7, same as both parents:
[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]{assistant}</s>
To trigger thinking
Prefill the assistant turn with a <think> block. The model will continue your prefill, close the tag, and drop into the narrative:
[INST]your message[/INST]<think>
{seed phrase}
Prefill examples that work well:
<think>
Ok i need to think about how to respond β what does the character feel right now,
what from their experience is relevant, what do they value, and what are they
trying to achieve, so
<think>
Ok i need to think as a creative writer β what twist would surprise here?
Let me find an engaging new direction nobody saw coming, so
<think>
Ok i need to think as an unhinged author β raw, explicit, intense, fully in
character with no holding back, so
The model inherits R1's thinking circuit but shares R1's preference for being prefilled rather than self-triggering. Seed the tag, let it cook.
Without thinking
Skip the prefill. It behaves close to pure Behemoth-X.
ποΈ Recommended Samplers
Start with Behemoth-X's recommended settings β the merge leans heavily on X's prose tuning.
For thinking mode, drop temperature to 0.6β0.8. The <think> block benefits from more deterministic reasoning; high temperature scrambles the structure.
π Usage
vLLM
python -m vllm.entrypoints.openai.api_server \
--model tacodevs/Behemoth-X-R1-123B \
--dtype bfloat16 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--trust-remote-code
Single-GPU inference
Grab one of the quantized variants:
- FP8 β ~123 GB, fits on 1x H200, near-lossless quality
- AWQ / GPTQ W4A16 β ~65 GB, fits on 1x H100, slight quality tradeoff
π§± Lineage
Mistral-Large-Instruct-2411 (Mistral AI)
ββ Behemoth-X-123B-v2 (TheDrummer) β the voice
ββ Behemoth-R1-123B-v2 (TheDrummer) β the mind
ββ Behemoth-X-R1-123B β the merge
π Known behaviors
<think>triggers on prefill, not spontaneously. Inherited from R1. Seed the tag.- Thinking style is R1-derived β structured, character-aware, useful but not floaty. If you want Opus-style literary pre-writing, that's a follow-up fine-tune target, not something this merge gives you for free.
- Prose voice is mostly X. Most generations are indistinguishable from pure X on writing quality.
- Long character cards work natively. No fine-tuning means no overfitting on context length.
π Credits
- TheDrummer β for Behemoth-X and Behemoth-R1, the two best Mistral Large fine-tunes in the creative space.
- Mistral AI β for the foundation both parents are built on.
- Arcee AI β for mergekit and the SCE implementation.
- FuseAI β for proving SCE preserves reasoning.
π License
Inherited from base: Mistral Research License β non-commercial use only.