Update model card

6fa996b verified 12 days ago

6.49 kB

license: other
license_name: mistral-research-license
license_link: https://mistral.ai/licenses/MRL-0.1.md
base_model:
  - TheDrummer/Behemoth-X-123B-v2
  - TheDrummer/Behemoth-R1-123B-v2
base_model_relation: merge
tags:
  - mergekit
  - merge
  - sce
  - mistral
  - mistral-large
  - thinking
  - reasoning
  - roleplay
  - creative-writing
language:
  - en
pipeline_tag: text-generation

🧠 Behemoth-X-R1-123B

A thinking beast that writes like a poet.

An SCE merge of Behemoth-X and Behemoth-R1 — prose voice meets reasoning mind in one 123B parameter model.

⚡ What makes this different

Most "thinking" models sacrifice prose for reasoning. Most creative models can't think their way out of a scene. Behemoth-X-R1 doesn't compromise — it carries the distinctive voice and character depth of Behemoth-X into a model that can open a <think> tag and actually use it.

No LoRA. No retraining. Just principled weight arithmetic using the same SCE merge recipe that FuseAI used to preserve reasoning in FuseO1.

The parents:

Behemoth-X-123B-v2 — the top-rated creative writer on the UGI Leaderboard. Character voice, prose density, the reason people run 123B at home.
Behemoth-R1-123B-v2 — Behemoth's reasoning sibling. Knows when to open <think>, knows when to close it.

The child: both, at once.

🧬 How it was made

Method: SCE — Select, Calculate, Erase

Unlike TIES or DARE, SCE doesn't prune deltas by density. It uses variance-aware matrix-level selection with sign consensus — meaning capability-bearing weight updates survive the merge even when they're small and diffuse. That matters here because reasoning is a behavioral trait encoded across many tiny parameter shifts, not a knowledge trait concentrated in a few big ones.

The recipe:

models:
  - model: TheDrummer/Behemoth-X-123B-v2
    parameters:
      weight: 0.55
  - model: TheDrummer/Behemoth-R1-123B-v2
    parameters:
      weight: 0.45
merge_method: sce
base_model: mistralai/Mistral-Large-Instruct-2411
parameters:
  select_topk: 1.0
dtype: bfloat16

Why these numbers?

55/45 — Slight lean toward X for prose quality while giving R1 enough mass to keep its thinking circuit intact. Both parents share the same base, same tokenizer (verified identical SHA256), and the same training lineage — ideal merge conditions.
select_topk: 1.0 — Keep all the deltas. Let variance + sign consensus do the work. This is the FuseO1 setting, validated empirically on reasoning merges.

📜 Prompt Format

Standard Mistral v7, same as both parents:

[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]{assistant}</s>

To trigger thinking

Prefill the assistant turn with a <think> block. The model will continue your prefill, close the tag, and drop into the narrative:

[INST]your message[/INST]<think>
{seed phrase}

Prefill examples that work well:

<think>
Ok i need to think about how to respond — what does the character feel right now,
what from their experience is relevant, what do they value, and what are they
trying to achieve, so

<think>
Ok i need to think as a creative writer — what twist would surprise here?
Let me find an engaging new direction nobody saw coming, so

<think>
Ok i need to think as an unhinged author — raw, explicit, intense, fully in
character with no holding back, so

The model inherits R1's thinking circuit but shares R1's preference for being prefilled rather than self-triggering. Seed the tag, let it cook.

Without thinking

Skip the prefill. It behaves close to pure Behemoth-X.

🎚️ Recommended Samplers

Start with Behemoth-X's recommended settings — the merge leans heavily on X's prose tuning.

For thinking mode, drop temperature to 0.6–0.8. The <think> block benefits from more deterministic reasoning; high temperature scrambles the structure.

🚀 Usage

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model tacodevs/Behemoth-X-R1-123B \
  --dtype bfloat16 \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --trust-remote-code

Single-GPU inference

Grab one of the quantized variants:

FP8 — ~123 GB, fits on 1x H200, near-lossless quality
AWQ / GPTQ W4A16 — ~65 GB, fits on 1x H100, slight quality tradeoff

🧱 Lineage

Mistral-Large-Instruct-2411 (Mistral AI)
  ├─ Behemoth-X-123B-v2   (TheDrummer)   ← the voice
  └─ Behemoth-R1-123B-v2  (TheDrummer)   ← the mind
       └─ Behemoth-X-R1-123B              ← the merge

🔍 Known behaviors

<think> triggers on prefill, not spontaneously. Inherited from R1. Seed the tag.
Thinking style is R1-derived — structured, character-aware, useful but not floaty. If you want Opus-style literary pre-writing, that's a follow-up fine-tune target, not something this merge gives you for free.
Prose voice is mostly X. Most generations are indistinguishable from pure X on writing quality.
Long character cards work natively. No fine-tuning means no overfitting on context length.

🙏 Credits

TheDrummer — for Behemoth-X and Behemoth-R1, the two best Mistral Large fine-tunes in the creative space.
Mistral AI — for the foundation both parents are built on.
Arcee AI — for mergekit and the SCE implementation.
FuseAI — for proving SCE preserves reasoning.

📄 License

Inherited from base: Mistral Research License — non-commercial use only.