Behemoth-X-R1-123B / README.md
tacodevs's picture
Update model card
6fa996b verified
|
raw
history blame
6.49 kB
metadata
license: other
license_name: mistral-research-license
license_link: https://mistral.ai/licenses/MRL-0.1.md
base_model:
  - TheDrummer/Behemoth-X-123B-v2
  - TheDrummer/Behemoth-R1-123B-v2
base_model_relation: merge
tags:
  - mergekit
  - merge
  - sce
  - mistral
  - mistral-large
  - thinking
  - reasoning
  - roleplay
  - creative-writing
language:
  - en
pipeline_tag: text-generation

🧠 Behemoth-X-R1-123B

A thinking beast that writes like a poet.

An SCE merge of Behemoth-X and Behemoth-R1 β€” prose voice meets reasoning mind in one 123B parameter model.

Base Method Size Context


⚑ What makes this different

Most "thinking" models sacrifice prose for reasoning. Most creative models can't think their way out of a scene. Behemoth-X-R1 doesn't compromise β€” it carries the distinctive voice and character depth of Behemoth-X into a model that can open a <think> tag and actually use it.

No LoRA. No retraining. Just principled weight arithmetic using the same SCE merge recipe that FuseAI used to preserve reasoning in FuseO1.

The parents:

The child: both, at once.


🧬 How it was made

Method: SCE β€” Select, Calculate, Erase

Unlike TIES or DARE, SCE doesn't prune deltas by density. It uses variance-aware matrix-level selection with sign consensus β€” meaning capability-bearing weight updates survive the merge even when they're small and diffuse. That matters here because reasoning is a behavioral trait encoded across many tiny parameter shifts, not a knowledge trait concentrated in a few big ones.

The recipe:

models:
  - model: TheDrummer/Behemoth-X-123B-v2
    parameters:
      weight: 0.55
  - model: TheDrummer/Behemoth-R1-123B-v2
    parameters:
      weight: 0.45
merge_method: sce
base_model: mistralai/Mistral-Large-Instruct-2411
parameters:
  select_topk: 1.0
dtype: bfloat16

Why these numbers?

  • 55/45 β€” Slight lean toward X for prose quality while giving R1 enough mass to keep its thinking circuit intact. Both parents share the same base, same tokenizer (verified identical SHA256), and the same training lineage β€” ideal merge conditions.
  • select_topk: 1.0 β€” Keep all the deltas. Let variance + sign consensus do the work. This is the FuseO1 setting, validated empirically on reasoning merges.

πŸ“œ Prompt Format

Standard Mistral v7, same as both parents:

[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]{assistant}</s>

To trigger thinking

Prefill the assistant turn with a <think> block. The model will continue your prefill, close the tag, and drop into the narrative:

[INST]your message[/INST]<think>
{seed phrase}

Prefill examples that work well:

<think>
Ok i need to think about how to respond β€” what does the character feel right now,
what from their experience is relevant, what do they value, and what are they
trying to achieve, so
<think>
Ok i need to think as a creative writer β€” what twist would surprise here?
Let me find an engaging new direction nobody saw coming, so
<think>
Ok i need to think as an unhinged author β€” raw, explicit, intense, fully in
character with no holding back, so

The model inherits R1's thinking circuit but shares R1's preference for being prefilled rather than self-triggering. Seed the tag, let it cook.

Without thinking

Skip the prefill. It behaves close to pure Behemoth-X.


🎚️ Recommended Samplers

Start with Behemoth-X's recommended settings β€” the merge leans heavily on X's prose tuning.

For thinking mode, drop temperature to 0.6–0.8. The <think> block benefits from more deterministic reasoning; high temperature scrambles the structure.


πŸš€ Usage

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model tacodevs/Behemoth-X-R1-123B \
  --dtype bfloat16 \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --trust-remote-code

Single-GPU inference

Grab one of the quantized variants:

  • FP8 β€” ~123 GB, fits on 1x H200, near-lossless quality
  • AWQ / GPTQ W4A16 β€” ~65 GB, fits on 1x H100, slight quality tradeoff

🧱 Lineage

Mistral-Large-Instruct-2411 (Mistral AI)
  β”œβ”€ Behemoth-X-123B-v2   (TheDrummer)   ← the voice
  └─ Behemoth-R1-123B-v2  (TheDrummer)   ← the mind
       └─ Behemoth-X-R1-123B              ← the merge

πŸ” Known behaviors

  • <think> triggers on prefill, not spontaneously. Inherited from R1. Seed the tag.
  • Thinking style is R1-derived β€” structured, character-aware, useful but not floaty. If you want Opus-style literary pre-writing, that's a follow-up fine-tune target, not something this merge gives you for free.
  • Prose voice is mostly X. Most generations are indistinguishable from pure X on writing quality.
  • Long character cards work natively. No fine-tuning means no overfitting on context length.

πŸ™ Credits

  • TheDrummer β€” for Behemoth-X and Behemoth-R1, the two best Mistral Large fine-tunes in the creative space.
  • Mistral AI β€” for the foundation both parents are built on.
  • Arcee AI β€” for mergekit and the SCE implementation.
  • FuseAI β€” for proving SCE preserves reasoning.

πŸ“„ License

Inherited from base: Mistral Research License β€” non-commercial use only.