tacodevs
/

Behemoth-X-R1-123B

@@ -23,30 +23,43 @@ pipeline_tag: text-generation
 <div align="center">
-# Behemoth-X-R1-123B
-### Behemoth-X's prose voice meets Behemoth-R1's thinking mind.
-*An SCE merge of TheDrummer's two flagship 123B Mistral Large fine-tunes.*
 </div>
 ---
-## What is this?
-Behemoth-X-R1-123B is a 55/45 SCE merge of:
-- **[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)** — the top-rated creative writing model on the [UGI Leaderboard](https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard), known for distinctive prose voice and deep character work.
-- **[TheDrummer/Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)** — Behemoth-X's reasoning sibling, trained to emit structured `<think>` blocks before responding.
-The goal: a single model that writes like X and thinks like R1. No additional training, no LoRA — just principled weight arithmetic using the SCE merge method that FuseAI used to preserve reasoning in their [FuseO1-DeepSeekR1-Qwen2.5-Instruct-32B-Preview](https://huggingface.co/FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Instruct-32B-Preview).
-## How it was made
-**Method:** [SCE (Select, Calculate, Erase)](https://arxiv.org/abs/2408.07990) — a variance-aware merge that uses matrix-level selection and sign consensus to preserve capability-bearing deltas across input models. Unlike TIES, SCE does not prune by density, which tends to preserve fragile behavioral traits like structured thinking.
-**Config:**
 ```yaml
 models:
   - model: TheDrummer/Behemoth-X-123B-v2
@@ -62,28 +75,31 @@ parameters:
 dtype: bfloat16
 ```
-**Why 55/45?** Slight lean toward X for prose quality while giving R1 enough weight to carry its thinking behavior across. Both models share the same base (`mistralai/Mistral-Large-Instruct-2411`), the same tokenizer (verified identical SHA256), and the same training lineage — ideal conditions for a merge.
-**Why `select_topk: 1.0`?** Keep all deltas. Let SCE's variance + sign consensus do the selection, following the FuseO1 precedent. Reasoning behavior is encoded in many small parameter shifts — aggressive pruning (density < 0.8) tends to dilute it.
-## Prompt Format
-Uses Mistral v7 template (same as both parents):
 ```
-[SYSTEM_PROMPT]{system_prompt}[/SYSTEM_PROMPT][INST]{user_message}[/INST]{assistant_response}</s>
 ```
 ### To trigger thinking
-Prefill the assistant turn with a `<think>` block. The model will continue the thinking, close the tag, and produce its response:
 ```
 [INST]your message[/INST]<think>
-{optional seed phrase}
 ```
-Example prefills from the [Telegai](https://telegai.com) edge function:
 ```
 <think>
@@ -98,17 +114,31 @@ Ok i need to think as a creative writer — what twist would surprise here?
 Let me find an engaging new direction nobody saw coming, so
 ```
-The model reads the prefill, continues in the same stream-of-consciousness style, closes `</think>`, and writes the narrative.
 ### Without thinking
-Skip the prefill and use it like any other Mistral-v7 model. It behaves close to pure Behemoth-X.
-## Recommended Samplers
-Start with Behemoth-X's recommended settings — the merge inherits most of X's prose tuning. Lower temperature (0.6-0.8) works better when thinking is enabled, since the thinking block benefits from more deterministic reasoning.
-## Usage with vLLM
 ```bash
 python -m vllm.entrypoints.openai.api_server \
@@ -119,32 +149,43 @@ python -m vllm.entrypoints.openai.api_server \
   --trust-remote-code
 ```
-For single-GPU inference, use one of the quantized variants (FP8 / AWQ / GPTQ) — see the collection.
-## Lineage
 ```
-Mistral-Large-Instruct-2411 (123B, Mistral AI)
-  ├─ TheDrummer/Behemoth-X-123B-v2      (creative writing)
-  └─ TheDrummer/Behemoth-R1-123B-v2     (reasoning)
-       └─ tacodevs/Behemoth-X-R1-123B   (SCE merge, this model)
 ```
-## Known Behaviors
-- **`<think>` block triggers on prefill.** The merge inherits R1's thinking circuit, but like R1 it doesn't reliably self-inject the tag — you need to prefill it.
-- **Thinking style is R1-derived.** Structured, bullet-ish, character-aware. Not the flowing pre-writing style of Opus or Grok. If you want literary author-planning thinking, that's a follow-up fine-tune target.
-- **Prose voice leans X.** The 55% X weight dominates prose style; most generations are indistinguishable from pure X on writing quality.
-- **Long character cards work.** Unlike `Behemoth-OpusX-123B` (our earlier LoRA experiment, which broke on 4k+ token system prompts), the merge handles long prompts natively since no new behavior was taught via fine-tuning.
-## Credits
-- **[TheDrummer](https://huggingface.co/TheDrummer)** — for Behemoth-X and Behemoth-R1, the two best Mistral Large fine-tunes in the creative/RP space.
-- **[Mistral AI](https://huggingface.co/mistralai)** — for Mistral-Large-Instruct-2411, the foundation both parents are built on.
-- **[Arcee AI / mergekit team](https://github.com/arcee-ai/mergekit)** — for the SCE implementation.
-- **[FuseAI](https://huggingface.co/FuseAI)** — for validating the SCE-reasoning-merge approach with FuseO1.
-- Merged by [tacodevs](https://huggingface.co/tacodevs) / [Telegai](https://telegai.com).
-## License
-Inherited from base model: **[Mistral Research License](https://mistral.ai/licenses/MRL-0.1.md)** — non-commercial use only.

 <div align="center">
+# 🧠 Behemoth-X-R1-123B
+### *A thinking beast that writes like a poet.*
+**An SCE merge of Behemoth-X and Behemoth-R1 — prose voice meets reasoning mind in one 123B parameter model.**
+[![Base](https://img.shields.io/badge/base-Mistral_Large_2411-orange)](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)
+[![Method](https://img.shields.io/badge/merge-SCE-purple)](https://arxiv.org/abs/2408.07990)
+[![Size](https://img.shields.io/badge/params-123B-red)]()
+[![Context](https://img.shields.io/badge/ctx-131k-blue)]()
 </div>
 ---
+## ⚡ What makes this different
+Most "thinking" models sacrifice prose for reasoning. Most creative models can't think their way out of a scene. **Behemoth-X-R1 doesn't compromise** — it carries the distinctive voice and character depth of Behemoth-X into a model that can open a `<think>` tag and actually use it.
+No LoRA. No retraining. Just **principled weight arithmetic** using the same SCE merge recipe that FuseAI used to preserve reasoning in [FuseO1](https://huggingface.co/FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Instruct-32B-Preview).
+**The parents:**
+- **[Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)** — the top-rated creative writer on the [UGI Leaderboard](https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard). Character voice, prose density, the reason people run 123B at home.
+- **[Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)** — Behemoth's reasoning sibling. Knows when to open `<think>`, knows when to close it.
+**The child:** both, at once.
+---
+## 🧬 How it was made
+**Method:** [SCE — Select, Calculate, Erase](https://arxiv.org/abs/2408.07990)
+Unlike TIES or DARE, SCE doesn't prune deltas by density. It uses **variance-aware matrix-level selection with sign consensus** — meaning capability-bearing weight updates survive the merge even when they're small and diffuse. That matters here because reasoning is a *behavioral* trait encoded across many tiny parameter shifts, not a knowledge trait concentrated in a few big ones.
+**The recipe:**
 ```yaml
 models:
   - model: TheDrummer/Behemoth-X-123B-v2
 dtype: bfloat16
 ```
+**Why these numbers?**
+- **55/45** — Slight lean toward X for prose quality while giving R1 enough mass to keep its thinking circuit intact. Both parents share the same base, same tokenizer (verified identical SHA256), and the same training lineage — ideal merge conditions.
+- **`select_topk: 1.0`** — Keep all the deltas. Let variance + sign consensus do the work. This is the FuseO1 setting, validated empirically on reasoning merges.
+---
+## 📜 Prompt Format
+Standard **Mistral v7**, same as both parents:
 ```
+[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]{assistant}</s>
 ```
 ### To trigger thinking
+Prefill the assistant turn with a `<think>` block. The model will continue your prefill, close the tag, and drop into the narrative:
 ```
 [INST]your message[/INST]<think>
+{seed phrase}
 ```
+**Prefill examples that work well:**
 ```
 <think>
 Let me find an engaging new direction nobody saw coming, so
 ```
+```
+<think>
+Ok i need to think as an unhinged author — raw, explicit, intense, fully in
+character with no holding back, so
+```
+The model inherits R1's thinking circuit but shares R1's preference for being prefilled rather than self-triggering. Seed the tag, let it cook.
 ### Without thinking
+Skip the prefill. It behaves close to pure Behemoth-X.
+---
+## 🎚️ Recommended Samplers
+Start with **Behemoth-X's** recommended settings — the merge leans heavily on X's prose tuning.
+For thinking mode, drop temperature to **0.6–0.8**. The `<think>` block benefits from more deterministic reasoning; high temperature scrambles the structure.
+---
+## 🚀 Usage
+### vLLM
 ```bash
 python -m vllm.entrypoints.openai.api_server \
   --trust-remote-code
 ```
+### Single-GPU inference
+Grab one of the quantized variants:
+- **FP8** — ~123 GB, fits on 1x H200, near-lossless quality
+- **AWQ / GPTQ W4A16** — ~65 GB, fits on 1x H100, slight quality tradeoff
+---
+## 🧱 Lineage
 ```
+Mistral-Large-Instruct-2411 (Mistral AI)
+  ├─ Behemoth-X-123B-v2   (TheDrummer)   ← the voice
+  └─ Behemoth-R1-123B-v2  (TheDrummer)   ← the mind
+       └─ Behemoth-X-R1-123B              ← the merge
 ```
+---
+## 🔍 Known behaviors
+- **`<think>` triggers on prefill, not spontaneously.** Inherited from R1. Seed the tag.
+- **Thinking style is R1-derived** — structured, character-aware, useful but not floaty. If you want Opus-style literary pre-writing, that's a follow-up fine-tune target, not something this merge gives you for free.
+- **Prose voice is mostly X.** Most generations are indistinguishable from pure X on writing quality.
+- **Long character cards work natively.** No fine-tuning means no overfitting on context length.
+---
+## 🙏 Credits
+- **[TheDrummer](https://huggingface.co/TheDrummer)** — for Behemoth-X and Behemoth-R1, the two best Mistral Large fine-tunes in the creative space.
+- **[Mistral AI](https://huggingface.co/mistralai)** — for the foundation both parents are built on.
+- **[Arcee AI](https://github.com/arcee-ai/mergekit)** — for mergekit and the SCE implementation.
+- **[FuseAI](https://huggingface.co/FuseAI)** — for proving SCE preserves reasoning.
+---
+## 📄 License
+Inherited from base: **[Mistral Research License](https://mistral.ai/licenses/MRL-0.1.md)** — non-commercial use only.