tacodevs commited on
Commit
6fa996b
Β·
verified Β·
1 Parent(s): bbdd938

Update model card

Browse files
Files changed (1) hide show
  1. README.md +84 -43
README.md CHANGED
@@ -23,30 +23,43 @@ pipeline_tag: text-generation
23
 
24
  <div align="center">
25
 
26
- # Behemoth-X-R1-123B
27
 
28
- ### Behemoth-X's prose voice meets Behemoth-R1's thinking mind.
29
 
30
- *An SCE merge of TheDrummer's two flagship 123B Mistral Large fine-tunes.*
 
 
 
 
 
31
 
32
  </div>
33
 
34
  ---
35
 
36
- ## What is this?
 
 
37
 
38
- Behemoth-X-R1-123B is a 55/45 SCE merge of:
39
 
40
- - **[TheDrummer/Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)** β€” the top-rated creative writing model on the [UGI Leaderboard](https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard), known for distinctive prose voice and deep character work.
41
- - **[TheDrummer/Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)** β€” Behemoth-X's reasoning sibling, trained to emit structured `<think>` blocks before responding.
 
42
 
43
- The goal: a single model that writes like X and thinks like R1. No additional training, no LoRA β€” just principled weight arithmetic using the SCE merge method that FuseAI used to preserve reasoning in their [FuseO1-DeepSeekR1-Qwen2.5-Instruct-32B-Preview](https://huggingface.co/FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Instruct-32B-Preview).
 
 
44
 
45
- ## How it was made
46
 
47
- **Method:** [SCE (Select, Calculate, Erase)](https://arxiv.org/abs/2408.07990) β€” a variance-aware merge that uses matrix-level selection and sign consensus to preserve capability-bearing deltas across input models. Unlike TIES, SCE does not prune by density, which tends to preserve fragile behavioral traits like structured thinking.
 
 
 
 
48
 
49
- **Config:**
50
  ```yaml
51
  models:
52
  - model: TheDrummer/Behemoth-X-123B-v2
@@ -62,28 +75,31 @@ parameters:
62
  dtype: bfloat16
63
  ```
64
 
65
- **Why 55/45?** Slight lean toward X for prose quality while giving R1 enough weight to carry its thinking behavior across. Both models share the same base (`mistralai/Mistral-Large-Instruct-2411`), the same tokenizer (verified identical SHA256), and the same training lineage β€” ideal conditions for a merge.
 
 
 
66
 
67
- **Why `select_topk: 1.0`?** Keep all deltas. Let SCE's variance + sign consensus do the selection, following the FuseO1 precedent. Reasoning behavior is encoded in many small parameter shifts β€” aggressive pruning (density < 0.8) tends to dilute it.
68
 
69
- ## Prompt Format
70
 
71
- Uses Mistral v7 template (same as both parents):
72
 
73
  ```
74
- [SYSTEM_PROMPT]{system_prompt}[/SYSTEM_PROMPT][INST]{user_message}[/INST]{assistant_response}</s>
75
  ```
76
 
77
  ### To trigger thinking
78
 
79
- Prefill the assistant turn with a `<think>` block. The model will continue the thinking, close the tag, and produce its response:
80
 
81
  ```
82
  [INST]your message[/INST]<think>
83
- {optional seed phrase}
84
  ```
85
 
86
- Example prefills from the [Telegai](https://telegai.com) edge function:
87
 
88
  ```
89
  <think>
@@ -98,17 +114,31 @@ Ok i need to think as a creative writer β€” what twist would surprise here?
98
  Let me find an engaging new direction nobody saw coming, so
99
  ```
100
 
101
- The model reads the prefill, continues in the same stream-of-consciousness style, closes `</think>`, and writes the narrative.
 
 
 
 
 
 
102
 
103
  ### Without thinking
104
 
105
- Skip the prefill and use it like any other Mistral-v7 model. It behaves close to pure Behemoth-X.
 
 
106
 
107
- ## Recommended Samplers
108
 
109
- Start with Behemoth-X's recommended settings β€” the merge inherits most of X's prose tuning. Lower temperature (0.6-0.8) works better when thinking is enabled, since the thinking block benefits from more deterministic reasoning.
110
 
111
- ## Usage with vLLM
 
 
 
 
 
 
112
 
113
  ```bash
114
  python -m vllm.entrypoints.openai.api_server \
@@ -119,32 +149,43 @@ python -m vllm.entrypoints.openai.api_server \
119
  --trust-remote-code
120
  ```
121
 
122
- For single-GPU inference, use one of the quantized variants (FP8 / AWQ / GPTQ) β€” see the collection.
 
 
 
 
 
 
123
 
124
- ## Lineage
125
 
126
  ```
127
- Mistral-Large-Instruct-2411 (123B, Mistral AI)
128
- β”œβ”€ TheDrummer/Behemoth-X-123B-v2 (creative writing)
129
- └─ TheDrummer/Behemoth-R1-123B-v2 (reasoning)
130
- └─ tacodevs/Behemoth-X-R1-123B (SCE merge, this model)
131
  ```
132
 
133
- ## Known Behaviors
 
 
 
 
 
 
 
134
 
135
- - **`<think>` block triggers on prefill.** The merge inherits R1's thinking circuit, but like R1 it doesn't reliably self-inject the tag β€” you need to prefill it.
136
- - **Thinking style is R1-derived.** Structured, bullet-ish, character-aware. Not the flowing pre-writing style of Opus or Grok. If you want literary author-planning thinking, that's a follow-up fine-tune target.
137
- - **Prose voice leans X.** The 55% X weight dominates prose style; most generations are indistinguishable from pure X on writing quality.
138
- - **Long character cards work.** Unlike `Behemoth-OpusX-123B` (our earlier LoRA experiment, which broke on 4k+ token system prompts), the merge handles long prompts natively since no new behavior was taught via fine-tuning.
139
 
140
- ## Credits
 
 
 
141
 
142
- - **[TheDrummer](https://huggingface.co/TheDrummer)** β€” for Behemoth-X and Behemoth-R1, the two best Mistral Large fine-tunes in the creative/RP space.
143
- - **[Mistral AI](https://huggingface.co/mistralai)** β€” for Mistral-Large-Instruct-2411, the foundation both parents are built on.
144
- - **[Arcee AI / mergekit team](https://github.com/arcee-ai/mergekit)** β€” for the SCE implementation.
145
- - **[FuseAI](https://huggingface.co/FuseAI)** β€” for validating the SCE-reasoning-merge approach with FuseO1.
146
- - Merged by [tacodevs](https://huggingface.co/tacodevs) / [Telegai](https://telegai.com).
147
 
148
- ## License
149
 
150
- Inherited from base model: **[Mistral Research License](https://mistral.ai/licenses/MRL-0.1.md)** β€” non-commercial use only.
 
23
 
24
  <div align="center">
25
 
26
+ # 🧠 Behemoth-X-R1-123B
27
 
28
+ ### *A thinking beast that writes like a poet.*
29
 
30
+ **An SCE merge of Behemoth-X and Behemoth-R1 β€” prose voice meets reasoning mind in one 123B parameter model.**
31
+
32
+ [![Base](https://img.shields.io/badge/base-Mistral_Large_2411-orange)](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)
33
+ [![Method](https://img.shields.io/badge/merge-SCE-purple)](https://arxiv.org/abs/2408.07990)
34
+ [![Size](https://img.shields.io/badge/params-123B-red)]()
35
+ [![Context](https://img.shields.io/badge/ctx-131k-blue)]()
36
 
37
  </div>
38
 
39
  ---
40
 
41
+ ## ⚑ What makes this different
42
+
43
+ Most "thinking" models sacrifice prose for reasoning. Most creative models can't think their way out of a scene. **Behemoth-X-R1 doesn't compromise** β€” it carries the distinctive voice and character depth of Behemoth-X into a model that can open a `<think>` tag and actually use it.
44
 
45
+ No LoRA. No retraining. Just **principled weight arithmetic** using the same SCE merge recipe that FuseAI used to preserve reasoning in [FuseO1](https://huggingface.co/FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Instruct-32B-Preview).
46
 
47
+ **The parents:**
48
+ - **[Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)** β€” the top-rated creative writer on the [UGI Leaderboard](https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard). Character voice, prose density, the reason people run 123B at home.
49
+ - **[Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)** β€” Behemoth's reasoning sibling. Knows when to open `<think>`, knows when to close it.
50
 
51
+ **The child:** both, at once.
52
+
53
+ ---
54
 
55
+ ## 🧬 How it was made
56
 
57
+ **Method:** [SCE β€” Select, Calculate, Erase](https://arxiv.org/abs/2408.07990)
58
+
59
+ Unlike TIES or DARE, SCE doesn't prune deltas by density. It uses **variance-aware matrix-level selection with sign consensus** β€” meaning capability-bearing weight updates survive the merge even when they're small and diffuse. That matters here because reasoning is a *behavioral* trait encoded across many tiny parameter shifts, not a knowledge trait concentrated in a few big ones.
60
+
61
+ **The recipe:**
62
 
 
63
  ```yaml
64
  models:
65
  - model: TheDrummer/Behemoth-X-123B-v2
 
75
  dtype: bfloat16
76
  ```
77
 
78
+ **Why these numbers?**
79
+
80
+ - **55/45** β€” Slight lean toward X for prose quality while giving R1 enough mass to keep its thinking circuit intact. Both parents share the same base, same tokenizer (verified identical SHA256), and the same training lineage β€” ideal merge conditions.
81
+ - **`select_topk: 1.0`** β€” Keep all the deltas. Let variance + sign consensus do the work. This is the FuseO1 setting, validated empirically on reasoning merges.
82
 
83
+ ---
84
 
85
+ ## πŸ“œ Prompt Format
86
 
87
+ Standard **Mistral v7**, same as both parents:
88
 
89
  ```
90
+ [SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]{assistant}</s>
91
  ```
92
 
93
  ### To trigger thinking
94
 
95
+ Prefill the assistant turn with a `<think>` block. The model will continue your prefill, close the tag, and drop into the narrative:
96
 
97
  ```
98
  [INST]your message[/INST]<think>
99
+ {seed phrase}
100
  ```
101
 
102
+ **Prefill examples that work well:**
103
 
104
  ```
105
  <think>
 
114
  Let me find an engaging new direction nobody saw coming, so
115
  ```
116
 
117
+ ```
118
+ <think>
119
+ Ok i need to think as an unhinged author β€” raw, explicit, intense, fully in
120
+ character with no holding back, so
121
+ ```
122
+
123
+ The model inherits R1's thinking circuit but shares R1's preference for being prefilled rather than self-triggering. Seed the tag, let it cook.
124
 
125
  ### Without thinking
126
 
127
+ Skip the prefill. It behaves close to pure Behemoth-X.
128
+
129
+ ---
130
 
131
+ ## 🎚️ Recommended Samplers
132
 
133
+ Start with **Behemoth-X's** recommended settings β€” the merge leans heavily on X's prose tuning.
134
 
135
+ For thinking mode, drop temperature to **0.6–0.8**. The `<think>` block benefits from more deterministic reasoning; high temperature scrambles the structure.
136
+
137
+ ---
138
+
139
+ ## πŸš€ Usage
140
+
141
+ ### vLLM
142
 
143
  ```bash
144
  python -m vllm.entrypoints.openai.api_server \
 
149
  --trust-remote-code
150
  ```
151
 
152
+ ### Single-GPU inference
153
+
154
+ Grab one of the quantized variants:
155
+ - **FP8** β€” ~123 GB, fits on 1x H200, near-lossless quality
156
+ - **AWQ / GPTQ W4A16** β€” ~65 GB, fits on 1x H100, slight quality tradeoff
157
+
158
+ ---
159
 
160
+ ## 🧱 Lineage
161
 
162
  ```
163
+ Mistral-Large-Instruct-2411 (Mistral AI)
164
+ β”œβ”€ Behemoth-X-123B-v2 (TheDrummer) ← the voice
165
+ └─ Behemoth-R1-123B-v2 (TheDrummer) ← the mind
166
+ └─ Behemoth-X-R1-123B ← the merge
167
  ```
168
 
169
+ ---
170
+
171
+ ## πŸ” Known behaviors
172
+
173
+ - **`<think>` triggers on prefill, not spontaneously.** Inherited from R1. Seed the tag.
174
+ - **Thinking style is R1-derived** β€” structured, character-aware, useful but not floaty. If you want Opus-style literary pre-writing, that's a follow-up fine-tune target, not something this merge gives you for free.
175
+ - **Prose voice is mostly X.** Most generations are indistinguishable from pure X on writing quality.
176
+ - **Long character cards work natively.** No fine-tuning means no overfitting on context length.
177
 
178
+ ---
179
+
180
+ ## πŸ™ Credits
 
181
 
182
+ - **[TheDrummer](https://huggingface.co/TheDrummer)** β€” for Behemoth-X and Behemoth-R1, the two best Mistral Large fine-tunes in the creative space.
183
+ - **[Mistral AI](https://huggingface.co/mistralai)** β€” for the foundation both parents are built on.
184
+ - **[Arcee AI](https://github.com/arcee-ai/mergekit)** β€” for mergekit and the SCE implementation.
185
+ - **[FuseAI](https://huggingface.co/FuseAI)** β€” for proving SCE preserves reasoning.
186
 
187
+ ---
 
 
 
 
188
 
189
+ ## πŸ“„ License
190
 
191
+ Inherited from base: **[Mistral Research License](https://mistral.ai/licenses/MRL-0.1.md)** β€” non-commercial use only.