Update model card
Browse files
README.md
CHANGED
|
@@ -23,30 +23,43 @@ pipeline_tag: text-generation
|
|
| 23 |
|
| 24 |
<div align="center">
|
| 25 |
|
| 26 |
-
# Behemoth-X-R1-123B
|
| 27 |
|
| 28 |
-
###
|
| 29 |
|
| 30 |
-
*An SCE merge of
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
</div>
|
| 33 |
|
| 34 |
---
|
| 35 |
|
| 36 |
-
## What
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
-
- **[
|
|
|
|
| 42 |
|
| 43 |
-
The
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
## How it was made
|
| 46 |
|
| 47 |
-
**Method:** [SCE
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
**Config:**
|
| 50 |
```yaml
|
| 51 |
models:
|
| 52 |
- model: TheDrummer/Behemoth-X-123B-v2
|
|
@@ -62,28 +75,31 @@ parameters:
|
|
| 62 |
dtype: bfloat16
|
| 63 |
```
|
| 64 |
|
| 65 |
-
**Why
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
-
## Prompt Format
|
| 70 |
|
| 71 |
-
|
| 72 |
|
| 73 |
```
|
| 74 |
-
[SYSTEM_PROMPT]{
|
| 75 |
```
|
| 76 |
|
| 77 |
### To trigger thinking
|
| 78 |
|
| 79 |
-
Prefill the assistant turn with a `<think>` block. The model will continue
|
| 80 |
|
| 81 |
```
|
| 82 |
[INST]your message[/INST]<think>
|
| 83 |
-
{
|
| 84 |
```
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
```
|
| 89 |
<think>
|
|
@@ -98,17 +114,31 @@ Ok i need to think as a creative writer β what twist would surprise here?
|
|
| 98 |
Let me find an engaging new direction nobody saw coming, so
|
| 99 |
```
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
### Without thinking
|
| 104 |
|
| 105 |
-
Skip the prefill
|
|
|
|
|
|
|
| 106 |
|
| 107 |
-
## Recommended Samplers
|
| 108 |
|
| 109 |
-
Start with Behemoth-X's recommended settings β the merge
|
| 110 |
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
```bash
|
| 114 |
python -m vllm.entrypoints.openai.api_server \
|
|
@@ -119,32 +149,43 @@ python -m vllm.entrypoints.openai.api_server \
|
|
| 119 |
--trust-remote-code
|
| 120 |
```
|
| 121 |
|
| 122 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
|
| 124 |
-
## Lineage
|
| 125 |
|
| 126 |
```
|
| 127 |
-
Mistral-Large-Instruct-2411 (
|
| 128 |
-
ββ
|
| 129 |
-
ββ
|
| 130 |
-
ββ
|
| 131 |
```
|
| 132 |
|
| 133 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
-
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
- **Long character cards work.** Unlike `Behemoth-OpusX-123B` (our earlier LoRA experiment, which broke on 4k+ token system prompts), the merge handles long prompts natively since no new behavior was taught via fine-tuning.
|
| 139 |
|
| 140 |
-
|
|
|
|
|
|
|
|
|
|
| 141 |
|
| 142 |
-
-
|
| 143 |
-
- **[Mistral AI](https://huggingface.co/mistralai)** β for Mistral-Large-Instruct-2411, the foundation both parents are built on.
|
| 144 |
-
- **[Arcee AI / mergekit team](https://github.com/arcee-ai/mergekit)** β for the SCE implementation.
|
| 145 |
-
- **[FuseAI](https://huggingface.co/FuseAI)** β for validating the SCE-reasoning-merge approach with FuseO1.
|
| 146 |
-
- Merged by [tacodevs](https://huggingface.co/tacodevs) / [Telegai](https://telegai.com).
|
| 147 |
|
| 148 |
-
## License
|
| 149 |
|
| 150 |
-
Inherited from base
|
|
|
|
| 23 |
|
| 24 |
<div align="center">
|
| 25 |
|
| 26 |
+
# π§ Behemoth-X-R1-123B
|
| 27 |
|
| 28 |
+
### *A thinking beast that writes like a poet.*
|
| 29 |
|
| 30 |
+
**An SCE merge of Behemoth-X and Behemoth-R1 β prose voice meets reasoning mind in one 123B parameter model.**
|
| 31 |
+
|
| 32 |
+
[](https://huggingface.co/mistralai/Mistral-Large-Instruct-2411)
|
| 33 |
+
[](https://arxiv.org/abs/2408.07990)
|
| 34 |
+
[]()
|
| 35 |
+
[]()
|
| 36 |
|
| 37 |
</div>
|
| 38 |
|
| 39 |
---
|
| 40 |
|
| 41 |
+
## β‘ What makes this different
|
| 42 |
+
|
| 43 |
+
Most "thinking" models sacrifice prose for reasoning. Most creative models can't think their way out of a scene. **Behemoth-X-R1 doesn't compromise** β it carries the distinctive voice and character depth of Behemoth-X into a model that can open a `<think>` tag and actually use it.
|
| 44 |
|
| 45 |
+
No LoRA. No retraining. Just **principled weight arithmetic** using the same SCE merge recipe that FuseAI used to preserve reasoning in [FuseO1](https://huggingface.co/FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Instruct-32B-Preview).
|
| 46 |
|
| 47 |
+
**The parents:**
|
| 48 |
+
- **[Behemoth-X-123B-v2](https://huggingface.co/TheDrummer/Behemoth-X-123B-v2)** β the top-rated creative writer on the [UGI Leaderboard](https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard). Character voice, prose density, the reason people run 123B at home.
|
| 49 |
+
- **[Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2)** β Behemoth's reasoning sibling. Knows when to open `<think>`, knows when to close it.
|
| 50 |
|
| 51 |
+
**The child:** both, at once.
|
| 52 |
+
|
| 53 |
+
---
|
| 54 |
|
| 55 |
+
## 𧬠How it was made
|
| 56 |
|
| 57 |
+
**Method:** [SCE β Select, Calculate, Erase](https://arxiv.org/abs/2408.07990)
|
| 58 |
+
|
| 59 |
+
Unlike TIES or DARE, SCE doesn't prune deltas by density. It uses **variance-aware matrix-level selection with sign consensus** β meaning capability-bearing weight updates survive the merge even when they're small and diffuse. That matters here because reasoning is a *behavioral* trait encoded across many tiny parameter shifts, not a knowledge trait concentrated in a few big ones.
|
| 60 |
+
|
| 61 |
+
**The recipe:**
|
| 62 |
|
|
|
|
| 63 |
```yaml
|
| 64 |
models:
|
| 65 |
- model: TheDrummer/Behemoth-X-123B-v2
|
|
|
|
| 75 |
dtype: bfloat16
|
| 76 |
```
|
| 77 |
|
| 78 |
+
**Why these numbers?**
|
| 79 |
+
|
| 80 |
+
- **55/45** β Slight lean toward X for prose quality while giving R1 enough mass to keep its thinking circuit intact. Both parents share the same base, same tokenizer (verified identical SHA256), and the same training lineage β ideal merge conditions.
|
| 81 |
+
- **`select_topk: 1.0`** β Keep all the deltas. Let variance + sign consensus do the work. This is the FuseO1 setting, validated empirically on reasoning merges.
|
| 82 |
|
| 83 |
+
---
|
| 84 |
|
| 85 |
+
## π Prompt Format
|
| 86 |
|
| 87 |
+
Standard **Mistral v7**, same as both parents:
|
| 88 |
|
| 89 |
```
|
| 90 |
+
[SYSTEM_PROMPT]{system}[/SYSTEM_PROMPT][INST]{user}[/INST]{assistant}</s>
|
| 91 |
```
|
| 92 |
|
| 93 |
### To trigger thinking
|
| 94 |
|
| 95 |
+
Prefill the assistant turn with a `<think>` block. The model will continue your prefill, close the tag, and drop into the narrative:
|
| 96 |
|
| 97 |
```
|
| 98 |
[INST]your message[/INST]<think>
|
| 99 |
+
{seed phrase}
|
| 100 |
```
|
| 101 |
|
| 102 |
+
**Prefill examples that work well:**
|
| 103 |
|
| 104 |
```
|
| 105 |
<think>
|
|
|
|
| 114 |
Let me find an engaging new direction nobody saw coming, so
|
| 115 |
```
|
| 116 |
|
| 117 |
+
```
|
| 118 |
+
<think>
|
| 119 |
+
Ok i need to think as an unhinged author β raw, explicit, intense, fully in
|
| 120 |
+
character with no holding back, so
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
The model inherits R1's thinking circuit but shares R1's preference for being prefilled rather than self-triggering. Seed the tag, let it cook.
|
| 124 |
|
| 125 |
### Without thinking
|
| 126 |
|
| 127 |
+
Skip the prefill. It behaves close to pure Behemoth-X.
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
|
| 131 |
+
## ποΈ Recommended Samplers
|
| 132 |
|
| 133 |
+
Start with **Behemoth-X's** recommended settings β the merge leans heavily on X's prose tuning.
|
| 134 |
|
| 135 |
+
For thinking mode, drop temperature to **0.6β0.8**. The `<think>` block benefits from more deterministic reasoning; high temperature scrambles the structure.
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## π Usage
|
| 140 |
+
|
| 141 |
+
### vLLM
|
| 142 |
|
| 143 |
```bash
|
| 144 |
python -m vllm.entrypoints.openai.api_server \
|
|
|
|
| 149 |
--trust-remote-code
|
| 150 |
```
|
| 151 |
|
| 152 |
+
### Single-GPU inference
|
| 153 |
+
|
| 154 |
+
Grab one of the quantized variants:
|
| 155 |
+
- **FP8** β ~123 GB, fits on 1x H200, near-lossless quality
|
| 156 |
+
- **AWQ / GPTQ W4A16** β ~65 GB, fits on 1x H100, slight quality tradeoff
|
| 157 |
+
|
| 158 |
+
---
|
| 159 |
|
| 160 |
+
## π§± Lineage
|
| 161 |
|
| 162 |
```
|
| 163 |
+
Mistral-Large-Instruct-2411 (Mistral AI)
|
| 164 |
+
ββ Behemoth-X-123B-v2 (TheDrummer) β the voice
|
| 165 |
+
ββ Behemoth-R1-123B-v2 (TheDrummer) β the mind
|
| 166 |
+
ββ Behemoth-X-R1-123B β the merge
|
| 167 |
```
|
| 168 |
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## π Known behaviors
|
| 172 |
+
|
| 173 |
+
- **`<think>` triggers on prefill, not spontaneously.** Inherited from R1. Seed the tag.
|
| 174 |
+
- **Thinking style is R1-derived** β structured, character-aware, useful but not floaty. If you want Opus-style literary pre-writing, that's a follow-up fine-tune target, not something this merge gives you for free.
|
| 175 |
+
- **Prose voice is mostly X.** Most generations are indistinguishable from pure X on writing quality.
|
| 176 |
+
- **Long character cards work natively.** No fine-tuning means no overfitting on context length.
|
| 177 |
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## π Credits
|
|
|
|
| 181 |
|
| 182 |
+
- **[TheDrummer](https://huggingface.co/TheDrummer)** β for Behemoth-X and Behemoth-R1, the two best Mistral Large fine-tunes in the creative space.
|
| 183 |
+
- **[Mistral AI](https://huggingface.co/mistralai)** β for the foundation both parents are built on.
|
| 184 |
+
- **[Arcee AI](https://github.com/arcee-ai/mergekit)** β for mergekit and the SCE implementation.
|
| 185 |
+
- **[FuseAI](https://huggingface.co/FuseAI)** β for proving SCE preserves reasoning.
|
| 186 |
|
| 187 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
|
| 189 |
+
## π License
|
| 190 |
|
| 191 |
+
Inherited from base: **[Mistral Research License](https://mistral.ai/licenses/MRL-0.1.md)** β non-commercial use only.
|