Update README.md
Browse files
README.md
CHANGED
|
@@ -45,6 +45,48 @@ parameters:
|
|
| 45 |
randomize: 0.05 # Small randomization for exploratory strength, can lead to innovative fusions
|
| 46 |
dtype: bfloat16
|
| 47 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
## 💻 Usage
|
| 50 |
|
|
|
|
| 45 |
randomize: 0.05 # Small randomization for exploratory strength, can lead to innovative fusions
|
| 46 |
dtype: bfloat16
|
| 47 |
```
|
| 48 |
+
## Background
|
| 49 |
+
|
| 50 |
+
This merge is an enhanced SLERP version of the DeepKarkhanis/NeuralPipe-7B-slerp repository.
|
| 51 |
+
Below is the step-by-step analysis followed by the enhanced YAML configuration. The enhancements are based on best practices for SLERP in mergekit (e.g., refining interpolation parameters for smoother fusion, adding stability options, and ensuring optimal blending to combine the efficiency of the base model with the reasoning strengths of the second model).
|
| 52 |
+
|
| 53 |
+
Step-by-Step Analysis of the Original Configuration:
|
| 54 |
+
Models Involved (sources):
|
| 55 |
+
|
| 56 |
+
1. Base model: OpenPipe/mistral-ft-optimized-1218 (a fine-tuned Mistral-7B optimized for performance and efficiency).
|
| 57 |
+
Second model: mlabonne/NeuralHermes-2.5-Mistral-7B (a Hermes variant focused on advanced reasoning, instruction-following, and natural language capabilities).
|
| 58 |
+
Layer range: [0, 32] for both, which is correct for Mistral-7B architecture (32 transformer layers). This allows full-model merging.
|
| 59 |
+
|
| 60 |
+
2. Merge Method (slerp):
|
| 61 |
+
|
| 62 |
+
SLERP (Spherical Linear Interpolation) is used, which is effective for merging similar architectures by interpolating weights on a sphere, preserving norms and leading to more stable fusions compared to linear merges.
|
| 63 |
+
Base_model is set to the first source, meaning interpolation starts from it toward the second model.
|
| 64 |
+
|
| 65 |
+
3. Parameters (t):
|
| 66 |
+
|
| 67 |
+
t controls the interpolation strength (0 = fully base model, 1 = fully second model).
|
| 68 |
+
Filtered for self_attn (attention layers): [0, 0.5, 0.3, 0.7, 1] – This creates a non-linear blend, starting conservative, dipping, then ramping up.
|
| 69 |
+
Filtered for mlp (feed-forward layers): [1, 0.5, 0.7, 0.3, 0] – Inverse pattern, starting heavy on second model, then varying.
|
| 70 |
+
Default: 0.5 (equal blend for other parameters).
|
| 71 |
+
This setup allows layer-specific customization, which is good for targeting strengths (e.g., attention for reasoning, MLP for computation).
|
| 72 |
+
|
| 73 |
+
4. dtype: bfloat16
|
| 74 |
+
|
| 75 |
+
Uses brain floating-point 16 for efficiency and precision during merging, suitable for modern hardware.
|
| 76 |
+
|
| 77 |
+
5. Strengths and Weaknesses:
|
| 78 |
+
|
| 79 |
+
Strengths: Combines optimization (from base) with advanced reasoning (from Hermes), potentially creating a strong hybrid for tasks like coding, chat, and analysis. Variable t adds nuance, avoiding uniform blending.
|
| 80 |
+
Weaknesses: The t lists are short (5 values), which might not provide smooth enough transitions across 32 layers—could lead to abrupt changes and suboptimal performance. No normalization or density parameters, which could improve stability. No randomization for diversity. Goal of "very advanced and strong" suggests needing finer control for better benchmark scores (e.g., higher MMLU, HellaSwag).
|
| 81 |
+
|
| 82 |
+
## Enhanced YAML Configuration:
|
| 83 |
+
|
| 84 |
+
Enhanced config by:
|
| 85 |
+
|
| 86 |
+
Expanding t lists to 7 values for smoother, more adaptive interpolation across layers (e.g., sinusoidal patterns to gradually emphasize strengths).
|
| 87 |
+
Adding a normalize parameter (true) to stabilize weight norms post-merge, reducing artifacts.
|
| 88 |
+
Introducing density parameters to control how much of each model's "mass" is retained, biased toward the stronger reasoning model.
|
| 89 |
+
Adding a small randomization factor for exploratory merging, which can lead to unexpectedly strong variants in practice.
|
| 90 |
|
| 91 |
## 💻 Usage
|
| 92 |
|