Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,11 @@ This is a extension of a pre-trained language models created using [mergekit](ht
|
|
| 18 |
# Merge Details
|
| 19 |
### Merge Method
|
| 20 |
|
| 21 |
-
This
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
### Models Merged
|
| 24 |
|
|
@@ -181,7 +185,47 @@ The following YAML configuration was used to produce this model:
|
|
| 181 |
merge_method: passthrough
|
| 182 |
dtype: bfloat16
|
| 183 |
|
|
|
|
| 184 |
|
|
|
|
| 185 |
|
| 186 |
-
|
| 187 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
# Merge Details
|
| 19 |
### Merge Method
|
| 20 |
|
| 21 |
+
This method employs mergekit's passthrough method to expand blocks within the "CorticalStack/pastiche-crown-clown-7b-dare-dpo" model. For every 5th layer,
|
| 22 |
+
a new layer is added, with the `o_proj` and `down_proj` parameters of these added layers initialized to zero, mirroring the approach used in LLaMA Pro.
|
| 23 |
+
|
| 24 |
+
### It's important to note that this configuration has not undergone fine-tuning. Therefore, when fine-tuning, ensure that only every 5th layer is trainable, while all other layers remain frozen.
|
| 25 |
+
|
| 26 |
|
| 27 |
### Models Merged
|
| 28 |
|
|
|
|
| 185 |
merge_method: passthrough
|
| 186 |
dtype: bfloat16
|
| 187 |
|
| 188 |
+
```
|
| 189 |
|
| 190 |
+
# Function to freeze layers
|
| 191 |
|
|
|
|
| 192 |
```
|
| 193 |
+
from transformers import AutoModelForCausalLM
|
| 194 |
+
|
| 195 |
+
def enable_grad_only_every_nth(model, n):
|
| 196 |
+
"""
|
| 197 |
+
This function configures the specified model to enable gradient calculations exclusively for every nth layer, starting
|
| 198 |
+
from the first layer (0-indexed), to accommodate newly added blocks for training. Concurrently, it freezes the gradients
|
| 199 |
+
for all other components of the model, including the embedding layers and the model's head. This setup is particularly
|
| 200 |
+
useful for fine-tuning processes where only a subset of layers are targeted for updates, ensuring efficient training and
|
| 201 |
+
adaptation of newly integrated layers while maintaining the pre-trained behavior of other model components.
|
| 202 |
+
"""
|
| 203 |
+
|
| 204 |
+
# Freeze embeddings.
|
| 205 |
+
for param in model.model.embed_tokens.parameters():
|
| 206 |
+
param.requires_grad = False
|
| 207 |
+
|
| 208 |
+
# Freeze lm_head.
|
| 209 |
+
for param in model.lm_head.parameters():
|
| 210 |
+
param.requires_grad = False
|
| 211 |
+
|
| 212 |
+
# Enable gradients for every nth layer
|
| 213 |
+
layers = model.model.layers # Access the ModuleList containing the layers
|
| 214 |
+
|
| 215 |
+
for index, layer in enumerate(layers):
|
| 216 |
+
|
| 217 |
+
if (index + 1) % n == 0: # Enables gradients for every nth layer, starting from the layer after the 0th
|
| 218 |
+
for param in layer.parameters():
|
| 219 |
+
param.requires_grad = True
|
| 220 |
+
else:
|
| 221 |
+
for param in layer.parameters():
|
| 222 |
+
param.requires_grad = False
|
| 223 |
+
|
| 224 |
+
model = transformers.AutoModelForCausalLM.from_pretrained(
|
| 225 |
+
"arcee-ai/Mistral-7B-Instruct-v0.2-expanded"
|
| 226 |
+
)
|
| 227 |
+
# Update layer gradients, specify the correct value for n based on your model's architecture
|
| 228 |
+
n =5
|
| 229 |
+
enable_grad_only_every_nth(model, n)
|
| 230 |
+
```
|
| 231 |
+
|