AlekseyCalvin commited on
Commit
a20e358
·
verified ·
1 Parent(s): cb05474

Upload AGENT.md

Browse files
Files changed (1) hide show
  1. AGENT.md +228 -0
AGENT.md ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Below is some implementation pointers, plus pseudocode directions for how to organize the ui for tabs 5-7, and which options to add. These should be processed by a mergekit config launcher and launch on a gradio spaces app via a CPU machine. It is up to you to correctly route them to the mergekit cli, per provided examples (such as https://huggingface.co/spaces/AlekseyCalvin/mergekit-soonr/blob/main/app.py) and to make it all work. For tab 8, the processing should be routed thru the mergekit-moe cli, rather than the mergekit-yaml cli. For 5-8, use mergekit yaml. Make sure to implement memory efficient model sharding, caching, processing, and output saving similar to tabs 1-4. MAKE SURE EVERYTHING CORRECTLY CORRESPONDS TO MERGEKIT CONDIGS!
2
+
3
+
4
+
5
+ As you may easily gather from https://github.com/arcee-ai/mergekit/blob/main/docs/merge_methods.md , the "Task Vector" family begins with the foundational merge_method: "task_arithmetic" method, extends it via the TrIm, Elect Sign & Merge (aka TIES Merging) method aka merge_method: "ties”, further extends it with the actual sparsifying methods – merge_method: “dare_ties” and merge_method: “dare_linear” – then further extends these with the Drop and rEscaLe via sampLing with mAgnitude (DELLA-Merging) methods w/ pruning, aka merge_method: “della” and merge_method: “della_linear”, along with with the Breadcrumbs methods (which sculpturally sparsify merges by pruning outliers to reinforce task vectors), aka merge_method: “breadcrumbs” and aka merge_method: “breadcrumbs_ties”, and finally the SCE method from FuseChat, aka aka merge_method: “sce” (with SCE constituting a form of adaptive matrix/tensor-level merging: running each group of merged tensors through a tri-partite cross-optimization process of variance-based masking, followed by task-vector weighing-factor calculation, and then sign-consensus alignment).
6
+
7
+ Besides the Interpolation and the Task Vector families, you more or less neglect the last category of MergeKit’s methods: the Specialized methods (model_stock, nearswap, arcee_fusion, passthrough, and technically also linear) not belonging to either the Spherical Interpolation family (slerp, nuslerp, multislerp, karcher), nor to the Task Vector family (task_arithmetic, ties, dare_ties, dare_linear, della, della_linear, breadcrumbs, breadcrumbs_ties, and sce).
8
+
9
+ Just as glaringly, you neglect to sufficiently consider the necessary range of actual key parameters one needs to implement each method. You make a crude attempt to partition parameters in a per-family way, but you grossly misjudge the range of actual parameters required to support even the few methods you do list.
10
+
11
+ For Clarity, I have outlined all of the parameters for the merge method families and methods from https://github.com/arcee-ai/mergekit/blob/main/docs/merge_methods.md , as well as how they should be represented within the included tabs, starting with tab 5. The regular non-MoE MergeKit tabs should be named “Amphinterpolative” (for tab 5), “Stir/Tie Bases” (for tab 6) “Specious” (for tab 7), “MoEr” (for tab 8, the MoE mergekit implementation, wherein you must also reimplement prompts), “Rawer” (for tab 9, the raw PyTorch implementation), and “Mario,DARE!” (For tab 10, with the custom DARE implementation from https://huggingface.co/spaces/AlekseyCalvin/DARE-MERGE-SOONR/blob/main/app.py, https://huggingface.co/spaces/AlekseyCalvin/DARE-MERGE-SOONR/blob/main/merge.py , and https://huggingface.co/spaces/AlekseyCalvin/DARE-MERGE-SOONR/blob/main/hf_merge.py, + https://github.com/martyn/safetensors-merge-supermario #refine this implementation also! Reference all these sources)
12
+
13
+ Here are the precise parameters and interface directions for tabs 5 thru 7. Implement the below in accordance to the exact correct logic from https://github.com/arcee-ai/mergekit/tree/main/mergekit . Everything has to be formatted to work with the library! Do not just guess or make things up. Align everything precisely and double-check every bit.
14
+
15
+ 1)Spherical Interpolation-type method family methods family (or per my new category syntax “Amphinterpolative”):
16
+
17
+ “merge_method":
18
+ a)“slerp” – takes 2 models, one must be specified as “base_model".
19
+
20
+ b)”nuslerp” = Requires exactly 2 models. A “base_model" can optionally be provided (it must be distinct from the two main models).
21
+
22
+ c)”multislerp” = Takes 2 or more models. A “base_model" can optionally be provided to operate in task vector space.
23
+
24
+ d)”karcher” = Takes 2 or more models. No “base_model” is used.
25
+
26
+
27
+ ## Amphinterpolative (Spherical Interpolation-type method family) Parameters Interface:
28
+
29
+ Amphinterpolative Tab GLOBAL PARAMETERS:
30
+ “base_model” (form designating the repo of base model, mandatory for “slerp” method, optional for “nuslerp”, “multislerp”, “karcher")
31
+
32
+ “t” (global, float or scalar between 0 and 1.0, this is the Interpolation factor between base and t=0 yields the base_model, t=1 yields the other model)
33
+
34
+ "tokenizer_source" ["base", "union", "model:path"] default should be "base"
35
+
36
+ “normalize_weights” (boolean true/false checkmark switch, default is “true”/checked)
37
+
38
+ “int8_mask” (global, boolean true/false checkmark switch, default is “false”/unchecked)
39
+
40
+ “nuslerp_flatten” (global, boolean true/false checkmark switch, designated for the “nuslerp” method, default is “false”/unchecked, if “true” performs row/column-wise interpolation)
41
+
42
+ “nuslerp_row_wise” (global, boolean true/false checkmark switch, designated for the “nuslerp” method, default is “false”/unchecked, if “true” and “nuslerp_flatten” is "false”, SLERPs row vectors instead of column vectors)
43
+
44
+ “eps” (global, designated for the “multislerp” method, a small numerical constant, default is "1e-8”)
45
+
46
+ “max_iter” (global, designated for the “karcher” method, an integer value, default is “10”, refers to maximum iterations for the Karcher mean algorithm)
47
+
48
+ “tol” (global, designated for the “karcher” method ,a small numerical constant, default is “1e-5”, refers to convergence tolerance apropos the Karcher mean)
49
+
50
+ Also add the same sharding logic from tabs 1-4.
51
+ As well as the float/precision selection per the mergekit-yaml config.
52
+ Also add option for chat templates from the mergekit config.
53
+
54
+
55
+ Amphinterpolative Tab Per-Model Parameters:
56
+ MODEL 1:
57
+ "model 1” (form input for repo name of model #1)
58
+
59
+ "weight (model 1)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
60
+
61
+ MODEL 2:
62
+ "model 2” (form input for repo name of model #2)
63
+
64
+ "weight (model 2)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
65
+
66
+ MODEL 3 (only for ”multislerp” & “karcher”, which accept more than 2 models):
67
+ "model 3” (form input for repo name of model #3)
68
+
69
+ "weight (model 3)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
70
+
71
+ MODEL 4 (only for ”multislerp” & “karcher”, which accept more than 2 models):
72
+ "model 4” (form input for repo name of model #4)
73
+
74
+ "weight (model 4)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
75
+
76
+ MODEL 5 (only for ”multislerp” & “karcher”, which accept more than 2 models):
77
+ "model 5” (form input for repo name of model #5)
78
+
79
+ "weight (model 4)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
80
+
81
+
82
+ 2)Task Vector-type methods family (or per my new category syntax “Stir/Tie Bases”), further detailed in https://github.com/arcee-ai/mergekit/blob/main/mergekit/merge_methods/generalized_task_arithmetic.py and in https://github.com/arcee-ai/mergekit/blob/main/docs/merge_methods.md, example in https://github.com/arcee-ai/mergekit/blob/main/examples/ties.yml
83
+
84
+ “merge_method":
85
+ a)”task_arithmetic” = Requires a “base_model” plus one or more other models.
86
+
87
+ b)”ties” = Requires 2 or more models. One of these must be specified as “base_model”.
88
+
89
+ c)”dare_ties” = Requires 2 or more models. One of these must be specified as “base_model”.
90
+
91
+ d)”dare_linear” = Requires 2 or more models. One of these must be specified as “base_model”.
92
+
93
+ e)”della” = Requires 2 or more models. One of these must be specified as “base_model”.
94
+
95
+ f)”della_linear” = Requires 2 or more models. One of these must be specified as “base_model”.
96
+
97
+ g)”breadcrumbs” = Requires 2 or more models. One of these must be specified as “base_model”.
98
+
99
+ h)”breadcrumbs_ties” = Parameters: "weight" (per-model, float from 0 to 1.0, may be a single 0-1.0 float value like “0.76” or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]"), "gamma" (per-model) – fraction of parameters to prune, "density" (per-model, float from 0 to 1.0, may be a single value or a list of values in gradient progression/regression like [0, 0.3, 0.7, 1] or [1, 0.7, 0.1]) , "lambda"
100
+
101
+ i)”sce” = Requires at least 2 models. One of these must be specified as “base_model”. Parameters: “select_topk” (Float from 0 to 1) – default is “1.0”, effectively neutralizing the parameter. If select_topk < 1.0, only the top select_topk fraction of parameter positions with the highest variance are kept active.
102
+
103
+ ## Task Vector Parameters Interface:
104
+ Stir/Tie Bases Tab GLOBAL PARAMETERS:
105
+ “base_model” (form designating the base model)
106
+
107
+ “normalize” (boolean true/false checkmark switch, default is “true")
108
+
109
+ “int8_mask” (global, boolean true/false checkmark switch, default is “false”)
110
+
111
+ “lambda” (global, float value, default is "1.0")
112
+
113
+ “rescale" (global, boolean true/false checkmark switch, designated for “dare_linear”, default is “true” )
114
+
115
+ “select_topk” (float value from 0 to 1.0, default is “1.0”, designated for “sce”)
116
+
117
+ Also add the same sharding logic from tabs 1-4.
118
+ As well as the float/precision selection per the mergekit-yaml config.
119
+ Also add option for chat templates from the mergekit config.
120
+
121
+
122
+ Task Vector Per-Model Parameters:
123
+ MODEL 1:
124
+ "model 1” (form input for repo name of model #1)
125
+
126
+ "weight (model 1)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
127
+
128
+ "density (model 1)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
129
+
130
+ "gamma (model 1)" (for “breadcrumbs” and “breadcrumbs_ties”, 0 to 1.0 float, default value is “0.01")
131
+
132
+ "epsilon (model 1)” (for “della” and “della_linear”, 0 to 1.0 float, default value is “0.15”)
133
+
134
+ MODEL 2:
135
+ "model 2” (form input for repo name of model #2)
136
+
137
+ "weight (model 2)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
138
+
139
+ "density (model 2)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
140
+
141
+ "gamma (model 2)" (for “breadcrumbs” and “breadcrumbs_ties”, 0 to 1.0 float, default value is “0.01")
142
+
143
+ "epsilon (model 2)” (for “della” and “della_linear”, 0 to 1.0 float, default value is “0.15”)
144
+
145
+ MODEL 3:
146
+ "model 3” (form input for repo name of model #3)
147
+
148
+ "weight (model 3)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
149
+
150
+ "density (model 3)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
151
+
152
+ "gamma (model 3)" (for “breadcrumbs” and “breadcrumbs_ties”, 0 to 1.0 float, default value is “0.01”) #(fraction of parameters to prune)
153
+
154
+ "epsilon (model 3)” (for “della” and “della_linear”, 0 to 1.0 float, default value is “0.15”)
155
+
156
+ MODEL 4:
157
+ "model 4” (form input for repo name of model #4)
158
+
159
+ "weight (model 4)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
160
+
161
+ "density (model 4)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
162
+
163
+ "gamma (model 4)" (for “breadcrumbs” and “breadcrumbs_ties”, 0 to 1.0 float, default value is “0.01”) #(fraction of parameters to prune)
164
+
165
+ "epsilon (model 4)” (for “della” and “della_linear”, 0 to 1.0 float, default value is “0.15”)
166
+
167
+ 3)Specialized methods family (add new category called “Specious” to app/ui):
168
+
169
+ “merge_method”:
170
+ a)”model_stock” = Requires at least 3 models. One of these must be specified as “base_model”. Parameters: “filter_wise” (true/false boolean) – default should be “false"
171
+
172
+ b)”nearswap” = Requires exactly 2 models. One of the two must be specified as “base_model”. Parameters: “t”, inputted similarly to “slerp”, but processed via distinct algorithm.
173
+
174
+ c)”arcee_fusion” = Requires exactly 2 models. One model must be specified as “base_model”. All other parameter inputs should be ignored.
175
+
176
+ d)”passthrough” = Takes exactly 1 model as input (model inputs beyond model A should be ignored). Parameters: Takes “filter” sub-argument as input to select and designate model component to extract at a certain “scale” value. e.g., {"filter": "down_proj", "value": 0.5}
177
+
178
+ Specious Tab GLOBAL PARAMETERS:
179
+ “base_model” (form designating the base model)
180
+
181
+ “normalize” (boolean true/false checkmark switch, default is “true")
182
+
183
+ “int8_mask” (global, boolean true/false checkmark switch, default is “false”)
184
+
185
+ “t” (global, float or scalar between 0 and 1.0, this is the Interpolation factor between base and
186
+
187
+ t=0 yields the base_model, t=1 yields the other model)
188
+
189
+ “filter_wise" (global, boolean true/false checkmark switch, designated for the “model_stock” method, default is “false”/unchecked)
190
+
191
+ Also add the same sharding logic from tabs 1-4.
192
+ As well as the float/precision selection per the mergekit-yaml config.
193
+ Also add option for chat templates from the mergekit config.
194
+
195
+ Specious Tab Per-Model Parameters:
196
+ MODEL 1:
197
+ "model 1” (form input for repo name of model #1)
198
+
199
+ "weight (model 1)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
200
+
201
+ “filter” (model1) (optional form input to specify component of model 1 to isolate, crucial for “passthrough”)
202
+
203
+ MODEL 2:
204
+ "model 2” (form input for repo name of model #2)
205
+
206
+ "weight (model 2)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
207
+
208
+
209
+ MODEL 3 (only for ”model_stock”, which accepts more than 2 models):
210
+
211
+ "model 3” (form input for repo name of model #3)
212
+
213
+ "weight (model 3)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
214
+
215
+
216
+ MODEL 4 (only for ”model_stock”, which accepts more than 2 models):
217
+
218
+ "model 4” (form input for repo name of model #4)
219
+
220
+ "weight (model 4)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
221
+
222
+
223
+ MODEL 5 (only for ”model_stock”, which accepts more than 2 models):
224
+
225
+ "model 5” (form input for repo name of model #5)
226
+
227
+ "weight (model 4)” (0 to 1.0 float, or a list of values in gradient progression/regression like "[0, 0.3, 0.7, 1]" or "[1, 0.7, 0.1]”, default is 1.0)
228
+