mxguru1 commited on
Commit
55f5f5e
·
verified ·
1 Parent(s): 059bd12

Add KV interception hooks + generalised allocator + smoke tests (2/3: assignment_v2.py)

Browse files
Files changed (1) hide show
  1. assignment_v2.py +484 -0
assignment_v2.py ADDED
@@ -0,0 +1,484 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Sovereign Hive — Greedy resource allocator (v2)
3
+ ================================================
4
+
5
+ What changed in v2:
6
+ - The allocator no longer hardcodes "weights" as the cost dimension. It now
7
+ works over any (cost_per_unit, unit_count) pair, so KV-cache options
8
+ (cost = bytes_per_kv_token × max_seq_len) flow through the same algorithm
9
+ as weight options (cost = bytes_per_param × param_count).
10
+ - Existing LayerOption / LayerCandidate / assign_bit_widths names are
11
+ preserved as thin aliases over the generic core, so call sites that
12
+ haven't been ported yet keep working unchanged.
13
+ - assign_combined() runs two independent allocations (one per budget) and
14
+ returns a CombinedAssignmentResult. Weight budget and KV budget do NOT
15
+ fungibly trade — saving weight bytes can't pay for KV bytes — because
16
+ the two pools live in physically different VRAM regions at inference.
17
+ The right interface is "two budgets, both must fit," not one combined
18
+ pot.
19
+
20
+ Why two-budgets-not-one:
21
+ Weight VRAM is static across the run. KV VRAM scales with context length
22
+ at inference. You commit to a max ctx upfront (e.g. 4K, 8K), size the
23
+ KV reserve for that, and the weights get what's left. Letting the
24
+ allocator decide to spend "saved weight bytes" on extra KV precision is
25
+ unsafe: it produces a config that fits at low ctx but OOMs at high ctx.
26
+
27
+ Algorithm (unchanged in spirit, generalized in code):
28
+ 1. Start: every candidate at its cheapest option.
29
+ 2. While budget allows: globally pick the (candidate, upgrade) pair
30
+ with the highest drift-reduction-per-extra-byte; apply.
31
+ 3. Stop: no upgrade fits or no upgrade reduces drift.
32
+
33
+ Complexity unchanged: O(C × O^2) per pass, converges in ≤ C × (O-1) passes,
34
+ where C = number of candidates and O = options per candidate. Milliseconds.
35
+ """
36
+
37
+ from __future__ import annotations
38
+
39
+ from dataclasses import dataclass, field
40
+ from typing import Literal
41
+
42
+ # ---------------------------------------------------------------------------
43
+ # Generic option / candidate types
44
+ # ---------------------------------------------------------------------------
45
+
46
+
47
+ @dataclass(frozen=True)
48
+ class GenericOption:
49
+ """One option for one candidate.
50
+
51
+ cost_per_unit × unit_count = total bytes if chosen.
52
+ drift is the measured quality cost (lower is better).
53
+ label / tag carry arbitrary identification for the caller's reconstruction
54
+ (e.g. ('hqq', 4) for weights; ('hqq_g64', 4, 4) for K/V split).
55
+ """
56
+ cost_per_unit: float
57
+ drift: float
58
+ label: tuple = () # caller-defined identification of the option
59
+
60
+
61
+ @dataclass
62
+ class GenericCandidate:
63
+ """A single allocation site (e.g. one weight layer or one KV layer)."""
64
+ candidate_id: tuple # (layer_idx, component) — caller-defined
65
+ unit_count: int # params for weights, max_seq_len for KV, etc.
66
+ options: list[GenericOption]
67
+
68
+ def cheapest(self) -> GenericOption:
69
+ return min(self.options, key=lambda o: o.cost_per_unit)
70
+
71
+
72
+ @dataclass
73
+ class GenericAssignment:
74
+ candidate_id: tuple
75
+ chosen: GenericOption
76
+ bytes_used: float
77
+
78
+
79
+ @dataclass
80
+ class GenericAssignmentResult:
81
+ assignments: list[GenericAssignment]
82
+ total_drift: float
83
+ total_bytes: float
84
+ budget_bytes: float
85
+ saturated: bool
86
+
87
+ @property
88
+ def total_gb(self) -> float:
89
+ return self.total_bytes / 1e9
90
+
91
+ @property
92
+ def budget_gb(self) -> float:
93
+ return self.budget_bytes / 1e9
94
+
95
+ @property
96
+ def headroom_gb(self) -> float:
97
+ return (self.budget_bytes - self.total_bytes) / 1e9
98
+
99
+
100
+ class BudgetInfeasibleError(Exception):
101
+ def __init__(self, current_bytes: float, budget_bytes: float, label: str = "budget"):
102
+ super().__init__(
103
+ f"Even the cheapest assignment ({current_bytes / 1e9:.2f} GB) exceeds "
104
+ f"the {label} ({budget_bytes / 1e9:.2f} GB). Reduce candidate count, "
105
+ "increase aggressiveness of cheapest option, or relax the budget."
106
+ )
107
+ self.current_bytes = current_bytes
108
+ self.budget_bytes = budget_bytes
109
+
110
+
111
+ # ---------------------------------------------------------------------------
112
+ # Core algorithm (generic)
113
+ # ---------------------------------------------------------------------------
114
+
115
+
116
+ def assign_greedy(
117
+ candidates: list[GenericCandidate],
118
+ budget_bytes: float,
119
+ *,
120
+ budget_label: str = "budget",
121
+ ) -> GenericAssignmentResult:
122
+ """Greedy allocation by drift-reduction-per-byte ratio.
123
+
124
+ Raises BudgetInfeasibleError if even the cheapest assignment overshoots.
125
+ """
126
+ if not candidates:
127
+ raise ValueError("No candidates provided")
128
+ if budget_bytes <= 0:
129
+ raise ValueError(f"Non-positive budget: {budget_bytes}")
130
+
131
+ # Initialize at cheapest option per candidate.
132
+ current: dict[tuple, GenericOption] = {}
133
+ bytes_used: dict[tuple, float] = {}
134
+ cand_by_id: dict[tuple, GenericCandidate] = {}
135
+
136
+ for c in candidates:
137
+ key = c.candidate_id
138
+ cheapest = c.cheapest()
139
+ current[key] = cheapest
140
+ bytes_used[key] = cheapest.cost_per_unit * c.unit_count
141
+ cand_by_id[key] = c
142
+
143
+ total_bytes = sum(bytes_used.values())
144
+ if total_bytes > budget_bytes:
145
+ raise BudgetInfeasibleError(total_bytes, budget_bytes, budget_label)
146
+
147
+ def best_upgrade(key: tuple):
148
+ """Best (ratio, target_option, extra_bytes) for this candidate, or None."""
149
+ cand = cand_by_id[key]
150
+ cur = current[key]
151
+ best = None
152
+ for opt in cand.options:
153
+ if opt.cost_per_unit <= cur.cost_per_unit:
154
+ continue
155
+ if opt.drift >= cur.drift:
156
+ continue
157
+ drift_reduction = cur.drift - opt.drift
158
+ extra_bytes = (opt.cost_per_unit - cur.cost_per_unit) * cand.unit_count
159
+ if extra_bytes <= 0:
160
+ continue
161
+ ratio = drift_reduction / extra_bytes
162
+ if best is None or ratio > best[0]:
163
+ best = (ratio, opt, extra_bytes)
164
+ return best
165
+
166
+ saturated = False
167
+ while True:
168
+ winner_key = None
169
+ winner_ratio = -1.0
170
+ winner_opt = None
171
+ winner_extra = 0.0
172
+ any_available = False
173
+
174
+ for key in current:
175
+ up = best_upgrade(key)
176
+ if up is None:
177
+ continue
178
+ any_available = True
179
+ ratio, target, extra = up
180
+ if total_bytes + extra > budget_bytes:
181
+ continue
182
+ if ratio > winner_ratio:
183
+ winner_ratio = ratio
184
+ winner_key = key
185
+ winner_opt = target
186
+ winner_extra = extra
187
+
188
+ if winner_key is None:
189
+ saturated = any_available
190
+ break
191
+
192
+ bytes_used[winner_key] += winner_extra
193
+ total_bytes += winner_extra
194
+ current[winner_key] = winner_opt
195
+
196
+ assignments = [
197
+ GenericAssignment(
198
+ candidate_id=key,
199
+ chosen=current[key],
200
+ bytes_used=bytes_used[key],
201
+ )
202
+ for key in sorted(current.keys())
203
+ ]
204
+ total_drift = sum(a.chosen.drift for a in assignments)
205
+ return GenericAssignmentResult(
206
+ assignments=assignments,
207
+ total_drift=total_drift,
208
+ total_bytes=total_bytes,
209
+ budget_bytes=budget_bytes,
210
+ saturated=saturated,
211
+ )
212
+
213
+
214
+ # ---------------------------------------------------------------------------
215
+ # Combined weight + KV allocation
216
+ # ---------------------------------------------------------------------------
217
+
218
+
219
+ @dataclass
220
+ class CombinedAssignmentResult:
221
+ """Result of running greedy allocation independently on two budgets."""
222
+ weights: GenericAssignmentResult
223
+ kv: GenericAssignmentResult | None # None if no KV candidates provided
224
+
225
+ @property
226
+ def total_drift(self) -> float:
227
+ kv_drift = self.kv.total_drift if self.kv else 0.0
228
+ return self.weights.total_drift + kv_drift
229
+
230
+ @property
231
+ def total_gb(self) -> float:
232
+ kv_gb = self.kv.total_gb if self.kv else 0.0
233
+ return self.weights.total_gb + kv_gb
234
+
235
+
236
+ def assign_combined(
237
+ weight_candidates: list[GenericCandidate],
238
+ kv_candidates: list[GenericCandidate] | None,
239
+ weight_budget_bytes: float,
240
+ kv_budget_bytes: float,
241
+ ) -> CombinedAssignmentResult:
242
+ """Run two independent greedy allocations under their respective budgets.
243
+
244
+ The budgets do NOT trade — see module docstring. Saved weight bytes
245
+ cannot be reassigned to KV at inference because the two pools live in
246
+ different VRAM regions and the KV pool scales with context length.
247
+ """
248
+ weight_result = assign_greedy(
249
+ weight_candidates, weight_budget_bytes, budget_label="weight budget"
250
+ )
251
+ kv_result = None
252
+ if kv_candidates:
253
+ kv_result = assign_greedy(
254
+ kv_candidates, kv_budget_bytes, budget_label="KV budget"
255
+ )
256
+ return CombinedAssignmentResult(weights=weight_result, kv=kv_result)
257
+
258
+
259
+ # ---------------------------------------------------------------------------
260
+ # Back-compat: existing names that callers in pipeline.py / hunter use
261
+ # ---------------------------------------------------------------------------
262
+ # These keep the v1 public surface intact. New code should use the generic
263
+ # names above. The aliases construct GenericCandidate/Option under the hood
264
+ # and translate results back into the old shapes.
265
+
266
+ Quantizer = Literal["hqq", "awq", "gptq"]
267
+ BitWidth = Literal[2, 3, 4]
268
+
269
+
270
+ @dataclass(frozen=True)
271
+ class LayerOption:
272
+ """Weight-quantization option for one layer/component."""
273
+ bits: BitWidth
274
+ quantizer: Quantizer
275
+ drift: float
276
+ bytes_per_param: float
277
+
278
+ def to_generic(self) -> GenericOption:
279
+ return GenericOption(
280
+ cost_per_unit=self.bytes_per_param,
281
+ drift=self.drift,
282
+ label=(self.quantizer, self.bits),
283
+ )
284
+
285
+ @classmethod
286
+ def from_generic(cls, g: GenericOption) -> LayerOption:
287
+ # label = (quantizer, bits)
288
+ quantizer, bits = g.label
289
+ return cls(
290
+ bits=bits,
291
+ quantizer=quantizer,
292
+ drift=g.drift,
293
+ bytes_per_param=g.cost_per_unit,
294
+ )
295
+
296
+
297
+ @dataclass
298
+ class LayerCandidate:
299
+ layer_idx: int
300
+ component: str
301
+ param_count: int
302
+ options: list[LayerOption]
303
+
304
+ def cheapest(self) -> LayerOption:
305
+ return min(self.options, key=lambda o: o.bytes_per_param)
306
+
307
+ def to_generic(self) -> GenericCandidate:
308
+ return GenericCandidate(
309
+ candidate_id=(self.layer_idx, self.component),
310
+ unit_count=self.param_count,
311
+ options=[o.to_generic() for o in self.options],
312
+ )
313
+
314
+
315
+ @dataclass
316
+ class Assignment:
317
+ layer_idx: int
318
+ component: str
319
+ chosen: LayerOption
320
+ bytes_used: float
321
+
322
+
323
+ @dataclass
324
+ class AssignmentResult:
325
+ assignments: list[Assignment]
326
+ total_drift: float
327
+ total_weights_gb: float
328
+ budget_gb: float
329
+ headroom_gb: float
330
+ saturated: bool
331
+
332
+ @property
333
+ def by_layer(self) -> dict[tuple[int, str], Assignment]:
334
+ return {(a.layer_idx, a.component): a for a in self.assignments}
335
+
336
+
337
+ def assign_bit_widths(
338
+ candidates: list[LayerCandidate],
339
+ weight_budget_gb: float,
340
+ ) -> AssignmentResult:
341
+ """v1 API — preserved. Delegates to the generic allocator."""
342
+ generic_cands = [c.to_generic() for c in candidates]
343
+ gen_result = assign_greedy(
344
+ generic_cands,
345
+ budget_bytes=weight_budget_gb * 1e9,
346
+ budget_label="weight budget",
347
+ )
348
+
349
+ # Translate back to v1 shapes
350
+ assignments: list[Assignment] = []
351
+ for ga in gen_result.assignments:
352
+ layer_idx, component = ga.candidate_id
353
+ assignments.append(Assignment(
354
+ layer_idx=layer_idx,
355
+ component=component,
356
+ chosen=LayerOption.from_generic(ga.chosen),
357
+ bytes_used=ga.bytes_used,
358
+ ))
359
+ return AssignmentResult(
360
+ assignments=assignments,
361
+ total_drift=gen_result.total_drift,
362
+ total_weights_gb=gen_result.total_gb,
363
+ budget_gb=weight_budget_gb,
364
+ headroom_gb=weight_budget_gb - gen_result.total_gb,
365
+ saturated=gen_result.saturated,
366
+ )
367
+
368
+
369
+ def pareto_frontier(
370
+ candidates: list[LayerCandidate],
371
+ budgets_gb: list[float],
372
+ ) -> list[AssignmentResult]:
373
+ """v1 API — preserved."""
374
+ results: list[AssignmentResult] = []
375
+ for b in budgets_gb:
376
+ try:
377
+ results.append(assign_bit_widths(candidates, b))
378
+ except BudgetInfeasibleError:
379
+ continue
380
+ return results
381
+
382
+
383
+ # ---------------------------------------------------------------------------
384
+ # KV-specific convenience wrappers
385
+ # ---------------------------------------------------------------------------
386
+
387
+
388
+ @dataclass(frozen=True)
389
+ class KVOption:
390
+ """KV-cache quantization option for one attention layer."""
391
+ k_bits: int
392
+ v_bits: int
393
+ quantizer: str
394
+ drift: float
395
+ bytes_per_kv_token: float
396
+
397
+ def to_generic(self) -> GenericOption:
398
+ return GenericOption(
399
+ cost_per_unit=self.bytes_per_kv_token,
400
+ drift=self.drift,
401
+ label=(self.quantizer, self.k_bits, self.v_bits),
402
+ )
403
+
404
+ @classmethod
405
+ def from_generic(cls, g: GenericOption) -> KVOption:
406
+ quantizer, k_bits, v_bits = g.label
407
+ return cls(
408
+ k_bits=k_bits,
409
+ v_bits=v_bits,
410
+ quantizer=quantizer,
411
+ drift=g.drift,
412
+ bytes_per_kv_token=g.cost_per_unit,
413
+ )
414
+
415
+
416
+ @dataclass
417
+ class KVCandidate:
418
+ layer_idx: int
419
+ num_kv_heads: int
420
+ head_dim: int
421
+ options: list[KVOption]
422
+
423
+ def to_generic(self, max_seq_len: int) -> GenericCandidate:
424
+ # unit_count for KV is the number of tokens we're sizing the cache for.
425
+ return GenericCandidate(
426
+ candidate_id=(self.layer_idx, "kv"),
427
+ unit_count=max_seq_len,
428
+ options=[o.to_generic() for o in self.options],
429
+ )
430
+
431
+
432
+ @dataclass
433
+ class KVAssignment:
434
+ layer_idx: int
435
+ chosen: KVOption
436
+ bytes_used: float
437
+
438
+
439
+ @dataclass
440
+ class KVAssignmentResult:
441
+ assignments: list[KVAssignment]
442
+ total_drift: float
443
+ total_kv_gb: float
444
+ budget_gb: float
445
+ headroom_gb: float
446
+ saturated: bool
447
+ max_seq_len: int
448
+
449
+
450
+ def assign_kv_bits(
451
+ candidates: list[KVCandidate],
452
+ kv_budget_gb: float,
453
+ max_seq_len: int,
454
+ ) -> KVAssignmentResult:
455
+ """Allocate KV bit-widths across attention layers under a KV-cache budget.
456
+
457
+ max_seq_len is the context length you're sizing the cache for. The budget
458
+ must fit the worst case (full max_seq_len) because the cache cannot be
459
+ re-quantized mid-generation.
460
+ """
461
+ generic_cands = [c.to_generic(max_seq_len) for c in candidates]
462
+ gen_result = assign_greedy(
463
+ generic_cands,
464
+ budget_bytes=kv_budget_gb * 1e9,
465
+ budget_label="KV cache budget",
466
+ )
467
+
468
+ assignments: list[KVAssignment] = []
469
+ for ga in gen_result.assignments:
470
+ layer_idx, _component = ga.candidate_id
471
+ assignments.append(KVAssignment(
472
+ layer_idx=layer_idx,
473
+ chosen=KVOption.from_generic(ga.chosen),
474
+ bytes_used=ga.bytes_used,
475
+ ))
476
+ return KVAssignmentResult(
477
+ assignments=assignments,
478
+ total_drift=gen_result.total_drift,
479
+ total_kv_gb=gen_result.total_gb,
480
+ budget_gb=kv_budget_gb,
481
+ headroom_gb=kv_budget_gb - gen_result.total_gb,
482
+ saturated=gen_result.saturated,
483
+ max_seq_len=max_seq_len,
484
+ )