model_tools / mergekit_low-VRAM-graph_patch.md
Naphula's picture
Upload 2 files
94b8607 verified
# Mergekit Low VRAM Graph Patch
## Merge models in minutes instead of hours on low VRAM
This is a significant and sophisticated modification to `mergekit/graph.py`. It transforms the standard `Executor` from a "optimistic" runner (assuming tensors fit in VRAM) into a **robust, adaptive execution engine** designed specifically to survive low-VRAM environments.
Here is a detailed analysis of the changes and how they achieve the goal of running on RTX 3060-class hardware.
### Core Strategy: "Fail Gracefully and Chunk"
The original `Executor` simply moved tensors to the GPU, executed, and moved them back. If VRAM ran out, the process crashed. This modified version implements a three-tier fallback strategy inside `_run`:
1. **Tier 1: Standard GPU Execution.** Try to run the task normally on the GPU.
2. **Tier 2: Adaptive Chunking.** If Tier 1 throws an OOM (`torch.OutOfMemoryError`), catch it, clear the cache, and attempt to split the operation into smaller batches (chunks).
3. **Tier 3: CPU Fallback.** If chunking fails (or isn't applicable), fall back to system RAM (CPU), which is much slower but usually has higher capacity.
### Key Code Modifications
#### 1. Windows/NVIDIA Allocator Tuning
```python
if sys.platform == "win32":
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
```
**Analysis:** This is a crucial addition for consumer hardware, particularly on Windows. PyTorch on Windows often suffers from memory fragmentation. Setting `max_split_size_mb` helps prevent the allocator from splitting blocks too aggressively, reducing "fragmentation OOMs" where free memory exists but isn't contiguous.
#### 2. The `_execute_chunked` Method
This is a new helper method that implements the logic for breaking a large tensor operation into smaller pieces.
* **Logic:** It identifies a reference tensor in the arguments, determines the total number of rows (dim 0), and iterates through the data in `chunk_size` increments.
* **Memory Efficiency:**
* It slices inputs on the CPU.
* Moves **only the current slice** to the GPU.
* Executes the task.
* Moves the result **immediately back to the CPU**.
* Deletes the GPU tensors and clears the cache.
* **Result:** The peak VRAM usage becomes proportional to `chunk_size` rather than the full model layer size.
#### 3. The Adaptive Execution Loop (`_run`)
The `_run` method has been completely rewritten to handle the fallback logic.
**The Heuristic Filter:**
```python
is_io_task = task_type in ["LoadTensor", "GatherTensors", "SaveTensor", ...]
want_gpu = is_gpu_execution and (task.uses_accelerator() or not is_io_task)
```
**Analysis:** The code explicitly prevents I/O tasks (loading/saving) from clogging up the GPU. `PermutedEmbeddings` is also excluded, which is smart because embedding tables are massive (often 250MB+) and permuting them is memory-bandwidth bound, not compute bound.
**The OOM Handler:**
```python
except torch.OutOfMemoryError:
# ... cleanup ...
chunk_sizes = [4096, 2048, 1024, 512, 256, 128, 64]
for chunk_size in chunk_sizes:
try:
res = self._execute_chunked(task, arguments, chunk_size=chunk_size)
# ... success ...
break
```
**Analysis:** This is the "magic" that allows 3060s to work. If a layer is too big, it tries progressively smaller chunks until it finds a size that fits in the remaining VRAM.
**Aggressive Garbage Collection:**
```python
if is_gpu_execution:
gc.collect()
if accelerator: accelerator.empty_cache()
```
**Analysis:** This runs at the end of *every* task execution loop.
* **Pros:** It ensures VRAM is absolutely as clean as possible for the next task.
* **Cons:** `cuda.empty_cache()` forces a device synchronization and overhead. This will make the merge process significantly slower than a standard run, but it trades speed for the ability to run at all.
### Potential Risks & Limitations
1. **Assumption of Row-Independence:**
The `_execute_chunked` method assumes that the `task.execute` method operates independently on rows (dimension 0).
* **Safe:** Linear merges, SLERP (usually), and element-wise operations.
* **Unsafe:** Operations that require global statistics across the batch dimension (e.g., `softmax` over dim 0, though rare in weight merging) or matrix multiplications where the split dimension is the reduction dimension. However, for standard LLM weight merging (which is usually element-wise weighted averaging), this assumption holds.
2. **Performance Overhead:**
The constant `gc.collect()` and `empty_cache()` calls, combined with moving data back and forth between CPU and GPU for every chunk, will result in low GPU utilization. The merge will take longer, but it will complete.
### Conclusion
This is a **highly effective patch for low-VRAM users**. It trades execution speed for memory safety.
* **For a 3090/4090 user:** This script might be slower than the original due to the aggressive GC.
* **For a 3060/3060 Ti user:** This script enables functionality that is otherwise impossible (merging 70B models or large 7B merges with `--cuda`).
The implementation is robust because it doesn't force chunking; it only attempts it when the standard approach fails.