Spaces:
Running
Running
| # Mergekit Low VRAM Graph Patch | |
| ## Merge models in minutes instead of hours on low VRAM | |
| This is a significant and sophisticated modification to `mergekit/graph.py`. It transforms the standard `Executor` from a "optimistic" runner (assuming tensors fit in VRAM) into a **robust, adaptive execution engine** designed specifically to survive low-VRAM environments. | |
| Here is a detailed analysis of the changes and how they achieve the goal of running on RTX 3060-class hardware. | |
| ### Core Strategy: "Fail Gracefully and Chunk" | |
| The original `Executor` simply moved tensors to the GPU, executed, and moved them back. If VRAM ran out, the process crashed. This modified version implements a three-tier fallback strategy inside `_run`: | |
| 1. **Tier 1: Standard GPU Execution.** Try to run the task normally on the GPU. | |
| 2. **Tier 2: Adaptive Chunking.** If Tier 1 throws an OOM (`torch.OutOfMemoryError`), catch it, clear the cache, and attempt to split the operation into smaller batches (chunks). | |
| 3. **Tier 3: CPU Fallback.** If chunking fails (or isn't applicable), fall back to system RAM (CPU), which is much slower but usually has higher capacity. | |
| ### Key Code Modifications | |
| #### 1. Windows/NVIDIA Allocator Tuning | |
| ```python | |
| if sys.platform == "win32": | |
| os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32" | |
| ``` | |
| **Analysis:** This is a crucial addition for consumer hardware, particularly on Windows. PyTorch on Windows often suffers from memory fragmentation. Setting `max_split_size_mb` helps prevent the allocator from splitting blocks too aggressively, reducing "fragmentation OOMs" where free memory exists but isn't contiguous. | |
| #### 2. The `_execute_chunked` Method | |
| This is a new helper method that implements the logic for breaking a large tensor operation into smaller pieces. | |
| * **Logic:** It identifies a reference tensor in the arguments, determines the total number of rows (dim 0), and iterates through the data in `chunk_size` increments. | |
| * **Memory Efficiency:** | |
| * It slices inputs on the CPU. | |
| * Moves **only the current slice** to the GPU. | |
| * Executes the task. | |
| * Moves the result **immediately back to the CPU**. | |
| * Deletes the GPU tensors and clears the cache. | |
| * **Result:** The peak VRAM usage becomes proportional to `chunk_size` rather than the full model layer size. | |
| #### 3. The Adaptive Execution Loop (`_run`) | |
| The `_run` method has been completely rewritten to handle the fallback logic. | |
| **The Heuristic Filter:** | |
| ```python | |
| is_io_task = task_type in ["LoadTensor", "GatherTensors", "SaveTensor", ...] | |
| want_gpu = is_gpu_execution and (task.uses_accelerator() or not is_io_task) | |
| ``` | |
| **Analysis:** The code explicitly prevents I/O tasks (loading/saving) from clogging up the GPU. `PermutedEmbeddings` is also excluded, which is smart because embedding tables are massive (often 250MB+) and permuting them is memory-bandwidth bound, not compute bound. | |
| **The OOM Handler:** | |
| ```python | |
| except torch.OutOfMemoryError: | |
| # ... cleanup ... | |
| chunk_sizes = [4096, 2048, 1024, 512, 256, 128, 64] | |
| for chunk_size in chunk_sizes: | |
| try: | |
| res = self._execute_chunked(task, arguments, chunk_size=chunk_size) | |
| # ... success ... | |
| break | |
| ``` | |
| **Analysis:** This is the "magic" that allows 3060s to work. If a layer is too big, it tries progressively smaller chunks until it finds a size that fits in the remaining VRAM. | |
| **Aggressive Garbage Collection:** | |
| ```python | |
| if is_gpu_execution: | |
| gc.collect() | |
| if accelerator: accelerator.empty_cache() | |
| ``` | |
| **Analysis:** This runs at the end of *every* task execution loop. | |
| * **Pros:** It ensures VRAM is absolutely as clean as possible for the next task. | |
| * **Cons:** `cuda.empty_cache()` forces a device synchronization and overhead. This will make the merge process significantly slower than a standard run, but it trades speed for the ability to run at all. | |
| ### Potential Risks & Limitations | |
| 1. **Assumption of Row-Independence:** | |
| The `_execute_chunked` method assumes that the `task.execute` method operates independently on rows (dimension 0). | |
| * **Safe:** Linear merges, SLERP (usually), and element-wise operations. | |
| * **Unsafe:** Operations that require global statistics across the batch dimension (e.g., `softmax` over dim 0, though rare in weight merging) or matrix multiplications where the split dimension is the reduction dimension. However, for standard LLM weight merging (which is usually element-wise weighted averaging), this assumption holds. | |
| 2. **Performance Overhead:** | |
| The constant `gc.collect()` and `empty_cache()` calls, combined with moving data back and forth between CPU and GPU for every chunk, will result in low GPU utilization. The merge will take longer, but it will complete. | |
| ### Conclusion | |
| This is a **highly effective patch for low-VRAM users**. It trades execution speed for memory safety. | |
| * **For a 3090/4090 user:** This script might be slower than the original due to the aggressive GC. | |
| * **For a 3060/3060 Ti user:** This script enables functionality that is otherwise impossible (merging 70B models or large 7B merges with `--cuda`). | |
| The implementation is robust because it doesn't force chunking; it only attempts it when the standard approach fails. |