# Mergekit Low VRAM Graph Patch ## Merge models in minutes instead of hours on low VRAM This is a significant and sophisticated modification to `mergekit/graph.py`. It transforms the standard `Executor` from a "optimistic" runner (assuming tensors fit in VRAM) into a **robust, adaptive execution engine** designed specifically to survive low-VRAM environments. Here is a detailed analysis of the changes and how they achieve the goal of running on RTX 3060-class hardware. ### Core Strategy: "Fail Gracefully and Chunk" The original `Executor` simply moved tensors to the GPU, executed, and moved them back. If VRAM ran out, the process crashed. This modified version implements a three-tier fallback strategy inside `_run`: 1. **Tier 1: Standard GPU Execution.** Try to run the task normally on the GPU. 2. **Tier 2: Adaptive Chunking.** If Tier 1 throws an OOM (`torch.OutOfMemoryError`), catch it, clear the cache, and attempt to split the operation into smaller batches (chunks). 3. **Tier 3: CPU Fallback.** If chunking fails (or isn't applicable), fall back to system RAM (CPU), which is much slower but usually has higher capacity. ### Key Code Modifications #### 1. Windows/NVIDIA Allocator Tuning ```python if sys.platform == "win32": os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32" ``` **Analysis:** This is a crucial addition for consumer hardware, particularly on Windows. PyTorch on Windows often suffers from memory fragmentation. Setting `max_split_size_mb` helps prevent the allocator from splitting blocks too aggressively, reducing "fragmentation OOMs" where free memory exists but isn't contiguous. #### 2. The `_execute_chunked` Method This is a new helper method that implements the logic for breaking a large tensor operation into smaller pieces. * **Logic:** It identifies a reference tensor in the arguments, determines the total number of rows (dim 0), and iterates through the data in `chunk_size` increments. * **Memory Efficiency:** * It slices inputs on the CPU. * Moves **only the current slice** to the GPU. * Executes the task. * Moves the result **immediately back to the CPU**. * Deletes the GPU tensors and clears the cache. * **Result:** The peak VRAM usage becomes proportional to `chunk_size` rather than the full model layer size. #### 3. The Adaptive Execution Loop (`_run`) The `_run` method has been completely rewritten to handle the fallback logic. **The Heuristic Filter:** ```python is_io_task = task_type in ["LoadTensor", "GatherTensors", "SaveTensor", ...] want_gpu = is_gpu_execution and (task.uses_accelerator() or not is_io_task) ``` **Analysis:** The code explicitly prevents I/O tasks (loading/saving) from clogging up the GPU. `PermutedEmbeddings` is also excluded, which is smart because embedding tables are massive (often 250MB+) and permuting them is memory-bandwidth bound, not compute bound. **The OOM Handler:** ```python except torch.OutOfMemoryError: # ... cleanup ... chunk_sizes = [4096, 2048, 1024, 512, 256, 128, 64] for chunk_size in chunk_sizes: try: res = self._execute_chunked(task, arguments, chunk_size=chunk_size) # ... success ... break ``` **Analysis:** This is the "magic" that allows 3060s to work. If a layer is too big, it tries progressively smaller chunks until it finds a size that fits in the remaining VRAM. **Aggressive Garbage Collection:** ```python if is_gpu_execution: gc.collect() if accelerator: accelerator.empty_cache() ``` **Analysis:** This runs at the end of *every* task execution loop. * **Pros:** It ensures VRAM is absolutely as clean as possible for the next task. * **Cons:** `cuda.empty_cache()` forces a device synchronization and overhead. This will make the merge process significantly slower than a standard run, but it trades speed for the ability to run at all. ### Potential Risks & Limitations 1. **Assumption of Row-Independence:** The `_execute_chunked` method assumes that the `task.execute` method operates independently on rows (dimension 0). * **Safe:** Linear merges, SLERP (usually), and element-wise operations. * **Unsafe:** Operations that require global statistics across the batch dimension (e.g., `softmax` over dim 0, though rare in weight merging) or matrix multiplications where the split dimension is the reduction dimension. However, for standard LLM weight merging (which is usually element-wise weighted averaging), this assumption holds. 2. **Performance Overhead:** The constant `gc.collect()` and `empty_cache()` calls, combined with moving data back and forth between CPU and GPU for every chunk, will result in low GPU utilization. The merge will take longer, but it will complete. ### Conclusion This is a **highly effective patch for low-VRAM users**. It trades execution speed for memory safety. * **For a 3090/4090 user:** This script might be slower than the original due to the aggressive GC. * **For a 3060/3060 Ti user:** This script enables functionality that is otherwise impossible (merging 70B models or large 7B merges with `--cuda`). The implementation is robust because it doesn't force chunking; it only attempts it when the standard approach fails.