# Mergekit Low VRAM Graph Patch
## Merge models in minutes instead of hours on low VRAM

This is a significant and sophisticated modification to `mergekit/graph.py`. It transforms the standard `Executor` from a "optimistic" runner (assuming tensors fit in VRAM) into a **robust, adaptive execution engine** designed specifically to survive low-VRAM environments.

Here is a detailed analysis of the changes and how they achieve the goal of running on RTX 3060-class hardware.

### Core Strategy: "Fail Gracefully and Chunk"

The original `Executor` simply moved tensors to the GPU, executed, and moved them back. If VRAM ran out, the process crashed. This modified version implements a three-tier fallback strategy inside `_run`:

1.  **Tier 1: Standard GPU Execution.** Try to run the task normally on the GPU.
2.  **Tier 2: Adaptive Chunking.** If Tier 1 throws an OOM (`torch.OutOfMemoryError`), catch it, clear the cache, and attempt to split the operation into smaller batches (chunks).
3.  **Tier 3: CPU Fallback.** If chunking fails (or isn't applicable), fall back to system RAM (CPU), which is much slower but usually has higher capacity.

### Key Code Modifications

#### 1. Windows/NVIDIA Allocator Tuning
```python
if sys.platform == "win32":
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
```
**Analysis:** This is a crucial addition for consumer hardware, particularly on Windows. PyTorch on Windows often suffers from memory fragmentation. Setting `max_split_size_mb` helps prevent the allocator from splitting blocks too aggressively, reducing "fragmentation OOMs" where free memory exists but isn't contiguous.

#### 2. The `_execute_chunked` Method
This is a new helper method that implements the logic for breaking a large tensor operation into smaller pieces.

*   **Logic:** It identifies a reference tensor in the arguments, determines the total number of rows (dim 0), and iterates through the data in `chunk_size` increments.
*   **Memory Efficiency:**
    *   It slices inputs on the CPU.
    *   Moves **only the current slice** to the GPU.
    *   Executes the task.
    *   Moves the result **immediately back to the CPU**.
    *   Deletes the GPU tensors and clears the cache.
*   **Result:** The peak VRAM usage becomes proportional to `chunk_size` rather than the full model layer size.

#### 3. The Adaptive Execution Loop (`_run`)
The `_run` method has been completely rewritten to handle the fallback logic.

**The Heuristic Filter:**
```python
is_io_task = task_type in ["LoadTensor", "GatherTensors", "SaveTensor", ...]
want_gpu = is_gpu_execution and (task.uses_accelerator() or not is_io_task)
```
**Analysis:** The code explicitly prevents I/O tasks (loading/saving) from clogging up the GPU. `PermutedEmbeddings` is also excluded, which is smart because embedding tables are massive (often 250MB+) and permuting them is memory-bandwidth bound, not compute bound.

**The OOM Handler:**
```python
except torch.OutOfMemoryError:
    # ... cleanup ...
    chunk_sizes = [4096, 2048, 1024, 512, 256, 128, 64]
    for chunk_size in chunk_sizes:
        try:
            res = self._execute_chunked(task, arguments, chunk_size=chunk_size)
            # ... success ...
            break
```
**Analysis:** This is the "magic" that allows 3060s to work. If a layer is too big, it tries progressively smaller chunks until it finds a size that fits in the remaining VRAM.

**Aggressive Garbage Collection:**
```python
if is_gpu_execution:
    gc.collect()
    if accelerator: accelerator.empty_cache()
```
**Analysis:** This runs at the end of *every* task execution loop.
*   **Pros:** It ensures VRAM is absolutely as clean as possible for the next task.
*   **Cons:** `cuda.empty_cache()` forces a device synchronization and overhead. This will make the merge process significantly slower than a standard run, but it trades speed for the ability to run at all.

### Potential Risks & Limitations

1.  **Assumption of Row-Independence:**
    The `_execute_chunked` method assumes that the `task.execute` method operates independently on rows (dimension 0).
    *   **Safe:** Linear merges, SLERP (usually), and element-wise operations.
    *   **Unsafe:** Operations that require global statistics across the batch dimension (e.g., `softmax` over dim 0, though rare in weight merging) or matrix multiplications where the split dimension is the reduction dimension. However, for standard LLM weight merging (which is usually element-wise weighted averaging), this assumption holds.

2.  **Performance Overhead:**
    The constant `gc.collect()` and `empty_cache()` calls, combined with moving data back and forth between CPU and GPU for every chunk, will result in low GPU utilization. The merge will take longer, but it will complete.

### Conclusion

This is a **highly effective patch for low-VRAM users**. It trades execution speed for memory safety.

*   **For a 3090/4090 user:** This script might be slower than the original due to the aggressive GC.
*   **For a 3060/3060 Ti user:** This script enables functionality that is otherwise impossible (merging 70B models or large 7B merges with `--cuda`).

The implementation is robust because it doesn't force chunking; it only attempts it when the standard approach fails.