Spaces:

Naphula
/

model_tools

Running

App Files Files Community

model_tools / mergekit_low-VRAM-graph_patch.md

Naphula

Upload 2 files

94b8607 verified about 2 months ago

preview code

raw

history blame contribute delete

5.36 kB

	# Mergekit Low VRAM Graph Patch
	## Merge models in minutes instead of hours on low VRAM

	This is a significant and sophisticated modification to `mergekit/graph.py`. It transforms the standard `Executor` from a "optimistic" runner (assuming tensors fit in VRAM) into a robust, adaptive execution engine designed specifically to survive low-VRAM environments.

	Here is a detailed analysis of the changes and how they achieve the goal of running on RTX 3060-class hardware.

	### Core Strategy: "Fail Gracefully and Chunk"

	The original `Executor` simply moved tensors to the GPU, executed, and moved them back. If VRAM ran out, the process crashed. This modified version implements a three-tier fallback strategy inside `_run`:

	1. Tier 1: Standard GPU Execution. Try to run the task normally on the GPU.
	2. Tier 2: Adaptive Chunking. If Tier 1 throws an OOM (`torch.OutOfMemoryError`), catch it, clear the cache, and attempt to split the operation into smaller batches (chunks).
	3. Tier 3: CPU Fallback. If chunking fails (or isn't applicable), fall back to system RAM (CPU), which is much slower but usually has higher capacity.

	### Key Code Modifications

	#### 1. Windows/NVIDIA Allocator Tuning
	```python
	if sys.platform == "win32":
	os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"
	```
	Analysis: This is a crucial addition for consumer hardware, particularly on Windows. PyTorch on Windows often suffers from memory fragmentation. Setting `max_split_size_mb` helps prevent the allocator from splitting blocks too aggressively, reducing "fragmentation OOMs" where free memory exists but isn't contiguous.

	#### 2. The `_execute_chunked` Method
	This is a new helper method that implements the logic for breaking a large tensor operation into smaller pieces.

	* Logic: It identifies a reference tensor in the arguments, determines the total number of rows (dim 0), and iterates through the data in `chunk_size` increments.
	* Memory Efficiency:
	* It slices inputs on the CPU.
	* Moves only the current slice to the GPU.
	* Executes the task.
	* Moves the result immediately back to the CPU.
	* Deletes the GPU tensors and clears the cache.
	* Result: The peak VRAM usage becomes proportional to `chunk_size` rather than the full model layer size.

	#### 3. The Adaptive Execution Loop (`_run`)
	The `_run` method has been completely rewritten to handle the fallback logic.

	The Heuristic Filter:
	```python
	is_io_task = task_type in ["LoadTensor", "GatherTensors", "SaveTensor", ...]
	want_gpu = is_gpu_execution and (task.uses_accelerator() or not is_io_task)
	```
	Analysis: The code explicitly prevents I/O tasks (loading/saving) from clogging up the GPU. `PermutedEmbeddings` is also excluded, which is smart because embedding tables are massive (often 250MB+) and permuting them is memory-bandwidth bound, not compute bound.

	The OOM Handler:
	```python
	except torch.OutOfMemoryError:
	# ... cleanup ...
	chunk_sizes = [4096, 2048, 1024, 512, 256, 128, 64]
	for chunk_size in chunk_sizes:
	try:
	res = self._execute_chunked(task, arguments, chunk_size=chunk_size)
	# ... success ...
	break
	```
	Analysis: This is the "magic" that allows 3060s to work. If a layer is too big, it tries progressively smaller chunks until it finds a size that fits in the remaining VRAM.

	Aggressive Garbage Collection:
	```python
	if is_gpu_execution:
	gc.collect()
	if accelerator: accelerator.empty_cache()
	```
	Analysis: This runs at the end of every task execution loop.
	* Pros: It ensures VRAM is absolutely as clean as possible for the next task.
	* Cons: `cuda.empty_cache()` forces a device synchronization and overhead. This will make the merge process significantly slower than a standard run, but it trades speed for the ability to run at all.

	### Potential Risks & Limitations

	1. Assumption of Row-Independence:
	The `_execute_chunked` method assumes that the `task.execute` method operates independently on rows (dimension 0).
	* Safe: Linear merges, SLERP (usually), and element-wise operations.
	* Unsafe: Operations that require global statistics across the batch dimension (e.g., `softmax` over dim 0, though rare in weight merging) or matrix multiplications where the split dimension is the reduction dimension. However, for standard LLM weight merging (which is usually element-wise weighted averaging), this assumption holds.

	2. Performance Overhead:
	The constant `gc.collect()` and `empty_cache()` calls, combined with moving data back and forth between CPU and GPU for every chunk, will result in low GPU utilization. The merge will take longer, but it will complete.

	### Conclusion

	This is a highly effective patch for low-VRAM users. It trades execution speed for memory safety.

	* For a 3090/4090 user: This script might be slower than the original due to the aggressive GC.
	* For a 3060/3060 Ti user: This script enables functionality that is otherwise impossible (merging 70B models or large 7B merges with `--cuda`).

	The implementation is robust because it doesn't force chunking; it only attempts it when the standard approach fails.