ayjays132's picture
Upload 478 files
101858b verified

Memory Optimization Module

Unified memory management system with shared Qwen model integration for zero memory overhead.

Module Structure

memory_optimization/
β”œβ”€β”€ __init__.py              # Module exports and convenience functions
β”œβ”€β”€ config.py                # MemoryOptimizationConfig
β”œβ”€β”€ manager.py               # UnifiedMemoryManager
β”œβ”€β”€ tensor_pool.py           # TensorPool
β”œβ”€β”€ model_cache.py           # ModelCache (uses shared Qwen model)
β”œβ”€β”€ cleanup.py               # MemoryCleanup
└── README.md                # This file

Features

βœ… Shared Model Integration

  • ModelCache: Uses shared Qwen model for zero memory overhead
  • Automatic fallback to cached models if shared model unavailable
  • Prevents model duplication across modules

βœ… CUDA Optimization

  • All operations run on CUDA when available
  • Efficient tensor pooling
  • Adaptive memory cleanup

βœ… Self-Contained Modules

  • Each component is independent
  • Easy to test and benchmark
  • Clean separation of concerns

Usage

Basic Usage

from memory_optimization import (
    UnifiedMemoryManager,
    MemoryOptimizationConfig,
    get_unified_memory_manager
)

# Initialize with shared model
config = MemoryOptimizationConfig(
    use_shared_model=True,
    device="cuda"
)
manager = UnifiedMemoryManager(config)

# Get shared model (uses shared Qwen if available)
model = manager.get_shared_model("Qwen/Qwen3-0.6B", "transformer")

# Get optimized tensor
tensor = manager.get_tensor((10, 1024), dtype=torch.float32)

# Return tensor to pool
manager.return_tensor(tensor)

Convenience Functions

from memory_optimization import (
    get_shared_model,
    get_tensor,
    return_tensor,
    clear_memory,
    get_memory_stats
)

# Get shared model
model = get_shared_model("Qwen/Qwen3-0.6B", "transformer")

# Get tensor
tensor = get_tensor((10, 1024))

# Return tensor
return_tensor(tensor)

# Get stats
stats = get_memory_stats()

# Clear memory
clear_memory()

Integration with Shared Model

The module automatically detects and uses the shared Qwen model:

  1. ModelCache: Uses shared Qwen model for transformers (zero memory overhead)
  2. ModelCache: Uses shared Qwen tokenizer (zero memory overhead)
  3. Automatic Fallback: Falls back to cached models if shared model unavailable

CUDA Compatibility

All components are CUDA-compatible:

  • Automatic device detection
  • Efficient GPU memory management
  • Adaptive cleanup based on memory pressure

Configuration

See config.py for all configuration options. Key settings:

  • use_shared_model: Enable shared Qwen model (default: True)
  • device: Device to use ("cuda" or "cpu")
  • memory_threshold: Memory usage threshold for cleanup
  • max_pool_size: Maximum tensor pool size
  • use_4bit_quantization: Enable 4-bit quantization

Memory Management

The system provides:

  • Tensor Pooling: Reuse tensors to reduce allocations
  • Model Caching: Share model instances across modules
  • Adaptive Cleanup: Automatic memory cleanup based on pressure
  • Emergency Cleanup: Force cleanup when memory is critical

Dependencies

  • PyTorch (CUDA support recommended)
  • Transformers
  • BitsAndBytes (for 4-bit quantization, optional)