# Memory Optimization Module Unified memory management system with shared Qwen model integration for zero memory overhead. ## Module Structure ``` memory_optimization/ ├── __init__.py # Module exports and convenience functions ├── config.py # MemoryOptimizationConfig ├── manager.py # UnifiedMemoryManager ├── tensor_pool.py # TensorPool ├── model_cache.py # ModelCache (uses shared Qwen model) ├── cleanup.py # MemoryCleanup └── README.md # This file ``` ## Features ### ✅ Shared Model Integration - **ModelCache**: Uses shared Qwen model for zero memory overhead - Automatic fallback to cached models if shared model unavailable - Prevents model duplication across modules ### ✅ CUDA Optimization - All operations run on CUDA when available - Efficient tensor pooling - Adaptive memory cleanup ### ✅ Self-Contained Modules - Each component is independent - Easy to test and benchmark - Clean separation of concerns ## Usage ### Basic Usage ```python from memory_optimization import ( UnifiedMemoryManager, MemoryOptimizationConfig, get_unified_memory_manager ) # Initialize with shared model config = MemoryOptimizationConfig( use_shared_model=True, device="cuda" ) manager = UnifiedMemoryManager(config) # Get shared model (uses shared Qwen if available) model = manager.get_shared_model("Qwen/Qwen3-0.6B", "transformer") # Get optimized tensor tensor = manager.get_tensor((10, 1024), dtype=torch.float32) # Return tensor to pool manager.return_tensor(tensor) ``` ### Convenience Functions ```python from memory_optimization import ( get_shared_model, get_tensor, return_tensor, clear_memory, get_memory_stats ) # Get shared model model = get_shared_model("Qwen/Qwen3-0.6B", "transformer") # Get tensor tensor = get_tensor((10, 1024)) # Return tensor return_tensor(tensor) # Get stats stats = get_memory_stats() # Clear memory clear_memory() ``` ## Integration with Shared Model The module automatically detects and uses the shared Qwen model: 1. **ModelCache**: Uses shared Qwen model for transformers (zero memory overhead) 2. **ModelCache**: Uses shared Qwen tokenizer (zero memory overhead) 3. **Automatic Fallback**: Falls back to cached models if shared model unavailable ## CUDA Compatibility All components are CUDA-compatible: - Automatic device detection - Efficient GPU memory management - Adaptive cleanup based on memory pressure ## Configuration See `config.py` for all configuration options. Key settings: - `use_shared_model`: Enable shared Qwen model (default: True) - `device`: Device to use ("cuda" or "cpu") - `memory_threshold`: Memory usage threshold for cleanup - `max_pool_size`: Maximum tensor pool size - `use_4bit_quantization`: Enable 4-bit quantization ## Memory Management The system provides: - **Tensor Pooling**: Reuse tensors to reduce allocations - **Model Caching**: Share model instances across modules - **Adaptive Cleanup**: Automatic memory cleanup based on pressure - **Emergency Cleanup**: Force cleanup when memory is critical ## Dependencies - PyTorch (CUDA support recommended) - Transformers - BitsAndBytes (for 4-bit quantization, optional)