ayjays132's picture
Upload 478 files
101858b verified
# Memory Optimization Module
Unified memory management system with shared Qwen model integration for zero memory overhead.
## Module Structure
```
memory_optimization/
β”œβ”€β”€ __init__.py # Module exports and convenience functions
β”œβ”€β”€ config.py # MemoryOptimizationConfig
β”œβ”€β”€ manager.py # UnifiedMemoryManager
β”œβ”€β”€ tensor_pool.py # TensorPool
β”œβ”€β”€ model_cache.py # ModelCache (uses shared Qwen model)
β”œβ”€β”€ cleanup.py # MemoryCleanup
└── README.md # This file
```
## Features
### βœ… Shared Model Integration
- **ModelCache**: Uses shared Qwen model for zero memory overhead
- Automatic fallback to cached models if shared model unavailable
- Prevents model duplication across modules
### βœ… CUDA Optimization
- All operations run on CUDA when available
- Efficient tensor pooling
- Adaptive memory cleanup
### βœ… Self-Contained Modules
- Each component is independent
- Easy to test and benchmark
- Clean separation of concerns
## Usage
### Basic Usage
```python
from memory_optimization import (
UnifiedMemoryManager,
MemoryOptimizationConfig,
get_unified_memory_manager
)
# Initialize with shared model
config = MemoryOptimizationConfig(
use_shared_model=True,
device="cuda"
)
manager = UnifiedMemoryManager(config)
# Get shared model (uses shared Qwen if available)
model = manager.get_shared_model("Qwen/Qwen3-0.6B", "transformer")
# Get optimized tensor
tensor = manager.get_tensor((10, 1024), dtype=torch.float32)
# Return tensor to pool
manager.return_tensor(tensor)
```
### Convenience Functions
```python
from memory_optimization import (
get_shared_model,
get_tensor,
return_tensor,
clear_memory,
get_memory_stats
)
# Get shared model
model = get_shared_model("Qwen/Qwen3-0.6B", "transformer")
# Get tensor
tensor = get_tensor((10, 1024))
# Return tensor
return_tensor(tensor)
# Get stats
stats = get_memory_stats()
# Clear memory
clear_memory()
```
## Integration with Shared Model
The module automatically detects and uses the shared Qwen model:
1. **ModelCache**: Uses shared Qwen model for transformers (zero memory overhead)
2. **ModelCache**: Uses shared Qwen tokenizer (zero memory overhead)
3. **Automatic Fallback**: Falls back to cached models if shared model unavailable
## CUDA Compatibility
All components are CUDA-compatible:
- Automatic device detection
- Efficient GPU memory management
- Adaptive cleanup based on memory pressure
## Configuration
See `config.py` for all configuration options. Key settings:
- `use_shared_model`: Enable shared Qwen model (default: True)
- `device`: Device to use ("cuda" or "cpu")
- `memory_threshold`: Memory usage threshold for cleanup
- `max_pool_size`: Maximum tensor pool size
- `use_4bit_quantization`: Enable 4-bit quantization
## Memory Management
The system provides:
- **Tensor Pooling**: Reuse tensors to reduce allocations
- **Model Caching**: Share model instances across modules
- **Adaptive Cleanup**: Automatic memory cleanup based on pressure
- **Emergency Cleanup**: Force cleanup when memory is critical
## Dependencies
- PyTorch (CUDA support recommended)
- Transformers
- BitsAndBytes (for 4-bit quantization, optional)