# Memory Optimization Module

Unified memory management system with shared Qwen model integration for zero memory overhead.

## Module Structure

```
memory_optimization/
├── __init__.py              # Module exports and convenience functions
├── config.py                # MemoryOptimizationConfig
├── manager.py               # UnifiedMemoryManager
├── tensor_pool.py           # TensorPool
├── model_cache.py           # ModelCache (uses shared Qwen model)
├── cleanup.py               # MemoryCleanup
└── README.md                # This file
```

## Features

### ✅ Shared Model Integration
- **ModelCache**: Uses shared Qwen model for zero memory overhead
- Automatic fallback to cached models if shared model unavailable
- Prevents model duplication across modules

### ✅ CUDA Optimization
- All operations run on CUDA when available
- Efficient tensor pooling
- Adaptive memory cleanup

### ✅ Self-Contained Modules
- Each component is independent
- Easy to test and benchmark
- Clean separation of concerns

## Usage

### Basic Usage

```python
from memory_optimization import (
    UnifiedMemoryManager,
    MemoryOptimizationConfig,
    get_unified_memory_manager
)

# Initialize with shared model
config = MemoryOptimizationConfig(
    use_shared_model=True,
    device="cuda"
)
manager = UnifiedMemoryManager(config)

# Get shared model (uses shared Qwen if available)
model = manager.get_shared_model("Qwen/Qwen3-0.6B", "transformer")

# Get optimized tensor
tensor = manager.get_tensor((10, 1024), dtype=torch.float32)

# Return tensor to pool
manager.return_tensor(tensor)
```

### Convenience Functions

```python
from memory_optimization import (
    get_shared_model,
    get_tensor,
    return_tensor,
    clear_memory,
    get_memory_stats
)

# Get shared model
model = get_shared_model("Qwen/Qwen3-0.6B", "transformer")

# Get tensor
tensor = get_tensor((10, 1024))

# Return tensor
return_tensor(tensor)

# Get stats
stats = get_memory_stats()

# Clear memory
clear_memory()
```

## Integration with Shared Model

The module automatically detects and uses the shared Qwen model:

1. **ModelCache**: Uses shared Qwen model for transformers (zero memory overhead)
2. **ModelCache**: Uses shared Qwen tokenizer (zero memory overhead)
3. **Automatic Fallback**: Falls back to cached models if shared model unavailable

## CUDA Compatibility

All components are CUDA-compatible:
- Automatic device detection
- Efficient GPU memory management
- Adaptive cleanup based on memory pressure

## Configuration

See `config.py` for all configuration options. Key settings:

- `use_shared_model`: Enable shared Qwen model (default: True)
- `device`: Device to use ("cuda" or "cpu")
- `memory_threshold`: Memory usage threshold for cleanup
- `max_pool_size`: Maximum tensor pool size
- `use_4bit_quantization`: Enable 4-bit quantization

## Memory Management

The system provides:
- **Tensor Pooling**: Reuse tensors to reduce allocations
- **Model Caching**: Share model instances across modules
- **Adaptive Cleanup**: Automatic memory cleanup based on pressure
- **Emergency Cleanup**: Force cleanup when memory is critical

## Dependencies

- PyTorch (CUDA support recommended)
- Transformers
- BitsAndBytes (for 4-bit quantization, optional)