File size: 3,432 Bytes
101858b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# Memory Optimization Module

Unified memory management system with shared Qwen model integration for zero memory overhead.

## Module Structure

```

memory_optimization/

β”œβ”€β”€ __init__.py              # Module exports and convenience functions

β”œβ”€β”€ config.py                # MemoryOptimizationConfig

β”œβ”€β”€ manager.py               # UnifiedMemoryManager

β”œβ”€β”€ tensor_pool.py           # TensorPool

β”œβ”€β”€ model_cache.py           # ModelCache (uses shared Qwen model)

β”œβ”€β”€ cleanup.py               # MemoryCleanup

└── README.md                # This file

```

## Features

### βœ… Shared Model Integration
- **ModelCache**: Uses shared Qwen model for zero memory overhead
- Automatic fallback to cached models if shared model unavailable
- Prevents model duplication across modules

### βœ… CUDA Optimization
- All operations run on CUDA when available
- Efficient tensor pooling
- Adaptive memory cleanup

### βœ… Self-Contained Modules
- Each component is independent
- Easy to test and benchmark
- Clean separation of concerns

## Usage

### Basic Usage

```python

from memory_optimization import (

    UnifiedMemoryManager,

    MemoryOptimizationConfig,

    get_unified_memory_manager

)



# Initialize with shared model

config = MemoryOptimizationConfig(

    use_shared_model=True,

    device="cuda"

)

manager = UnifiedMemoryManager(config)



# Get shared model (uses shared Qwen if available)

model = manager.get_shared_model("Qwen/Qwen3-0.6B", "transformer")



# Get optimized tensor

tensor = manager.get_tensor((10, 1024), dtype=torch.float32)



# Return tensor to pool

manager.return_tensor(tensor)

```

### Convenience Functions

```python

from memory_optimization import (

    get_shared_model,

    get_tensor,

    return_tensor,

    clear_memory,

    get_memory_stats

)



# Get shared model

model = get_shared_model("Qwen/Qwen3-0.6B", "transformer")



# Get tensor

tensor = get_tensor((10, 1024))



# Return tensor

return_tensor(tensor)



# Get stats

stats = get_memory_stats()



# Clear memory

clear_memory()

```

## Integration with Shared Model

The module automatically detects and uses the shared Qwen model:

1. **ModelCache**: Uses shared Qwen model for transformers (zero memory overhead)
2. **ModelCache**: Uses shared Qwen tokenizer (zero memory overhead)
3. **Automatic Fallback**: Falls back to cached models if shared model unavailable

## CUDA Compatibility

All components are CUDA-compatible:
- Automatic device detection
- Efficient GPU memory management
- Adaptive cleanup based on memory pressure

## Configuration

See `config.py` for all configuration options. Key settings:

- `use_shared_model`: Enable shared Qwen model (default: True)
- `device`: Device to use ("cuda" or "cpu")
- `memory_threshold`: Memory usage threshold for cleanup
- `max_pool_size`: Maximum tensor pool size
- `use_4bit_quantization`: Enable 4-bit quantization

## Memory Management

The system provides:
- **Tensor Pooling**: Reuse tensors to reduce allocations
- **Model Caching**: Share model instances across modules
- **Adaptive Cleanup**: Automatic memory cleanup based on pressure
- **Emergency Cleanup**: Force cleanup when memory is critical

## Dependencies

- PyTorch (CUDA support recommended)
- Transformers
- BitsAndBytes (for 4-bit quantization, optional)