| Model Inference Flow on Virtual GPU | |
| ================================ | |
| 1. Storage and VRAM Setup | |
| ------------------------- | |
| [HTTPGPUStorage] | |
| β β² | |
| β β² Zero-Copy | |
| β β² Memory Mapping | |
| βΌ βΌ | |
| [Local Storage]ββ>[Virtual VRAM] | |
| (Memory Pages) (Page Tables) | |
| β β | |
| ββββββββββββββββ β | |
| βΌ βΌ | |
| [vGPU Device] | |
| β | |
| βΌ | |
| 2. Model Loading and Device Movement | |
| ---------------------------------- | |
| [Florence-2-Large] ---load---> [PyTorch Model] | |
| β β | |
| β βΌ | |
| β [to_vgpu() conversion] | |
| β β | |
| βββββββββββββββββββ β | |
| βΌ βΌ | |
| [Model on vGPU Device] | |
| β | |
| βΌ | |
| 3. Input Processing and Inference | |
| -------------------------------- | |
| [Input Text] -----> [Tokenizer] -----> [Tensor] | |
| β | |
| βΌ | |
| [to_vgpu() conversion] | |
| β | |
| βΌ | |
| [Tensor on vGPU] | |
| β | |
| βΌ | |
| 4. Model Inference Flow | |
| ---------------------- | |
| [Model Forward Pass] | |
| β | |
| βΌ | |
| [vGPU Computation] | |
| β | |
| βΌ | |
| [PyTorch Output Tensor] | |
| β | |
| βΌ | |
| [Last Hidden State] | |
| (Shape: [batch_size, seq_length, hidden_size]) | |
| Data Flow and Memory Management: | |
| ----------------------------- | |
| 1. Storage Layer: | |
| - HTTPGPUStorage ββ> Local Storage (Memory Pages) | |
| - Local Storage ββ> Virtual VRAM (Zero-Copy) | |
| - Virtual VRAM manages page tables pointing to local storage | |
| 2. Memory Architecture: | |
| - Local Storage: Physical memory pages | |
| - Virtual VRAM: Page tables and memory mappings | |
| - Zero-copy between Local Storage and VRAM | |
| - Direct memory access for GPU operations | |
| 3. Processing Flow: | |
| - Model Layer: HF Model ββ> PyTorch ββ> vGPU | |
| - Input Layer: Text ββ> Tokens ββ> Tensor ββ> vGPU | |
| - Output Layer: vGPU ββ> PyTorch Tensor ββ> Results | |
| Key Components: | |
| -------------- | |
| - HTTP Storage: HTTPGPUStorage (Network interface) | |
| - Local Store: Memory pages (Physical storage) | |
| - Virtual VRAM: Page tables (Memory management) | |
| - Device: vGPU (Computation) | |
| - Model: Florence-2-Large (transformer) | |
| - Framework: PyTorch (ML operations) | |
| - Interface: to_vgpu() (Zero-copy transfer) | |
| Memory Management Details: | |
| ------------------------ | |
| 1. Local Storage: | |
| - Manages physical memory pages | |
| - Direct mapping to virtual VRAM | |
| - Zero-copy access for GPU ops | |
| 2. Virtual VRAM: | |
| - Page table management | |
| - Memory mapping to local storage | |
| - No physical copying of data | |
| - Direct GPU access to memory | |
| Model Load (.npy files) | |
| β | |
| βββ AIAccelerator (Manages distribution) | |
| β β | |
| β βββ MultiGPUSystem (8 chips) | |
| β β β | |
| β β βββ Each GPUChip (108 SMs each) | |
| β β β β | |
| β β β βββ Each SM (3000 tensor cores) | |
| β β β β | |
| β β β βββ Individual Tensor Cores | |
| β β β (Direct hardware-level execution) | |
| β β β | |
| β β βββ NVLink 4.0 between chips | |
| β β | |
| β βββ LocalStorage (electron-speed data access) | |
| AI Model/Operation | |
| β | |
| βββ AIAccelerator | |
| β β | |
| β βββ GPUParallelDistributor (Splits work) | |
| β β β | |
| β β βββ Distributes across GPUs | |
| β β | |
| β βββ MultiGPUSystem (Manages hardware) | |
| β β | |
| β βββ 8 GPU Chips | |
| β β β | |
| β β βββ 108 SMs each | |
| β β β β | |
| β β β βββ 3000 tensor cores each | |
| β β β | |
| β β βββ Local Storage | |
| β β | |
| β βββ NVLink Connections | |
| http_storage.py (LocalStorage) | |
| β | |
| tensor_storage.py (TensorStorage) | |
| β | |
| multithread_storage.py (MultithreadStorage) | |
| β | |
| ai_http.py (AIAccelerator) | |
| multi_gpu_system_http.py | |
| β | |
| βββUsesββ> LocalStorage (for state/tensor storage) | |
| βββUsesββ> GPUChip (for individual GPU operations) | |
| β | |
| βββUsesββ> MultiCoreSystem (for computation) | |
| βββUsesββ> ThreadedCore (for threads) | |
| gpu_arch.py (outdated) | |
| β | |
| βββUsesββ> MultiCoreSystem (old usage) | |
| βββUsesββ> CustomVRAM (outdated) | |
| βββUsesββ> GPUStateDB (outdated) | |
| βββUsesββ> AIAccelerator (limited integration) |