Model Inference Flow on Virtual GPU ================================ 1. Storage and VRAM Setup ------------------------- [HTTPGPUStorage] │ ╲ │ ╲ Zero-Copy │ ╲ Memory Mapping ▼ ▼ [Local Storage]──>[Virtual VRAM] (Memory Pages) (Page Tables) │ │ └──────────────┐ │ ▼ ▼ [vGPU Device] │ ▼ 2. Model Loading and Device Movement ---------------------------------- [Florence-2-Large] ---load---> [PyTorch Model] │ │ │ ▼ │ [to_vgpu() conversion] │ │ └─────────────────┐ │ ▼ ▼ [Model on vGPU Device] │ ▼ 3. Input Processing and Inference -------------------------------- [Input Text] -----> [Tokenizer] -----> [Tensor] │ ▼ [to_vgpu() conversion] │ ▼ [Tensor on vGPU] │ ▼ 4. Model Inference Flow ---------------------- [Model Forward Pass] │ ▼ [vGPU Computation] │ ▼ [PyTorch Output Tensor] │ ▼ [Last Hidden State] (Shape: [batch_size, seq_length, hidden_size]) Data Flow and Memory Management: ----------------------------- 1. Storage Layer: - HTTPGPUStorage ──> Local Storage (Memory Pages) - Local Storage ──> Virtual VRAM (Zero-Copy) - Virtual VRAM manages page tables pointing to local storage 2. Memory Architecture: - Local Storage: Physical memory pages - Virtual VRAM: Page tables and memory mappings - Zero-copy between Local Storage and VRAM - Direct memory access for GPU operations 3. Processing Flow: - Model Layer: HF Model ──> PyTorch ──> vGPU - Input Layer: Text ──> Tokens ──> Tensor ──> vGPU - Output Layer: vGPU ──> PyTorch Tensor ──> Results Key Components: -------------- - HTTP Storage: HTTPGPUStorage (Network interface) - Local Store: Memory pages (Physical storage) - Virtual VRAM: Page tables (Memory management) - Device: vGPU (Computation) - Model: Florence-2-Large (transformer) - Framework: PyTorch (ML operations) - Interface: to_vgpu() (Zero-copy transfer) Memory Management Details: ------------------------ 1. Local Storage: - Manages physical memory pages - Direct mapping to virtual VRAM - Zero-copy access for GPU ops 2. Virtual VRAM: - Page table management - Memory mapping to local storage - No physical copying of data - Direct GPU access to memory Model Load (.npy files) │ ├── AIAccelerator (Manages distribution) │ │ │ ├── MultiGPUSystem (8 chips) │ │ │ │ │ ├── Each GPUChip (108 SMs each) │ │ │ │ │ │ │ └── Each SM (3000 tensor cores) │ │ │ │ │ │ │ └── Individual Tensor Cores │ │ │ (Direct hardware-level execution) │ │ │ │ │ └── NVLink 4.0 between chips │ │ │ └── LocalStorage (electron-speed data access) AI Model/Operation │ ├── AIAccelerator │ │ │ ├── GPUParallelDistributor (Splits work) │ │ │ │ │ └── Distributes across GPUs │ │ │ └── MultiGPUSystem (Manages hardware) │ │ │ ├── 8 GPU Chips │ │ │ │ │ ├── 108 SMs each │ │ │ │ │ │ │ └── 3000 tensor cores each │ │ │ │ │ └── Local Storage │ │ │ └── NVLink Connections http_storage.py (LocalStorage) ↓ tensor_storage.py (TensorStorage) ↓ multithread_storage.py (MultithreadStorage) ↓ ai_http.py (AIAccelerator) multi_gpu_system_http.py │ ├──Uses──> LocalStorage (for state/tensor storage) └──Uses──> GPUChip (for individual GPU operations) │ └──Uses──> MultiCoreSystem (for computation) └──Uses──> ThreadedCore (for threads) gpu_arch.py (outdated) │ ├──Uses──> MultiCoreSystem (old usage) ├──Uses──> CustomVRAM (outdated) ├──Uses──> GPUStateDB (outdated) └──Uses──> AIAccelerator (limited integration)