Model Inference Flow on Virtual GPU ================================ 1. Storage and VRAM Setup ------------------------- [HTTPGPUStorage] │ ╲ │ ╲ Zero-Copy │ ╲ Memory Mapping ▼ ▼ [Local Storage]──>[Virtual VRAM] (Memory Pages) (Page Tables) │ │ └──────────────┐ │ ▼ ▼ [vGPU Device] │ ▼ 2. Model Loading and Device Movement ---------------------------------- [Florence-2-Large] ---load---> [PyTorch Model] │ │ │ ▼ │ [to_vgpu() conversion] │ │ └─────────────────┐ │ ▼ ▼ [Model on vGPU Device] │ ▼ 3. Input Processing and Inference -------------------------------- [Input Text] -----> [Tokenizer] -----> [Tensor] │ ▼ [to_vgpu() conversion] │ ▼ [Tensor on vGPU] │ ▼ 4. Model Inference Flow ---------------------- [Model Forward Pass] │ ▼ [vGPU Computation] │ ▼ [PyTorch Output Tensor] │ ▼ [Last Hidden State] (Shape: [batch_size, seq_length, hidden_size]) Data Flow and Memory Management: ----------------------------- 1. Storage Layer: - HTTPGPUStorage ──> Local Storage (Memory Pages) - Local Storage ──> Virtual VRAM (Zero-Copy) - Virtual VRAM manages page tables pointing to local storage 2. Memory Architecture: - Local Storage: Physical memory pages - Virtual VRAM: Page tables and memory mappings - Zero-copy between Local Storage and VRAM - Direct memory access for GPU operations 3. Processing Flow: - Model Layer: HF Model ──> PyTorch ──> vGPU - Input Layer: Text ──> Tokens ──> Tensor ──> vGPU - Output Layer: vGPU ──> PyTorch Tensor ──> Results Key Components: -------------- - HTTP Storage: HTTPGPUStorage (Network interface) - Local Store: Memory pages (Physical storage) - Virtual VRAM: Page tables (Memory management) - Device: vGPU (Computation) - Model: Florence-2-Large (transformer) - Framework: PyTorch (ML operations) - Interface: to_vgpu() (Zero-copy transfer) Memory Management Details: ------------------------ 1. Local Storage: - Manages physical memory pages - Direct mapping to virtual VRAM - Zero-copy access for GPU ops 2. Virtual VRAM: - Page table management - Memory mapping to local storage - No physical copying of data - Direct GPU access to memory