NMFL / model_inference_flow.txt
Factor Studios
Upload 207 files
1980145 verified
Model Inference Flow on Virtual GPU
================================
1. Storage and VRAM Setup
-------------------------
[HTTPGPUStorage]
β”‚ β•²
β”‚ β•² Zero-Copy
β”‚ β•² Memory Mapping
β–Ό β–Ό
[Local Storage]──>[Virtual VRAM]
(Memory Pages) (Page Tables)
β”‚ β”‚
└──────────────┐ β”‚
β–Ό β–Ό
[vGPU Device]
β”‚
β–Ό
2. Model Loading and Device Movement
----------------------------------
[Florence-2-Large] ---load---> [PyTorch Model]
β”‚ β”‚
β”‚ β–Ό
β”‚ [to_vgpu() conversion]
β”‚ β”‚
└─────────────────┐ β”‚
β–Ό β–Ό
[Model on vGPU Device]
β”‚
β–Ό
3. Input Processing and Inference
--------------------------------
[Input Text] -----> [Tokenizer] -----> [Tensor]
β”‚
β–Ό
[to_vgpu() conversion]
β”‚
β–Ό
[Tensor on vGPU]
β”‚
β–Ό
4. Model Inference Flow
----------------------
[Model Forward Pass]
β”‚
β–Ό
[vGPU Computation]
β”‚
β–Ό
[PyTorch Output Tensor]
β”‚
β–Ό
[Last Hidden State]
(Shape: [batch_size, seq_length, hidden_size])
Data Flow and Memory Management:
-----------------------------
1. Storage Layer:
- HTTPGPUStorage ──> Local Storage (Memory Pages)
- Local Storage ──> Virtual VRAM (Zero-Copy)
- Virtual VRAM manages page tables pointing to local storage
2. Memory Architecture:
- Local Storage: Physical memory pages
- Virtual VRAM: Page tables and memory mappings
- Zero-copy between Local Storage and VRAM
- Direct memory access for GPU operations
3. Processing Flow:
- Model Layer: HF Model ──> PyTorch ──> vGPU
- Input Layer: Text ──> Tokens ──> Tensor ──> vGPU
- Output Layer: vGPU ──> PyTorch Tensor ──> Results
Key Components:
--------------
- HTTP Storage: HTTPGPUStorage (Network interface)
- Local Store: Memory pages (Physical storage)
- Virtual VRAM: Page tables (Memory management)
- Device: vGPU (Computation)
- Model: Florence-2-Large (transformer)
- Framework: PyTorch (ML operations)
- Interface: to_vgpu() (Zero-copy transfer)
Memory Management Details:
------------------------
1. Local Storage:
- Manages physical memory pages
- Direct mapping to virtual VRAM
- Zero-copy access for GPU ops
2. Virtual VRAM:
- Page table management
- Memory mapping to local storage
- No physical copying of data
- Direct GPU access to memory