Model Inference Flow on Virtual GPU
================================

1. Storage and VRAM Setup
-------------------------
[HTTPGPUStorage]  
      │     ╲
      │      ╲    Zero-Copy
      │       ╲   Memory Mapping
      ▼        ▼
[Local Storage]──>[Virtual VRAM]
 (Memory Pages)     (Page Tables)
      │                  │
      └──────────────┐  │
                     ▼  ▼
                [vGPU Device]
                     │
                     ▼
2. Model Loading and Device Movement
----------------------------------
[Florence-2-Large] ---load---> [PyTorch Model]
         │                          │
         │                          ▼
         │                   [to_vgpu() conversion]
         │                          │
         └─────────────────┐       │
                          ▼       ▼
                    [Model on vGPU Device]
                           │
                           ▼
3. Input Processing and Inference
--------------------------------
[Input Text] -----> [Tokenizer] -----> [Tensor]
                                         │
                                         ▼
                              [to_vgpu() conversion]
                                         │
                                         ▼
                               [Tensor on vGPU]
                                         │
                                         ▼
4. Model Inference Flow
----------------------
[Model Forward Pass]
       │
       ▼
[vGPU Computation]
       │
       ▼
[PyTorch Output Tensor]
       │
       ▼
[Last Hidden State]
(Shape: [batch_size, seq_length, hidden_size])

Data Flow and Memory Management:
-----------------------------
1. Storage Layer:   
   - HTTPGPUStorage ──> Local Storage (Memory Pages)
   - Local Storage ──> Virtual VRAM (Zero-Copy)
   - Virtual VRAM manages page tables pointing to local storage

2. Memory Architecture:
   - Local Storage: Physical memory pages
   - Virtual VRAM: Page tables and memory mappings
   - Zero-copy between Local Storage and VRAM
   - Direct memory access for GPU operations

3. Processing Flow:
   - Model Layer:   HF Model ──> PyTorch ──> vGPU
   - Input Layer:   Text ──> Tokens ──> Tensor ──> vGPU
   - Output Layer:  vGPU ──> PyTorch Tensor ──> Results

Key Components:
--------------
- HTTP Storage:  HTTPGPUStorage (Network interface)
- Local Store:   Memory pages (Physical storage)
- Virtual VRAM:  Page tables (Memory management)
- Device:        vGPU (Computation)
- Model:         Florence-2-Large (transformer)
- Framework:     PyTorch (ML operations)
- Interface:     to_vgpu() (Zero-copy transfer)

Memory Management Details:
------------------------
1. Local Storage:
   - Manages physical memory pages
   - Direct mapping to virtual VRAM
   - Zero-copy access for GPU ops

2. Virtual VRAM:
   - Page table management
   - Memory mapping to local storage
   - No physical copying of data
   - Direct GPU access to memory