Spaces:

jdesiree
/

Mimir

Sleeping

App Files Files Community

jdesiree commited on Oct 24, 2025

Commit

4514554

verified ·

1 Parent(s): c775c71

Update README.md

Browse files

Files changed (1) hide show

README.md +64 -24

README.md CHANGED Viewed

@@ -39,6 +39,7 @@ User Input → Tool Decision Agent → Routing Agents (4x) → Thinking Agents (
 **Core Technologies:**
 * **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
 * **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
 * **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
 * **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
@@ -344,50 +345,88 @@ Create_Graph_Tool(
 ### Model Loading & Optimization Strategy
-#### Three-Stage Loading Pipeline
-**Stage 1: Build Time (Docker)**
 ```yaml
 # preload_from_hub in README.md
-- Downloads Llama-3.2-3B during Docker build
-- Cached in ~/.cache/huggingface/hub/
-- No download time at runtime
 ```
-**Stage 2: Startup (compile_model.py)**
 ```python
-# Runs before Gradio launch
-- Load model from HF cache
-- Apply 4-bit NF4 quantization
-- Run warmup inference (CUDA kernel compilation)
-- Create singleton instance for reuse
 ```
-**Stage 3: Runtime (Lazy Loading)**
-```python
-# First agent call triggers load
-def _ensure_loaded(self):
-    if self.model_loaded:
-        return  # Already loaded
-    # Load once, reuse forever
 ```
 **Memory Optimization**:
 - **4-bit NF4 Quantization**: 75% memory reduction
   - Llama-3.2-3B: ~6GB → ~1GB VRAM
 - **Shared Model Strategy**: ALL agents share one model instance
 - **Device Mapping**: Automatic distribution with ZeroGPU
 - **128K Context**: Long conversations without truncation
 **ZeroGPU Integration**:
 ```python
-@spaces.GPU(duration=120)  # Dynamic allocation
-def _ensure_loaded(self):
     # GPU available for 120 seconds
-    # Loads model once, reuses across agents
-    # Automatically released after
 ```
 ---
 ### Analytics & Evaluation System
@@ -421,12 +460,13 @@ def _ensure_loaded(self):
 ### Performance Benchmarks
-**Model Performance:**
 * **Inference Speed**: 25-40 tokens/second (with ZeroGPU)
 * **Memory Usage**: ~1GB VRAM (4-bit quantization)
 * **Context Window**: 128K tokens
-* **Cold Start**: ~3-5 seconds (first load)
 * **Warm Inference**: <1 second per agent
 **Llama 3.2 Quality Scores:**
 * MMLU: 63.4 (competitive with larger models)

 **Core Technologies:**
 * **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
+* **Lazy Loading Strategy**: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
 * **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
 * **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
 * **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
 ### Model Loading & Optimization Strategy
+#### Two-Stage Lazy Loading Pipeline
+**Stage 1: Build Time (Docker) - Optional Pre-caching**
 ```yaml
 # preload_from_hub in README.md
+preload_from_hub:
+  - meta-llama/Llama-3.2-3B-Instruct
 ```
+* Downloads model weights during Docker build
+* Cached in HuggingFace hub cache directory
+* Reduces first-request latency (no download needed)
+* **Optional but recommended** for production deployments
+**Stage 2: Runtime (Lazy Loading with Automatic Caching)**
 ```python
+# model_manager.py - LazyLlamaModel class
+def _load_model(self):
+    """Load on first generate() call"""
+    if self.model is not None:
+        return  # Already loaded - reuse cached instance
+    # First call: Load with 4-bit quantization
+    self.model = AutoModelForCausalLM.from_pretrained(
+        "meta-llama/Llama-3.2-3B-Instruct",
+        quantization_config=quantization_config,
+        device_map="auto",
+    )
+    # Model stays in memory for all future calls
+# All agents share this single instance
+@spaces.GPU(duration=120)
+def _load_model(self):
+    # GPU allocated for 120 seconds during first load
+    # Then reused without re-allocation
 ```
+**Loading Flow**:
+```
+App starts → Instant startup (no model loading)
+             ↓
+First user request → Triggers model load (~30-60s)
+                    ├─ Download from cache (if preloaded: instant)
+                    ├─ Load with 4-bit quantization
+                    ├─ Create pipeline
+                    └─ Cache in memory
+                    ↓
+All subsequent requests → Use cached model (~1s)
 ```
 **Memory Optimization**:
 - **4-bit NF4 Quantization**: 75% memory reduction
   - Llama-3.2-3B: ~6GB → ~1GB VRAM
 - **Shared Model Strategy**: ALL agents share one model instance
+- **Singleton Pattern**: Thread-safe model caching
 - **Device Mapping**: Automatic distribution with ZeroGPU
 - **128K Context**: Long conversations without truncation
 **ZeroGPU Integration**:
 ```python
+@spaces.GPU(duration=120)  # Dynamic allocation for first load
+def _load_model(self):
     # GPU available for 120 seconds
+    # Loads model once on first request
+    # Cached instance reused across all agents
+    # Automatic GPU management by ZeroGPU
 ```
+**Performance Characteristics**:
+* **First Request**: 30-60 seconds (one-time model load)
+  - With `preload_from_hub`: 30-40s (just quantization)
+  - Without preload: 50-60s (download + quantization)
+* **Subsequent Requests**: <1 second per agent
+* **Memory Footprint**: ~1GB VRAM (persistent)
+* **Cold Start**: Instant app startup (model loads on demand)
+**Why Lazy Loading?**
+* ✅ **Instant Startup**: App launches immediately
+* ✅ **ZeroGPU Optimal**: Perfect for dynamic GPU allocation
+* ✅ **Memory Efficient**: Only loads when needed
+* ✅ **Cache Persistent**: Stays loaded between requests
+* ✅ **Serverless Friendly**: Ideal for HuggingFace Spaces
 ---
 ### Analytics & Evaluation System
 ### Performance Benchmarks
+**Runtime Performance:**
 * **Inference Speed**: 25-40 tokens/second (with ZeroGPU)
 * **Memory Usage**: ~1GB VRAM (4-bit quantization)
 * **Context Window**: 128K tokens
+* **First Request**: ~30-60 seconds (one-time load)
 * **Warm Inference**: <1 second per agent
+* **Startup Time**: Instant (lazy loading)
 **Llama 3.2 Quality Scores:**
 * MMLU: 63.4 (competitive with larger models)