jdesiree commited on
Commit
4514554
Β·
verified Β·
1 Parent(s): c775c71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -24
README.md CHANGED
@@ -39,6 +39,7 @@ User Input β†’ Tool Decision Agent β†’ Routing Agents (4x) β†’ Thinking Agents (
39
  **Core Technologies:**
40
 
41
  * **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
 
42
  * **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
43
  * **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
44
  * **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
@@ -344,50 +345,88 @@ Create_Graph_Tool(
344
 
345
  ### Model Loading & Optimization Strategy
346
 
347
- #### Three-Stage Loading Pipeline
348
 
349
- **Stage 1: Build Time (Docker)**
350
  ```yaml
351
  # preload_from_hub in README.md
352
- - Downloads Llama-3.2-3B during Docker build
353
- - Cached in ~/.cache/huggingface/hub/
354
- - No download time at runtime
355
  ```
 
 
 
 
356
 
357
- **Stage 2: Startup (compile_model.py)**
358
  ```python
359
- # Runs before Gradio launch
360
- - Load model from HF cache
361
- - Apply 4-bit NF4 quantization
362
- - Run warmup inference (CUDA kernel compilation)
363
- - Create singleton instance for reuse
 
 
 
 
 
 
 
 
 
 
 
 
 
 
364
  ```
365
 
366
- **Stage 3: Runtime (Lazy Loading)**
367
- ```python
368
- # First agent call triggers load
369
- def _ensure_loaded(self):
370
- if self.model_loaded:
371
- return # Already loaded
372
- # Load once, reuse forever
 
 
 
 
373
  ```
374
 
375
  **Memory Optimization**:
376
  - **4-bit NF4 Quantization**: 75% memory reduction
377
  - Llama-3.2-3B: ~6GB β†’ ~1GB VRAM
378
  - **Shared Model Strategy**: ALL agents share one model instance
 
379
  - **Device Mapping**: Automatic distribution with ZeroGPU
380
  - **128K Context**: Long conversations without truncation
381
 
382
  **ZeroGPU Integration**:
383
  ```python
384
- @spaces.GPU(duration=120) # Dynamic allocation
385
- def _ensure_loaded(self):
386
  # GPU available for 120 seconds
387
- # Loads model once, reuses across agents
388
- # Automatically released after
 
389
  ```
390
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
391
  ---
392
 
393
  ### Analytics & Evaluation System
@@ -421,12 +460,13 @@ def _ensure_loaded(self):
421
 
422
  ### Performance Benchmarks
423
 
424
- **Model Performance:**
425
  * **Inference Speed**: 25-40 tokens/second (with ZeroGPU)
426
  * **Memory Usage**: ~1GB VRAM (4-bit quantization)
427
  * **Context Window**: 128K tokens
428
- * **Cold Start**: ~3-5 seconds (first load)
429
  * **Warm Inference**: <1 second per agent
 
430
 
431
  **Llama 3.2 Quality Scores:**
432
  * MMLU: 63.4 (competitive with larger models)
 
39
  **Core Technologies:**
40
 
41
  * **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
42
+ * **Lazy Loading Strategy**: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
43
  * **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
44
  * **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
45
  * **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
 
345
 
346
  ### Model Loading & Optimization Strategy
347
 
348
+ #### Two-Stage Lazy Loading Pipeline
349
 
350
+ **Stage 1: Build Time (Docker) - Optional Pre-caching**
351
  ```yaml
352
  # preload_from_hub in README.md
353
+ preload_from_hub:
354
+ - meta-llama/Llama-3.2-3B-Instruct
 
355
  ```
356
+ * Downloads model weights during Docker build
357
+ * Cached in HuggingFace hub cache directory
358
+ * Reduces first-request latency (no download needed)
359
+ * **Optional but recommended** for production deployments
360
 
361
+ **Stage 2: Runtime (Lazy Loading with Automatic Caching)**
362
  ```python
363
+ # model_manager.py - LazyLlamaModel class
364
+ def _load_model(self):
365
+ """Load on first generate() call"""
366
+ if self.model is not None:
367
+ return # Already loaded - reuse cached instance
368
+
369
+ # First call: Load with 4-bit quantization
370
+ self.model = AutoModelForCausalLM.from_pretrained(
371
+ "meta-llama/Llama-3.2-3B-Instruct",
372
+ quantization_config=quantization_config,
373
+ device_map="auto",
374
+ )
375
+ # Model stays in memory for all future calls
376
+
377
+ # All agents share this single instance
378
+ @spaces.GPU(duration=120)
379
+ def _load_model(self):
380
+ # GPU allocated for 120 seconds during first load
381
+ # Then reused without re-allocation
382
  ```
383
 
384
+ **Loading Flow**:
385
+ ```
386
+ App starts β†’ Instant startup (no model loading)
387
+ ↓
388
+ First user request β†’ Triggers model load (~30-60s)
389
+ β”œβ”€ Download from cache (if preloaded: instant)
390
+ β”œβ”€ Load with 4-bit quantization
391
+ β”œβ”€ Create pipeline
392
+ └─ Cache in memory
393
+ ↓
394
+ All subsequent requests β†’ Use cached model (~1s)
395
  ```
396
 
397
  **Memory Optimization**:
398
  - **4-bit NF4 Quantization**: 75% memory reduction
399
  - Llama-3.2-3B: ~6GB β†’ ~1GB VRAM
400
  - **Shared Model Strategy**: ALL agents share one model instance
401
+ - **Singleton Pattern**: Thread-safe model caching
402
  - **Device Mapping**: Automatic distribution with ZeroGPU
403
  - **128K Context**: Long conversations without truncation
404
 
405
  **ZeroGPU Integration**:
406
  ```python
407
+ @spaces.GPU(duration=120) # Dynamic allocation for first load
408
+ def _load_model(self):
409
  # GPU available for 120 seconds
410
+ # Loads model once on first request
411
+ # Cached instance reused across all agents
412
+ # Automatic GPU management by ZeroGPU
413
  ```
414
 
415
+ **Performance Characteristics**:
416
+ * **First Request**: 30-60 seconds (one-time model load)
417
+ - With `preload_from_hub`: 30-40s (just quantization)
418
+ - Without preload: 50-60s (download + quantization)
419
+ * **Subsequent Requests**: <1 second per agent
420
+ * **Memory Footprint**: ~1GB VRAM (persistent)
421
+ * **Cold Start**: Instant app startup (model loads on demand)
422
+
423
+ **Why Lazy Loading?**
424
+ * βœ… **Instant Startup**: App launches immediately
425
+ * βœ… **ZeroGPU Optimal**: Perfect for dynamic GPU allocation
426
+ * βœ… **Memory Efficient**: Only loads when needed
427
+ * βœ… **Cache Persistent**: Stays loaded between requests
428
+ * βœ… **Serverless Friendly**: Ideal for HuggingFace Spaces
429
+
430
  ---
431
 
432
  ### Analytics & Evaluation System
 
460
 
461
  ### Performance Benchmarks
462
 
463
+ **Runtime Performance:**
464
  * **Inference Speed**: 25-40 tokens/second (with ZeroGPU)
465
  * **Memory Usage**: ~1GB VRAM (4-bit quantization)
466
  * **Context Window**: 128K tokens
467
+ * **First Request**: ~30-60 seconds (one-time load)
468
  * **Warm Inference**: <1 second per agent
469
+ * **Startup Time**: Instant (lazy loading)
470
 
471
  **Llama 3.2 Quality Scores:**
472
  * MMLU: 63.4 (competitive with larger models)