Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -39,6 +39,7 @@ User Input β Tool Decision Agent β Routing Agents (4x) β Thinking Agents (
|
|
| 39 |
**Core Technologies:**
|
| 40 |
|
| 41 |
* **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
|
|
|
|
| 42 |
* **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
|
| 43 |
* **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
|
| 44 |
* **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
|
|
@@ -344,50 +345,88 @@ Create_Graph_Tool(
|
|
| 344 |
|
| 345 |
### Model Loading & Optimization Strategy
|
| 346 |
|
| 347 |
-
####
|
| 348 |
|
| 349 |
-
**Stage 1: Build Time (Docker)**
|
| 350 |
```yaml
|
| 351 |
# preload_from_hub in README.md
|
| 352 |
-
|
| 353 |
-
-
|
| 354 |
-
- No download time at runtime
|
| 355 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 356 |
|
| 357 |
-
**Stage 2:
|
| 358 |
```python
|
| 359 |
-
#
|
| 360 |
-
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 364 |
```
|
| 365 |
|
| 366 |
-
**
|
| 367 |
-
```
|
| 368 |
-
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 373 |
```
|
| 374 |
|
| 375 |
**Memory Optimization**:
|
| 376 |
- **4-bit NF4 Quantization**: 75% memory reduction
|
| 377 |
- Llama-3.2-3B: ~6GB β ~1GB VRAM
|
| 378 |
- **Shared Model Strategy**: ALL agents share one model instance
|
|
|
|
| 379 |
- **Device Mapping**: Automatic distribution with ZeroGPU
|
| 380 |
- **128K Context**: Long conversations without truncation
|
| 381 |
|
| 382 |
**ZeroGPU Integration**:
|
| 383 |
```python
|
| 384 |
-
@spaces.GPU(duration=120) # Dynamic allocation
|
| 385 |
-
def
|
| 386 |
# GPU available for 120 seconds
|
| 387 |
-
# Loads model once
|
| 388 |
-
#
|
|
|
|
| 389 |
```
|
| 390 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 391 |
---
|
| 392 |
|
| 393 |
### Analytics & Evaluation System
|
|
@@ -421,12 +460,13 @@ def _ensure_loaded(self):
|
|
| 421 |
|
| 422 |
### Performance Benchmarks
|
| 423 |
|
| 424 |
-
**
|
| 425 |
* **Inference Speed**: 25-40 tokens/second (with ZeroGPU)
|
| 426 |
* **Memory Usage**: ~1GB VRAM (4-bit quantization)
|
| 427 |
* **Context Window**: 128K tokens
|
| 428 |
-
* **
|
| 429 |
* **Warm Inference**: <1 second per agent
|
|
|
|
| 430 |
|
| 431 |
**Llama 3.2 Quality Scores:**
|
| 432 |
* MMLU: 63.4 (competitive with larger models)
|
|
|
|
| 39 |
**Core Technologies:**
|
| 40 |
|
| 41 |
* **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
|
| 42 |
+
* **Lazy Loading Strategy**: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
|
| 43 |
* **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
|
| 44 |
* **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
|
| 45 |
* **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
|
|
|
|
| 345 |
|
| 346 |
### Model Loading & Optimization Strategy
|
| 347 |
|
| 348 |
+
#### Two-Stage Lazy Loading Pipeline
|
| 349 |
|
| 350 |
+
**Stage 1: Build Time (Docker) - Optional Pre-caching**
|
| 351 |
```yaml
|
| 352 |
# preload_from_hub in README.md
|
| 353 |
+
preload_from_hub:
|
| 354 |
+
- meta-llama/Llama-3.2-3B-Instruct
|
|
|
|
| 355 |
```
|
| 356 |
+
* Downloads model weights during Docker build
|
| 357 |
+
* Cached in HuggingFace hub cache directory
|
| 358 |
+
* Reduces first-request latency (no download needed)
|
| 359 |
+
* **Optional but recommended** for production deployments
|
| 360 |
|
| 361 |
+
**Stage 2: Runtime (Lazy Loading with Automatic Caching)**
|
| 362 |
```python
|
| 363 |
+
# model_manager.py - LazyLlamaModel class
|
| 364 |
+
def _load_model(self):
|
| 365 |
+
"""Load on first generate() call"""
|
| 366 |
+
if self.model is not None:
|
| 367 |
+
return # Already loaded - reuse cached instance
|
| 368 |
+
|
| 369 |
+
# First call: Load with 4-bit quantization
|
| 370 |
+
self.model = AutoModelForCausalLM.from_pretrained(
|
| 371 |
+
"meta-llama/Llama-3.2-3B-Instruct",
|
| 372 |
+
quantization_config=quantization_config,
|
| 373 |
+
device_map="auto",
|
| 374 |
+
)
|
| 375 |
+
# Model stays in memory for all future calls
|
| 376 |
+
|
| 377 |
+
# All agents share this single instance
|
| 378 |
+
@spaces.GPU(duration=120)
|
| 379 |
+
def _load_model(self):
|
| 380 |
+
# GPU allocated for 120 seconds during first load
|
| 381 |
+
# Then reused without re-allocation
|
| 382 |
```
|
| 383 |
|
| 384 |
+
**Loading Flow**:
|
| 385 |
+
```
|
| 386 |
+
App starts β Instant startup (no model loading)
|
| 387 |
+
β
|
| 388 |
+
First user request β Triggers model load (~30-60s)
|
| 389 |
+
ββ Download from cache (if preloaded: instant)
|
| 390 |
+
ββ Load with 4-bit quantization
|
| 391 |
+
ββ Create pipeline
|
| 392 |
+
ββ Cache in memory
|
| 393 |
+
β
|
| 394 |
+
All subsequent requests β Use cached model (~1s)
|
| 395 |
```
|
| 396 |
|
| 397 |
**Memory Optimization**:
|
| 398 |
- **4-bit NF4 Quantization**: 75% memory reduction
|
| 399 |
- Llama-3.2-3B: ~6GB β ~1GB VRAM
|
| 400 |
- **Shared Model Strategy**: ALL agents share one model instance
|
| 401 |
+
- **Singleton Pattern**: Thread-safe model caching
|
| 402 |
- **Device Mapping**: Automatic distribution with ZeroGPU
|
| 403 |
- **128K Context**: Long conversations without truncation
|
| 404 |
|
| 405 |
**ZeroGPU Integration**:
|
| 406 |
```python
|
| 407 |
+
@spaces.GPU(duration=120) # Dynamic allocation for first load
|
| 408 |
+
def _load_model(self):
|
| 409 |
# GPU available for 120 seconds
|
| 410 |
+
# Loads model once on first request
|
| 411 |
+
# Cached instance reused across all agents
|
| 412 |
+
# Automatic GPU management by ZeroGPU
|
| 413 |
```
|
| 414 |
|
| 415 |
+
**Performance Characteristics**:
|
| 416 |
+
* **First Request**: 30-60 seconds (one-time model load)
|
| 417 |
+
- With `preload_from_hub`: 30-40s (just quantization)
|
| 418 |
+
- Without preload: 50-60s (download + quantization)
|
| 419 |
+
* **Subsequent Requests**: <1 second per agent
|
| 420 |
+
* **Memory Footprint**: ~1GB VRAM (persistent)
|
| 421 |
+
* **Cold Start**: Instant app startup (model loads on demand)
|
| 422 |
+
|
| 423 |
+
**Why Lazy Loading?**
|
| 424 |
+
* β
**Instant Startup**: App launches immediately
|
| 425 |
+
* β
**ZeroGPU Optimal**: Perfect for dynamic GPU allocation
|
| 426 |
+
* β
**Memory Efficient**: Only loads when needed
|
| 427 |
+
* β
**Cache Persistent**: Stays loaded between requests
|
| 428 |
+
* β
**Serverless Friendly**: Ideal for HuggingFace Spaces
|
| 429 |
+
|
| 430 |
---
|
| 431 |
|
| 432 |
### Analytics & Evaluation System
|
|
|
|
| 460 |
|
| 461 |
### Performance Benchmarks
|
| 462 |
|
| 463 |
+
**Runtime Performance:**
|
| 464 |
* **Inference Speed**: 25-40 tokens/second (with ZeroGPU)
|
| 465 |
* **Memory Usage**: ~1GB VRAM (4-bit quantization)
|
| 466 |
* **Context Window**: 128K tokens
|
| 467 |
+
* **First Request**: ~30-60 seconds (one-time load)
|
| 468 |
* **Warm Inference**: <1 second per agent
|
| 469 |
+
* **Startup Time**: Instant (lazy loading)
|
| 470 |
|
| 471 |
**Llama 3.2 Quality Scores:**
|
| 472 |
* MMLU: 63.4 (competitive with larger models)
|