File size: 17,080 Bytes
89b3776
 
 
 
 
 
 
 
 
 
 
 
 
 
d7d2683
89b3776
 
 
 
 
 
14f74c5
89b3776
 
 
 
 
 
 
 
14f74c5
 
89b3776
 
 
 
14f74c5
4514554
89b3776
 
 
 
14f74c5
89b3776
 
 
 
14f74c5
89b3776
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14f74c5
89b3776
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14f74c5
89b3776
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14f74c5
89b3776
 
 
14f74c5
89b3776
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14f74c5
89b3776
 
14f74c5
 
89b3776
14f74c5
 
89b3776
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14f74c5
89b3776
 
 
 
 
 
 
14f74c5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89b3776
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14f74c5
89b3776
 
 
 
 
 
 
 
 
 
 
 
 
4514554
89b3776
4514554
89b3776
 
4514554
 
89b3776
4514554
 
 
 
89b3776
4514554
89b3776
4514554
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89b3776
 
4514554
 
 
 
 
 
 
 
 
 
 
89b3776
 
 
14f74c5
 
 
4514554
14f74c5
 
89b3776
 
 
4514554
 
14f74c5
4514554
 
 
89b3776
 
4514554
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89b3776
 
 
 
 
 
 
14f74c5
89b3776
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14f74c5
 
 
 
 
4514554
14f74c5
 
 
4514554
14f74c5
4514554
14f74c5
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
---
title: Mimir
emoji: πŸ“š
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
python_version: '3.10'
short_description: Advanced prompt engineering for educational AI systems.
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/68700e7552b74a1dcbb2a87e/Z7P8DJ57rc5P1ozA5gwp3.png
hardware: zero-gpu-dynamic
startup_duration_timeout: 30m
---

# Mimir: Educational AI Assistant
## Advanced Multi-Agent Architecture & Prompt Engineering Portfolio Project

### Project Overview
Mimir demonstrates enterprise-grade AI system design through a sophisticated multi-agent architecture applied to educational technology. The system showcases advanced prompt engineering, intelligent decision-making pipelines, and state-persistent conversation management. Unlike simple single-model implementations, Mimir employs **four specialized agent types** working in concert: a tool decision engine, four parallel routing agents for prompt selection, three preprocessing thinking agents for complex reasoning, and a unified response generator. This architecture prioritizes pedagogical effectiveness through dynamic context assembly, ensuring responses are tailored to each unique educational interaction.

***

### Technical Architecture

**Multi-Agent System:**
```
User Input β†’ Tool Decision Agent β†’ Routing Agents (4x) β†’ Thinking Agents (3x) β†’ Response Agent β†’ Output
                     ↓                    ↓                      ↓                  ↓
              Llama-3.2-3B         Llama-3.2-3B (shared)    Llama-3.2-3B      Llama-3.2-3B
```

**Core Technologies:**

* **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
* **Lazy Loading Strategy**: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
* **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
* **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
* **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
* **Gradio**: Multi-page interface (Chatbot + Analytics Dashboard)
* **Python**: Advanced backend with 4-bit quantization and streaming

**Key Frameworks & Libraries:**

* `transformers` & `accelerate` for model loading and inference optimization
* `bitsandbytes` for 4-bit NF4 quantization (75% memory reduction)
* `peft` for Parameter-Efficient Fine-Tuning support
* `spaces` for HuggingFace ZeroGPU integration
* `matplotlib` for dynamic visualization generation
* Custom state management system with SQLite and dataset backup

***

### Advanced Agent Architecture

#### Agent Pipeline Overview
The system processes each user interaction through a sophisticated four-stage pipeline, with each stage making intelligent decisions that shape the final response.

#### Stage 1: Tool Decision Agent
**Purpose**: Determines if visualization tools enhance learning

**Model**: Llama-3.2-3B-Instruct (4-bit NF4 quantized)

**Prompt Engineering**:
* Highly constrained binary decision prompt (YES/NO only)
* Explicit INCLUDE/EXCLUDE criteria for educational contexts
* Zero-shot classification with educational domain knowledge

**Decision Criteria**:
```
INCLUDE: Mathematical functions, data analysis, chart interpretation, 
         trend visualization, proportional relationships

EXCLUDE: Greetings, definitions, explanations without data
```

**Output**: Boolean flag activating `TOOL_USE_ENHANCEMENT` prompt segment

---

#### Stage 2: Prompt Routing Agents (4 Specialized Agents)
**Purpose**: Intelligent prompt segment selection through parallel analysis

**Model**: Shared Llama-3.2-3B-Instruct instance (memory efficient)

**Agent Specializations**:

1. **Agent 1 - Practice Question Detector**
   - Analyzes conversation context for practice question opportunities
   - Considers user's expressed understanding and learning progression
   - Activates: `STRUCTURE_PRACTICE_QUESTIONS`

2. **Agent 2 - Discovery Mode Classifier**
   - Dual-classification: vague input detection + understanding assessment
   - Returns: `VAUGE_INPUT`, `USER_UNDERSTANDING`, or neither
   - Enables guided discovery and clarification strategies

3. **Agent 3 - Follow-up Assessment Agent**
   - Detects if user is responding to previous practice questions
   - Analyzes conversation history for grading opportunities
   - Activates: `PRACTICE_QUESTION_FOLLOWUP` (triggers grading mode)

4. **Agent 4 - Teaching Mode Assessor**
   - Evaluates need for direct instruction vs. structured practice
   - Multi-output agent (can activate multiple prompts)
   - Activates: `GUIDING_TEACHING`, `STRUCTURE_PRACTICE_QUESTIONS`

**Prompt Engineering Innovation**:
* Each agent uses a specialized system prompt with clear decision criteria
* Structured output formats for reliable parsing
* Context-aware analysis incorporating full conversation history
* Sequential execution prevents decision conflicts

---

#### Stage 3: Thinking Agents (Preprocessing Layer)
**Purpose**: Generate reasoning context before final response (CoT/ToT)

**Model**: Llama-3.2-3B-Instruct (shared instance)

**Agent Specializations**:

1. **Math Thinking Agent**
   - **Method**: Tree-of-Thought reasoning for mathematical problems
   - **Activation**: When `LATEX_FORMATTING` is active
   - **Output Structure**:
     ```
     Key Terms β†’ Principles β†’ Formulas β†’ Step-by-Step Solution β†’ Summary
     ```
   - **Complexity Routing**: Decision tree determines detail level (1A: basic, 1B: complex)

2. **Question/Answer Design Agent**
   - **Method**: Chain-of-Thought for practice question formulation
   - **Activation**: When `STRUCTURE_PRACTICE_QUESTIONS` is active
   - **Formatted Inputs**: Tool context, LaTeX guidelines, practice question templates
   - **Output**: Question design, data formatting, answer bank generation

3. **Reasoning Thinking Agent**
   - **Method**: General Chain-of-Thought preprocessing
   - **Activation**: When tools, follow-ups, or teaching mode active
   - **Output Structure**:
     ```
     User Knowledge Summary β†’ Understanding Analysis β†’ 
     Previous Actions β†’ Reference Fact Sheet
     ```

**Prompt Engineering Innovation**:
* Thinking agents produce **context for ResponseAgent**, not final output
* Outputs are invisible to user but inform response quality
* Tree-of-Thought (ToT) for math: explores multiple solution paths
* Chain-of-Thought (CoT) for others: step-by-step reasoning traces

---

#### Stage 4: Response Agent (Educational Response Generation)
**Purpose**: Generate pedagogically sound final response

**Model**: Llama-3.2-3B-Instruct (same shared instance)

**Configuration**:
* 4-bit NF4 quantization (BitsAndBytes)
* Mixed precision BF16 inference
* Accelerate integration for distributed computation
* 128K context window
* Multilingual support (8 languages)

**Prompt Assembly Process**:
1. **Core Identity**: Always included (defines Mimir persona)
2. **Logical Expressions**: Regex-triggered prompts (e.g., math keywords β†’ `LATEX_FORMATTING`)
3. **Agent-Selected Prompts**: Dynamic assembly based on routing agent decisions
4. **Context Integration**: Tool outputs, thinking agent outputs, conversation history
5. **Complete Prompt**: All segments joined with proper formatting

**Dynamic Prompt Library** (11 segments):
```
Core:          CORE_IDENTITY (always)
Formatting:    GENERAL_FORMATTING (always), LATEX_FORMATTING (math)
Discovery:     VAUGE_INPUT, USER_UNDERSTANDING
Teaching:      GUIDING_TEACHING
Practice:      STRUCTURE_PRACTICE_QUESTIONS, PRACTICE_QUESTION_FOLLOWUP
Tool:          TOOL_USE_ENHANCEMENT
```

**Response Post-Processing**:
* Artifact cleanup (remove special tokens)
* Intelligent truncation at logical breakpoints
* Sentence integrity preservation
* Quality validation gates
* Word-by-word streaming for UX

---

### Model Specifications

**Llama-3.2-3B-Instruct Details:**
* **Parameters**: 3.21 billion
* **Architecture**: Optimized transformer with Grouped-Query Attention (GQA)
* **Training Data**: 9 trillion tokens (December 2023 cutoff)
* **Context Length**: 128,000 tokens
* **Languages**: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
* **Quantization**: 4-bit NF4 (~1GB VRAM)
* **Training Method**: Knowledge distillation from Llama 3.1 8B/70B + SFT + RLHF

**Why Single Model Architecture:**
* βœ… **Consistency**: Same reasoning style across all agents
* βœ… **Memory Efficient**: One model, shared instance (~1GB total)
* βœ… **Instruction-Tuned**: Optimized for educational dialogue
* βœ… **Fast Inference**: 3B parameters = quick responses
* βœ… **ZeroGPU Friendly**: Small enough for dynamic allocation
* βœ… **128K Context**: Can handle long educational conversations

---

### Prompt Engineering Techniques Demonstrated

#### 1. Hierarchical Prompt Architecture
**Three-Layer System**:
- **Agent System Prompts**: Specialized instructions for each agent type
- **Response Prompt Segments**: Modular components dynamically assembled
- **Thinking Prompts**: Preprocessing templates for reasoning generation

**Innovation**: Separates decision-making logic from response generation, enabling precise control over AI behavior at each pipeline stage.

#### 2. Per-Turn Prompt State Management
**PromptStateManager**:
```python
# Reset at turn start - clean slate
prompt_state.reset()  # All 11 prompts β†’ False

# Agents activate relevant prompts
prompt_state.update("LATEX_FORMATTING", True)
prompt_state.update("GUIDING_TEACHING", True)

# Assemble only active prompts
active_prompts = prompt_state.get_active_response_prompts()
# Returns: ["CORE_IDENTITY", "GENERAL_FORMATTING", 
#           "LATEX_FORMATTING", "GUIDING_TEACHING"]
```

**Benefits**:
- No prompt pollution between turns
- Context-appropriate responses every time
- Traceable decision-making for debugging

#### 3. Logical Expression System
**Regex-Based Automatic Activation**:
```python
# Math keyword detection
math_regex = r'\b(calculus|algebra|equation|solve|derivative)\b'
if re.search(math_regex, user_input, re.IGNORECASE):
    prompt_state.update("LATEX_FORMATTING", True)
```

**Hybrid Approach**: Combines rule-based triggers with LLM decision-making for optimal reliability.

#### 4. Constraint-Based Agent Prompting
**Tool Decision Example**:
```
System Prompt: Analyze query and determine if visualization needed.

Output Format: YES or NO (nothing else)

INCLUDE if: mathematical functions, data analysis, trends
EXCLUDE if: greetings, simple definitions, no data
```

**Result**: Reliable, parseable outputs from agents without complex post-processing.

#### 5. Chain-of-Thought & Tree-of-Thought Preprocessing
**CoT for Sequential Reasoning**:
```
Step 1: Assess topic β†’ 
Step 2: Identify user understanding β†’ 
Step 3: Previous actions β†’ 
Step 4: Reference facts
```

**ToT for Mathematical Reasoning**:
```
Question Type Assessment β†’ 
  Branch 1A (Simple): Minimal steps
  Branch 1B (Complex): Full derivation with principles
```

**Innovation**: Thinking agents generate rich context that guides ResponseAgent to higher-quality outputs.

#### 6. Academic Integrity by Design
**Embedded in Core Prompts**:
* "Do not provide full solutions - guide through processes instead"
* "Break problems into conceptual components"
* "Ask clarifying questions about their understanding"
* Subject-specific guidelines (Math: explain concepts, not compute)

**Follow-up Grading**:
* Agent 3 detects practice question responses
* `PRACTICE_QUESTION_FOLLOWUP` prompt activates
* Automated assessment with constructive feedback

#### 7. Multi-Modal Response Generation
**Tool Integration**:
```python
# Tool decision β†’ JSON generation β†’ matplotlib rendering β†’ base64 encoding
Create_Graph_Tool(
    data={"Week 1": 120, "Week 2": 155, ...},
    plot_type="line",
    title="Crop Yield Analysis",
    educational_context="Visualizes growth trend over time"
)
```

**Result**: In-memory graph generation with educational context, embedded directly in response.

---

### State Management & Persistence

#### GlobalStateManager Architecture
**Dual-Layer Persistence**:
1. **SQLite Database**: Fast local access, immediate writes
2. **HuggingFace Dataset**: Cloud backup, hourly sync

**State Categories**:
```python
- Conversation State: Full chat history + agent context
- Prompt State: Per-turn activation (resets each interaction)
- Analytics State: Metrics, dashboard data, export history
- Evaluation State: Quality scores, classifier accuracy, user feedback
- ML Model Cache: Loaded model for reuse across sessions
```

**Thread Safety**: All state operations protected by `threading.Lock()`

**Cleanup Strategy**: 
- Automatic cleanup every 60 minutes
- Remove sessions older than 24 hours
- Prevents memory leaks in long-running deployments

---

### Model Loading & Optimization Strategy

#### Two-Stage Lazy Loading Pipeline

**Stage 1: Build Time (Docker) - Optional Pre-caching**
```yaml
# preload_from_hub in README.md
preload_from_hub:
  - meta-llama/Llama-3.2-3B-Instruct
```
* Downloads model weights during Docker build
* Cached in HuggingFace hub cache directory
* Reduces first-request latency (no download needed)
* **Optional but recommended** for production deployments

**Stage 2: Runtime (Lazy Loading with Automatic Caching)**
```python
# model_manager.py - LazyLlamaModel class
def _load_model(self):
    """Load on first generate() call"""
    if self.model is not None:
        return  # Already loaded - reuse cached instance
    
    # First call: Load with 4-bit quantization
    self.model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        quantization_config=quantization_config,
        device_map="auto",
    )
    # Model stays in memory for all future calls

# All agents share this single instance
@spaces.GPU(duration=120)
def _load_model(self):
    # GPU allocated for 120 seconds during first load
    # Then reused without re-allocation
```

**Loading Flow**:
```
App starts β†’ Instant startup (no model loading)
             ↓
First user request β†’ Triggers model load (~30-60s)
                    β”œβ”€ Download from cache (if preloaded: instant)
                    β”œβ”€ Load with 4-bit quantization
                    β”œβ”€ Create pipeline
                    └─ Cache in memory
                    ↓
All subsequent requests β†’ Use cached model (~1s)
```

**Memory Optimization**:
- **4-bit NF4 Quantization**: 75% memory reduction
  - Llama-3.2-3B: ~6GB β†’ ~1GB VRAM
- **Shared Model Strategy**: ALL agents share one model instance
- **Singleton Pattern**: Thread-safe model caching
- **Device Mapping**: Automatic distribution with ZeroGPU
- **128K Context**: Long conversations without truncation

**ZeroGPU Integration**:
```python
@spaces.GPU(duration=120)  # Dynamic allocation for first load
def _load_model(self):
    # GPU available for 120 seconds
    # Loads model once on first request
    # Cached instance reused across all agents
    # Automatic GPU management by ZeroGPU
```

**Performance Characteristics**:
* **First Request**: 30-60 seconds (one-time model load)
  - With `preload_from_hub`: 30-40s (just quantization)
  - Without preload: 50-60s (download + quantization)
* **Subsequent Requests**: <1 second per agent
* **Memory Footprint**: ~1GB VRAM (persistent)
* **Cold Start**: Instant app startup (model loads on demand)

**Why Lazy Loading?**
* βœ… **Instant Startup**: App launches immediately
* βœ… **ZeroGPU Optimal**: Perfect for dynamic GPU allocation
* βœ… **Memory Efficient**: Only loads when needed
* βœ… **Cache Persistent**: Stays loaded between requests
* βœ… **Serverless Friendly**: Ideal for HuggingFace Spaces

---

### Analytics & Evaluation System

#### Built-In Dashboard
**Real-Time Metrics**:
* Total conversations
* Average response time
* Success rate (quality score >3.5)
* Educational quality scores (ML-evaluated)
* Classifier accuracy rates
* Active sessions count

**LightEval Integration**:
* BertScore for semantic quality
* ROUGE for response completeness
* Custom educational quality indicators:
  - Has examples
  - Structured explanation
  - Appropriate length
  - Encourages learning
  - Uses LaTeX (for math)
  - Clear sections

**Exportable Data**:
* JSON export with full metrics
* CSV export of interaction history
* Programmatic access via API

---

### Performance Benchmarks

**Runtime Performance:**
* **Inference Speed**: 25-40 tokens/second (with ZeroGPU)
* **Memory Usage**: ~1GB VRAM (4-bit quantization)
* **Context Window**: 128K tokens
* **First Request**: ~30-60 seconds (one-time load)
* **Warm Inference**: <1 second per agent
* **Startup Time**: Instant (lazy loading)

**Llama 3.2 Quality Scores:**
* MMLU: 63.4 (competitive with larger models)
* GSM8K (Math): 73.9
* HumanEval (Coding): 59.3
* Multilingual: 8 languages supported
* Safety: RLHF-aligned for educational use