Spaces:

jdesiree
/

Mimir

Sleeping

File size: 17,080 Bytes

---
title: Mimir
emoji: 📚
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
python_version: '3.10'
short_description: Advanced prompt engineering for educational AI systems.
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/68700e7552b74a1dcbb2a87e/Z7P8DJ57rc5P1ozA5gwp3.png
hardware: zero-gpu-dynamic
startup_duration_timeout: 30m
---

# Mimir: Educational AI Assistant
## Advanced Multi-Agent Architecture & Prompt Engineering Portfolio Project

### Project Overview
Mimir demonstrates enterprise-grade AI system design through a sophisticated multi-agent architecture applied to educational technology. The system showcases advanced prompt engineering, intelligent decision-making pipelines, and state-persistent conversation management. Unlike simple single-model implementations, Mimir employs **four specialized agent types** working in concert: a tool decision engine, four parallel routing agents for prompt selection, three preprocessing thinking agents for complex reasoning, and a unified response generator. This architecture prioritizes pedagogical effectiveness through dynamic context assembly, ensuring responses are tailored to each unique educational interaction.

***

### Technical Architecture

**Multi-Agent System:**
```
User Input → Tool Decision Agent → Routing Agents (4x) → Thinking Agents (3x) → Response Agent → Output
                     ↓                    ↓                      ↓                  ↓
              Llama-3.2-3B         Llama-3.2-3B (shared)    Llama-3.2-3B      Llama-3.2-3B
```

**Core Technologies:**

* **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
* **Lazy Loading Strategy**: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
* **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
* **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
* **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
* **Gradio**: Multi-page interface (Chatbot + Analytics Dashboard)
* **Python**: Advanced backend with 4-bit quantization and streaming

**Key Frameworks & Libraries:**

* `transformers` & `accelerate` for model loading and inference optimization
* `bitsandbytes` for 4-bit NF4 quantization (75% memory reduction)
* `peft` for Parameter-Efficient Fine-Tuning support
* `spaces` for HuggingFace ZeroGPU integration
* `matplotlib` for dynamic visualization generation
* Custom state management system with SQLite and dataset backup

***

### Advanced Agent Architecture

#### Agent Pipeline Overview
The system processes each user interaction through a sophisticated four-stage pipeline, with each stage making intelligent decisions that shape the final response.

#### Stage 1: Tool Decision Agent
**Purpose**: Determines if visualization tools enhance learning

**Model**: Llama-3.2-3B-Instruct (4-bit NF4 quantized)

**Prompt Engineering**:
* Highly constrained binary decision prompt (YES/NO only)
* Explicit INCLUDE/EXCLUDE criteria for educational contexts
* Zero-shot classification with educational domain knowledge

**Decision Criteria**:
```
INCLUDE: Mathematical functions, data analysis, chart interpretation, 
         trend visualization, proportional relationships

EXCLUDE: Greetings, definitions, explanations without data
```

**Output**: Boolean flag activating `TOOL_USE_ENHANCEMENT` prompt segment

---

#### Stage 2: Prompt Routing Agents (4 Specialized Agents)
**Purpose**: Intelligent prompt segment selection through parallel analysis

**Model**: Shared Llama-3.2-3B-Instruct instance (memory efficient)

**Agent Specializations**:

1. **Agent 1 - Practice Question Detector**
   - Analyzes conversation context for practice question opportunities
   - Considers user's expressed understanding and learning progression
   - Activates: `STRUCTURE_PRACTICE_QUESTIONS`

2. **Agent 2 - Discovery Mode Classifier**
   - Dual-classification: vague input detection + understanding assessment
   - Returns: `VAUGE_INPUT`, `USER_UNDERSTANDING`, or neither
   - Enables guided discovery and clarification strategies

3. **Agent 3 - Follow-up Assessment Agent**
   - Detects if user is responding to previous practice questions
   - Analyzes conversation history for grading opportunities
   - Activates: `PRACTICE_QUESTION_FOLLOWUP` (triggers grading mode)

4. **Agent 4 - Teaching Mode Assessor**
   - Evaluates need for direct instruction vs. structured practice
   - Multi-output agent (can activate multiple prompts)
   - Activates: `GUIDING_TEACHING`, `STRUCTURE_PRACTICE_QUESTIONS`

**Prompt Engineering Innovation**:
* Each agent uses a specialized system prompt with clear decision criteria
* Structured output formats for reliable parsing
* Context-aware analysis incorporating full conversation history
* Sequential execution prevents decision conflicts

---

#### Stage 3: Thinking Agents (Preprocessing Layer)
**Purpose**: Generate reasoning context before final response (CoT/ToT)

**Model**: Llama-3.2-3B-Instruct (shared instance)

**Agent Specializations**:

1. **Math Thinking Agent**
   - **Method**: Tree-of-Thought reasoning for mathematical problems
   - **Activation**: When `LATEX_FORMATTING` is active
   - **Output Structure**:
     ```
     Key Terms → Principles → Formulas → Step-by-Step Solution → Summary
     ```
   - **Complexity Routing**: Decision tree determines detail level (1A: basic, 1B: complex)

2. **Question/Answer Design Agent**
   - **Method**: Chain-of-Thought for practice question formulation
   - **Activation**: When `STRUCTURE_PRACTICE_QUESTIONS` is active
   - **Formatted Inputs**: Tool context, LaTeX guidelines, practice question templates
   - **Output**: Question design, data formatting, answer bank generation

3. **Reasoning Thinking Agent**
   - **Method**: General Chain-of-Thought preprocessing
   - **Activation**: When tools, follow-ups, or teaching mode active
   - **Output Structure**:
     ```
     User Knowledge Summary → Understanding Analysis → 
     Previous Actions → Reference Fact Sheet
     ```

**Prompt Engineering Innovation**:
* Thinking agents produce **context for ResponseAgent**, not final output
* Outputs are invisible to user but inform response quality
* Tree-of-Thought (ToT) for math: explores multiple solution paths
* Chain-of-Thought (CoT) for others: step-by-step reasoning traces

---

#### Stage 4: Response Agent (Educational Response Generation)
**Purpose**: Generate pedagogically sound final response

**Model**: Llama-3.2-3B-Instruct (same shared instance)

**Configuration**:
* 4-bit NF4 quantization (BitsAndBytes)
* Mixed precision BF16 inference
* Accelerate integration for distributed computation
* 128K context window
* Multilingual support (8 languages)

**Prompt Assembly Process**:
1. **Core Identity**: Always included (defines Mimir persona)
2. **Logical Expressions**: Regex-triggered prompts (e.g., math keywords → `LATEX_FORMATTING`)
3. **Agent-Selected Prompts**: Dynamic assembly based on routing agent decisions
4. **Context Integration**: Tool outputs, thinking agent outputs, conversation history
5. **Complete Prompt**: All segments joined with proper formatting

**Dynamic Prompt Library** (11 segments):
```
Core:          CORE_IDENTITY (always)
Formatting:    GENERAL_FORMATTING (always), LATEX_FORMATTING (math)
Discovery:     VAUGE_INPUT, USER_UNDERSTANDING
Teaching:      GUIDING_TEACHING
Practice:      STRUCTURE_PRACTICE_QUESTIONS, PRACTICE_QUESTION_FOLLOWUP
Tool:          TOOL_USE_ENHANCEMENT
```

**Response Post-Processing**:
* Artifact cleanup (remove special tokens)
* Intelligent truncation at logical breakpoints
* Sentence integrity preservation
* Quality validation gates
* Word-by-word streaming for UX

---

### Model Specifications

**Llama-3.2-3B-Instruct Details:**
* **Parameters**: 3.21 billion
* **Architecture**: Optimized transformer with Grouped-Query Attention (GQA)
* **Training Data**: 9 trillion tokens (December 2023 cutoff)
* **Context Length**: 128,000 tokens
* **Languages**: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
* **Quantization**: 4-bit NF4 (~1GB VRAM)
* **Training Method**: Knowledge distillation from Llama 3.1 8B/70B + SFT + RLHF

**Why Single Model Architecture:**
* ✅ **Consistency**: Same reasoning style across all agents
* ✅ **Memory Efficient**: One model, shared instance (~1GB total)
* ✅ **Instruction-Tuned**: Optimized for educational dialogue
* ✅ **Fast Inference**: 3B parameters = quick responses
* ✅ **ZeroGPU Friendly**: Small enough for dynamic allocation
* ✅ **128K Context**: Can handle long educational conversations

---

### Prompt Engineering Techniques Demonstrated

#### 1. Hierarchical Prompt Architecture
**Three-Layer System**:
- **Agent System Prompts**: Specialized instructions for each agent type
- **Response Prompt Segments**: Modular components dynamically assembled
- **Thinking Prompts**: Preprocessing templates for reasoning generation

**Innovation**: Separates decision-making logic from response generation, enabling precise control over AI behavior at each pipeline stage.

#### 2. Per-Turn Prompt State Management
**PromptStateManager**:
```python
# Reset at turn start - clean slate
prompt_state.reset()  # All 11 prompts → False

# Agents activate relevant prompts
prompt_state.update("LATEX_FORMATTING", True)
prompt_state.update("GUIDING_TEACHING", True)

# Assemble only active prompts
active_prompts = prompt_state.get_active_response_prompts()
# Returns: ["CORE_IDENTITY", "GENERAL_FORMATTING", 
#           "LATEX_FORMATTING", "GUIDING_TEACHING"]
```

**Benefits**:
- No prompt pollution between turns
- Context-appropriate responses every time
- Traceable decision-making for debugging

#### 3. Logical Expression System
**Regex-Based Automatic Activation**:
```python
# Math keyword detection
math_regex = r'\b(calculus|algebra|equation|solve|derivative)\b'
if re.search(math_regex, user_input, re.IGNORECASE):
    prompt_state.update("LATEX_FORMATTING", True)
```

**Hybrid Approach**: Combines rule-based triggers with LLM decision-making for optimal reliability.

#### 4. Constraint-Based Agent Prompting
**Tool Decision Example**:
```
System Prompt: Analyze query and determine if visualization needed.

Output Format: YES or NO (nothing else)

INCLUDE if: mathematical functions, data analysis, trends
EXCLUDE if: greetings, simple definitions, no data
```

**Result**: Reliable, parseable outputs from agents without complex post-processing.

#### 5. Chain-of-Thought & Tree-of-Thought Preprocessing
**CoT for Sequential Reasoning**:
```
Step 1: Assess topic → 
Step 2: Identify user understanding → 
Step 3: Previous actions → 
Step 4: Reference facts
```

**ToT for Mathematical Reasoning**:
```
Question Type Assessment → 
  Branch 1A (Simple): Minimal steps
  Branch 1B (Complex): Full derivation with principles
```

**Innovation**: Thinking agents generate rich context that guides ResponseAgent to higher-quality outputs.

#### 6. Academic Integrity by Design
**Embedded in Core Prompts**:
* "Do not provide full solutions - guide through processes instead"
* "Break problems into conceptual components"
* "Ask clarifying questions about their understanding"
* Subject-specific guidelines (Math: explain concepts, not compute)

**Follow-up Grading**:
* Agent 3 detects practice question responses
* `PRACTICE_QUESTION_FOLLOWUP` prompt activates
* Automated assessment with constructive feedback

#### 7. Multi-Modal Response Generation
**Tool Integration**:
```python
# Tool decision → JSON generation → matplotlib rendering → base64 encoding
Create_Graph_Tool(
    data={"Week 1": 120, "Week 2": 155, ...},
    plot_type="line",
    title="Crop Yield Analysis",
    educational_context="Visualizes growth trend over time"
)
```

**Result**: In-memory graph generation with educational context, embedded directly in response.

---

### State Management & Persistence

#### GlobalStateManager Architecture
**Dual-Layer Persistence**:
1. **SQLite Database**: Fast local access, immediate writes
2. **HuggingFace Dataset**: Cloud backup, hourly sync

**State Categories**:
```python
- Conversation State: Full chat history + agent context
- Prompt State: Per-turn activation (resets each interaction)
- Analytics State: Metrics, dashboard data, export history
- Evaluation State: Quality scores, classifier accuracy, user feedback
- ML Model Cache: Loaded model for reuse across sessions
```

**Thread Safety**: All state operations protected by `threading.Lock()`

**Cleanup Strategy**: 
- Automatic cleanup every 60 minutes
- Remove sessions older than 24 hours
- Prevents memory leaks in long-running deployments

---

### Model Loading & Optimization Strategy

#### Two-Stage Lazy Loading Pipeline

**Stage 1: Build Time (Docker) - Optional Pre-caching**
```yaml
# preload_from_hub in README.md
preload_from_hub:
  - meta-llama/Llama-3.2-3B-Instruct
```
* Downloads model weights during Docker build
* Cached in HuggingFace hub cache directory
* Reduces first-request latency (no download needed)
* **Optional but recommended** for production deployments

**Stage 2: Runtime (Lazy Loading with Automatic Caching)**
```python
# model_manager.py - LazyLlamaModel class
def _load_model(self):
    """Load on first generate() call"""
    if self.model is not None:
        return  # Already loaded - reuse cached instance
    
    # First call: Load with 4-bit quantization
    self.model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        quantization_config=quantization_config,
        device_map="auto",
    )
    # Model stays in memory for all future calls

# All agents share this single instance
@spaces.GPU(duration=120)
def _load_model(self):
    # GPU allocated for 120 seconds during first load
    # Then reused without re-allocation
```

**Loading Flow**:
```
App starts → Instant startup (no model loading)
             ↓
First user request → Triggers model load (~30-60s)
                    ├─ Download from cache (if preloaded: instant)
                    ├─ Load with 4-bit quantization
                    ├─ Create pipeline
                    └─ Cache in memory
                    ↓
All subsequent requests → Use cached model (~1s)
```

**Memory Optimization**:
- **4-bit NF4 Quantization**: 75% memory reduction
  - Llama-3.2-3B: ~6GB → ~1GB VRAM
- **Shared Model Strategy**: ALL agents share one model instance
- **Singleton Pattern**: Thread-safe model caching
- **Device Mapping**: Automatic distribution with ZeroGPU
- **128K Context**: Long conversations without truncation

**ZeroGPU Integration**:
```python
@spaces.GPU(duration=120)  # Dynamic allocation for first load
def _load_model(self):
    # GPU available for 120 seconds
    # Loads model once on first request
    # Cached instance reused across all agents
    # Automatic GPU management by ZeroGPU
```

**Performance Characteristics**:
* **First Request**: 30-60 seconds (one-time model load)
  - With `preload_from_hub`: 30-40s (just quantization)
  - Without preload: 50-60s (download + quantization)
* **Subsequent Requests**: <1 second per agent
* **Memory Footprint**: ~1GB VRAM (persistent)
* **Cold Start**: Instant app startup (model loads on demand)

**Why Lazy Loading?**
* ✅ **Instant Startup**: App launches immediately
* ✅ **ZeroGPU Optimal**: Perfect for dynamic GPU allocation
* ✅ **Memory Efficient**: Only loads when needed
* ✅ **Cache Persistent**: Stays loaded between requests
* ✅ **Serverless Friendly**: Ideal for HuggingFace Spaces

---

### Analytics & Evaluation System

#### Built-In Dashboard
**Real-Time Metrics**:
* Total conversations
* Average response time
* Success rate (quality score >3.5)
* Educational quality scores (ML-evaluated)
* Classifier accuracy rates
* Active sessions count

**LightEval Integration**:
* BertScore for semantic quality
* ROUGE for response completeness
* Custom educational quality indicators:
  - Has examples
  - Structured explanation
  - Appropriate length
  - Encourages learning
  - Uses LaTeX (for math)
  - Clear sections

**Exportable Data**:
* JSON export with full metrics
* CSV export of interaction history
* Programmatic access via API

---

### Performance Benchmarks

**Runtime Performance:**
* **Inference Speed**: 25-40 tokens/second (with ZeroGPU)
* **Memory Usage**: ~1GB VRAM (4-bit quantization)
* **Context Window**: 128K tokens
* **First Request**: ~30-60 seconds (one-time load)
* **Warm Inference**: <1 second per agent
* **Startup Time**: Instant (lazy loading)

**Llama 3.2 Quality Scores:**
* MMLU: 63.4 (competitive with larger models)
* GSM8K (Math): 73.9
* HumanEval (Coding): 59.3
* Multilingual: 8 languages supported
* Safety: RLHF-aligned for educational use