Mimir / README.md
jdesiree's picture
Update README.md
b79a4e1 verified
---
title: Mimir
emoji: πŸ“š
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
python_version: '3.10'
short_description: Advanced prompt engineering for educational AI systems.
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/68700e7552b74a1dcbb2a87e/Z7P8DJ57rc5P1ozA5gwp3.png
hardware: zero-gpu-dynamic
startup_duration_timeout: 30m
---
# Mimir: Educational AI Assistant
## Advanced Multi-Agent Architecture & Prompt Engineering Portfolio Project
### Project Overview
Mimir demonstrates enterprise-grade AI system design through a sophisticated multi-agent architecture applied to educational technology. The system showcases advanced prompt engineering, intelligent decision-making pipelines, and state-persistent conversation management. Unlike simple single-model implementations, Mimir employs **four specialized agent types** working in concert: a tool decision engine, four parallel routing agents for prompt selection, three preprocessing thinking agents for complex reasoning, and a unified response generator. This architecture prioritizes pedagogical effectiveness through dynamic context assembly, ensuring responses are tailored to each unique educational interaction.
***
### Technical Architecture
**Multi-Agent System:**
```
User Input β†’ Tool Decision Agent β†’ Routing Agents (4x) β†’ Thinking Agents (3x) β†’ Response Agent β†’ Output
↓ ↓ ↓ ↓
Llama-3.2-3B Llama-3.2-3B (shared) Llama-3.2-3B Llama-3.2-3B
```
**Core Technologies:**
* **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
* **Lazy Loading Strategy**: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
* **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
* **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
* **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
* **Gradio**: Multi-page interface (Chatbot + Analytics Dashboard)
* **Python**: Advanced backend with 4-bit quantization and streaming
**Key Frameworks & Libraries:**
* `transformers` & `accelerate` for model loading and inference optimization
* `bitsandbytes` for 4-bit NF4 quantization (75% memory reduction)
* `peft` for Parameter-Efficient Fine-Tuning support
* `spaces` for HuggingFace ZeroGPU integration
* `matplotlib` for dynamic visualization generation
* Custom state management system with SQLite and dataset backup
***
### Advanced Agent Architecture
#### Agent Pipeline Overview
The system processes each user interaction through a sophisticated four-stage pipeline, with each stage making intelligent decisions that shape the final response.
#### Stage 1: Tool Decision Agent
**Purpose**: Determines if visualization tools enhance learning
**Model**: Llama-3.2-3B-Instruct (4-bit NF4 quantized)
**Prompt Engineering**:
* Highly constrained binary decision prompt (YES/NO only)
* Explicit INCLUDE/EXCLUDE criteria for educational contexts
* Zero-shot classification with educational domain knowledge
**Decision Criteria**:
```
INCLUDE: Mathematical functions, data analysis, chart interpretation,
trend visualization, proportional relationships
EXCLUDE: Greetings, definitions, explanations without data
```
**Output**: Boolean flag activating `TOOL_USE_ENHANCEMENT` prompt segment
---
#### Stage 2: Prompt Routing Agents (4 Specialized Agents)
**Purpose**: Intelligent prompt segment selection through parallel analysis
**Model**: Shared Llama-3.2-3B-Instruct instance (memory efficient)
**Agent Specializations**:
1. **Agent 1 - Practice Question Detector**
- Analyzes conversation context for practice question opportunities
- Considers user's expressed understanding and learning progression
- Activates: `STRUCTURE_PRACTICE_QUESTIONS`
2. **Agent 2 - Discovery Mode Classifier**
- Dual-classification: vague input detection + understanding assessment
- Returns: `VAUGE_INPUT`, `USER_UNDERSTANDING`, or neither
- Enables guided discovery and clarification strategies
3. **Agent 3 - Follow-up Assessment Agent**
- Detects if user is responding to previous practice questions
- Analyzes conversation history for grading opportunities
- Activates: `PRACTICE_QUESTION_FOLLOWUP` (triggers grading mode)
4. **Agent 4 - Teaching Mode Assessor**
- Evaluates need for direct instruction vs. structured practice
- Multi-output agent (can activate multiple prompts)
- Activates: `GUIDING_TEACHING`, `STRUCTURE_PRACTICE_QUESTIONS`
**Prompt Engineering Innovation**:
* Each agent uses a specialized system prompt with clear decision criteria
* Structured output formats for reliable parsing
* Context-aware analysis incorporating full conversation history
* Sequential execution prevents decision conflicts
---
#### Stage 3: Thinking Agents (Preprocessing Layer)
**Purpose**: Generate reasoning context before final response (CoT/ToT)
**Model**: Llama-3.2-3B-Instruct (shared instance)
**Agent Specializations**:
1. **Math Thinking Agent**
- **Method**: Tree-of-Thought reasoning for mathematical problems
- **Activation**: When `LATEX_FORMATTING` is active
- **Output Structure**:
```
Key Terms β†’ Principles β†’ Formulas β†’ Step-by-Step Solution β†’ Summary
```
- **Complexity Routing**: Decision tree determines detail level (1A: basic, 1B: complex)
2. **Question/Answer Design Agent**
- **Method**: Chain-of-Thought for practice question formulation
- **Activation**: When `STRUCTURE_PRACTICE_QUESTIONS` is active
- **Formatted Inputs**: Tool context, LaTeX guidelines, practice question templates
- **Output**: Question design, data formatting, answer bank generation
3. **Reasoning Thinking Agent**
- **Method**: General Chain-of-Thought preprocessing
- **Activation**: When tools, follow-ups, or teaching mode active
- **Output Structure**:
```
User Knowledge Summary β†’ Understanding Analysis β†’
Previous Actions β†’ Reference Fact Sheet
```
**Prompt Engineering Innovation**:
* Thinking agents produce **context for ResponseAgent**, not final output
* Outputs are invisible to user but inform response quality
* Tree-of-Thought (ToT) for math: explores multiple solution paths
* Chain-of-Thought (CoT) for others: step-by-step reasoning traces
---
#### Stage 4: Response Agent (Educational Response Generation)
**Purpose**: Generate pedagogically sound final response
**Model**: Llama-3.2-3B-Instruct (same shared instance)
**Configuration**:
* 4-bit NF4 quantization (BitsAndBytes)
* Mixed precision BF16 inference
* Accelerate integration for distributed computation
* 128K context window
* Multilingual support (8 languages)
**Prompt Assembly Process**:
1. **Core Identity**: Always included (defines Mimir persona)
2. **Logical Expressions**: Regex-triggered prompts (e.g., math keywords β†’ `LATEX_FORMATTING`)
3. **Agent-Selected Prompts**: Dynamic assembly based on routing agent decisions
4. **Context Integration**: Tool outputs, thinking agent outputs, conversation history
5. **Complete Prompt**: All segments joined with proper formatting
**Dynamic Prompt Library** (11 segments):
```
Core: CORE_IDENTITY (always)
Formatting: GENERAL_FORMATTING (always), LATEX_FORMATTING (math)
Discovery: VAUGE_INPUT, USER_UNDERSTANDING
Teaching: GUIDING_TEACHING
Practice: STRUCTURE_PRACTICE_QUESTIONS, PRACTICE_QUESTION_FOLLOWUP
Tool: TOOL_USE_ENHANCEMENT
```
**Response Post-Processing**:
* Artifact cleanup (remove special tokens)
* Intelligent truncation at logical breakpoints
* Sentence integrity preservation
* Quality validation gates
* Word-by-word streaming for UX
---
### Model Specifications
**Llama-3.2-3B-Instruct Details:**
* **Parameters**: 3.21 billion
* **Architecture**: Optimized transformer with Grouped-Query Attention (GQA)
* **Training Data**: 9 trillion tokens (December 2023 cutoff)
* **Context Length**: 128,000 tokens
* **Languages**: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
* **Quantization**: 4-bit NF4 (~1GB VRAM)
* **Training Method**: Knowledge distillation from Llama 3.1 8B/70B + SFT + RLHF
**Why Single Model Architecture:**
* βœ… **Consistency**: Same reasoning style across all agents
* βœ… **Memory Efficient**: One model, shared instance (~1GB total)
* βœ… **Instruction-Tuned**: Optimized for educational dialogue
* βœ… **Fast Inference**: 3B parameters = quick responses
* βœ… **ZeroGPU Friendly**: Small enough for dynamic allocation
* βœ… **128K Context**: Can handle long educational conversations
---
### Prompt Engineering Techniques Demonstrated
#### 1. Hierarchical Prompt Architecture
**Three-Layer System**:
- **Agent System Prompts**: Specialized instructions for each agent type
- **Response Prompt Segments**: Modular components dynamically assembled
- **Thinking Prompts**: Preprocessing templates for reasoning generation
**Innovation**: Separates decision-making logic from response generation, enabling precise control over AI behavior at each pipeline stage.
#### 2. Per-Turn Prompt State Management
**PromptStateManager**:
```python
# Reset at turn start - clean slate
prompt_state.reset() # All 11 prompts β†’ False
# Agents activate relevant prompts
prompt_state.update("LATEX_FORMATTING", True)
prompt_state.update("GUIDING_TEACHING", True)
# Assemble only active prompts
active_prompts = prompt_state.get_active_response_prompts()
# Returns: ["CORE_IDENTITY", "GENERAL_FORMATTING",
# "LATEX_FORMATTING", "GUIDING_TEACHING"]
```
**Benefits**:
- No prompt pollution between turns
- Context-appropriate responses every time
- Traceable decision-making for debugging
#### 3. Logical Expression System
**Regex-Based Automatic Activation**:
```python
# Math keyword detection
math_regex = r'\b(calculus|algebra|equation|solve|derivative)\b'
if re.search(math_regex, user_input, re.IGNORECASE):
prompt_state.update("LATEX_FORMATTING", True)
```
**Hybrid Approach**: Combines rule-based triggers with LLM decision-making for optimal reliability.
#### 4. Constraint-Based Agent Prompting
**Tool Decision Example**:
```
System Prompt: Analyze query and determine if visualization needed.
Output Format: YES or NO (nothing else)
INCLUDE if: mathematical functions, data analysis, trends
EXCLUDE if: greetings, simple definitions, no data
```
**Result**: Reliable, parseable outputs from agents without complex post-processing.
#### 5. Chain-of-Thought & Tree-of-Thought Preprocessing
**CoT for Sequential Reasoning**:
```
Step 1: Assess topic β†’
Step 2: Identify user understanding β†’
Step 3: Previous actions β†’
Step 4: Reference facts
```
**ToT for Mathematical Reasoning**:
```
Question Type Assessment β†’
Branch 1A (Simple): Minimal steps
Branch 1B (Complex): Full derivation with principles
```
**Innovation**: Thinking agents generate rich context that guides ResponseAgent to higher-quality outputs.
#### 6. Academic Integrity by Design
**Embedded in Core Prompts**:
* "Do not provide full solutions - guide through processes instead"
* "Break problems into conceptual components"
* "Ask clarifying questions about their understanding"
* Subject-specific guidelines (Math: explain concepts, not compute)
**Follow-up Grading**:
* Agent 3 detects practice question responses
* `PRACTICE_QUESTION_FOLLOWUP` prompt activates
* Automated assessment with constructive feedback
#### 7. Multi-Modal Response Generation
**Tool Integration**:
```python
# Tool decision β†’ JSON generation β†’ matplotlib rendering β†’ base64 encoding
Create_Graph_Tool(
data={"Week 1": 120, "Week 2": 155, ...},
plot_type="line",
title="Crop Yield Analysis",
educational_context="Visualizes growth trend over time"
)
```
**Result**: In-memory graph generation with educational context, embedded directly in response.
---
### State Management & Persistence
#### GlobalStateManager Architecture
**Dual-Layer Persistence**:
1. **SQLite Database**: Fast local access, immediate writes
2. **HuggingFace Dataset**: Cloud backup, hourly sync
**State Categories**:
```python
- Conversation State: Full chat history + agent context
- Prompt State: Per-turn activation (resets each interaction)
- Analytics State: Metrics, dashboard data, export history
- Evaluation State: Quality scores, classifier accuracy, user feedback
- ML Model Cache: Loaded model for reuse across sessions
```
**Thread Safety**: All state operations protected by `threading.Lock()`
**Cleanup Strategy**:
- Automatic cleanup every 60 minutes
- Remove sessions older than 24 hours
- Prevents memory leaks in long-running deployments
---
### Model Loading & Optimization Strategy
#### Two-Stage Lazy Loading Pipeline
**Stage 1: Build Time (Docker) - Optional Pre-caching**
```yaml
# preload_from_hub in README.md
preload_from_hub:
- meta-llama/Llama-3.2-3B-Instruct
```
* Downloads model weights during Docker build
* Cached in HuggingFace hub cache directory
* Reduces first-request latency (no download needed)
* **Optional but recommended** for production deployments
**Stage 2: Runtime (Lazy Loading with Automatic Caching)**
```python
# model_manager.py - LazyLlamaModel class
def _load_model(self):
"""Load on first generate() call"""
if self.model is not None:
return # Already loaded - reuse cached instance
# First call: Load with 4-bit quantization
self.model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct",
quantization_config=quantization_config,
device_map="auto",
)
# Model stays in memory for all future calls
# All agents share this single instance
@spaces.GPU(duration=120)
def _load_model(self):
# GPU allocated for 120 seconds during first load
# Then reused without re-allocation
```
**Loading Flow**:
```
App starts β†’ Instant startup (no model loading)
↓
First user request β†’ Triggers model load (~30-60s)
β”œβ”€ Download from cache (if preloaded: instant)
β”œβ”€ Load with 4-bit quantization
β”œβ”€ Create pipeline
└─ Cache in memory
↓
All subsequent requests β†’ Use cached model (~1s)
```
**Memory Optimization**:
- **4-bit NF4 Quantization**: 75% memory reduction
- Llama-3.2-3B: ~6GB β†’ ~1GB VRAM
- **Shared Model Strategy**: ALL agents share one model instance
- **Singleton Pattern**: Thread-safe model caching
- **Device Mapping**: Automatic distribution with ZeroGPU
- **128K Context**: Long conversations without truncation
**ZeroGPU Integration**:
```python
@spaces.GPU(duration=120) # Dynamic allocation for first load
def _load_model(self):
# GPU available for 120 seconds
# Loads model once on first request
# Cached instance reused across all agents
# Automatic GPU management by ZeroGPU
```
**Performance Characteristics**:
* **First Request**: 30-60 seconds (one-time model load)
- With `preload_from_hub`: 30-40s (just quantization)
- Without preload: 50-60s (download + quantization)
* **Subsequent Requests**: <1 second per agent
* **Memory Footprint**: ~1GB VRAM (persistent)
* **Cold Start**: Instant app startup (model loads on demand)
**Why Lazy Loading?**
* βœ… **Instant Startup**: App launches immediately
* βœ… **ZeroGPU Optimal**: Perfect for dynamic GPU allocation
* βœ… **Memory Efficient**: Only loads when needed
* βœ… **Cache Persistent**: Stays loaded between requests
* βœ… **Serverless Friendly**: Ideal for HuggingFace Spaces
---
### Analytics & Evaluation System
#### Built-In Dashboard
**Real-Time Metrics**:
* Total conversations
* Average response time
* Success rate (quality score >3.5)
* Educational quality scores (ML-evaluated)
* Classifier accuracy rates
* Active sessions count
**LightEval Integration**:
* BertScore for semantic quality
* ROUGE for response completeness
* Custom educational quality indicators:
- Has examples
- Structured explanation
- Appropriate length
- Encourages learning
- Uses LaTeX (for math)
- Clear sections
**Exportable Data**:
* JSON export with full metrics
* CSV export of interaction history
* Programmatic access via API
---
### Performance Benchmarks
**Runtime Performance:**
* **Inference Speed**: 25-40 tokens/second (with ZeroGPU)
* **Memory Usage**: ~1GB VRAM (4-bit quantization)
* **Context Window**: 128K tokens
* **First Request**: ~30-60 seconds (one-time load)
* **Warm Inference**: <1 second per agent
* **Startup Time**: Instant (lazy loading)
**Llama 3.2 Quality Scores:**
* MMLU: 63.4 (competitive with larger models)
* GSM8K (Math): 73.9
* HumanEval (Coding): 59.3
* Multilingual: 8 languages supported
* Safety: RLHF-aligned for educational use