Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.8.0
title: Mimir
emoji: π
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
python_version: '3.10'
short_description: Advanced prompt engineering for educational AI systems.
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/68700e7552b74a1dcbb2a87e/Z7P8DJ57rc5P1ozA5gwp3.png
hardware: zero-gpu-dynamic
startup_duration_timeout: 30m
Mimir: Educational AI Assistant
Advanced Multi-Agent Architecture & Prompt Engineering Portfolio Project
Project Overview
Mimir demonstrates enterprise-grade AI system design through a sophisticated multi-agent architecture applied to educational technology. The system showcases advanced prompt engineering, intelligent decision-making pipelines, and state-persistent conversation management. Unlike simple single-model implementations, Mimir employs four specialized agent types working in concert: a tool decision engine, four parallel routing agents for prompt selection, three preprocessing thinking agents for complex reasoning, and a unified response generator. This architecture prioritizes pedagogical effectiveness through dynamic context assembly, ensuring responses are tailored to each unique educational interaction.
Technical Architecture
Multi-Agent System:
User Input β Tool Decision Agent β Routing Agents (4x) β Thinking Agents (3x) β Response Agent β Output
β β β β
Llama-3.2-3B Llama-3.2-3B (shared) Llama-3.2-3B Llama-3.2-3B
Core Technologies:
- Unified Model Architecture: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
- Lazy Loading Strategy: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
- Custom Orchestration: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
- State Management: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
- ZeroGPU Integration: Dynamic GPU allocation with
@spaces.GPUdecorators for efficient resource usage - Gradio: Multi-page interface (Chatbot + Analytics Dashboard)
- Python: Advanced backend with 4-bit quantization and streaming
Key Frameworks & Libraries:
transformers&acceleratefor model loading and inference optimizationbitsandbytesfor 4-bit NF4 quantization (75% memory reduction)peftfor Parameter-Efficient Fine-Tuning supportspacesfor HuggingFace ZeroGPU integrationmatplotlibfor dynamic visualization generation- Custom state management system with SQLite and dataset backup
Advanced Agent Architecture
Agent Pipeline Overview
The system processes each user interaction through a sophisticated four-stage pipeline, with each stage making intelligent decisions that shape the final response.
Stage 1: Tool Decision Agent
Purpose: Determines if visualization tools enhance learning
Model: Llama-3.2-3B-Instruct (4-bit NF4 quantized)
Prompt Engineering:
- Highly constrained binary decision prompt (YES/NO only)
- Explicit INCLUDE/EXCLUDE criteria for educational contexts
- Zero-shot classification with educational domain knowledge
Decision Criteria:
INCLUDE: Mathematical functions, data analysis, chart interpretation,
trend visualization, proportional relationships
EXCLUDE: Greetings, definitions, explanations without data
Output: Boolean flag activating TOOL_USE_ENHANCEMENT prompt segment
Stage 2: Prompt Routing Agents (4 Specialized Agents)
Purpose: Intelligent prompt segment selection through parallel analysis
Model: Shared Llama-3.2-3B-Instruct instance (memory efficient)
Agent Specializations:
Agent 1 - Practice Question Detector
- Analyzes conversation context for practice question opportunities
- Considers user's expressed understanding and learning progression
- Activates:
STRUCTURE_PRACTICE_QUESTIONS
Agent 2 - Discovery Mode Classifier
- Dual-classification: vague input detection + understanding assessment
- Returns:
VAUGE_INPUT,USER_UNDERSTANDING, or neither - Enables guided discovery and clarification strategies
Agent 3 - Follow-up Assessment Agent
- Detects if user is responding to previous practice questions
- Analyzes conversation history for grading opportunities
- Activates:
PRACTICE_QUESTION_FOLLOWUP(triggers grading mode)
Agent 4 - Teaching Mode Assessor
- Evaluates need for direct instruction vs. structured practice
- Multi-output agent (can activate multiple prompts)
- Activates:
GUIDING_TEACHING,STRUCTURE_PRACTICE_QUESTIONS
Prompt Engineering Innovation:
- Each agent uses a specialized system prompt with clear decision criteria
- Structured output formats for reliable parsing
- Context-aware analysis incorporating full conversation history
- Sequential execution prevents decision conflicts
Stage 3: Thinking Agents (Preprocessing Layer)
Purpose: Generate reasoning context before final response (CoT/ToT)
Model: Llama-3.2-3B-Instruct (shared instance)
Agent Specializations:
Math Thinking Agent
- Method: Tree-of-Thought reasoning for mathematical problems
- Activation: When
LATEX_FORMATTINGis active - Output Structure:
Key Terms β Principles β Formulas β Step-by-Step Solution β Summary - Complexity Routing: Decision tree determines detail level (1A: basic, 1B: complex)
Question/Answer Design Agent
- Method: Chain-of-Thought for practice question formulation
- Activation: When
STRUCTURE_PRACTICE_QUESTIONSis active - Formatted Inputs: Tool context, LaTeX guidelines, practice question templates
- Output: Question design, data formatting, answer bank generation
Reasoning Thinking Agent
- Method: General Chain-of-Thought preprocessing
- Activation: When tools, follow-ups, or teaching mode active
- Output Structure:
User Knowledge Summary β Understanding Analysis β Previous Actions β Reference Fact Sheet
Prompt Engineering Innovation:
- Thinking agents produce context for ResponseAgent, not final output
- Outputs are invisible to user but inform response quality
- Tree-of-Thought (ToT) for math: explores multiple solution paths
- Chain-of-Thought (CoT) for others: step-by-step reasoning traces
Stage 4: Response Agent (Educational Response Generation)
Purpose: Generate pedagogically sound final response
Model: Llama-3.2-3B-Instruct (same shared instance)
Configuration:
- 4-bit NF4 quantization (BitsAndBytes)
- Mixed precision BF16 inference
- Accelerate integration for distributed computation
- 128K context window
- Multilingual support (8 languages)
Prompt Assembly Process:
- Core Identity: Always included (defines Mimir persona)
- Logical Expressions: Regex-triggered prompts (e.g., math keywords β
LATEX_FORMATTING) - Agent-Selected Prompts: Dynamic assembly based on routing agent decisions
- Context Integration: Tool outputs, thinking agent outputs, conversation history
- Complete Prompt: All segments joined with proper formatting
Dynamic Prompt Library (11 segments):
Core: CORE_IDENTITY (always)
Formatting: GENERAL_FORMATTING (always), LATEX_FORMATTING (math)
Discovery: VAUGE_INPUT, USER_UNDERSTANDING
Teaching: GUIDING_TEACHING
Practice: STRUCTURE_PRACTICE_QUESTIONS, PRACTICE_QUESTION_FOLLOWUP
Tool: TOOL_USE_ENHANCEMENT
Response Post-Processing:
- Artifact cleanup (remove special tokens)
- Intelligent truncation at logical breakpoints
- Sentence integrity preservation
- Quality validation gates
- Word-by-word streaming for UX
Model Specifications
Llama-3.2-3B-Instruct Details:
- Parameters: 3.21 billion
- Architecture: Optimized transformer with Grouped-Query Attention (GQA)
- Training Data: 9 trillion tokens (December 2023 cutoff)
- Context Length: 128,000 tokens
- Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
- Quantization: 4-bit NF4 (~1GB VRAM)
- Training Method: Knowledge distillation from Llama 3.1 8B/70B + SFT + RLHF
Why Single Model Architecture:
- β Consistency: Same reasoning style across all agents
- β Memory Efficient: One model, shared instance (~1GB total)
- β Instruction-Tuned: Optimized for educational dialogue
- β Fast Inference: 3B parameters = quick responses
- β ZeroGPU Friendly: Small enough for dynamic allocation
- β 128K Context: Can handle long educational conversations
Prompt Engineering Techniques Demonstrated
1. Hierarchical Prompt Architecture
Three-Layer System:
- Agent System Prompts: Specialized instructions for each agent type
- Response Prompt Segments: Modular components dynamically assembled
- Thinking Prompts: Preprocessing templates for reasoning generation
Innovation: Separates decision-making logic from response generation, enabling precise control over AI behavior at each pipeline stage.
2. Per-Turn Prompt State Management
PromptStateManager:
# Reset at turn start - clean slate
prompt_state.reset() # All 11 prompts β False
# Agents activate relevant prompts
prompt_state.update("LATEX_FORMATTING", True)
prompt_state.update("GUIDING_TEACHING", True)
# Assemble only active prompts
active_prompts = prompt_state.get_active_response_prompts()
# Returns: ["CORE_IDENTITY", "GENERAL_FORMATTING",
# "LATEX_FORMATTING", "GUIDING_TEACHING"]
Benefits:
- No prompt pollution between turns
- Context-appropriate responses every time
- Traceable decision-making for debugging
3. Logical Expression System
Regex-Based Automatic Activation:
# Math keyword detection
math_regex = r'\b(calculus|algebra|equation|solve|derivative)\b'
if re.search(math_regex, user_input, re.IGNORECASE):
prompt_state.update("LATEX_FORMATTING", True)
Hybrid Approach: Combines rule-based triggers with LLM decision-making for optimal reliability.
4. Constraint-Based Agent Prompting
Tool Decision Example:
System Prompt: Analyze query and determine if visualization needed.
Output Format: YES or NO (nothing else)
INCLUDE if: mathematical functions, data analysis, trends
EXCLUDE if: greetings, simple definitions, no data
Result: Reliable, parseable outputs from agents without complex post-processing.
5. Chain-of-Thought & Tree-of-Thought Preprocessing
CoT for Sequential Reasoning:
Step 1: Assess topic β
Step 2: Identify user understanding β
Step 3: Previous actions β
Step 4: Reference facts
ToT for Mathematical Reasoning:
Question Type Assessment β
Branch 1A (Simple): Minimal steps
Branch 1B (Complex): Full derivation with principles
Innovation: Thinking agents generate rich context that guides ResponseAgent to higher-quality outputs.
6. Academic Integrity by Design
Embedded in Core Prompts:
- "Do not provide full solutions - guide through processes instead"
- "Break problems into conceptual components"
- "Ask clarifying questions about their understanding"
- Subject-specific guidelines (Math: explain concepts, not compute)
Follow-up Grading:
- Agent 3 detects practice question responses
PRACTICE_QUESTION_FOLLOWUPprompt activates- Automated assessment with constructive feedback
7. Multi-Modal Response Generation
Tool Integration:
# Tool decision β JSON generation β matplotlib rendering β base64 encoding
Create_Graph_Tool(
data={"Week 1": 120, "Week 2": 155, ...},
plot_type="line",
title="Crop Yield Analysis",
educational_context="Visualizes growth trend over time"
)
Result: In-memory graph generation with educational context, embedded directly in response.
State Management & Persistence
GlobalStateManager Architecture
Dual-Layer Persistence:
- SQLite Database: Fast local access, immediate writes
- HuggingFace Dataset: Cloud backup, hourly sync
State Categories:
- Conversation State: Full chat history + agent context
- Prompt State: Per-turn activation (resets each interaction)
- Analytics State: Metrics, dashboard data, export history
- Evaluation State: Quality scores, classifier accuracy, user feedback
- ML Model Cache: Loaded model for reuse across sessions
Thread Safety: All state operations protected by threading.Lock()
Cleanup Strategy:
- Automatic cleanup every 60 minutes
- Remove sessions older than 24 hours
- Prevents memory leaks in long-running deployments
Model Loading & Optimization Strategy
Two-Stage Lazy Loading Pipeline
Stage 1: Build Time (Docker) - Optional Pre-caching
# preload_from_hub in README.md
preload_from_hub:
- meta-llama/Llama-3.2-3B-Instruct
- Downloads model weights during Docker build
- Cached in HuggingFace hub cache directory
- Reduces first-request latency (no download needed)
- Optional but recommended for production deployments
Stage 2: Runtime (Lazy Loading with Automatic Caching)
# model_manager.py - LazyLlamaModel class
def _load_model(self):
"""Load on first generate() call"""
if self.model is not None:
return # Already loaded - reuse cached instance
# First call: Load with 4-bit quantization
self.model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct",
quantization_config=quantization_config,
device_map="auto",
)
# Model stays in memory for all future calls
# All agents share this single instance
@spaces.GPU(duration=120)
def _load_model(self):
# GPU allocated for 120 seconds during first load
# Then reused without re-allocation
Loading Flow:
App starts β Instant startup (no model loading)
β
First user request β Triggers model load (~30-60s)
ββ Download from cache (if preloaded: instant)
ββ Load with 4-bit quantization
ββ Create pipeline
ββ Cache in memory
β
All subsequent requests β Use cached model (~1s)
Memory Optimization:
- 4-bit NF4 Quantization: 75% memory reduction
- Llama-3.2-3B: ~6GB β ~1GB VRAM
- Shared Model Strategy: ALL agents share one model instance
- Singleton Pattern: Thread-safe model caching
- Device Mapping: Automatic distribution with ZeroGPU
- 128K Context: Long conversations without truncation
ZeroGPU Integration:
@spaces.GPU(duration=120) # Dynamic allocation for first load
def _load_model(self):
# GPU available for 120 seconds
# Loads model once on first request
# Cached instance reused across all agents
# Automatic GPU management by ZeroGPU
Performance Characteristics:
- First Request: 30-60 seconds (one-time model load)
- With
preload_from_hub: 30-40s (just quantization) - Without preload: 50-60s (download + quantization)
- With
- Subsequent Requests: <1 second per agent
- Memory Footprint: ~1GB VRAM (persistent)
- Cold Start: Instant app startup (model loads on demand)
Why Lazy Loading?
- β Instant Startup: App launches immediately
- β ZeroGPU Optimal: Perfect for dynamic GPU allocation
- β Memory Efficient: Only loads when needed
- β Cache Persistent: Stays loaded between requests
- β Serverless Friendly: Ideal for HuggingFace Spaces
Analytics & Evaluation System
Built-In Dashboard
Real-Time Metrics:
- Total conversations
- Average response time
- Success rate (quality score >3.5)
- Educational quality scores (ML-evaluated)
- Classifier accuracy rates
- Active sessions count
LightEval Integration:
- BertScore for semantic quality
- ROUGE for response completeness
- Custom educational quality indicators:
- Has examples
- Structured explanation
- Appropriate length
- Encourages learning
- Uses LaTeX (for math)
- Clear sections
Exportable Data:
- JSON export with full metrics
- CSV export of interaction history
- Programmatic access via API
Performance Benchmarks
Runtime Performance:
- Inference Speed: 25-40 tokens/second (with ZeroGPU)
- Memory Usage: ~1GB VRAM (4-bit quantization)
- Context Window: 128K tokens
- First Request: ~30-60 seconds (one-time load)
- Warm Inference: <1 second per agent
- Startup Time: Instant (lazy loading)
Llama 3.2 Quality Scores:
- MMLU: 63.4 (competitive with larger models)
- GSM8K (Math): 73.9
- HumanEval (Coding): 59.3
- Multilingual: 8 languages supported
- Safety: RLHF-aligned for educational use