--- title: Mimir emoji: 📚 colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 5.49.1 app_file: app.py pinned: true python_version: '3.10' short_description: Advanced prompt engineering for educational AI systems. thumbnail: >- https://cdn-uploads.huggingface.co/production/uploads/68700e7552b74a1dcbb2a87e/Z7P8DJ57rc5P1ozA5gwp3.png hardware: zero-gpu-dynamic startup_duration_timeout: 30m --- # Mimir: Educational AI Assistant ## Advanced Multi-Agent Architecture & Prompt Engineering Portfolio Project ### Project Overview Mimir demonstrates enterprise-grade AI system design through a sophisticated multi-agent architecture applied to educational technology. The system showcases advanced prompt engineering, intelligent decision-making pipelines, and state-persistent conversation management. Unlike simple single-model implementations, Mimir employs **four specialized agent types** working in concert: a tool decision engine, four parallel routing agents for prompt selection, three preprocessing thinking agents for complex reasoning, and a unified response generator. This architecture prioritizes pedagogical effectiveness through dynamic context assembly, ensuring responses are tailored to each unique educational interaction. *** ### Technical Architecture **Multi-Agent System:** ``` User Input → Tool Decision Agent → Routing Agents (4x) → Thinking Agents (3x) → Response Agent → Output ↓ ↓ ↓ ↓ Llama-3.2-3B Llama-3.2-3B (shared) Llama-3.2-3B Llama-3.2-3B ``` **Core Technologies:** * **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation * **Lazy Loading Strategy**: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU) * **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization * **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets) * **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage * **Gradio**: Multi-page interface (Chatbot + Analytics Dashboard) * **Python**: Advanced backend with 4-bit quantization and streaming **Key Frameworks & Libraries:** * `transformers` & `accelerate` for model loading and inference optimization * `bitsandbytes` for 4-bit NF4 quantization (75% memory reduction) * `peft` for Parameter-Efficient Fine-Tuning support * `spaces` for HuggingFace ZeroGPU integration * `matplotlib` for dynamic visualization generation * Custom state management system with SQLite and dataset backup *** ### Advanced Agent Architecture #### Agent Pipeline Overview The system processes each user interaction through a sophisticated four-stage pipeline, with each stage making intelligent decisions that shape the final response. #### Stage 1: Tool Decision Agent **Purpose**: Determines if visualization tools enhance learning **Model**: Llama-3.2-3B-Instruct (4-bit NF4 quantized) **Prompt Engineering**: * Highly constrained binary decision prompt (YES/NO only) * Explicit INCLUDE/EXCLUDE criteria for educational contexts * Zero-shot classification with educational domain knowledge **Decision Criteria**: ``` INCLUDE: Mathematical functions, data analysis, chart interpretation, trend visualization, proportional relationships EXCLUDE: Greetings, definitions, explanations without data ``` **Output**: Boolean flag activating `TOOL_USE_ENHANCEMENT` prompt segment --- #### Stage 2: Prompt Routing Agents (4 Specialized Agents) **Purpose**: Intelligent prompt segment selection through parallel analysis **Model**: Shared Llama-3.2-3B-Instruct instance (memory efficient) **Agent Specializations**: 1. **Agent 1 - Practice Question Detector** - Analyzes conversation context for practice question opportunities - Considers user's expressed understanding and learning progression - Activates: `STRUCTURE_PRACTICE_QUESTIONS` 2. **Agent 2 - Discovery Mode Classifier** - Dual-classification: vague input detection + understanding assessment - Returns: `VAUGE_INPUT`, `USER_UNDERSTANDING`, or neither - Enables guided discovery and clarification strategies 3. **Agent 3 - Follow-up Assessment Agent** - Detects if user is responding to previous practice questions - Analyzes conversation history for grading opportunities - Activates: `PRACTICE_QUESTION_FOLLOWUP` (triggers grading mode) 4. **Agent 4 - Teaching Mode Assessor** - Evaluates need for direct instruction vs. structured practice - Multi-output agent (can activate multiple prompts) - Activates: `GUIDING_TEACHING`, `STRUCTURE_PRACTICE_QUESTIONS` **Prompt Engineering Innovation**: * Each agent uses a specialized system prompt with clear decision criteria * Structured output formats for reliable parsing * Context-aware analysis incorporating full conversation history * Sequential execution prevents decision conflicts --- #### Stage 3: Thinking Agents (Preprocessing Layer) **Purpose**: Generate reasoning context before final response (CoT/ToT) **Model**: Llama-3.2-3B-Instruct (shared instance) **Agent Specializations**: 1. **Math Thinking Agent** - **Method**: Tree-of-Thought reasoning for mathematical problems - **Activation**: When `LATEX_FORMATTING` is active - **Output Structure**: ``` Key Terms → Principles → Formulas → Step-by-Step Solution → Summary ``` - **Complexity Routing**: Decision tree determines detail level (1A: basic, 1B: complex) 2. **Question/Answer Design Agent** - **Method**: Chain-of-Thought for practice question formulation - **Activation**: When `STRUCTURE_PRACTICE_QUESTIONS` is active - **Formatted Inputs**: Tool context, LaTeX guidelines, practice question templates - **Output**: Question design, data formatting, answer bank generation 3. **Reasoning Thinking Agent** - **Method**: General Chain-of-Thought preprocessing - **Activation**: When tools, follow-ups, or teaching mode active - **Output Structure**: ``` User Knowledge Summary → Understanding Analysis → Previous Actions → Reference Fact Sheet ``` **Prompt Engineering Innovation**: * Thinking agents produce **context for ResponseAgent**, not final output * Outputs are invisible to user but inform response quality * Tree-of-Thought (ToT) for math: explores multiple solution paths * Chain-of-Thought (CoT) for others: step-by-step reasoning traces --- #### Stage 4: Response Agent (Educational Response Generation) **Purpose**: Generate pedagogically sound final response **Model**: Llama-3.2-3B-Instruct (same shared instance) **Configuration**: * 4-bit NF4 quantization (BitsAndBytes) * Mixed precision BF16 inference * Accelerate integration for distributed computation * 128K context window * Multilingual support (8 languages) **Prompt Assembly Process**: 1. **Core Identity**: Always included (defines Mimir persona) 2. **Logical Expressions**: Regex-triggered prompts (e.g., math keywords → `LATEX_FORMATTING`) 3. **Agent-Selected Prompts**: Dynamic assembly based on routing agent decisions 4. **Context Integration**: Tool outputs, thinking agent outputs, conversation history 5. **Complete Prompt**: All segments joined with proper formatting **Dynamic Prompt Library** (11 segments): ``` Core: CORE_IDENTITY (always) Formatting: GENERAL_FORMATTING (always), LATEX_FORMATTING (math) Discovery: VAUGE_INPUT, USER_UNDERSTANDING Teaching: GUIDING_TEACHING Practice: STRUCTURE_PRACTICE_QUESTIONS, PRACTICE_QUESTION_FOLLOWUP Tool: TOOL_USE_ENHANCEMENT ``` **Response Post-Processing**: * Artifact cleanup (remove special tokens) * Intelligent truncation at logical breakpoints * Sentence integrity preservation * Quality validation gates * Word-by-word streaming for UX --- ### Model Specifications **Llama-3.2-3B-Instruct Details:** * **Parameters**: 3.21 billion * **Architecture**: Optimized transformer with Grouped-Query Attention (GQA) * **Training Data**: 9 trillion tokens (December 2023 cutoff) * **Context Length**: 128,000 tokens * **Languages**: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai * **Quantization**: 4-bit NF4 (~1GB VRAM) * **Training Method**: Knowledge distillation from Llama 3.1 8B/70B + SFT + RLHF **Why Single Model Architecture:** * ✅ **Consistency**: Same reasoning style across all agents * ✅ **Memory Efficient**: One model, shared instance (~1GB total) * ✅ **Instruction-Tuned**: Optimized for educational dialogue * ✅ **Fast Inference**: 3B parameters = quick responses * ✅ **ZeroGPU Friendly**: Small enough for dynamic allocation * ✅ **128K Context**: Can handle long educational conversations --- ### Prompt Engineering Techniques Demonstrated #### 1. Hierarchical Prompt Architecture **Three-Layer System**: - **Agent System Prompts**: Specialized instructions for each agent type - **Response Prompt Segments**: Modular components dynamically assembled - **Thinking Prompts**: Preprocessing templates for reasoning generation **Innovation**: Separates decision-making logic from response generation, enabling precise control over AI behavior at each pipeline stage. #### 2. Per-Turn Prompt State Management **PromptStateManager**: ```python # Reset at turn start - clean slate prompt_state.reset() # All 11 prompts → False # Agents activate relevant prompts prompt_state.update("LATEX_FORMATTING", True) prompt_state.update("GUIDING_TEACHING", True) # Assemble only active prompts active_prompts = prompt_state.get_active_response_prompts() # Returns: ["CORE_IDENTITY", "GENERAL_FORMATTING", # "LATEX_FORMATTING", "GUIDING_TEACHING"] ``` **Benefits**: - No prompt pollution between turns - Context-appropriate responses every time - Traceable decision-making for debugging #### 3. Logical Expression System **Regex-Based Automatic Activation**: ```python # Math keyword detection math_regex = r'\b(calculus|algebra|equation|solve|derivative)\b' if re.search(math_regex, user_input, re.IGNORECASE): prompt_state.update("LATEX_FORMATTING", True) ``` **Hybrid Approach**: Combines rule-based triggers with LLM decision-making for optimal reliability. #### 4. Constraint-Based Agent Prompting **Tool Decision Example**: ``` System Prompt: Analyze query and determine if visualization needed. Output Format: YES or NO (nothing else) INCLUDE if: mathematical functions, data analysis, trends EXCLUDE if: greetings, simple definitions, no data ``` **Result**: Reliable, parseable outputs from agents without complex post-processing. #### 5. Chain-of-Thought & Tree-of-Thought Preprocessing **CoT for Sequential Reasoning**: ``` Step 1: Assess topic → Step 2: Identify user understanding → Step 3: Previous actions → Step 4: Reference facts ``` **ToT for Mathematical Reasoning**: ``` Question Type Assessment → Branch 1A (Simple): Minimal steps Branch 1B (Complex): Full derivation with principles ``` **Innovation**: Thinking agents generate rich context that guides ResponseAgent to higher-quality outputs. #### 6. Academic Integrity by Design **Embedded in Core Prompts**: * "Do not provide full solutions - guide through processes instead" * "Break problems into conceptual components" * "Ask clarifying questions about their understanding" * Subject-specific guidelines (Math: explain concepts, not compute) **Follow-up Grading**: * Agent 3 detects practice question responses * `PRACTICE_QUESTION_FOLLOWUP` prompt activates * Automated assessment with constructive feedback #### 7. Multi-Modal Response Generation **Tool Integration**: ```python # Tool decision → JSON generation → matplotlib rendering → base64 encoding Create_Graph_Tool( data={"Week 1": 120, "Week 2": 155, ...}, plot_type="line", title="Crop Yield Analysis", educational_context="Visualizes growth trend over time" ) ``` **Result**: In-memory graph generation with educational context, embedded directly in response. --- ### State Management & Persistence #### GlobalStateManager Architecture **Dual-Layer Persistence**: 1. **SQLite Database**: Fast local access, immediate writes 2. **HuggingFace Dataset**: Cloud backup, hourly sync **State Categories**: ```python - Conversation State: Full chat history + agent context - Prompt State: Per-turn activation (resets each interaction) - Analytics State: Metrics, dashboard data, export history - Evaluation State: Quality scores, classifier accuracy, user feedback - ML Model Cache: Loaded model for reuse across sessions ``` **Thread Safety**: All state operations protected by `threading.Lock()` **Cleanup Strategy**: - Automatic cleanup every 60 minutes - Remove sessions older than 24 hours - Prevents memory leaks in long-running deployments --- ### Model Loading & Optimization Strategy #### Two-Stage Lazy Loading Pipeline **Stage 1: Build Time (Docker) - Optional Pre-caching** ```yaml # preload_from_hub in README.md preload_from_hub: - meta-llama/Llama-3.2-3B-Instruct ``` * Downloads model weights during Docker build * Cached in HuggingFace hub cache directory * Reduces first-request latency (no download needed) * **Optional but recommended** for production deployments **Stage 2: Runtime (Lazy Loading with Automatic Caching)** ```python # model_manager.py - LazyLlamaModel class def _load_model(self): """Load on first generate() call""" if self.model is not None: return # Already loaded - reuse cached instance # First call: Load with 4-bit quantization self.model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B-Instruct", quantization_config=quantization_config, device_map="auto", ) # Model stays in memory for all future calls # All agents share this single instance @spaces.GPU(duration=120) def _load_model(self): # GPU allocated for 120 seconds during first load # Then reused without re-allocation ``` **Loading Flow**: ``` App starts → Instant startup (no model loading) ↓ First user request → Triggers model load (~30-60s) ├─ Download from cache (if preloaded: instant) ├─ Load with 4-bit quantization ├─ Create pipeline └─ Cache in memory ↓ All subsequent requests → Use cached model (~1s) ``` **Memory Optimization**: - **4-bit NF4 Quantization**: 75% memory reduction - Llama-3.2-3B: ~6GB → ~1GB VRAM - **Shared Model Strategy**: ALL agents share one model instance - **Singleton Pattern**: Thread-safe model caching - **Device Mapping**: Automatic distribution with ZeroGPU - **128K Context**: Long conversations without truncation **ZeroGPU Integration**: ```python @spaces.GPU(duration=120) # Dynamic allocation for first load def _load_model(self): # GPU available for 120 seconds # Loads model once on first request # Cached instance reused across all agents # Automatic GPU management by ZeroGPU ``` **Performance Characteristics**: * **First Request**: 30-60 seconds (one-time model load) - With `preload_from_hub`: 30-40s (just quantization) - Without preload: 50-60s (download + quantization) * **Subsequent Requests**: <1 second per agent * **Memory Footprint**: ~1GB VRAM (persistent) * **Cold Start**: Instant app startup (model loads on demand) **Why Lazy Loading?** * ✅ **Instant Startup**: App launches immediately * ✅ **ZeroGPU Optimal**: Perfect for dynamic GPU allocation * ✅ **Memory Efficient**: Only loads when needed * ✅ **Cache Persistent**: Stays loaded between requests * ✅ **Serverless Friendly**: Ideal for HuggingFace Spaces --- ### Analytics & Evaluation System #### Built-In Dashboard **Real-Time Metrics**: * Total conversations * Average response time * Success rate (quality score >3.5) * Educational quality scores (ML-evaluated) * Classifier accuracy rates * Active sessions count **LightEval Integration**: * BertScore for semantic quality * ROUGE for response completeness * Custom educational quality indicators: - Has examples - Structured explanation - Appropriate length - Encourages learning - Uses LaTeX (for math) - Clear sections **Exportable Data**: * JSON export with full metrics * CSV export of interaction history * Programmatic access via API --- ### Performance Benchmarks **Runtime Performance:** * **Inference Speed**: 25-40 tokens/second (with ZeroGPU) * **Memory Usage**: ~1GB VRAM (4-bit quantization) * **Context Window**: 128K tokens * **First Request**: ~30-60 seconds (one-time load) * **Warm Inference**: <1 second per agent * **Startup Time**: Instant (lazy loading) **Llama 3.2 Quality Scores:** * MMLU: 63.4 (competitive with larger models) * GSM8K (Math): 73.9 * HumanEval (Coding): 59.3 * Multilingual: 8 languages supported * Safety: RLHF-aligned for educational use