Spaces:
Sleeping
Sleeping
| title: Mimir | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: true | |
| python_version: '3.10' | |
| short_description: Advanced prompt engineering for educational AI systems. | |
| thumbnail: >- | |
| https://cdn-uploads.huggingface.co/production/uploads/68700e7552b74a1dcbb2a87e/Z7P8DJ57rc5P1ozA5gwp3.png | |
| hardware: zero-gpu-dynamic | |
| startup_duration_timeout: 30m | |
| # Mimir: Educational AI Assistant | |
| ## Advanced Multi-Agent Architecture & Prompt Engineering Portfolio Project | |
| ### Project Overview | |
| Mimir demonstrates enterprise-grade AI system design through a sophisticated multi-agent architecture applied to educational technology. The system showcases advanced prompt engineering, intelligent decision-making pipelines, and state-persistent conversation management. Unlike simple single-model implementations, Mimir employs **four specialized agent types** working in concert: a tool decision engine, four parallel routing agents for prompt selection, three preprocessing thinking agents for complex reasoning, and a unified response generator. This architecture prioritizes pedagogical effectiveness through dynamic context assembly, ensuring responses are tailored to each unique educational interaction. | |
| *** | |
| ### Technical Architecture | |
| **Multi-Agent System:** | |
| ``` | |
| User Input β Tool Decision Agent β Routing Agents (4x) β Thinking Agents (3x) β Response Agent β Output | |
| β β β β | |
| Llama-3.2-3B Llama-3.2-3B (shared) Llama-3.2-3B Llama-3.2-3B | |
| ``` | |
| **Core Technologies:** | |
| * **Unified Model Architecture**: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation | |
| * **Lazy Loading Strategy**: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU) | |
| * **Custom Orchestration**: Hand-built agent coordination replacing traditional frameworks for precise control and optimization | |
| * **State Management**: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets) | |
| * **ZeroGPU Integration**: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage | |
| * **Gradio**: Multi-page interface (Chatbot + Analytics Dashboard) | |
| * **Python**: Advanced backend with 4-bit quantization and streaming | |
| **Key Frameworks & Libraries:** | |
| * `transformers` & `accelerate` for model loading and inference optimization | |
| * `bitsandbytes` for 4-bit NF4 quantization (75% memory reduction) | |
| * `peft` for Parameter-Efficient Fine-Tuning support | |
| * `spaces` for HuggingFace ZeroGPU integration | |
| * `matplotlib` for dynamic visualization generation | |
| * Custom state management system with SQLite and dataset backup | |
| *** | |
| ### Advanced Agent Architecture | |
| #### Agent Pipeline Overview | |
| The system processes each user interaction through a sophisticated four-stage pipeline, with each stage making intelligent decisions that shape the final response. | |
| #### Stage 1: Tool Decision Agent | |
| **Purpose**: Determines if visualization tools enhance learning | |
| **Model**: Llama-3.2-3B-Instruct (4-bit NF4 quantized) | |
| **Prompt Engineering**: | |
| * Highly constrained binary decision prompt (YES/NO only) | |
| * Explicit INCLUDE/EXCLUDE criteria for educational contexts | |
| * Zero-shot classification with educational domain knowledge | |
| **Decision Criteria**: | |
| ``` | |
| INCLUDE: Mathematical functions, data analysis, chart interpretation, | |
| trend visualization, proportional relationships | |
| EXCLUDE: Greetings, definitions, explanations without data | |
| ``` | |
| **Output**: Boolean flag activating `TOOL_USE_ENHANCEMENT` prompt segment | |
| --- | |
| #### Stage 2: Prompt Routing Agents (4 Specialized Agents) | |
| **Purpose**: Intelligent prompt segment selection through parallel analysis | |
| **Model**: Shared Llama-3.2-3B-Instruct instance (memory efficient) | |
| **Agent Specializations**: | |
| 1. **Agent 1 - Practice Question Detector** | |
| - Analyzes conversation context for practice question opportunities | |
| - Considers user's expressed understanding and learning progression | |
| - Activates: `STRUCTURE_PRACTICE_QUESTIONS` | |
| 2. **Agent 2 - Discovery Mode Classifier** | |
| - Dual-classification: vague input detection + understanding assessment | |
| - Returns: `VAUGE_INPUT`, `USER_UNDERSTANDING`, or neither | |
| - Enables guided discovery and clarification strategies | |
| 3. **Agent 3 - Follow-up Assessment Agent** | |
| - Detects if user is responding to previous practice questions | |
| - Analyzes conversation history for grading opportunities | |
| - Activates: `PRACTICE_QUESTION_FOLLOWUP` (triggers grading mode) | |
| 4. **Agent 4 - Teaching Mode Assessor** | |
| - Evaluates need for direct instruction vs. structured practice | |
| - Multi-output agent (can activate multiple prompts) | |
| - Activates: `GUIDING_TEACHING`, `STRUCTURE_PRACTICE_QUESTIONS` | |
| **Prompt Engineering Innovation**: | |
| * Each agent uses a specialized system prompt with clear decision criteria | |
| * Structured output formats for reliable parsing | |
| * Context-aware analysis incorporating full conversation history | |
| * Sequential execution prevents decision conflicts | |
| --- | |
| #### Stage 3: Thinking Agents (Preprocessing Layer) | |
| **Purpose**: Generate reasoning context before final response (CoT/ToT) | |
| **Model**: Llama-3.2-3B-Instruct (shared instance) | |
| **Agent Specializations**: | |
| 1. **Math Thinking Agent** | |
| - **Method**: Tree-of-Thought reasoning for mathematical problems | |
| - **Activation**: When `LATEX_FORMATTING` is active | |
| - **Output Structure**: | |
| ``` | |
| Key Terms β Principles β Formulas β Step-by-Step Solution β Summary | |
| ``` | |
| - **Complexity Routing**: Decision tree determines detail level (1A: basic, 1B: complex) | |
| 2. **Question/Answer Design Agent** | |
| - **Method**: Chain-of-Thought for practice question formulation | |
| - **Activation**: When `STRUCTURE_PRACTICE_QUESTIONS` is active | |
| - **Formatted Inputs**: Tool context, LaTeX guidelines, practice question templates | |
| - **Output**: Question design, data formatting, answer bank generation | |
| 3. **Reasoning Thinking Agent** | |
| - **Method**: General Chain-of-Thought preprocessing | |
| - **Activation**: When tools, follow-ups, or teaching mode active | |
| - **Output Structure**: | |
| ``` | |
| User Knowledge Summary β Understanding Analysis β | |
| Previous Actions β Reference Fact Sheet | |
| ``` | |
| **Prompt Engineering Innovation**: | |
| * Thinking agents produce **context for ResponseAgent**, not final output | |
| * Outputs are invisible to user but inform response quality | |
| * Tree-of-Thought (ToT) for math: explores multiple solution paths | |
| * Chain-of-Thought (CoT) for others: step-by-step reasoning traces | |
| --- | |
| #### Stage 4: Response Agent (Educational Response Generation) | |
| **Purpose**: Generate pedagogically sound final response | |
| **Model**: Llama-3.2-3B-Instruct (same shared instance) | |
| **Configuration**: | |
| * 4-bit NF4 quantization (BitsAndBytes) | |
| * Mixed precision BF16 inference | |
| * Accelerate integration for distributed computation | |
| * 128K context window | |
| * Multilingual support (8 languages) | |
| **Prompt Assembly Process**: | |
| 1. **Core Identity**: Always included (defines Mimir persona) | |
| 2. **Logical Expressions**: Regex-triggered prompts (e.g., math keywords β `LATEX_FORMATTING`) | |
| 3. **Agent-Selected Prompts**: Dynamic assembly based on routing agent decisions | |
| 4. **Context Integration**: Tool outputs, thinking agent outputs, conversation history | |
| 5. **Complete Prompt**: All segments joined with proper formatting | |
| **Dynamic Prompt Library** (11 segments): | |
| ``` | |
| Core: CORE_IDENTITY (always) | |
| Formatting: GENERAL_FORMATTING (always), LATEX_FORMATTING (math) | |
| Discovery: VAUGE_INPUT, USER_UNDERSTANDING | |
| Teaching: GUIDING_TEACHING | |
| Practice: STRUCTURE_PRACTICE_QUESTIONS, PRACTICE_QUESTION_FOLLOWUP | |
| Tool: TOOL_USE_ENHANCEMENT | |
| ``` | |
| **Response Post-Processing**: | |
| * Artifact cleanup (remove special tokens) | |
| * Intelligent truncation at logical breakpoints | |
| * Sentence integrity preservation | |
| * Quality validation gates | |
| * Word-by-word streaming for UX | |
| --- | |
| ### Model Specifications | |
| **Llama-3.2-3B-Instruct Details:** | |
| * **Parameters**: 3.21 billion | |
| * **Architecture**: Optimized transformer with Grouped-Query Attention (GQA) | |
| * **Training Data**: 9 trillion tokens (December 2023 cutoff) | |
| * **Context Length**: 128,000 tokens | |
| * **Languages**: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai | |
| * **Quantization**: 4-bit NF4 (~1GB VRAM) | |
| * **Training Method**: Knowledge distillation from Llama 3.1 8B/70B + SFT + RLHF | |
| **Why Single Model Architecture:** | |
| * β **Consistency**: Same reasoning style across all agents | |
| * β **Memory Efficient**: One model, shared instance (~1GB total) | |
| * β **Instruction-Tuned**: Optimized for educational dialogue | |
| * β **Fast Inference**: 3B parameters = quick responses | |
| * β **ZeroGPU Friendly**: Small enough for dynamic allocation | |
| * β **128K Context**: Can handle long educational conversations | |
| --- | |
| ### Prompt Engineering Techniques Demonstrated | |
| #### 1. Hierarchical Prompt Architecture | |
| **Three-Layer System**: | |
| - **Agent System Prompts**: Specialized instructions for each agent type | |
| - **Response Prompt Segments**: Modular components dynamically assembled | |
| - **Thinking Prompts**: Preprocessing templates for reasoning generation | |
| **Innovation**: Separates decision-making logic from response generation, enabling precise control over AI behavior at each pipeline stage. | |
| #### 2. Per-Turn Prompt State Management | |
| **PromptStateManager**: | |
| ```python | |
| # Reset at turn start - clean slate | |
| prompt_state.reset() # All 11 prompts β False | |
| # Agents activate relevant prompts | |
| prompt_state.update("LATEX_FORMATTING", True) | |
| prompt_state.update("GUIDING_TEACHING", True) | |
| # Assemble only active prompts | |
| active_prompts = prompt_state.get_active_response_prompts() | |
| # Returns: ["CORE_IDENTITY", "GENERAL_FORMATTING", | |
| # "LATEX_FORMATTING", "GUIDING_TEACHING"] | |
| ``` | |
| **Benefits**: | |
| - No prompt pollution between turns | |
| - Context-appropriate responses every time | |
| - Traceable decision-making for debugging | |
| #### 3. Logical Expression System | |
| **Regex-Based Automatic Activation**: | |
| ```python | |
| # Math keyword detection | |
| math_regex = r'\b(calculus|algebra|equation|solve|derivative)\b' | |
| if re.search(math_regex, user_input, re.IGNORECASE): | |
| prompt_state.update("LATEX_FORMATTING", True) | |
| ``` | |
| **Hybrid Approach**: Combines rule-based triggers with LLM decision-making for optimal reliability. | |
| #### 4. Constraint-Based Agent Prompting | |
| **Tool Decision Example**: | |
| ``` | |
| System Prompt: Analyze query and determine if visualization needed. | |
| Output Format: YES or NO (nothing else) | |
| INCLUDE if: mathematical functions, data analysis, trends | |
| EXCLUDE if: greetings, simple definitions, no data | |
| ``` | |
| **Result**: Reliable, parseable outputs from agents without complex post-processing. | |
| #### 5. Chain-of-Thought & Tree-of-Thought Preprocessing | |
| **CoT for Sequential Reasoning**: | |
| ``` | |
| Step 1: Assess topic β | |
| Step 2: Identify user understanding β | |
| Step 3: Previous actions β | |
| Step 4: Reference facts | |
| ``` | |
| **ToT for Mathematical Reasoning**: | |
| ``` | |
| Question Type Assessment β | |
| Branch 1A (Simple): Minimal steps | |
| Branch 1B (Complex): Full derivation with principles | |
| ``` | |
| **Innovation**: Thinking agents generate rich context that guides ResponseAgent to higher-quality outputs. | |
| #### 6. Academic Integrity by Design | |
| **Embedded in Core Prompts**: | |
| * "Do not provide full solutions - guide through processes instead" | |
| * "Break problems into conceptual components" | |
| * "Ask clarifying questions about their understanding" | |
| * Subject-specific guidelines (Math: explain concepts, not compute) | |
| **Follow-up Grading**: | |
| * Agent 3 detects practice question responses | |
| * `PRACTICE_QUESTION_FOLLOWUP` prompt activates | |
| * Automated assessment with constructive feedback | |
| #### 7. Multi-Modal Response Generation | |
| **Tool Integration**: | |
| ```python | |
| # Tool decision β JSON generation β matplotlib rendering β base64 encoding | |
| Create_Graph_Tool( | |
| data={"Week 1": 120, "Week 2": 155, ...}, | |
| plot_type="line", | |
| title="Crop Yield Analysis", | |
| educational_context="Visualizes growth trend over time" | |
| ) | |
| ``` | |
| **Result**: In-memory graph generation with educational context, embedded directly in response. | |
| --- | |
| ### State Management & Persistence | |
| #### GlobalStateManager Architecture | |
| **Dual-Layer Persistence**: | |
| 1. **SQLite Database**: Fast local access, immediate writes | |
| 2. **HuggingFace Dataset**: Cloud backup, hourly sync | |
| **State Categories**: | |
| ```python | |
| - Conversation State: Full chat history + agent context | |
| - Prompt State: Per-turn activation (resets each interaction) | |
| - Analytics State: Metrics, dashboard data, export history | |
| - Evaluation State: Quality scores, classifier accuracy, user feedback | |
| - ML Model Cache: Loaded model for reuse across sessions | |
| ``` | |
| **Thread Safety**: All state operations protected by `threading.Lock()` | |
| **Cleanup Strategy**: | |
| - Automatic cleanup every 60 minutes | |
| - Remove sessions older than 24 hours | |
| - Prevents memory leaks in long-running deployments | |
| --- | |
| ### Model Loading & Optimization Strategy | |
| #### Two-Stage Lazy Loading Pipeline | |
| **Stage 1: Build Time (Docker) - Optional Pre-caching** | |
| ```yaml | |
| # preload_from_hub in README.md | |
| preload_from_hub: | |
| - meta-llama/Llama-3.2-3B-Instruct | |
| ``` | |
| * Downloads model weights during Docker build | |
| * Cached in HuggingFace hub cache directory | |
| * Reduces first-request latency (no download needed) | |
| * **Optional but recommended** for production deployments | |
| **Stage 2: Runtime (Lazy Loading with Automatic Caching)** | |
| ```python | |
| # model_manager.py - LazyLlamaModel class | |
| def _load_model(self): | |
| """Load on first generate() call""" | |
| if self.model is not None: | |
| return # Already loaded - reuse cached instance | |
| # First call: Load with 4-bit quantization | |
| self.model = AutoModelForCausalLM.from_pretrained( | |
| "meta-llama/Llama-3.2-3B-Instruct", | |
| quantization_config=quantization_config, | |
| device_map="auto", | |
| ) | |
| # Model stays in memory for all future calls | |
| # All agents share this single instance | |
| @spaces.GPU(duration=120) | |
| def _load_model(self): | |
| # GPU allocated for 120 seconds during first load | |
| # Then reused without re-allocation | |
| ``` | |
| **Loading Flow**: | |
| ``` | |
| App starts β Instant startup (no model loading) | |
| β | |
| First user request β Triggers model load (~30-60s) | |
| ββ Download from cache (if preloaded: instant) | |
| ββ Load with 4-bit quantization | |
| ββ Create pipeline | |
| ββ Cache in memory | |
| β | |
| All subsequent requests β Use cached model (~1s) | |
| ``` | |
| **Memory Optimization**: | |
| - **4-bit NF4 Quantization**: 75% memory reduction | |
| - Llama-3.2-3B: ~6GB β ~1GB VRAM | |
| - **Shared Model Strategy**: ALL agents share one model instance | |
| - **Singleton Pattern**: Thread-safe model caching | |
| - **Device Mapping**: Automatic distribution with ZeroGPU | |
| - **128K Context**: Long conversations without truncation | |
| **ZeroGPU Integration**: | |
| ```python | |
| @spaces.GPU(duration=120) # Dynamic allocation for first load | |
| def _load_model(self): | |
| # GPU available for 120 seconds | |
| # Loads model once on first request | |
| # Cached instance reused across all agents | |
| # Automatic GPU management by ZeroGPU | |
| ``` | |
| **Performance Characteristics**: | |
| * **First Request**: 30-60 seconds (one-time model load) | |
| - With `preload_from_hub`: 30-40s (just quantization) | |
| - Without preload: 50-60s (download + quantization) | |
| * **Subsequent Requests**: <1 second per agent | |
| * **Memory Footprint**: ~1GB VRAM (persistent) | |
| * **Cold Start**: Instant app startup (model loads on demand) | |
| **Why Lazy Loading?** | |
| * β **Instant Startup**: App launches immediately | |
| * β **ZeroGPU Optimal**: Perfect for dynamic GPU allocation | |
| * β **Memory Efficient**: Only loads when needed | |
| * β **Cache Persistent**: Stays loaded between requests | |
| * β **Serverless Friendly**: Ideal for HuggingFace Spaces | |
| --- | |
| ### Analytics & Evaluation System | |
| #### Built-In Dashboard | |
| **Real-Time Metrics**: | |
| * Total conversations | |
| * Average response time | |
| * Success rate (quality score >3.5) | |
| * Educational quality scores (ML-evaluated) | |
| * Classifier accuracy rates | |
| * Active sessions count | |
| **LightEval Integration**: | |
| * BertScore for semantic quality | |
| * ROUGE for response completeness | |
| * Custom educational quality indicators: | |
| - Has examples | |
| - Structured explanation | |
| - Appropriate length | |
| - Encourages learning | |
| - Uses LaTeX (for math) | |
| - Clear sections | |
| **Exportable Data**: | |
| * JSON export with full metrics | |
| * CSV export of interaction history | |
| * Programmatic access via API | |
| --- | |
| ### Performance Benchmarks | |
| **Runtime Performance:** | |
| * **Inference Speed**: 25-40 tokens/second (with ZeroGPU) | |
| * **Memory Usage**: ~1GB VRAM (4-bit quantization) | |
| * **Context Window**: 128K tokens | |
| * **First Request**: ~30-60 seconds (one-time load) | |
| * **Warm Inference**: <1 second per agent | |
| * **Startup Time**: Instant (lazy loading) | |
| **Llama 3.2 Quality Scores:** | |
| * MMLU: 63.4 (competitive with larger models) | |
| * GSM8K (Math): 73.9 | |
| * HumanEval (Coding): 59.3 | |
| * Multilingual: 8 languages supported | |
| * Safety: RLHF-aligned for educational use |