Mimir / README.md
jdesiree's picture
Update README.md
b79a4e1 verified

A newer version of the Gradio SDK is available: 6.8.0

Upgrade
metadata
title: Mimir
emoji: πŸ“š
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
python_version: '3.10'
short_description: Advanced prompt engineering for educational AI systems.
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/68700e7552b74a1dcbb2a87e/Z7P8DJ57rc5P1ozA5gwp3.png
hardware: zero-gpu-dynamic
startup_duration_timeout: 30m

Mimir: Educational AI Assistant

Advanced Multi-Agent Architecture & Prompt Engineering Portfolio Project

Project Overview

Mimir demonstrates enterprise-grade AI system design through a sophisticated multi-agent architecture applied to educational technology. The system showcases advanced prompt engineering, intelligent decision-making pipelines, and state-persistent conversation management. Unlike simple single-model implementations, Mimir employs four specialized agent types working in concert: a tool decision engine, four parallel routing agents for prompt selection, three preprocessing thinking agents for complex reasoning, and a unified response generator. This architecture prioritizes pedagogical effectiveness through dynamic context assembly, ensuring responses are tailored to each unique educational interaction.


Technical Architecture

Multi-Agent System:

User Input β†’ Tool Decision Agent β†’ Routing Agents (4x) β†’ Thinking Agents (3x) β†’ Response Agent β†’ Output
                     ↓                    ↓                      ↓                  ↓
              Llama-3.2-3B         Llama-3.2-3B (shared)    Llama-3.2-3B      Llama-3.2-3B

Core Technologies:

  • Unified Model Architecture: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
  • Lazy Loading Strategy: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
  • Custom Orchestration: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
  • State Management: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
  • ZeroGPU Integration: Dynamic GPU allocation with @spaces.GPU decorators for efficient resource usage
  • Gradio: Multi-page interface (Chatbot + Analytics Dashboard)
  • Python: Advanced backend with 4-bit quantization and streaming

Key Frameworks & Libraries:

  • transformers & accelerate for model loading and inference optimization
  • bitsandbytes for 4-bit NF4 quantization (75% memory reduction)
  • peft for Parameter-Efficient Fine-Tuning support
  • spaces for HuggingFace ZeroGPU integration
  • matplotlib for dynamic visualization generation
  • Custom state management system with SQLite and dataset backup

Advanced Agent Architecture

Agent Pipeline Overview

The system processes each user interaction through a sophisticated four-stage pipeline, with each stage making intelligent decisions that shape the final response.

Stage 1: Tool Decision Agent

Purpose: Determines if visualization tools enhance learning

Model: Llama-3.2-3B-Instruct (4-bit NF4 quantized)

Prompt Engineering:

  • Highly constrained binary decision prompt (YES/NO only)
  • Explicit INCLUDE/EXCLUDE criteria for educational contexts
  • Zero-shot classification with educational domain knowledge

Decision Criteria:

INCLUDE: Mathematical functions, data analysis, chart interpretation, 
         trend visualization, proportional relationships

EXCLUDE: Greetings, definitions, explanations without data

Output: Boolean flag activating TOOL_USE_ENHANCEMENT prompt segment


Stage 2: Prompt Routing Agents (4 Specialized Agents)

Purpose: Intelligent prompt segment selection through parallel analysis

Model: Shared Llama-3.2-3B-Instruct instance (memory efficient)

Agent Specializations:

  1. Agent 1 - Practice Question Detector

    • Analyzes conversation context for practice question opportunities
    • Considers user's expressed understanding and learning progression
    • Activates: STRUCTURE_PRACTICE_QUESTIONS
  2. Agent 2 - Discovery Mode Classifier

    • Dual-classification: vague input detection + understanding assessment
    • Returns: VAUGE_INPUT, USER_UNDERSTANDING, or neither
    • Enables guided discovery and clarification strategies
  3. Agent 3 - Follow-up Assessment Agent

    • Detects if user is responding to previous practice questions
    • Analyzes conversation history for grading opportunities
    • Activates: PRACTICE_QUESTION_FOLLOWUP (triggers grading mode)
  4. Agent 4 - Teaching Mode Assessor

    • Evaluates need for direct instruction vs. structured practice
    • Multi-output agent (can activate multiple prompts)
    • Activates: GUIDING_TEACHING, STRUCTURE_PRACTICE_QUESTIONS

Prompt Engineering Innovation:

  • Each agent uses a specialized system prompt with clear decision criteria
  • Structured output formats for reliable parsing
  • Context-aware analysis incorporating full conversation history
  • Sequential execution prevents decision conflicts

Stage 3: Thinking Agents (Preprocessing Layer)

Purpose: Generate reasoning context before final response (CoT/ToT)

Model: Llama-3.2-3B-Instruct (shared instance)

Agent Specializations:

  1. Math Thinking Agent

    • Method: Tree-of-Thought reasoning for mathematical problems
    • Activation: When LATEX_FORMATTING is active
    • Output Structure:
      Key Terms β†’ Principles β†’ Formulas β†’ Step-by-Step Solution β†’ Summary
      
    • Complexity Routing: Decision tree determines detail level (1A: basic, 1B: complex)
  2. Question/Answer Design Agent

    • Method: Chain-of-Thought for practice question formulation
    • Activation: When STRUCTURE_PRACTICE_QUESTIONS is active
    • Formatted Inputs: Tool context, LaTeX guidelines, practice question templates
    • Output: Question design, data formatting, answer bank generation
  3. Reasoning Thinking Agent

    • Method: General Chain-of-Thought preprocessing
    • Activation: When tools, follow-ups, or teaching mode active
    • Output Structure:
      User Knowledge Summary β†’ Understanding Analysis β†’ 
      Previous Actions β†’ Reference Fact Sheet
      

Prompt Engineering Innovation:

  • Thinking agents produce context for ResponseAgent, not final output
  • Outputs are invisible to user but inform response quality
  • Tree-of-Thought (ToT) for math: explores multiple solution paths
  • Chain-of-Thought (CoT) for others: step-by-step reasoning traces

Stage 4: Response Agent (Educational Response Generation)

Purpose: Generate pedagogically sound final response

Model: Llama-3.2-3B-Instruct (same shared instance)

Configuration:

  • 4-bit NF4 quantization (BitsAndBytes)
  • Mixed precision BF16 inference
  • Accelerate integration for distributed computation
  • 128K context window
  • Multilingual support (8 languages)

Prompt Assembly Process:

  1. Core Identity: Always included (defines Mimir persona)
  2. Logical Expressions: Regex-triggered prompts (e.g., math keywords β†’ LATEX_FORMATTING)
  3. Agent-Selected Prompts: Dynamic assembly based on routing agent decisions
  4. Context Integration: Tool outputs, thinking agent outputs, conversation history
  5. Complete Prompt: All segments joined with proper formatting

Dynamic Prompt Library (11 segments):

Core:          CORE_IDENTITY (always)
Formatting:    GENERAL_FORMATTING (always), LATEX_FORMATTING (math)
Discovery:     VAUGE_INPUT, USER_UNDERSTANDING
Teaching:      GUIDING_TEACHING
Practice:      STRUCTURE_PRACTICE_QUESTIONS, PRACTICE_QUESTION_FOLLOWUP
Tool:          TOOL_USE_ENHANCEMENT

Response Post-Processing:

  • Artifact cleanup (remove special tokens)
  • Intelligent truncation at logical breakpoints
  • Sentence integrity preservation
  • Quality validation gates
  • Word-by-word streaming for UX

Model Specifications

Llama-3.2-3B-Instruct Details:

  • Parameters: 3.21 billion
  • Architecture: Optimized transformer with Grouped-Query Attention (GQA)
  • Training Data: 9 trillion tokens (December 2023 cutoff)
  • Context Length: 128,000 tokens
  • Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
  • Quantization: 4-bit NF4 (~1GB VRAM)
  • Training Method: Knowledge distillation from Llama 3.1 8B/70B + SFT + RLHF

Why Single Model Architecture:

  • βœ… Consistency: Same reasoning style across all agents
  • βœ… Memory Efficient: One model, shared instance (~1GB total)
  • βœ… Instruction-Tuned: Optimized for educational dialogue
  • βœ… Fast Inference: 3B parameters = quick responses
  • βœ… ZeroGPU Friendly: Small enough for dynamic allocation
  • βœ… 128K Context: Can handle long educational conversations

Prompt Engineering Techniques Demonstrated

1. Hierarchical Prompt Architecture

Three-Layer System:

  • Agent System Prompts: Specialized instructions for each agent type
  • Response Prompt Segments: Modular components dynamically assembled
  • Thinking Prompts: Preprocessing templates for reasoning generation

Innovation: Separates decision-making logic from response generation, enabling precise control over AI behavior at each pipeline stage.

2. Per-Turn Prompt State Management

PromptStateManager:

# Reset at turn start - clean slate
prompt_state.reset()  # All 11 prompts β†’ False

# Agents activate relevant prompts
prompt_state.update("LATEX_FORMATTING", True)
prompt_state.update("GUIDING_TEACHING", True)

# Assemble only active prompts
active_prompts = prompt_state.get_active_response_prompts()
# Returns: ["CORE_IDENTITY", "GENERAL_FORMATTING", 
#           "LATEX_FORMATTING", "GUIDING_TEACHING"]

Benefits:

  • No prompt pollution between turns
  • Context-appropriate responses every time
  • Traceable decision-making for debugging

3. Logical Expression System

Regex-Based Automatic Activation:

# Math keyword detection
math_regex = r'\b(calculus|algebra|equation|solve|derivative)\b'
if re.search(math_regex, user_input, re.IGNORECASE):
    prompt_state.update("LATEX_FORMATTING", True)

Hybrid Approach: Combines rule-based triggers with LLM decision-making for optimal reliability.

4. Constraint-Based Agent Prompting

Tool Decision Example:

System Prompt: Analyze query and determine if visualization needed.

Output Format: YES or NO (nothing else)

INCLUDE if: mathematical functions, data analysis, trends
EXCLUDE if: greetings, simple definitions, no data

Result: Reliable, parseable outputs from agents without complex post-processing.

5. Chain-of-Thought & Tree-of-Thought Preprocessing

CoT for Sequential Reasoning:

Step 1: Assess topic β†’ 
Step 2: Identify user understanding β†’ 
Step 3: Previous actions β†’ 
Step 4: Reference facts

ToT for Mathematical Reasoning:

Question Type Assessment β†’ 
  Branch 1A (Simple): Minimal steps
  Branch 1B (Complex): Full derivation with principles

Innovation: Thinking agents generate rich context that guides ResponseAgent to higher-quality outputs.

6. Academic Integrity by Design

Embedded in Core Prompts:

  • "Do not provide full solutions - guide through processes instead"
  • "Break problems into conceptual components"
  • "Ask clarifying questions about their understanding"
  • Subject-specific guidelines (Math: explain concepts, not compute)

Follow-up Grading:

  • Agent 3 detects practice question responses
  • PRACTICE_QUESTION_FOLLOWUP prompt activates
  • Automated assessment with constructive feedback

7. Multi-Modal Response Generation

Tool Integration:

# Tool decision β†’ JSON generation β†’ matplotlib rendering β†’ base64 encoding
Create_Graph_Tool(
    data={"Week 1": 120, "Week 2": 155, ...},
    plot_type="line",
    title="Crop Yield Analysis",
    educational_context="Visualizes growth trend over time"
)

Result: In-memory graph generation with educational context, embedded directly in response.


State Management & Persistence

GlobalStateManager Architecture

Dual-Layer Persistence:

  1. SQLite Database: Fast local access, immediate writes
  2. HuggingFace Dataset: Cloud backup, hourly sync

State Categories:

- Conversation State: Full chat history + agent context
- Prompt State: Per-turn activation (resets each interaction)
- Analytics State: Metrics, dashboard data, export history
- Evaluation State: Quality scores, classifier accuracy, user feedback
- ML Model Cache: Loaded model for reuse across sessions

Thread Safety: All state operations protected by threading.Lock()

Cleanup Strategy:

  • Automatic cleanup every 60 minutes
  • Remove sessions older than 24 hours
  • Prevents memory leaks in long-running deployments

Model Loading & Optimization Strategy

Two-Stage Lazy Loading Pipeline

Stage 1: Build Time (Docker) - Optional Pre-caching

# preload_from_hub in README.md
preload_from_hub:
  - meta-llama/Llama-3.2-3B-Instruct
  • Downloads model weights during Docker build
  • Cached in HuggingFace hub cache directory
  • Reduces first-request latency (no download needed)
  • Optional but recommended for production deployments

Stage 2: Runtime (Lazy Loading with Automatic Caching)

# model_manager.py - LazyLlamaModel class
def _load_model(self):
    """Load on first generate() call"""
    if self.model is not None:
        return  # Already loaded - reuse cached instance
    
    # First call: Load with 4-bit quantization
    self.model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        quantization_config=quantization_config,
        device_map="auto",
    )
    # Model stays in memory for all future calls

# All agents share this single instance
@spaces.GPU(duration=120)
def _load_model(self):
    # GPU allocated for 120 seconds during first load
    # Then reused without re-allocation

Loading Flow:

App starts β†’ Instant startup (no model loading)
             ↓
First user request β†’ Triggers model load (~30-60s)
                    β”œβ”€ Download from cache (if preloaded: instant)
                    β”œβ”€ Load with 4-bit quantization
                    β”œβ”€ Create pipeline
                    └─ Cache in memory
                    ↓
All subsequent requests β†’ Use cached model (~1s)

Memory Optimization:

  • 4-bit NF4 Quantization: 75% memory reduction
    • Llama-3.2-3B: ~6GB β†’ ~1GB VRAM
  • Shared Model Strategy: ALL agents share one model instance
  • Singleton Pattern: Thread-safe model caching
  • Device Mapping: Automatic distribution with ZeroGPU
  • 128K Context: Long conversations without truncation

ZeroGPU Integration:

@spaces.GPU(duration=120)  # Dynamic allocation for first load
def _load_model(self):
    # GPU available for 120 seconds
    # Loads model once on first request
    # Cached instance reused across all agents
    # Automatic GPU management by ZeroGPU

Performance Characteristics:

  • First Request: 30-60 seconds (one-time model load)
    • With preload_from_hub: 30-40s (just quantization)
    • Without preload: 50-60s (download + quantization)
  • Subsequent Requests: <1 second per agent
  • Memory Footprint: ~1GB VRAM (persistent)
  • Cold Start: Instant app startup (model loads on demand)

Why Lazy Loading?

  • βœ… Instant Startup: App launches immediately
  • βœ… ZeroGPU Optimal: Perfect for dynamic GPU allocation
  • βœ… Memory Efficient: Only loads when needed
  • βœ… Cache Persistent: Stays loaded between requests
  • βœ… Serverless Friendly: Ideal for HuggingFace Spaces

Analytics & Evaluation System

Built-In Dashboard

Real-Time Metrics:

  • Total conversations
  • Average response time
  • Success rate (quality score >3.5)
  • Educational quality scores (ML-evaluated)
  • Classifier accuracy rates
  • Active sessions count

LightEval Integration:

  • BertScore for semantic quality
  • ROUGE for response completeness
  • Custom educational quality indicators:
    • Has examples
    • Structured explanation
    • Appropriate length
    • Encourages learning
    • Uses LaTeX (for math)
    • Clear sections

Exportable Data:

  • JSON export with full metrics
  • CSV export of interaction history
  • Programmatic access via API

Performance Benchmarks

Runtime Performance:

  • Inference Speed: 25-40 tokens/second (with ZeroGPU)
  • Memory Usage: ~1GB VRAM (4-bit quantization)
  • Context Window: 128K tokens
  • First Request: ~30-60 seconds (one-time load)
  • Warm Inference: <1 second per agent
  • Startup Time: Instant (lazy loading)

Llama 3.2 Quality Scores:

  • MMLU: 63.4 (competitive with larger models)
  • GSM8K (Math): 73.9
  • HumanEval (Coding): 59.3
  • Multilingual: 8 languages supported
  • Safety: RLHF-aligned for educational use