Spaces:

jdesiree
/

Mimir

Sleeping

App Files Files Community

Mimir / README.md

jdesiree

Update README.md

b79a4e1 verified 4 months ago

preview code

raw

history blame contribute delete

17.1 kB

A newer version of the Gradio SDK is available: 6.8.0

Upgrade

metadata

title: Mimir
emoji: 📚
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: true
python_version: '3.10'
short_description: Advanced prompt engineering for educational AI systems.
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/68700e7552b74a1dcbb2a87e/Z7P8DJ57rc5P1ozA5gwp3.png
hardware: zero-gpu-dynamic
startup_duration_timeout: 30m

Mimir: Educational AI Assistant

Advanced Multi-Agent Architecture & Prompt Engineering Portfolio Project

Project Overview

Mimir demonstrates enterprise-grade AI system design through a sophisticated multi-agent architecture applied to educational technology. The system showcases advanced prompt engineering, intelligent decision-making pipelines, and state-persistent conversation management. Unlike simple single-model implementations, Mimir employs four specialized agent types working in concert: a tool decision engine, four parallel routing agents for prompt selection, three preprocessing thinking agents for complex reasoning, and a unified response generator. This architecture prioritizes pedagogical effectiveness through dynamic context assembly, ensuring responses are tailored to each unique educational interaction.

Technical Architecture

Multi-Agent System:

User Input → Tool Decision Agent → Routing Agents (4x) → Thinking Agents (3x) → Response Agent → Output
                     ↓                    ↓                      ↓                  ↓
              Llama-3.2-3B         Llama-3.2-3B (shared)    Llama-3.2-3B      Llama-3.2-3B

Core Technologies:

Unified Model Architecture: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
Lazy Loading Strategy: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
Custom Orchestration: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
State Management: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
ZeroGPU Integration: Dynamic GPU allocation with @spaces.GPU decorators for efficient resource usage
Gradio: Multi-page interface (Chatbot + Analytics Dashboard)
Python: Advanced backend with 4-bit quantization and streaming

Key Frameworks & Libraries:

transformers & accelerate for model loading and inference optimization
bitsandbytes for 4-bit NF4 quantization (75% memory reduction)
peft for Parameter-Efficient Fine-Tuning support
spaces for HuggingFace ZeroGPU integration
matplotlib for dynamic visualization generation
Custom state management system with SQLite and dataset backup

Advanced Agent Architecture

Agent Pipeline Overview

The system processes each user interaction through a sophisticated four-stage pipeline, with each stage making intelligent decisions that shape the final response.

Stage 1: Tool Decision Agent

Purpose: Determines if visualization tools enhance learning

Model: Llama-3.2-3B-Instruct (4-bit NF4 quantized)

Prompt Engineering:

Highly constrained binary decision prompt (YES/NO only)
Explicit INCLUDE/EXCLUDE criteria for educational contexts
Zero-shot classification with educational domain knowledge

Decision Criteria:

INCLUDE: Mathematical functions, data analysis, chart interpretation, 
         trend visualization, proportional relationships

EXCLUDE: Greetings, definitions, explanations without data

Output: Boolean flag activating TOOL_USE_ENHANCEMENT prompt segment

Stage 2: Prompt Routing Agents (4 Specialized Agents)

Purpose: Intelligent prompt segment selection through parallel analysis

Model: Shared Llama-3.2-3B-Instruct instance (memory efficient)

Agent Specializations:

Agent 1 - Practice Question Detector
- Analyzes conversation context for practice question opportunities
- Considers user's expressed understanding and learning progression
- Activates: STRUCTURE_PRACTICE_QUESTIONS
Agent 2 - Discovery Mode Classifier
- Dual-classification: vague input detection + understanding assessment
- Returns: VAUGE_INPUT, USER_UNDERSTANDING, or neither
- Enables guided discovery and clarification strategies
Agent 3 - Follow-up Assessment Agent
- Detects if user is responding to previous practice questions
- Analyzes conversation history for grading opportunities
- Activates: PRACTICE_QUESTION_FOLLOWUP (triggers grading mode)
Agent 4 - Teaching Mode Assessor
- Evaluates need for direct instruction vs. structured practice
- Multi-output agent (can activate multiple prompts)
- Activates: GUIDING_TEACHING, STRUCTURE_PRACTICE_QUESTIONS

Prompt Engineering Innovation:

Each agent uses a specialized system prompt with clear decision criteria
Structured output formats for reliable parsing
Context-aware analysis incorporating full conversation history
Sequential execution prevents decision conflicts

Stage 3: Thinking Agents (Preprocessing Layer)

Purpose: Generate reasoning context before final response (CoT/ToT)

Model: Llama-3.2-3B-Instruct (shared instance)

Agent Specializations:

Math Thinking Agent
- Method: Tree-of-Thought reasoning for mathematical problems
- Activation: When LATEX_FORMATTING is active
- Output Structure:
```
Key Terms → Principles → Formulas → Step-by-Step Solution → Summary
```
- Complexity Routing: Decision tree determines detail level (1A: basic, 1B: complex)
Question/Answer Design Agent
- Method: Chain-of-Thought for practice question formulation
- Activation: When STRUCTURE_PRACTICE_QUESTIONS is active
- Formatted Inputs: Tool context, LaTeX guidelines, practice question templates
- Output: Question design, data formatting, answer bank generation
Reasoning Thinking Agent
- Method: General Chain-of-Thought preprocessing
- Activation: When tools, follow-ups, or teaching mode active
- Output Structure:
```
User Knowledge Summary → Understanding Analysis → 
Previous Actions → Reference Fact Sheet
```

Prompt Engineering Innovation:

Thinking agents produce context for ResponseAgent, not final output
Outputs are invisible to user but inform response quality
Tree-of-Thought (ToT) for math: explores multiple solution paths
Chain-of-Thought (CoT) for others: step-by-step reasoning traces

Stage 4: Response Agent (Educational Response Generation)

Purpose: Generate pedagogically sound final response

Model: Llama-3.2-3B-Instruct (same shared instance)

Configuration:

4-bit NF4 quantization (BitsAndBytes)
Mixed precision BF16 inference
Accelerate integration for distributed computation
128K context window
Multilingual support (8 languages)

Prompt Assembly Process:

Core Identity: Always included (defines Mimir persona)
Logical Expressions: Regex-triggered prompts (e.g., math keywords → LATEX_FORMATTING)
Agent-Selected Prompts: Dynamic assembly based on routing agent decisions
Context Integration: Tool outputs, thinking agent outputs, conversation history
Complete Prompt: All segments joined with proper formatting

Dynamic Prompt Library (11 segments):

Core:          CORE_IDENTITY (always)
Formatting:    GENERAL_FORMATTING (always), LATEX_FORMATTING (math)
Discovery:     VAUGE_INPUT, USER_UNDERSTANDING
Teaching:      GUIDING_TEACHING
Practice:      STRUCTURE_PRACTICE_QUESTIONS, PRACTICE_QUESTION_FOLLOWUP
Tool:          TOOL_USE_ENHANCEMENT

Response Post-Processing:

Artifact cleanup (remove special tokens)
Intelligent truncation at logical breakpoints
Sentence integrity preservation
Quality validation gates
Word-by-word streaming for UX

Model Specifications

Llama-3.2-3B-Instruct Details:

Parameters: 3.21 billion
Architecture: Optimized transformer with Grouped-Query Attention (GQA)
Training Data: 9 trillion tokens (December 2023 cutoff)
Context Length: 128,000 tokens
Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
Quantization: 4-bit NF4 (~1GB VRAM)
Training Method: Knowledge distillation from Llama 3.1 8B/70B + SFT + RLHF

Why Single Model Architecture:

✅ Consistency: Same reasoning style across all agents
✅ Memory Efficient: One model, shared instance (~1GB total)
✅ Instruction-Tuned: Optimized for educational dialogue
✅ Fast Inference: 3B parameters = quick responses
✅ ZeroGPU Friendly: Small enough for dynamic allocation
✅ 128K Context: Can handle long educational conversations

Prompt Engineering Techniques Demonstrated

1. Hierarchical Prompt Architecture

Three-Layer System:

Agent System Prompts: Specialized instructions for each agent type
Response Prompt Segments: Modular components dynamically assembled
Thinking Prompts: Preprocessing templates for reasoning generation

Innovation: Separates decision-making logic from response generation, enabling precise control over AI behavior at each pipeline stage.

2. Per-Turn Prompt State Management

PromptStateManager:

# Reset at turn start - clean slate
prompt_state.reset()  # All 11 prompts → False

# Agents activate relevant prompts
prompt_state.update("LATEX_FORMATTING", True)
prompt_state.update("GUIDING_TEACHING", True)

# Assemble only active prompts
active_prompts = prompt_state.get_active_response_prompts()
# Returns: ["CORE_IDENTITY", "GENERAL_FORMATTING", 
#           "LATEX_FORMATTING", "GUIDING_TEACHING"]

Benefits:

No prompt pollution between turns
Context-appropriate responses every time
Traceable decision-making for debugging

3. Logical Expression System

Regex-Based Automatic Activation:

# Math keyword detection
math_regex = r'\b(calculus|algebra|equation|solve|derivative)\b'
if re.search(math_regex, user_input, re.IGNORECASE):
    prompt_state.update("LATEX_FORMATTING", True)

Hybrid Approach: Combines rule-based triggers with LLM decision-making for optimal reliability.

4. Constraint-Based Agent Prompting

Tool Decision Example:

System Prompt: Analyze query and determine if visualization needed.

Output Format: YES or NO (nothing else)

INCLUDE if: mathematical functions, data analysis, trends
EXCLUDE if: greetings, simple definitions, no data

Result: Reliable, parseable outputs from agents without complex post-processing.

5. Chain-of-Thought & Tree-of-Thought Preprocessing

CoT for Sequential Reasoning:

Step 1: Assess topic → 
Step 2: Identify user understanding → 
Step 3: Previous actions → 
Step 4: Reference facts

ToT for Mathematical Reasoning:

Question Type Assessment → 
  Branch 1A (Simple): Minimal steps
  Branch 1B (Complex): Full derivation with principles

Innovation: Thinking agents generate rich context that guides ResponseAgent to higher-quality outputs.

6. Academic Integrity by Design

Embedded in Core Prompts:

"Do not provide full solutions - guide through processes instead"
"Break problems into conceptual components"
"Ask clarifying questions about their understanding"
Subject-specific guidelines (Math: explain concepts, not compute)

Follow-up Grading:

Agent 3 detects practice question responses
PRACTICE_QUESTION_FOLLOWUP prompt activates
Automated assessment with constructive feedback

7. Multi-Modal Response Generation

Tool Integration:

# Tool decision → JSON generation → matplotlib rendering → base64 encoding
Create_Graph_Tool(
    data={"Week 1": 120, "Week 2": 155, ...},
    plot_type="line",
    title="Crop Yield Analysis",
    educational_context="Visualizes growth trend over time"
)

Result: In-memory graph generation with educational context, embedded directly in response.

State Management & Persistence

GlobalStateManager Architecture

Dual-Layer Persistence:

SQLite Database: Fast local access, immediate writes
HuggingFace Dataset: Cloud backup, hourly sync

State Categories:

- Conversation State: Full chat history + agent context
- Prompt State: Per-turn activation (resets each interaction)
- Analytics State: Metrics, dashboard data, export history
- Evaluation State: Quality scores, classifier accuracy, user feedback
- ML Model Cache: Loaded model for reuse across sessions

Thread Safety: All state operations protected by threading.Lock()

Cleanup Strategy:

Automatic cleanup every 60 minutes
Remove sessions older than 24 hours
Prevents memory leaks in long-running deployments

Model Loading & Optimization Strategy

Two-Stage Lazy Loading Pipeline

Stage 1: Build Time (Docker) - Optional Pre-caching

# preload_from_hub in README.md
preload_from_hub:
  - meta-llama/Llama-3.2-3B-Instruct

Downloads model weights during Docker build
Cached in HuggingFace hub cache directory
Reduces first-request latency (no download needed)
Optional but recommended for production deployments

Stage 2: Runtime (Lazy Loading with Automatic Caching)

# model_manager.py - LazyLlamaModel class
def _load_model(self):
    """Load on first generate() call"""
    if self.model is not None:
        return  # Already loaded - reuse cached instance
    
    # First call: Load with 4-bit quantization
    self.model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        quantization_config=quantization_config,
        device_map="auto",
    )
    # Model stays in memory for all future calls

# All agents share this single instance
@spaces.GPU(duration=120)
def _load_model(self):
    # GPU allocated for 120 seconds during first load
    # Then reused without re-allocation

Loading Flow:

App starts → Instant startup (no model loading)
             ↓
First user request → Triggers model load (~30-60s)
                    ├─ Download from cache (if preloaded: instant)
                    ├─ Load with 4-bit quantization
                    ├─ Create pipeline
                    └─ Cache in memory
                    ↓
All subsequent requests → Use cached model (~1s)

Memory Optimization:

4-bit NF4 Quantization: 75% memory reduction
- Llama-3.2-3B: ~6GB → ~1GB VRAM
Shared Model Strategy: ALL agents share one model instance
Singleton Pattern: Thread-safe model caching
Device Mapping: Automatic distribution with ZeroGPU
128K Context: Long conversations without truncation

ZeroGPU Integration:

@spaces.GPU(duration=120)  # Dynamic allocation for first load
def _load_model(self):
    # GPU available for 120 seconds
    # Loads model once on first request
    # Cached instance reused across all agents
    # Automatic GPU management by ZeroGPU

Performance Characteristics:

First Request: 30-60 seconds (one-time model load)
- With preload_from_hub: 30-40s (just quantization)
- Without preload: 50-60s (download + quantization)
Subsequent Requests: <1 second per agent
Memory Footprint: ~1GB VRAM (persistent)
Cold Start: Instant app startup (model loads on demand)

Why Lazy Loading?

✅ Instant Startup: App launches immediately
✅ ZeroGPU Optimal: Perfect for dynamic GPU allocation
✅ Memory Efficient: Only loads when needed
✅ Cache Persistent: Stays loaded between requests
✅ Serverless Friendly: Ideal for HuggingFace Spaces

Analytics & Evaluation System

Built-In Dashboard

Real-Time Metrics:

Total conversations
Average response time
Success rate (quality score >3.5)
Educational quality scores (ML-evaluated)
Classifier accuracy rates
Active sessions count

LightEval Integration:

BertScore for semantic quality
ROUGE for response completeness
Custom educational quality indicators:
- Has examples
- Structured explanation
- Appropriate length
- Encourages learning
- Uses LaTeX (for math)
- Clear sections

Exportable Data:

JSON export with full metrics
CSV export of interaction history
Programmatic access via API

Performance Benchmarks

Runtime Performance:

Inference Speed: 25-40 tokens/second (with ZeroGPU)
Memory Usage: ~1GB VRAM (4-bit quantization)
Context Window: 128K tokens
First Request: ~30-60 seconds (one-time load)
Warm Inference: <1 second per agent
Startup Time: Instant (lazy loading)

Llama 3.2 Quality Scores:

MMLU: 63.4 (competitive with larger models)
GSM8K (Math): 73.9
HumanEval (Coding): 59.3
Multilingual: 8 languages supported
Safety: RLHF-aligned for educational use