Spaces:

jdesiree
/

Mimir

Sleeping

App Files Files Community

Mimir / README.md

jdesiree

Update README.md

b79a4e1 verified 4 months ago

preview code

raw

history blame contribute delete

17.1 kB

	---
	title: Mimir
	emoji: 📚
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: true
	python_version: '3.10'
	short_description: Advanced prompt engineering for educational AI systems.
	thumbnail: >-
	https://cdn-uploads.huggingface.co/production/uploads/68700e7552b74a1dcbb2a87e/Z7P8DJ57rc5P1ozA5gwp3.png
	hardware: zero-gpu-dynamic
	startup_duration_timeout: 30m
	---

	# Mimir: Educational AI Assistant
	## Advanced Multi-Agent Architecture & Prompt Engineering Portfolio Project

	### Project Overview
	Mimir demonstrates enterprise-grade AI system design through a sophisticated multi-agent architecture applied to educational technology. The system showcases advanced prompt engineering, intelligent decision-making pipelines, and state-persistent conversation management. Unlike simple single-model implementations, Mimir employs four specialized agent types working in concert: a tool decision engine, four parallel routing agents for prompt selection, three preprocessing thinking agents for complex reasoning, and a unified response generator. This architecture prioritizes pedagogical effectiveness through dynamic context assembly, ensuring responses are tailored to each unique educational interaction.

	***

	### Technical Architecture

	Multi-Agent System:
	```
	User Input → Tool Decision Agent → Routing Agents (4x) → Thinking Agents (3x) → Response Agent → Output
	↓ ↓ ↓ ↓
	Llama-3.2-3B Llama-3.2-3B (shared) Llama-3.2-3B Llama-3.2-3B
	```

	Core Technologies:

	* Unified Model Architecture: Llama-3.2-3B-Instruct (3.21B parameters) for all tasks - decision-making, reasoning, and response generation
	* Lazy Loading Strategy: Model loads on first request and caches for subsequent calls (optimal for ZeroGPU)
	* Custom Orchestration: Hand-built agent coordination replacing traditional frameworks for precise control and optimization
	* State Management: Thread-safe global state with dual persistence (SQLite + HuggingFace Datasets)
	* ZeroGPU Integration: Dynamic GPU allocation with `@spaces.GPU` decorators for efficient resource usage
	* Gradio: Multi-page interface (Chatbot + Analytics Dashboard)
	* Python: Advanced backend with 4-bit quantization and streaming

	Key Frameworks & Libraries:

	* `transformers` & `accelerate` for model loading and inference optimization
	* `bitsandbytes` for 4-bit NF4 quantization (75% memory reduction)
	* `peft` for Parameter-Efficient Fine-Tuning support
	* `spaces` for HuggingFace ZeroGPU integration
	* `matplotlib` for dynamic visualization generation
	* Custom state management system with SQLite and dataset backup

	***

	### Advanced Agent Architecture

	#### Agent Pipeline Overview
	The system processes each user interaction through a sophisticated four-stage pipeline, with each stage making intelligent decisions that shape the final response.

	#### Stage 1: Tool Decision Agent
	Purpose: Determines if visualization tools enhance learning

	Model: Llama-3.2-3B-Instruct (4-bit NF4 quantized)

	Prompt Engineering:
	* Highly constrained binary decision prompt (YES/NO only)
	* Explicit INCLUDE/EXCLUDE criteria for educational contexts
	* Zero-shot classification with educational domain knowledge

	Decision Criteria:
	```
	INCLUDE: Mathematical functions, data analysis, chart interpretation,
	trend visualization, proportional relationships

	EXCLUDE: Greetings, definitions, explanations without data
	```

	Output: Boolean flag activating `TOOL_USE_ENHANCEMENT` prompt segment

	---

	#### Stage 2: Prompt Routing Agents (4 Specialized Agents)
	Purpose: Intelligent prompt segment selection through parallel analysis

	Model: Shared Llama-3.2-3B-Instruct instance (memory efficient)

	Agent Specializations:

	1. Agent 1 - Practice Question Detector
	- Analyzes conversation context for practice question opportunities
	- Considers user's expressed understanding and learning progression
	- Activates: `STRUCTURE_PRACTICE_QUESTIONS`

	2. Agent 2 - Discovery Mode Classifier
	- Dual-classification: vague input detection + understanding assessment
	- Returns: `VAUGE_INPUT`, `USER_UNDERSTANDING`, or neither
	- Enables guided discovery and clarification strategies

	3. Agent 3 - Follow-up Assessment Agent
	- Detects if user is responding to previous practice questions
	- Analyzes conversation history for grading opportunities
	- Activates: `PRACTICE_QUESTION_FOLLOWUP` (triggers grading mode)

	4. Agent 4 - Teaching Mode Assessor
	- Evaluates need for direct instruction vs. structured practice
	- Multi-output agent (can activate multiple prompts)
	- Activates: `GUIDING_TEACHING`, `STRUCTURE_PRACTICE_QUESTIONS`

	Prompt Engineering Innovation:
	* Each agent uses a specialized system prompt with clear decision criteria
	* Structured output formats for reliable parsing
	* Context-aware analysis incorporating full conversation history
	* Sequential execution prevents decision conflicts

	---

	#### Stage 3: Thinking Agents (Preprocessing Layer)
	Purpose: Generate reasoning context before final response (CoT/ToT)

	Model: Llama-3.2-3B-Instruct (shared instance)

	Agent Specializations:

	1. Math Thinking Agent
	- Method: Tree-of-Thought reasoning for mathematical problems
	- Activation: When `LATEX_FORMATTING` is active
	- Output Structure:
	```
	Key Terms → Principles → Formulas → Step-by-Step Solution → Summary
	```
	- Complexity Routing: Decision tree determines detail level (1A: basic, 1B: complex)

	2. Question/Answer Design Agent
	- Method: Chain-of-Thought for practice question formulation
	- Activation: When `STRUCTURE_PRACTICE_QUESTIONS` is active
	- Formatted Inputs: Tool context, LaTeX guidelines, practice question templates
	- Output: Question design, data formatting, answer bank generation

	3. Reasoning Thinking Agent
	- Method: General Chain-of-Thought preprocessing
	- Activation: When tools, follow-ups, or teaching mode active
	- Output Structure:
	```
	User Knowledge Summary → Understanding Analysis →
	Previous Actions → Reference Fact Sheet
	```

	Prompt Engineering Innovation:
	* Thinking agents produce context for ResponseAgent, not final output
	* Outputs are invisible to user but inform response quality
	* Tree-of-Thought (ToT) for math: explores multiple solution paths
	* Chain-of-Thought (CoT) for others: step-by-step reasoning traces

	---

	#### Stage 4: Response Agent (Educational Response Generation)
	Purpose: Generate pedagogically sound final response

	Model: Llama-3.2-3B-Instruct (same shared instance)

	Configuration:
	* 4-bit NF4 quantization (BitsAndBytes)
	* Mixed precision BF16 inference
	* Accelerate integration for distributed computation
	* 128K context window
	* Multilingual support (8 languages)

	Prompt Assembly Process:
	1. Core Identity: Always included (defines Mimir persona)
	2. Logical Expressions: Regex-triggered prompts (e.g., math keywords → `LATEX_FORMATTING`)
	3. Agent-Selected Prompts: Dynamic assembly based on routing agent decisions
	4. Context Integration: Tool outputs, thinking agent outputs, conversation history
	5. Complete Prompt: All segments joined with proper formatting

	Dynamic Prompt Library (11 segments):
	```
	Core: CORE_IDENTITY (always)
	Formatting: GENERAL_FORMATTING (always), LATEX_FORMATTING (math)
	Discovery: VAUGE_INPUT, USER_UNDERSTANDING
	Teaching: GUIDING_TEACHING
	Practice: STRUCTURE_PRACTICE_QUESTIONS, PRACTICE_QUESTION_FOLLOWUP
	Tool: TOOL_USE_ENHANCEMENT
	```

	Response Post-Processing:
	* Artifact cleanup (remove special tokens)
	* Intelligent truncation at logical breakpoints
	* Sentence integrity preservation
	* Quality validation gates
	* Word-by-word streaming for UX

	---

	### Model Specifications

	Llama-3.2-3B-Instruct Details:
	* Parameters: 3.21 billion
	* Architecture: Optimized transformer with Grouped-Query Attention (GQA)
	* Training Data: 9 trillion tokens (December 2023 cutoff)
	* Context Length: 128,000 tokens
	* Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
	* Quantization: 4-bit NF4 (~1GB VRAM)
	* Training Method: Knowledge distillation from Llama 3.1 8B/70B + SFT + RLHF

	Why Single Model Architecture:
	* ✅ Consistency: Same reasoning style across all agents
	* ✅ Memory Efficient: One model, shared instance (~1GB total)
	* ✅ Instruction-Tuned: Optimized for educational dialogue
	* ✅ Fast Inference: 3B parameters = quick responses
	* ✅ ZeroGPU Friendly: Small enough for dynamic allocation
	* ✅ 128K Context: Can handle long educational conversations

	---

	### Prompt Engineering Techniques Demonstrated

	#### 1. Hierarchical Prompt Architecture
	Three-Layer System:
	- Agent System Prompts: Specialized instructions for each agent type
	- Response Prompt Segments: Modular components dynamically assembled
	- Thinking Prompts: Preprocessing templates for reasoning generation

	Innovation: Separates decision-making logic from response generation, enabling precise control over AI behavior at each pipeline stage.

	#### 2. Per-Turn Prompt State Management
	PromptStateManager:
	```python
	# Reset at turn start - clean slate
	prompt_state.reset() # All 11 prompts → False

	# Agents activate relevant prompts
	prompt_state.update("LATEX_FORMATTING", True)
	prompt_state.update("GUIDING_TEACHING", True)

	# Assemble only active prompts
	active_prompts = prompt_state.get_active_response_prompts()
	# Returns: ["CORE_IDENTITY", "GENERAL_FORMATTING",
	# "LATEX_FORMATTING", "GUIDING_TEACHING"]
	```

	Benefits:
	- No prompt pollution between turns
	- Context-appropriate responses every time
	- Traceable decision-making for debugging

	#### 3. Logical Expression System
	Regex-Based Automatic Activation:
	```python
	# Math keyword detection
	math_regex = r'\b(calculus\|algebra\|equation\|solve\|derivative)\b'
	if re.search(math_regex, user_input, re.IGNORECASE):
	prompt_state.update("LATEX_FORMATTING", True)
	```

	Hybrid Approach: Combines rule-based triggers with LLM decision-making for optimal reliability.

	#### 4. Constraint-Based Agent Prompting
	Tool Decision Example:
	```
	System Prompt: Analyze query and determine if visualization needed.

	Output Format: YES or NO (nothing else)

	INCLUDE if: mathematical functions, data analysis, trends
	EXCLUDE if: greetings, simple definitions, no data
	```

	Result: Reliable, parseable outputs from agents without complex post-processing.

	#### 5. Chain-of-Thought & Tree-of-Thought Preprocessing
	CoT for Sequential Reasoning:
	```
	Step 1: Assess topic →
	Step 2: Identify user understanding →
	Step 3: Previous actions →
	Step 4: Reference facts
	```

	ToT for Mathematical Reasoning:
	```
	Question Type Assessment →
	Branch 1A (Simple): Minimal steps
	Branch 1B (Complex): Full derivation with principles
	```

	Innovation: Thinking agents generate rich context that guides ResponseAgent to higher-quality outputs.

	#### 6. Academic Integrity by Design
	Embedded in Core Prompts:
	* "Do not provide full solutions - guide through processes instead"
	* "Break problems into conceptual components"
	* "Ask clarifying questions about their understanding"
	* Subject-specific guidelines (Math: explain concepts, not compute)

	Follow-up Grading:
	* Agent 3 detects practice question responses
	* `PRACTICE_QUESTION_FOLLOWUP` prompt activates
	* Automated assessment with constructive feedback

	#### 7. Multi-Modal Response Generation
	Tool Integration:
	```python
	# Tool decision → JSON generation → matplotlib rendering → base64 encoding
	Create_Graph_Tool(
	data={"Week 1": 120, "Week 2": 155, ...},
	plot_type="line",
	title="Crop Yield Analysis",
	educational_context="Visualizes growth trend over time"
	)
	```

	Result: In-memory graph generation with educational context, embedded directly in response.

	---

	### State Management & Persistence

	#### GlobalStateManager Architecture
	Dual-Layer Persistence:
	1. SQLite Database: Fast local access, immediate writes
	2. HuggingFace Dataset: Cloud backup, hourly sync

	State Categories:
	```python
	- Conversation State: Full chat history + agent context
	- Prompt State: Per-turn activation (resets each interaction)
	- Analytics State: Metrics, dashboard data, export history
	- Evaluation State: Quality scores, classifier accuracy, user feedback
	- ML Model Cache: Loaded model for reuse across sessions
	```

	Thread Safety: All state operations protected by `threading.Lock()`

	Cleanup Strategy:
	- Automatic cleanup every 60 minutes
	- Remove sessions older than 24 hours
	- Prevents memory leaks in long-running deployments

	---

	### Model Loading & Optimization Strategy

	#### Two-Stage Lazy Loading Pipeline

	Stage 1: Build Time (Docker) - Optional Pre-caching
	```yaml
	# preload_from_hub in README.md
	preload_from_hub:
	- meta-llama/Llama-3.2-3B-Instruct
	```
	* Downloads model weights during Docker build
	* Cached in HuggingFace hub cache directory
	* Reduces first-request latency (no download needed)
	* Optional but recommended for production deployments

	Stage 2: Runtime (Lazy Loading with Automatic Caching)
	```python
	# model_manager.py - LazyLlamaModel class
	def _load_model(self):
	"""Load on first generate() call"""
	if self.model is not None:
	return # Already loaded - reuse cached instance

	# First call: Load with 4-bit quantization
	self.model = AutoModelForCausalLM.from_pretrained(
	"meta-llama/Llama-3.2-3B-Instruct",
	quantization_config=quantization_config,
	device_map="auto",
	)
	# Model stays in memory for all future calls

	# All agents share this single instance
	@spaces.GPU(duration=120)
	def _load_model(self):
	# GPU allocated for 120 seconds during first load
	# Then reused without re-allocation
	```

	Loading Flow:
	```
	App starts → Instant startup (no model loading)
	↓
	First user request → Triggers model load (~30-60s)
	├─ Download from cache (if preloaded: instant)
	├─ Load with 4-bit quantization
	├─ Create pipeline
	└─ Cache in memory
	↓
	All subsequent requests → Use cached model (~1s)
	```

	Memory Optimization:
	- 4-bit NF4 Quantization: 75% memory reduction
	- Llama-3.2-3B: ~6GB → ~1GB VRAM
	- Shared Model Strategy: ALL agents share one model instance
	- Singleton Pattern: Thread-safe model caching
	- Device Mapping: Automatic distribution with ZeroGPU
	- 128K Context: Long conversations without truncation

	ZeroGPU Integration:
	```python
	@spaces.GPU(duration=120) # Dynamic allocation for first load
	def _load_model(self):
	# GPU available for 120 seconds
	# Loads model once on first request
	# Cached instance reused across all agents
	# Automatic GPU management by ZeroGPU
	```

	Performance Characteristics:
	* First Request: 30-60 seconds (one-time model load)
	- With `preload_from_hub`: 30-40s (just quantization)
	- Without preload: 50-60s (download + quantization)
	* Subsequent Requests: <1 second per agent
	* Memory Footprint: ~1GB VRAM (persistent)
	* Cold Start: Instant app startup (model loads on demand)

	Why Lazy Loading?
	* ✅ Instant Startup: App launches immediately
	* ✅ ZeroGPU Optimal: Perfect for dynamic GPU allocation
	* ✅ Memory Efficient: Only loads when needed
	* ✅ Cache Persistent: Stays loaded between requests
	* ✅ Serverless Friendly: Ideal for HuggingFace Spaces

	---

	### Analytics & Evaluation System

	#### Built-In Dashboard
	Real-Time Metrics:
	* Total conversations
	* Average response time
	* Success rate (quality score >3.5)
	* Educational quality scores (ML-evaluated)
	* Classifier accuracy rates
	* Active sessions count

	LightEval Integration:
	* BertScore for semantic quality
	* ROUGE for response completeness
	* Custom educational quality indicators:
	- Has examples
	- Structured explanation
	- Appropriate length
	- Encourages learning
	- Uses LaTeX (for math)
	- Clear sections

	Exportable Data:
	* JSON export with full metrics
	* CSV export of interaction history
	* Programmatic access via API

	---

	### Performance Benchmarks

	Runtime Performance:
	* Inference Speed: 25-40 tokens/second (with ZeroGPU)
	* Memory Usage: ~1GB VRAM (4-bit quantization)
	* Context Window: 128K tokens
	* First Request: ~30-60 seconds (one-time load)
	* Warm Inference: <1 second per agent
	* Startup Time: Instant (lazy loading)

	Llama 3.2 Quality Scores:
	* MMLU: 63.4 (competitive with larger models)
	* GSM8K (Math): 73.9
	* HumanEval (Coding): 59.3
	* Multilingual: 8 languages supported
	* Safety: RLHF-aligned for educational use