Spaces:

Luigi
/

tiny-scribe

Running

App Files Files Community

tiny-scribe / docs /advanced-mode-implementation-plan.md

Luigi

fix: improve extraction success rate with Qwen3 models

061dfb7 about 1 month ago

preview code

raw

history blame contribute delete

48.2 kB

	# Advanced 2-Stage Meeting Summarization - Complete Implementation Plan

	Project: Tiny Scribe - Advanced Mode
	Date: 2026-02-04
	Status: Ready for Implementation
	Estimated Effort: 13-19 hours

	---

	## Table of Contents

	1. [Executive Summary](#executive-summary)
	2. [Design Decisions](#design-decisions)
	3. [Model Registries](#model-registries)
	4. [UI Implementation](#ui-implementation)
	5. [Model Management Infrastructure](#model-management-infrastructure)
	6. [Extraction Pipeline](#extraction-pipeline)
	7. [Implementation Checklist](#implementation-checklist)
	8. [Testing Strategy](#testing-strategy)
	9. [Implementation Priority](#implementation-priority)
	10. [Risk Assessment](#risk-assessment)

	---

	## Executive Summary

	This plan details the implementation of a 3-model Advanced Summarization Pipeline for Tiny Scribe, featuring:

	- ✅ 3 independent model registries (Extraction, Embedding, Synthesis)
	- ✅ User-configurable extraction context (2K-8K tokens, default 4K)
	- ✅ Reasoning/thinking model support with independent toggles per stage
	- ✅ Sequential model loading for memory efficiency
	- ✅ Bilingual support (English + Traditional Chinese zh-TW)
	- ✅ Fail-fast error handling with graceful UI feedback
	- ✅ Complete independence from Standard mode

	### Architecture

	```
	Stage 1: EXTRACTION → Parse transcript → Create windows → Extract JSON items
	Stage 2: DEDUPLICATION → Compute embeddings → Remove semantic duplicates
	Stage 3: SYNTHESIS → Generate executive summary from deduplicated items
	```

	### Key Metrics

	\| Metric \| Value \|
	\|---------\|-------\|
	\| New Code \| ~1,800 lines \|
	\| Modified Code \| ~60 lines \|
	\| Total Models \| 33 unique (13 + 4 + 16) \|
	\| Default Models \| `qwen3_1.7b_q4`, `granite-107m`, `qwen3_1.7b_q4` \|
	\| Memory Strategy \| Sequential load/unload (safe for HF Spaces Free Tier) \|

	---

	## Design Decisions

	### Q1: Extraction Model List Composition (REVISION)
	Decision: Option A - 11 models (≤1.7B), excluding LFM2-Extract models

	Rationale: 11 models excluding LFM2-Extract specialized models (removed after testing showed 85.7% failure rate due to hallucination and schema non-compliance. Replaced with Qwen3 models that support reasoning and better handle Chinese content.)

	### Q1a: Synthesis Model Selection (NEW)
	Decision: Restrict to models ≤4GB (max 4B parameters)

	Rationale: HF Spaces Free Tier only has 16GB RAM; 7B+ models will OOM. Remove ernie_21b, glm_4_7_flash_reap_30b, qwen3_30b_thinking_q1, qwen3_30b_instruct_q1

	### Q2: Independence from Standard Mode
	Decision: Option B - Both Extraction AND Synthesis fully independent from `AVAILABLE_MODELS`

	Rationale: Full independence prevents parameter cross-contamination; synthesis models have their own optimized temperatures (0.7-0.9) separate from Standard mode

	### Q3: Extraction n_ctx UI Control
	Decision: Option A - Slider (2K-8K, step 1024, default 4K)

	Rationale: Maximum flexibility for users to balance precision vs speed

	### Q4: Default Models
	Decision:
	- Extraction: `qwen3_1.7b_q4` (supports reasoning, better Chinese understanding)
	- Embedding: `granite-107m` (fastest, good enough)
	- Synthesis: `qwen3_1.7b_q4` (larger than extraction, better quality)

	Rationale: Balanced defaults optimized for quality and speed. Qwen3 1.7B chosen over LFM2-Extract based on empirical testing showing superior extraction success rate and schema compliance.

	### Q5: Model Key Naming
	Decision: Keep same keys (no prefix like `adv_synth_`)

	Rationale: Simpler, less duplication, clear role-based config resolution

	### Q6: Model Overlap Between Stages
	Decision: Allow overlap with independent settings per role

	Rationale: Same model can be extraction + synthesis with different parameters

	### Q7: Reasoning Checkbox UI Flow
	Decision: Option B - Separate checkboxes for extraction and synthesis

	Rationale: Independent control per stage, clearer user intent

	### Q8: Thinking Block Display
	Decision: Option A - Reuse "MODEL THINKING PROCESS" field

	Rationale: Consistent with Standard mode, no UI layout changes needed

	### Q9: Window Token Counting with User n_ctx
	Decision: Option A - Strict adherence to user's slider value

	Rationale: Respect user's explicit choice, they may want larger/smaller windows

	### Q10: Model Loading Error Handling
	Decision: Option C - Graceful failure with UI error message

	Rationale: Most user-friendly, allows retry with different model

	---

	## Model Registries

	### 1. EXTRACTION_MODELS (13 models - FINAL)

	Location: `/home/luigi/tiny-scribe/app.py`

	Features:
	- ✅ Independent from `AVAILABLE_MODELS`
	- ✅ User-adjustable `n_ctx` (2K-8K, default 4K)
	- ✅ Extraction-optimized settings (temp 0.1-0.3)
	- ✅ 2 hybrid models with reasoning toggle
	- ✅ All models verified on HuggingFace

	Complete Registry (LFM2-Extract models removed after testing):

	```python
	EXTRACTION_MODELS = {
	"falcon_h1_100m": {
	"name": "Falcon-H1 100M",
	"repo_id": "mradermacher/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF",
	"filename": "*Q8_0.gguf",
	"max_context": 32768,
	"default_n_ctx": 4096,
	"params_size": "100M",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.2,
	"top_p": 0.9,
	"top_k": 30,
	"repeat_penalty": 1.0,
	},
	},
	"gemma3_270m": {
	"name": "Gemma-3 270M",
	"repo_id": "unsloth/gemma-3-270m-it-qat-GGUF",
	"filename": "*Q8_0.gguf",
	"max_context": 32768,
	"default_n_ctx": 4096,
	"params_size": "270M",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.3,
	"top_p": 0.9,
	"top_k": 40,
	"repeat_penalty": 1.0,
	},
	},
	"ernie_300m": {
	"name": "ERNIE-4.5 0.3B (131K Context)",
	"repo_id": "unsloth/ERNIE-4.5-0.3B-PT-GGUF",
	"filename": "*Q8_0.gguf",
	"max_context": 131072,
	"default_n_ctx": 4096,
	"params_size": "300M",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.2,
	"top_p": 0.9,
	"top_k": 30,
	"repeat_penalty": 1.0,
	},
	},
	"granite_350m": {
	"name": "Granite-4.0 350M",
	"repo_id": "unsloth/granite-4.0-h-350m-GGUF",
	"filename": "*Q8_0.gguf",
	"max_context": 32768,
	"default_n_ctx": 4096,
	"params_size": "350M",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.1,
	"top_p": 0.95,
	"top_k": 30,
	"repeat_penalty": 1.0,
	},
	},
	"lfm2_350m": {
	"name": "LFM2 350M",
	"repo_id": "LiquidAI/LFM2-350M-GGUF",
	"filename": "*Q8_0.gguf",
	"max_context": 32768,
	"default_n_ctx": 4096,
	"params_size": "350M",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.2,
	"top_p": 0.9,
	"top_k": 40,
	"repeat_penalty": 1.0,
	},
	},
	"bitcpm4_500m": {
	"name": "BitCPM4 0.5B (128K Context)",
	"repo_id": "openbmb/BitCPM4-0.5B-GGUF",
	"filename": "*q4_0.gguf",
	"max_context": 131072,
	"default_n_ctx": 4096,
	"params_size": "500M",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.2,
	"top_p": 0.9,
	"top_k": 30,
	"repeat_penalty": 1.0,
	},
	},
	"hunyuan_500m": {
	"name": "Hunyuan 0.5B (256K Context)",
	"repo_id": "mradermacher/Hunyuan-0.5B-Instruct-GGUF",
	"filename": "*Q8_0.gguf",
	"max_context": 262144,
	"default_n_ctx": 4096,
	"params_size": "500M",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.2,
	"top_p": 0.9,
	"top_k": 30,
	"repeat_penalty": 1.0,
	},
	},
	"qwen3_600m_q4": {
	"name": "Qwen3 0.6B Q4 (32K Context)",
	"repo_id": "unsloth/Qwen3-0.6B-GGUF",
	"filename": "*Q4_0.gguf",
	"max_context": 32768,
	"default_n_ctx": 4096,
	"params_size": "600M",
	"supports_reasoning": True, # ← HYBRID MODEL
	"supports_toggle": True, # ← User can toggle reasoning
	"inference_settings": {
	"temperature": 0.3,
	"top_p": 0.9,
	"top_k": 20,
	"repeat_penalty": 1.0,
	},
	},
	"granite_3_1_1b_q8": {
	"name": "Granite 3.1 1B-A400M Instruct (128K Context)",
	"repo_id": "bartowski/granite-3.1-1b-a400m-instruct-GGUF",
	"filename": "*Q8_0.gguf",
	"max_context": 131072,
	"default_n_ctx": 4096,
	"params_size": "1B",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.3,
	"top_p": 0.9,
	"top_k": 30,
	"repeat_penalty": 1.0,
	},
	},
	"falcon_h1_1.5b_q4": {
	"name": "Falcon-H1 1.5B Q4",
	"repo_id": "unsloth/Falcon-H1-1.5B-Deep-Instruct-GGUF",
	"filename": "*Q4_K_M.gguf",
	"max_context": 32768,
	"default_n_ctx": 4096,
	"params_size": "1.5B",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.2,
	"top_p": 0.9,
	"top_k": 30,
	"repeat_penalty": 1.0,
	},
	},
	"qwen3_1.7b_q4": {
	"name": "Qwen3 1.7B Q4 (32K Context)",
	"repo_id": "unsloth/Qwen3-1.7B-GGUF",
	"filename": "*Q4_0.gguf",
	"max_context": 32768,
	"default_n_ctx": 4096,
	"params_size": "1.7B",
	"supports_reasoning": True, # ← HYBRID MODEL
	"supports_toggle": True, # ← User can toggle reasoning
	"inference_settings": {
	"temperature": 0.3,
	"top_p": 0.9,
	"top_k": 20,
	"repeat_penalty": 1.0,
	},
	},
	"lfm2_extract_350m": {
	"name": "LFM2-Extract 350M (Specialized)",
	"repo_id": "LiquidAI/LFM2-350M-Extract-GGUF",
	"filename": "*Q8_0.gguf",
	"max_context": 32768,
	"default_n_ctx": 4096,
	"params_size": "350M",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.0, # ← Greedy decoding per Liquid AI docs
	"top_p": 1.0,
	"top_k": 0,
	"repeat_penalty": 1.0,
	},
	},
	"lfm2_extract_1.2b": {
	"name": "LFM2-Extract 1.2B (Specialized)",
	"repo_id": "LiquidAI/LFM2-1.2B-Extract-GGUF",
	"filename": "*Q8_0.gguf",
	"max_context": 32768,
	"default_n_ctx": 4096,
	"params_size": "1.2B",
	"supports_reasoning": False,
	"supports_toggle": False,
	"inference_settings": {
	"temperature": 0.0, # ← Greedy decoding per Liquid AI docs
	"top_p": 1.0,
	"top_k": 0,
	"repeat_penalty": 1.0,
	},
	},
	}
	```

	Hybrid Models (Reasoning Support):
	- `qwen3_600m_q4` - 600M, user-toggleable reasoning
	- `qwen3_1.7b_q4` - 1.7B, user-toggleable reasoning

	---

	### 2. SYNTHESIS_MODELS (16 models)

	Location: `/home/luigi/tiny-scribe/app.py`

	Features:
	- ✅ Fully independent from `AVAILABLE_MODELS` (no shared references)
	- ✅ Synthesis-optimized temperatures (0.7-0.9, higher than extraction)
	- ✅ 3 hybrid + 5 thinking-only models with reasoning support
	- ✅ Default: `qwen3_1.7b_q4`

	Registry Definition:

	```python
	# FULLY INDEPENDENT from AVAILABLE_MODELS (no shared references)
	# Synthesis-optimized settings: higher temperatures (0.7-0.9) for creative summary generation
	SYNTHESIS_MODELS = {
	"granite_3_1_1b_q8": {..., "temperature": 0.8},
	"falcon_h1_1.5b_q4": {..., "temperature": 0.7},
	"qwen3_1.7b_q4": {..., "temperature": 0.8}, # DEFAULT
	"granite_3_3_2b_q4": {..., "temperature": 0.8},
	"youtu_llm_2b_q8": {..., "temperature": 0.8}, # reasoning toggle
	"lfm2_2_6b_transcript": {..., "temperature": 0.7},
	"breeze_3b_q4": {..., "temperature": 0.7},
	"granite_3_1_3b_q4": {..., "temperature": 0.8},
	"qwen3_4b_thinking_q3": {..., "temperature": 0.8}, # thinking-only
	"granite4_tiny_q3": {..., "temperature": 0.8},
	"ernie_21b_pt_q1": {..., "temperature": 0.8},
	"ernie_21b_thinking_q1": {..., "temperature": 0.9}, # thinking-only
	"glm_4_7_flash_reap_30b": {..., "temperature": 0.8}, # thinking-only
	"glm_4_7_flash_30b_iq2": {..., "temperature": 0.7},
	"qwen3_30b_thinking_q1": {..., "temperature": 0.8}, # thinking-only
	"qwen3_30b_instruct_q1": {..., "temperature": 0.7},
	}
	```

	Reasoning Models:
	- Hybrid (toggleable): `qwen3_1.7b_q4`, `youtu_llm_2b_q8`
	- Thinking-only: `qwen3_4b_thinking_q3`, `ernie_21b_thinking_q1`, `glm_4_7_flash_reap_30b`, `qwen3_30b_thinking_q1`

	---

	### 3. EMBEDDING_MODELS (4 models)

	Location: `/home/luigi/tiny-scribe/meeting_summarizer/extraction.py`

	Features:
	- ✅ Dedicated embedding models (not in AVAILABLE_MODELS)
	- ✅ Used exclusively for deduplication phase
	- ✅ Range: 384-dim to 1024-dim
	- ✅ Default: `granite-107m`

	Registry:

	```python
	EMBEDDING_MODELS = {
	"granite-107m": {
	"name": "Granite 107M Multilingual (384-dim)",
	"repo_id": "ibm-granite/granite-embedding-107m-multilingual",
	"filename": "*Q8_0.gguf",
	"embedding_dim": 384,
	"max_context": 2048,
	"description": "Fastest, multilingual, good for quick deduplication",
	},
	"granite-278m": {
	"name": "Granite 278M Multilingual (768-dim)",
	"repo_id": "ibm-granite/granite-embedding-278m-multilingual",
	"filename": "*Q8_0.gguf",
	"embedding_dim": 768,
	"max_context": 2048,
	"description": "Balanced speed/quality, multilingual",
	},
	"gemma-300m": {
	"name": "Embedding Gemma 300M (768-dim)",
	"repo_id": "unsloth/embeddinggemma-300m-GGUF",
	"filename": "*Q8_0.gguf",
	"embedding_dim": 768,
	"max_context": 2048,
	"description": "Google embedding model, strong semantics",
	},
	"qwen-600m": {
	"name": "Qwen3 Embedding 600M (1024-dim)",
	"repo_id": "Qwen/Qwen3-Embedding-0.6B-GGUF",
	"filename": "*Q8_0.gguf",
	"embedding_dim": 1024,
	"max_context": 2048,
	"description": "Highest quality, best for critical dedup",
	},
	}
	```

	---

	## UI Implementation

	### Advanced Mode Controls (Option B: Separate Reasoning Checkboxes)

	Location: `/home/luigi/tiny-scribe/app.py`, Gradio interface section

	```python
	# ===== ADVANCED MODE CONTROLS =====
	# Uses gr.TabItem inside gr.Tabs (not gr.Group with visibility toggle)
	with gr.TabItem("🧠 Advanced Mode (3-Model Pipeline)"):

	# Model Selection Row
	with gr.Row():
	extraction_model = gr.Dropdown(
	choices=list(EXTRACTION_MODELS.keys()),
	value="qwen3_1.7b_q4", # ⭐ DEFAULT
	label="🔍 Stage 1: Extraction Model (≤1.7B)",
	info="Extracts structured items (action_items, decisions, key_points, questions) from windows"
	)

	embedding_model = gr.Dropdown(
	choices=list(EMBEDDING_MODELS.keys()),
	value="granite-107m", # ⭐ DEFAULT
	label="🧬 Stage 2: Embedding Model",
	info="Computes semantic embeddings for deduplication across categories"
	)

	synthesis_model = gr.Dropdown(
	choices=list(SYNTHESIS_MODELS.keys()),
	value="qwen3_1.7b_q4", # ⭐ DEFAULT
	label="✨ Stage 3: Synthesis Model (1B-30B)",
	info="Generates final executive summary from deduplicated items"
	)

	# Extraction Parameters Row
	with gr.Row():
	extraction_n_ctx = gr.Slider(
	minimum=2048,
	maximum=8192,
	step=1024,
	value=4096, # ⭐ DEFAULT 4K
	label="🪟 Extraction Context Window (n_ctx)",
	info="Smaller = more windows (higher precision), Larger = fewer windows (faster processing)"
	)

	overlap_turns = gr.Slider(
	minimum=1,
	maximum=5,
	step=1,
	value=2,
	label="🔄 Window Overlap (speaker turns)",
	info="Number of speaker turns shared between adjacent windows (reduces information loss)"
	)

	# Deduplication Parameters Row
	with gr.Row():
	similarity_threshold = gr.Slider(
	minimum=0.70,
	maximum=0.95,
	step=0.01,
	value=0.85,
	label="🎯 Deduplication Similarity Threshold",
	info="Items with cosine similarity above this are considered duplicates (higher = stricter)"
	)

	# SEPARATE REASONING CONTROLS (Q7: Option B)
	with gr.Row():
	enable_extraction_reasoning = gr.Checkbox(
	value=False,
	visible=False, # Conditional visibility based on extraction model
	label="🧠 Enable Reasoning for Extraction",
	info="Use thinking process before JSON output (Qwen3 hybrid models only)"
	)

	enable_synthesis_reasoning = gr.Checkbox(
	value=True,
	visible=True, # Conditional visibility based on synthesis model
	label="🧠 Enable Reasoning for Synthesis",
	info="Use thinking process for final summary generation"
	)

	# Output Settings Row
	with gr.Row():
	adv_output_language = gr.Radio(
	choices=["en", "zh-TW"],
	value="en",
	label="🌐 Output Language",
	info="Extraction auto-detects language from transcript, synthesis uses this setting"
	)

	adv_max_tokens = gr.Slider(
	minimum=512,
	maximum=4096,
	step=128,
	value=2048,
	label="📏 Max Synthesis Tokens",
	info="Maximum tokens for final executive summary"
	)

	# Logging Control
	enable_detailed_logging = gr.Checkbox(
	value=True,
	label="📝 Enable Detailed Trace Logging",
	info="Save JSONL trace file (embedded in download JSON) for debugging pipeline"
	)

	# Model Info Accordion
	with gr.Accordion("📋 Model Details & Settings", open=False):
	with gr.Row():
	with gr.Column():
	extraction_model_info = gr.Markdown("Extraction Model\n\nSelect a model to see details")
	with gr.Column():
	embedding_model_info = gr.Markdown("Embedding Model\n\nSelect a model to see details")
	with gr.Column():
	synthesis_model_info = gr.Markdown("Synthesis Model\n\nSelect a model to see details")
	```

	---

	### Conditional Reasoning Checkbox Visibility Logic

	```python
	def update_extraction_reasoning_visibility(model_key):
	"""Show/hide extraction reasoning checkbox based on model capabilities."""
	config = EXTRACTION_MODELS.get(model_key, {})
	supports_toggle = config.get("supports_toggle", False)

	if supports_toggle:
	# Hybrid model (qwen3_600m_q4, qwen3_1.7b_q4)
	return gr.update(
	visible=True,
	value=False,
	interactive=True,
	label="🧠 Enable Reasoning for Extraction"
	)
	elif config.get("supports_reasoning", False) and not supports_toggle:
	# Thinking-only model (none currently in extraction, but future-proof)
	return gr.update(
	visible=True,
	value=True,
	interactive=False,
	label="🧠 Reasoning Mode for Extraction (Always On)"
	)
	else:
	# Non-reasoning model
	return gr.update(visible=False, value=False)


	def update_synthesis_reasoning_visibility(model_key):
	"""Show/hide synthesis reasoning checkbox based on model capabilities."""
	# Reuse existing logic from Standard mode
	return update_reasoning_visibility(model_key) # Existing function


	# Wire up event handlers
	extraction_model.change(
	fn=update_extraction_reasoning_visibility,
	inputs=[extraction_model],
	outputs=[enable_extraction_reasoning]
	)

	synthesis_model.change(
	fn=update_synthesis_reasoning_visibility,
	inputs=[synthesis_model],
	outputs=[enable_synthesis_reasoning]
	)
	```

	---

	### Model Info Display Functions

	```python
	def get_extraction_model_info(model_key):
	"""Generate markdown info for extraction model."""
	config = EXTRACTION_MODELS.get(model_key, {})
	settings = config.get("inference_settings", {})

	reasoning_support = ""
	if config.get("supports_toggle"):
	reasoning_support = "\nReasoning: Hybrid (user-toggleable)"
	elif config.get("supports_reasoning"):
	reasoning_support = "\nReasoning: Thinking-only (always on)"

	return f"""{config.get('name', 'Unknown')}

	Size: {config.get('params_size', 'N/A')}
	Max Context: {config.get('max_context', 0):,} tokens
	Default n_ctx: {config.get('default_n_ctx', 4096):,} tokens (user-adjustable via slider)
	Repository: `{config.get('repo_id', 'N/A')}`{reasoning_support}

	Extraction-Optimized Settings:
	- Temperature: {settings.get('temperature', 'N/A')} (deterministic for JSON)
	- Top P: {settings.get('top_p', 'N/A')}
	- Top K: {settings.get('top_k', 'N/A')}
	- Repeat Penalty: {settings.get('repeat_penalty', 'N/A')}
	"""


	def get_embedding_model_info(model_key):
	"""Generate markdown info for embedding model."""
	from meeting_summarizer.extraction import EMBEDDING_MODELS
	config = EMBEDDING_MODELS.get(model_key, {})

	return f"""{config.get('name', 'Unknown')}

	Embedding Dimension: {config.get('embedding_dim', 'N/A')}
	Context: {config.get('max_context', 0):,} tokens
	Repository: `{config.get('repo_id', 'N/A')}`

	Description: {config.get('description', 'N/A')}
	"""


	def get_synthesis_model_info(model_key):
	"""Generate markdown info for synthesis model."""
	config = SYNTHESIS_MODELS.get(model_key, {})
	settings = config.get("inference_settings", {})

	reasoning_support = ""
	if config.get("supports_toggle"):
	reasoning_support = "\nReasoning: Hybrid (user-toggleable)"
	elif config.get("supports_reasoning"):
	reasoning_support = "\nReasoning: Thinking-only (always on)"

	return f"""{config.get('name', 'Unknown')}

	Max Context: {config.get('max_context', 0):,} tokens
	Repository: `{config.get('repo_id', 'N/A')}`{reasoning_support}

	Synthesis-Optimized Settings:
	- Temperature: {settings.get('temperature', 'N/A')} (from Standard mode)
	- Top P: {settings.get('top_p', 'N/A')}
	- Top K: {settings.get('top_k', 'N/A')}
	- Repeat Penalty: {settings.get('repeat_penalty', 'N/A')}
	"""


	# Wire up info update handlers
	extraction_model.change(
	fn=get_extraction_model_info,
	inputs=[extraction_model],
	outputs=[extraction_model_info]
	)

	embedding_model.change(
	fn=get_embedding_model_info,
	inputs=[embedding_model],
	outputs=[embedding_model_info]
	)

	synthesis_model.change(
	fn=get_synthesis_model_info,
	inputs=[synthesis_model],
	outputs=[synthesis_model_info]
	)
	```

	---

	## Model Management Infrastructure

	### Role-Aware Configuration Resolver

	```python
	def get_model_config(model_key: str, model_role: str) -> Dict[str, Any]:
	"""
	Get model configuration based on role.

	Ensures same model (e.g., qwen3_1.7b_q4) uses DIFFERENT settings
	for extraction vs synthesis.

	Args:
	model_key: Model identifier (e.g., "qwen3_1.7b_q4")
	model_role: "extraction" or "synthesis"

	Returns:
	Model configuration dict with role-specific settings

	Raises:
	ValueError: If model_key not available for specified role
	"""
	if model_role == "extraction":
	if model_key not in EXTRACTION_MODELS:
	available = ", ".join(list(EXTRACTION_MODELS.keys())[:3]) + "..."
	raise ValueError(
	f"Model '{model_key}' not available for extraction role. "
	f"Available: {available}"
	)
	return EXTRACTION_MODELS[model_key]

	elif model_role == "synthesis":
	if model_key not in SYNTHESIS_MODELS:
	available = ", ".join(list(SYNTHESIS_MODELS.keys())[:3]) + "..."
	raise ValueError(
	f"Model '{model_key}' not available for synthesis role. "
	f"Available: {available}"
	)
	return SYNTHESIS_MODELS[model_key]

	else:
	raise ValueError(
	f"Unknown model role: '{model_role}'. "
	f"Must be 'extraction' or 'synthesis'"
	)
	```

	---

	### Role-Aware Model Loader (Q9: Option A - Respect user's n_ctx choice)

	```python
	def load_model_for_role(
	model_key: str,
	model_role: str,
	n_threads: int = 2,
	user_n_ctx: Optional[int] = None # For extraction, from slider
	) -> Tuple[Llama, str]:
	"""
	Load model with role-specific configuration.

	Args:
	model_key: Model identifier
	model_role: "extraction" or "synthesis"
	n_threads: CPU threads
	user_n_ctx: User-specified n_ctx (extraction only, from slider)

	Returns:
	(loaded_model, info_message)

	Raises:
	Exception: If model loading fails (Q10: Option C - fail gracefully)
	"""
	try:
	config = get_model_config(model_key, model_role)

	# Calculate n_ctx (Q9: Option A - Strict adherence to user's choice)
	if model_role == "extraction" and user_n_ctx is not None:
	n_ctx = min(user_n_ctx, config["max_context"], MAX_USABLE_CTX)
	else:
	# Synthesis or default extraction
	n_ctx = min(config.get("max_context", 8192), MAX_USABLE_CTX)

	# Detect GPU support
	requested_ngl = int(os.environ.get("N_GPU_LAYERS", 0))
	n_gpu_layers = requested_ngl

	if requested_ngl != 0:
	try:
	from llama_cpp import llama_supports_gpu_offload
	gpu_available = llama_supports_gpu_offload()
	if not gpu_available:
	logger.warning("GPU requested but not available. Using CPU.")
	n_gpu_layers = 0
	except Exception as e:
	logger.warning(f"Could not detect GPU: {e}. Using CPU.")
	n_gpu_layers = 0

	# Load model
	logger.info(f"Loading {config['name']} for {model_role} role (n_ctx={n_ctx:,})")

	llm = Llama.from_pretrained(
	repo_id=config["repo_id"],
	filename=config["filename"],
	n_ctx=n_ctx,
	n_batch=min(2048, n_ctx),
	n_threads=n_threads,
	n_threads_batch=n_threads,
	n_gpu_layers=n_gpu_layers,
	verbose=False,
	seed=1337,
	)

	info_msg = (
	f"✅ Loaded: {config['name']} for {model_role} "
	f"(n_ctx={n_ctx:,}, threads={n_threads})"
	)
	logger.info(info_msg)

	return llm, info_msg

	except Exception as e:
	# Q10: Option C - Fail gracefully, let user select different model
	error_msg = (
	f"❌ Failed to load {model_key} for {model_role}: {str(e)}\n\n"
	f"Please select a different model and try again."
	)
	logger.error(error_msg, exc_info=True)
	raise Exception(error_msg)


	def unload_model(llm: Llama, model_name: str = "model") -> None:
	"""Explicitly unload model and trigger garbage collection."""
	if llm:
	logger.info(f"Unloading {model_name}")
	del llm
	gc.collect()
	time.sleep(0.5) # Allow OS to reclaim memory
	```

	---

	## Extraction Pipeline

	### Extraction System Prompt Builder (Bilingual + Reasoning)

	```python
	def build_extraction_system_prompt(
	output_language: str,
	supports_reasoning: bool,
	supports_toggle: bool,
	enable_reasoning: bool
	) -> str:
	"""
	Build extraction system prompt with optional reasoning mode.

	Args:
	output_language: "en" or "zh-TW" (auto-detected from transcript)
	supports_reasoning: Model has reasoning capability
	supports_toggle: User can toggle reasoning on/off
	enable_reasoning: User's choice (only applies if supports_toggle=True)

	Returns:
	System prompt string
	"""
	# Determine reasoning mode
	if supports_toggle and enable_reasoning:
	# Hybrid model with reasoning enabled
	reasoning_instruction_en = """
	Use your reasoning capabilities to analyze the content before extracting.

	Your reasoning should:
	1. Identify key decision points and action items
	2. Distinguish explicit decisions from general discussion
	3. Categorize information appropriately (action vs point vs question)

	After reasoning, output ONLY valid JSON."""

	reasoning_instruction_zh = """
	使用你的推理能力分析內容後再進行提取。

	你的推理應該：
	1. 識別關鍵決策點和行動項目
	2. 區分明確決策與一般討論
	3. 適當分類資訊（行動 vs 要點 vs 問題）

	推理後，僅輸出 JSON。"""
	else:
	reasoning_instruction_en = ""
	reasoning_instruction_zh = ""

	# Build full prompt
	if output_language == "zh-TW":
	return f"""你是會議分析助手。從逐字稿中提取結構化資訊。
	{reasoning_instruction_zh}

	僅輸出有效的 JSON，使用此精確架構：
	{{
	"action_items": ["包含負責人和截止日期的任務", ...],
	"decisions": ["包含理由的決策", ...],
	"key_points": ["重要討論要點", ...],
	"open_questions": ["未解決的問題或疑慮", ...]
	}}

	規則：
	- 每個項目必須是完整、獨立的句子
	- 在每個項目中包含上下文（誰、什麼、何時）
	- 如果類別沒有項目，使用空陣列 []
	- 僅輸出 JSON，無 markdown，無解釋"""

	else: # English
	return f"""You are a meeting analysis assistant. Extract structured information from transcript.
	{reasoning_instruction_en}

	Output ONLY valid JSON with this exact schema:
	{{
	"action_items": ["Task with owner and deadline", ...],
	"decisions": ["Decision made with rationale", ...],
	"key_points": ["Important discussion point", ...],
	"open_questions": ["Unresolved question or concern", ...]
	}}

	Rules:
	- Each item must be a complete, standalone sentence
	- Include context (who, what, when) in each item
	- If a category has no items, use empty array []
	- Output ONLY JSON, no markdown, no explanations"""
	```

	---

	### Extraction Streaming with Reasoning Parsing (Q8: Option A - Show in "MODEL THINKING PROCESS")

	```python
	def stream_extract_from_window(
	extraction_llm: Llama,
	window: Window,
	window_id: int,
	total_windows: int,
	tracer: Tracer,
	tokenizer: NativeTokenizer,
	enable_reasoning: bool = False
	) -> Generator[Tuple[str, str, Dict[str, List[str]], bool], None, None]:
	"""
	Stream extraction from single window with live progress + optional reasoning.

	Yields:
	(ticker_text, thinking_text, partial_items, is_complete)
	- ticker_text: Progress ticker for UI
	- thinking_text: Reasoning/thinking blocks (if extraction model supports it)
	- partial_items: Current extracted items
	- is_complete: True on final yield
	"""
	# Auto-detect language from window content
	has_cjk = bool(re.search(r'[\u4e00-\u9fff]', window.content))
	output_language = "zh-TW" if has_cjk else "en"

	# Build system prompt with reasoning support
	config = EXTRACTION_MODELS[window.model_key] # Assuming we pass model_key in Window
	system_prompt = build_extraction_system_prompt(
	output_language=output_language,
	supports_reasoning=config.get("supports_reasoning", False),
	supports_toggle=config.get("supports_toggle", False),
	enable_reasoning=enable_reasoning
	)

	user_prompt = f"Transcript:\n\n{window.content}"

	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": user_prompt}
	]

	# Stream extraction
	full_response = ""
	thinking_content = ""
	start_time = time.time()
	first_token_time = None
	token_count = 0

	try:
	stream = extraction_llm.create_chat_completion(
	messages=messages,
	max_tokens=1024,
	temperature=config["inference_settings"]["temperature"],
	top_p=config["inference_settings"]["top_p"],
	top_k=config["inference_settings"]["top_k"],
	repeat_penalty=config["inference_settings"]["repeat_penalty"],
	stream=True,
	)

	for chunk in stream:
	if 'choices' in chunk and len(chunk['choices']) > 0:
	delta = chunk['choices'][0].get('delta', {})
	content = delta.get('content', '')

	if content:
	if first_token_time is None:
	first_token_time = time.time()

	token_count += 1
	full_response += content

	# Parse thinking blocks if reasoning enabled
	if enable_reasoning and config.get("supports_reasoning"):
	thinking, remaining = parse_thinking_blocks(full_response, streaming=True)
	thinking_content = thinking or ""
	json_text = remaining
	else:
	json_text = full_response

	# Try to parse JSON
	partial_items = _try_parse_extraction_json(json_text)

	# Calculate progress metrics
	elapsed = time.time() - start_time
	tps = token_count / elapsed if elapsed > 0 else 0
	remaining_tokens = 1024 - token_count
	eta = int(remaining_tokens / tps) if tps > 0 else 0

	# Get item counts for ticker
	items_count = {
	"action_items": len(partial_items.get("action_items", [])),
	"decisions": len(partial_items.get("decisions", [])),
	"key_points": len(partial_items.get("key_points", [])),
	"open_questions": len(partial_items.get("open_questions", []))
	}

	# Get last extracted item as snippet
	last_item = ""
	for category in ["action_items", "decisions", "key_points", "open_questions"]:
	if partial_items.get(category):
	last_item = partial_items[category][-1]
	break

	# Format progress ticker
	input_tokens = tokenizer.count(window.content)
	ticker = format_progress_ticker(
	current_window=window_id,
	total_windows=total_windows,
	window_tokens=input_tokens,
	max_tokens=4096, # Reference max for percentage
	items_found=items_count,
	tokens_per_sec=tps,
	eta_seconds=eta,
	current_snippet=last_item
	)

	# Q8: Option A - Show in "MODEL THINKING PROCESS" field
	yield (ticker, thinking_content, partial_items, False)

	# Final parse
	if enable_reasoning and config.get("supports_reasoning"):
	thinking, remaining = parse_thinking_blocks(full_response)
	thinking_content = thinking or ""
	json_text = remaining
	else:
	json_text = full_response

	final_items = _try_parse_extraction_json(json_text)

	if not final_items:
	# JSON parsing failed - FAIL ENTIRE PIPELINE (strict mode)
	error_msg = f"Failed to parse JSON from window {window_id}. Response: {json_text[:200]}"
	tracer.log_extraction(
	window_id=window_id,
	extraction=None,
	llm_response=_sample_llm_response(full_response),
	error=error_msg
	)
	raise ValueError(error_msg)

	# Log successful extraction
	tracer.log_extraction(
	window_id=window_id,
	extraction=final_items,
	llm_response=_sample_llm_response(full_response),
	thinking=_sample_llm_response(thinking_content) if thinking_content else None,
	error=None
	)

	# Final ticker
	elapsed = time.time() - start_time
	tps = token_count / elapsed if elapsed > 0 else 0
	items_count = {k: len(v) for k, v in final_items.items()}

	ticker = format_progress_ticker(
	current_window=window_id,
	total_windows=total_windows,
	window_tokens=input_tokens,
	max_tokens=4096,
	items_found=items_count,
	tokens_per_sec=tps,
	eta_seconds=0,
	current_snippet="✅ Extraction complete"
	)

	yield (ticker, thinking_content, final_items, True)

	except Exception as e:
	# Log error and re-raise to fail entire pipeline
	tracer.log_extraction(
	window_id=window_id,
	extraction=None,
	llm_response=_sample_llm_response(full_response) if full_response else "",
	error=str(e)
	)
	raise
	```

	---

	## Implementation Checklist

	### Files to Create

	- [ ] `/home/luigi/tiny-scribe/meeting_summarizer/extraction.py` (~900 lines)
	- [ ] `NativeTokenizer` class
	- [ ] `EmbeddingModel` class + `EMBEDDING_MODELS` registry
	- [ ] `format_progress_ticker()` function
	- [ ] `stream_extract_from_window()` function (with reasoning support)
	- [ ] `deduplicate_items()` function
	- [ ] `stream_synthesize_executive_summary()` function

	### Files to Modify

	- [ ] `/home/luigi/tiny-scribe/meeting_summarizer/__init__.py`
	- [ ] Remove `filter_validated_items` import/export

	- [ ] `/home/luigi/tiny-scribe/meeting_summarizer/trace.py`
	- [ ] Add `log_extraction()` method
	- [ ] Add `log_deduplication()` method
	- [ ] Add `log_synthesis()` method

	- [ ] `/home/luigi/tiny-scribe/app.py` (~800 lines added/modified)
	- [ ] Add `EXTRACTION_MODELS` registry (13 models)
	- [ ] Add `SYNTHESIS_MODELS` reference
	- [ ] Add `get_model_config()` function
	- [ ] Add `load_model_for_role()` function
	- [ ] Add `unload_model()` function
	- [ ] Add `build_extraction_system_prompt()` function
	- [ ] Add `summarize_advanced()` generator function
	- [ ] Add Advanced mode UI controls
	- [ ] Add reasoning visibility logic
	- [ ] Add model info display functions
	- [ ] Update `download_summary_json()` for trace embedding

	### Code Statistics

	\| Metric \| Count \|
	\|--------\|-------\|
	\| New Lines \| ~1,800 \|
	\| Modified Lines \| ~60 \|
	\| Removed Lines \| ~2 \|
	\| New Functions \| 12 \|
	\| New Classes \| 2 \|
	\| UI Controls \| 11 \|

	---

	## Testing Strategy

	### Phase 1: Model Registry Validation

	```bash
	python -c "
	from app import EXTRACTION_MODELS, SYNTHESIS_MODELS
	from meeting_summarizer.extraction import EMBEDDING_MODELS

	assert len(EXTRACTION_MODELS) == 13, 'Extraction models count mismatch'
	assert len(EMBEDDING_MODELS) == 4, 'Embedding models count mismatch'
	assert len(SYNTHESIS_MODELS) == 16, 'Synthesis models count mismatch'

	# Verify independent settings
	ext_qwen = EXTRACTION_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature']
	syn_qwen = SYNTHESIS_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature']
	assert ext_qwen == 0.3, f'Extraction temp wrong: {ext_qwen}'
	assert syn_qwen == 0.8, f'Synthesis temp wrong: {syn_qwen}'

	print('✅ All model registries validated!')
	"
	```

	### Phase 2: UI Control Validation

	Manual Checks:
	1. Select "Advanced" mode
	2. Verify 3 dropdowns show correct counts (13, 4, 16)
	3. Verify default models selected
	4. Adjust extraction_n_ctx slider (2K → 8K)
	5. Select qwen3_600m_q4 for extraction → reasoning checkbox appears
	6. Select qwen3_1.7b_q4 for extraction → reasoning checkbox visible (Qwen3 supports reasoning)
	7. Select qwen3_4b_thinking_q3 for synthesis → reasoning locked ON
	8. Verify model info panels update on selection

	### Phase 3: Pipeline Test - min.txt (Quick)

	Configuration:
	- Extraction: `qwen3_1.7b_q4` (default)
	- Extraction n_ctx: 4096 (default)
	- Embedding: `granite-107m` (default)
	- Synthesis: `qwen3_1.7b_q4` (default)
	- Similarity threshold: 0.85 (default)

	Expected:
	- 1 window created
	- ~2-4 items extracted
	- 0-1 duplicates removed
	- Final summary generated
	- Total time: ~30-60s
	- Download JSON contains trace

	### Phase 4: Pipeline Test - Reasoning Models

	Configuration:
	- Extraction: `qwen3_600m_q4`
	- ☑ Enable Reasoning for Extraction (test hybrid model)
	- Extraction n_ctx: 2048 (smaller windows)
	- Embedding: `granite-278m` (test balanced embedding)
	- Synthesis: `qwen3_1.7b_q4`
	- ☑ Enable Reasoning for Synthesis

	Expected:
	- More windows (~4-6 with 2K context)
	- "MODEL THINKING PROCESS" shows extraction thinking + ticker
	- ~10-15 items extracted
	- ~2-4 duplicates removed
	- Final summary with thinking blocks
	- Total time: ~2-3 min

	### Phase 5: Pipeline Test - full.txt (Production)

	Configuration:
	- Extraction: `qwen3_1.7b_q4` (high quality, reasoning enabled)
	- Extraction n_ctx: 4096 (default)
	- Embedding: `qwen-600m` (highest quality)
	- Synthesis: `qwen3_4b_thinking_q3` (4B thinking model)
	- Output language: zh-TW (test Chinese)

	Expected:
	- ~3-5 windows (4K context)
	- ~40-60 items extracted
	- ~10-15 duplicates removed
	- Final summary in Traditional Chinese
	- Total time: ~5-8 min
	- Download JSON with embedded trace (~1-2MB)

	### Phase 6: Error Handling Test (Q10: Option C)

	Scenarios:
	1. Disconnect internet during model download
	2. Manually corrupt model cache
	3. Use invalid model repo_id in EXTRACTION_MODELS

	Expected behavior:
	- Error message displayed in UI: "❌ Failed to load lfm2_extract_1.2b..."
	- Pipeline stops (doesn't try fallback)
	- User can select different model and retry
	- Trace file saved with error details

	---

	## Implementation Priority

	### Suggested Implementation Sequence (13-19 hours total)

	1. Model Registries (1-2 hours)
	- [ ] Add `EXTRACTION_MODELS` to `app.py`
	- [ ] Add `SYNTHESIS_MODELS` reference
	- [ ] Add `EMBEDDING_MODELS` to `extraction.py`
	- [ ] Validate with smoke test

	2. Core Infrastructure (2-3 hours)
	- [ ] Implement `get_model_config()`
	- [ ] Implement `load_model_for_role()` with user_n_ctx support
	- [ ] Implement `unload_model()`
	- [ ] Implement `build_extraction_system_prompt()` with reasoning support
	- [ ] Update `trace.py` with 3 new logging methods
	- [ ] Update `__init__.py`

	3. Extraction Module (3-4 hours)
	- [ ] Implement `NativeTokenizer` class
	- [ ] Implement `EmbeddingModel` class
	- [ ] Implement `format_progress_ticker()`
	- [ ] Implement `stream_extract_from_window()` with reasoning parsing
	- [ ] Implement `deduplicate_items()`
	- [ ] Implement `stream_synthesize_executive_summary()`

	4. UI Integration (2-3 hours)
	- [ ] Add Advanced mode controls to Gradio interface
	- [ ] Implement reasoning checkbox visibility logic
	- [ ] Implement model info display functions
	- [ ] Wire up all event handlers
	- [ ] Test UI responsiveness

	5. Pipeline Orchestration (3-4 hours)
	- [ ] Implement `summarize_advanced()` generator function
	- [ ] Sequential model loading/unloading logic
	- [ ] Error handling with graceful failures
	- [ ] Progress ticker updates
	- [ ] Trace embedding in download JSON

	6. Testing & Validation (2-3 hours)
	- [ ] Run all test phases (min.txt → full.txt)
	- [ ] Validate reasoning models behavior
	- [ ] Test error handling scenarios
	- [ ] Performance optimization (if needed)

	---

	## Risk Assessment

	\| Risk \| Probability \| Impact \| Mitigation \|
	\|-------\|-------------\|--------\|------------\|
	\| Memory overflow on HF Spaces Free Tier \| Low \| High \| Sequential loading/unloading tested; add memory monitoring \|
	\| Reasoning output breaks JSON parsing \| Medium \| Medium \| Robust thinking block parsing with fallback; strict error handling \|
	\| User n_ctx slider causes OOM \| Low \| Medium \| Cap at MAX_USABLE_CTX (32K); show warning if user sets too high \|
	\| Embedding models slow down pipeline \| Medium \| Low \| Default to granite-107m (fastest); user can upgrade if needed \|
	\| Trace file too large \| Low \| Low \| Response sampling (400 chars) already implemented; compress if >5MB \|

	---

	## Appendix: Model Comparison Tables

	### Extraction Models (11)

	\| Model \| Size \| Context \| Reasoning \| Settings \|
	\|--------\|------\|---------\|-----------\|----------\|
	\| falcon_h1_100m \| 100M \| 32K \| No \| temp=0.2 \|
	\| gemma3_270m \| 270M \| 32K \| No \| temp=0.3 \|
	\| ernie_300m \| 300M \| 131K \| No \| temp=0.2 \|
	\| granite_350m \| 350M \| 32K \| No \| temp=0.1 \|
	\| bitcpm4_500m \| 500M \| 128K \| No \| temp=0.2 \|
	\| hunyuan_500m \| 500M \| 256K \| No \| temp=0.2 \|
	\| qwen3_600m_q4 \| 600M \| 32K \| Hybrid \| temp=0.3 \|
	\| granite_3_1_1b_q8 \| 1B \| 128K \| No \| temp=0.3 \|
	\| falcon_h1_1.5b_q4 \| 1.5B \| 32K \| No \| temp=0.2 \|
	\| qwen3_1.7b_q4 \| 1.7B \| 32K \| Hybrid \| temp=0.3 \|
	\| lfm2_extract_1.2b \| 1.2B \| 32K \| No \| temp=0.2 \|

	### Synthesis Models (16)

	\| Model \| Size \| Context \| Reasoning \| Settings \|
	\|--------\|------\|---------\|-----------\|----------\|
	\| granite_3_1_1b_q8 \| 1B \| 128K \| No \| temp=0.7 \|
	\| falcon_h1_1.5b_q4 \| 1.5B \| 32K \| No \| temp=0.1 \|
	\| qwen3_1.7b_q4 \| 1.7B \| 32K \| Hybrid \| temp=0.8 \|
	\| granite_3_3_2b_q4 \| 2B \| 128K \| No \| temp=0.8 \|
	\| youtu_llm_2b_q8 \| 2B \| 128K \| Hybrid \| temp=0.8 \|
	\| lfm2_2_6b_transcript \| 2.6B \| 32K \| No \| temp=0.7 \|
	\| breeze_3b_q4 \| 3B \| 32K \| No \| temp=0.7 \|
	\| granite_3_1_3b_q4 \| 3B \| 128K \| No \| temp=0.8 \|
	\| qwen3_4b_thinking_q3 \| 4B \| 256K \| Thinking-only \| temp=0.8 \|
	\| granite4_tiny_q3 \| 7B \| 128K \| No \| temp=0.8 \|
	\| ernie_21b_pt_q1 \| 21B \| 128K \| No \| temp=0.8 \|
	\| ernie_21b_thinking_q1 \| 21B \| 128K \| Thinking-only \| temp=0.9 \|
	\| glm_4_7_flash_reap_30b \| 30B \| 128K \| Thinking-only \| temp=0.8 \|
	\| glm_4_7_flash_30b_iq2 \| 30B \| 128K \| No \| temp=0.7 \|
	\| qwen3_30b_thinking_q1 \| 30B \| 256K \| Thinking-only \| temp=0.8 \|
	\| qwen3_30b_instruct_q1 \| 30B \| 256K \| No \| temp=0.7 \|

	### Embedding Models (4)

	\| Model \| Size \| Dimension \| Speed \| Quality \|
	\|--------\|------\|-----------\|-------\|---------\|
	\| granite-107m \| 107M \| 384 \| Fastest \| Good \|
	\| granite-278m \| 278M \| 768 \| Balanced \| Better \|
	\| gemma-300m \| 300M \| 768 \| Fast \| Good \|
	\| qwen-600m \| 600M \| 1024 \| Slower \| Best \|

	---

	## Next Steps

	Once approved, implementation will proceed in the order outlined in the Priority section. All code will be committed with descriptive messages referencing this plan document.

	Ready for implementation approval.

	---

	Document Version: 1.1
	Last Updated: 2026-02-05
	Author: Claude (Anthropic)
	Reviewer: Updated post-implementation to match actual code