Spaces:

Luigi
/

tiny-scribe

Running

App Files Files Community

Luigi commited on Feb 4

Commit

b7e57ed

1 Parent(s): b4c1319

docs: Add Advanced Mode implementation plan

Browse files

Files changed (1) hide show

docs/advanced-mode-implementation-plan.md +1389 -0

docs/advanced-mode-implementation-plan.md ADDED Viewed

	@@ -0,0 +1,1389 @@

+# Advanced 2-Stage Meeting Summarization - Complete Implementation Plan
+**Project:** Tiny Scribe - Advanced Mode
+**Date:** 2026-02-04
+**Status:** Ready for Implementation
+**Estimated Effort:** 13-19 hours
+---
+## Table of Contents
+1. [Executive Summary](#executive-summary)
+2. [Design Decisions](#design-decisions)
+3. [Model Registries](#model-registries)
+4. [UI Implementation](#ui-implementation)
+5. [Model Management Infrastructure](#model-management-infrastructure)
+6. [Extraction Pipeline](#extraction-pipeline)
+7. [Implementation Checklist](#implementation-checklist)
+8. [Testing Strategy](#testing-strategy)
+9. [Implementation Priority](#implementation-priority)
+10. [Risk Assessment](#risk-assessment)
+---
+## Executive Summary
+This plan details the implementation of a **3-model Advanced Summarization Pipeline** for Tiny Scribe, featuring:
+- ✅ **3 independent model registries** (Extraction, Embedding, Synthesis)
+- ✅ **User-configurable extraction context** (2K-8K tokens, default 4K)
+- ✅ **Reasoning/thinking model support** with independent toggles per stage
+- ✅ **Sequential model loading** for memory efficiency
+- ✅ **Bilingual support** (English + Traditional Chinese zh-TW)
+- ✅ **Fail-fast error handling** with graceful UI feedback
+- ✅ **Complete independence** from Standard mode
+### Architecture
+```
+Stage 1: EXTRACTION    → Parse transcript → Create windows → Extract JSON items
+Stage 2: DEDUPLICATION → Compute embeddings → Remove semantic duplicates
+Stage 3: SYNTHESIS     → Generate executive summary from deduplicated items
+```
+### Key Metrics
+| Metric | Value |
+|---------|-------|
+| **New Code** | ~1,800 lines |
+| **Modified Code** | ~60 lines |
+| **Total Models** | 33 unique (13 + 4 + 16) |
+| **Default Models** | `lfm2_extract_1.2b`, `granite-107m`, `qwen3_1.7b_q4` |
+| **Memory Strategy** | Sequential load/unload (safe for HF Spaces Free Tier) |
+---
+## Design Decisions
+### Q1: Extraction Model List Composition
+**Decision:** Option A - 13 models (≤1.7B + 2 LFM2-Extract)
+**Rationale:** Maximum flexibility for users, includes specialized extraction models
+### Q2: Independence from Standard Mode
+**Decision:** Option A - Extraction fully independent, Synthesis references `AVAILABLE_MODELS`
+**Rationale:** Avoid duplication while maintaining clear separation of concerns
+### Q3: Extraction n_ctx UI Control
+**Decision:** Option A - Slider (2K-8K, step 1024, default 4K)
+**Rationale:** Maximum flexibility for users to balance precision vs speed
+### Q4: Default Models
+**Decision:**
+- Extraction: `lfm2_extract_1.2b` (specialized, high quality)
+- Embedding: `granite-107m` (fastest, good enough)
+- Synthesis: `qwen3_1.7b_q4` (larger than extraction, better quality)
+**Rationale:** Balanced defaults optimized for quality and speed
+### Q5: Model Key Naming
+**Decision:** Keep same keys (no prefix like `adv_synth_`)
+**Rationale:** Simpler, less duplication, clear role-based config resolution
+### Q6: Model Overlap Between Stages
+**Decision:** Allow overlap with independent settings per role
+**Rationale:** Same model can be extraction + synthesis with different parameters
+### Q7: Reasoning Checkbox UI Flow
+**Decision:** Option B - Separate checkboxes for extraction and synthesis
+**Rationale:** Independent control per stage, clearer user intent
+### Q8: Thinking Block Display
+**Decision:** Option A - Reuse "MODEL THINKING PROCESS" field
+**Rationale:** Consistent with Standard mode, no UI layout changes needed
+### Q9: Window Token Counting with User n_ctx
+**Decision:** Option A - Strict adherence to user's slider value
+**Rationale:** Respect user's explicit choice, they may want larger/smaller windows
+### Q10: Model Loading Error Handling
+**Decision:** Option C - Graceful failure with UI error message
+**Rationale:** Most user-friendly, allows retry with different model
+---
+## Model Registries
+### 1. EXTRACTION_MODELS (13 models)
+**Location:** `/home/luigi/tiny-scribe/app.py`
+**Features:**
+- ✅ Independent from `AVAILABLE_MODELS`
+- ✅ User-adjustable `n_ctx` (2K-8K, default 4K)
+- ✅ Extraction-optimized settings (temp 0.1-0.3)
+- ✅ 2 hybrid models with reasoning toggle
+**Complete Registry:**
+```python
+EXTRACTION_MODELS = {
+    "falcon_h1_100m": {
+        "name": "Falcon-H1 100M",
+        "repo_id": "mradermacher/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "100M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "gemma3_270m": {
+        "name": "Gemma-3 270M",
+        "repo_id": "unsloth/gemma-3-270m-it-qat-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "270M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.3,
+            "top_p": 0.9,
+            "top_k": 40,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "ernie_300m": {
+        "name": "ERNIE-4.5 0.3B (131K Context)",
+        "repo_id": "unsloth/ERNIE-4.5-0.3B-PT-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 131072,
+        "default_n_ctx": 4096,
+        "params_size": "300M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "granite_350m": {
+        "name": "Granite-4.0 350M",
+        "repo_id": "unsloth/granite-4.0-h-350m-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "350M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.1,
+            "top_p": 0.95,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "lfm2_350m": {
+        "name": "LFM2 350M",
+        "repo_id": "LiquidAI/LFM2-350M-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "350M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 40,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "bitcpm4_500m": {
+        "name": "BitCPM4 0.5B (128K Context)",
+        "repo_id": "openbmb/BitCPM4-0.5B-GGUF",
+        "filename": "*q4_0.gguf",
+        "max_context": 131072,
+        "default_n_ctx": 4096,
+        "params_size": "500M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "hunyuan_500m": {
+        "name": "Hunyuan 0.5B (256K Context)",
+        "repo_id": "mradermacher/Hunyuan-0.5B-Instruct-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 262144,
+        "default_n_ctx": 4096,
+        "params_size": "500M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "qwen3_600m_q4": {
+        "name": "Qwen3 0.6B Q4 (32K Context)",
+        "repo_id": "unsloth/Qwen3-0.6B-GGUF",
+        "filename": "*Q4_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "600M",
+        "supports_reasoning": True,       # ← HYBRID MODEL
+        "supports_toggle": True,          # ← User can toggle reasoning
+        "inference_settings": {
+            "temperature": 0.3,
+            "top_p": 0.9,
+            "top_k": 20,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "granite_3_1_1b_q8": {
+        "name": "Granite 3.1 1B-A400M Instruct (128K Context)",
+        "repo_id": "bartowski/granite-3.1-1b-a400m-instruct-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 131072,
+        "default_n_ctx": 4096,
+        "params_size": "1B",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.3,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "falcon_h1_1.5b_q4": {
+        "name": "Falcon-H1 1.5B Q4",
+        "repo_id": "unsloth/Falcon-H1-1.5B-Deep-Instruct-GGUF",
+        "filename": "*Q4_K_M.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "1.5B",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "qwen3_1.7b_q4": {
+        "name": "Qwen3 1.7B Q4 (32K Context)",
+        "repo_id": "unsloth/Qwen3-1.7B-GGUF",
+        "filename": "*Q4_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "1.7B",
+        "supports_reasoning": True,       # ← HYBRID MODEL
+        "supports_toggle": True,          # ← User can toggle reasoning
+        "inference_settings": {
+            "temperature": 0.3,
+            "top_p": 0.9,
+            "top_k": 20,
+            "repeat_penalty": 1.0,
+        },
+    },
+    # ===== SPECIALIZED EXTRACTION MODELS =====
+    "lfm2_extract_350m": {
+        "name": "🎯 LFM2-Extract 350M (Specialized)",
+        "repo_id": "LiquidAI/LFM2-350M-Extract-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "350M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "description": "Optimized for extraction tasks",
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "lfm2_extract_1.2b": {
+        "name": "🎯 LFM2-Extract 1.2B (High Quality)",
+        "repo_id": "LiquidAI/LFM2-1.2B-Extract-GGUF",
+        "filename": "*Q4_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "1.2B",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "description": "Higher quality extraction for complex meetings",
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+}
+```
+**Hybrid Models (Reasoning Support):**
+- `qwen3_600m_q4` - 600M, user-toggleable reasoning
+- `qwen3_1.7b_q4` - 1.7B, user-toggleable reasoning
+---
+### 2. SYNTHESIS_MODELS (16 models)
+**Location:** `/home/luigi/tiny-scribe/app.py`
+**Features:**
+- ✅ References models #9-24 from `AVAILABLE_MODELS`
+- ✅ Inherits generation settings from Standard mode
+- ✅ 3 hybrid + 5 thinking-only models with reasoning support
+- ✅ Default: `qwen3_1.7b_q4`
+**Registry Definition:**
+```python
+# Synthesis models reference existing AVAILABLE_MODELS (Standard mode)
+SYNTHESIS_MODELS = {
+    k: AVAILABLE_MODELS[k] for k in [
+        "granite_3_1_1b_q8",      # #9  - 1B, 128K ctx
+        "falcon_h1_1.5b_q4",      # #10 - 1.5B, 32K ctx
+        "qwen3_1.7b_q4",          # #11 - 1.7B, 32K ctx, reasoning toggle (DEFAULT)
+        "granite_3_3_2b_q4",      # #12 - 2B, 128K ctx
+        "youtu_llm_2b_q8",        # #13 - 2B, 128K ctx, reasoning toggle
+        "lfm2_2_6b_transcript",   # #14 - 2.6B, 32K ctx, transcript-optimized
+        "breeze_3b_q4",           # #15 - 3B, 32K ctx
+        "granite_3_1_3b_q4",      # #16 - 3B, 128K ctx
+        "qwen3_4b_thinking_q3",   # #17 - 4B, 256K ctx, thinking-only
+        "granite4_tiny_q3",       # #18 - 7B, 128K ctx
+        "ernie_21b_pt_q1",        # #19 - 21B, 128K ctx
+        "ernie_21b_thinking_q1",  # #20 - 21B, 128K ctx, thinking-only
+        "glm_4_7_flash_reap_30b", # #21 - 30B, 128K ctx, thinking-only
+        "glm_4_7_flash_30b_iq2",  # #22 - 30B, 128K ctx
+        "qwen3_30b_thinking_q1",  # #23 - 30B, 256K ctx, thinking-only
+        "qwen3_30b_instruct_q1",  # #24 - 30B, 256K ctx
+    ]
+}
+```
+**Reasoning Models:**
+- Hybrid (toggleable): `qwen3_1.7b_q4`, `youtu_llm_2b_q8`
+- Thinking-only: `qwen3_4b_thinking_q3`, `ernie_21b_thinking_q1`, `glm_4_7_flash_reap_30b`, `qwen3_30b_thinking_q1`
+---
+### 3. EMBEDDING_MODELS (4 models)
+**Location:** `/home/luigi/tiny-scribe/meeting_summarizer/extraction.py`
+**Features:**
+- ✅ Dedicated embedding models (not in AVAILABLE_MODELS)
+- ✅ Used exclusively for deduplication phase
+- ✅ Range: 384-dim to 1024-dim
+- ✅ Default: `granite-107m`
+**Registry:**
+```python
+EMBEDDING_MODELS = {
+    "granite-107m": {
+        "name": "Granite 107M Multilingual (384-dim)",
+        "repo_id": "ibm-granite/granite-embedding-107m-multilingual",
+        "filename": "*Q8_0.gguf",
+        "embedding_dim": 384,
+        "max_context": 2048,
+        "description": "Fastest, multilingual, good for quick deduplication",
+    },
+    "granite-278m": {
+        "name": "Granite 278M Multilingual (768-dim)",
+        "repo_id": "ibm-granite/granite-embedding-278m-multilingual",
+        "filename": "*Q8_0.gguf",
+        "embedding_dim": 768,
+        "max_context": 2048,
+        "description": "Balanced speed/quality, multilingual",
+    },
+    "gemma-300m": {
+        "name": "Embedding Gemma 300M (768-dim)",
+        "repo_id": "unsloth/embeddinggemma-300m-GGUF",
+        "filename": "*Q8_0.gguf",
+        "embedding_dim": 768,
+        "max_context": 2048,
+        "description": "Google embedding model, strong semantics",
+    },
+    "qwen-600m": {
+        "name": "Qwen3 Embedding 600M (1024-dim)",
+        "repo_id": "Qwen/Qwen3-Embedding-0.6B-GGUF",
+        "filename": "*Q8_0.gguf",
+        "embedding_dim": 1024,
+        "max_context": 2048,
+        "description": "Highest quality, best for critical dedup",
+    },
+}
+```
+---
+## UI Implementation
+### Advanced Mode Controls (Option B: Separate Reasoning Checkboxes)
+**Location:** `/home/luigi/tiny-scribe/app.py`, Gradio interface section
+```python
+# ===== ADVANCED MODE CONTROLS =====
+with gr.Group(visible=False) as advanced_controls:
+    gr.Markdown("### 🧠 Advanced 3-Model Pipeline Configuration")
+    # Model Selection Row
+    with gr.Row():
+        extraction_model = gr.Dropdown(
+            choices=list(EXTRACTION_MODELS.keys()),
+            value="lfm2_extract_1.2b",  # ⭐ DEFAULT
+            label="🔍 Stage 1: Extraction Model (≤1.7B)",
+            info="Extracts structured items (action_items, decisions, key_points, questions) from windows"
+        )
+        embedding_model = gr.Dropdown(
+            choices=list(EMBEDDING_MODELS.keys()),
+            value="granite-107m",  # ⭐ DEFAULT
+            label="🧬 Stage 2: Embedding Model",
+            info="Computes semantic embeddings for deduplication across categories"
+        )
+        synthesis_model = gr.Dropdown(
+            choices=list(SYNTHESIS_MODELS.keys()),
+            value="qwen3_1.7b_q4",  # ⭐ DEFAULT
+            label="✨ Stage 3: Synthesis Model (1B-30B)",
+            info="Generates final executive summary from deduplicated items"
+        )
+    # Extraction Parameters Row
+    with gr.Row():
+        extraction_n_ctx = gr.Slider(
+            minimum=2048,
+            maximum=8192,
+            step=1024,
+            value=4096,  # ⭐ DEFAULT 4K
+            label="🪟 Extraction Context Window (n_ctx)",
+            info="Smaller = more windows (higher precision), Larger = fewer windows (faster processing)"
+        )
+        overlap_turns = gr.Slider(
+            minimum=1,
+            maximum=5,
+            step=1,
+            value=2,
+            label="🔄 Window Overlap (speaker turns)",
+            info="Number of speaker turns shared between adjacent windows (reduces information loss)"
+        )
+    # Deduplication Parameters Row
+    with gr.Row():
+        similarity_threshold = gr.Slider(
+            minimum=0.70,
+            maximum=0.95,
+            step=0.01,
+            value=0.85,
+            label="🎯 Deduplication Similarity Threshold",
+            info="Items with cosine similarity above this are considered duplicates (higher = stricter)"
+        )
+    # SEPARATE REASONING CONTROLS (Q7: Option B)
+    with gr.Row():
+        enable_extraction_reasoning = gr.Checkbox(
+            value=False,
+            visible=False,  # Conditional visibility based on extraction model
+            label="🧠 Enable Reasoning for Extraction",
+            info="Use thinking process before JSON output (Qwen3 hybrid models only)"
+        )
+        enable_synthesis_reasoning = gr.Checkbox(
+            value=True,
+            visible=True,  # Conditional visibility based on synthesis model
+            label="🧠 Enable Reasoning for Synthesis",
+            info="Use thinking process for final summary generation"
+        )
+    # Output Settings Row
+    with gr.Row():
+        adv_output_language = gr.Radio(
+            choices=["en", "zh-TW"],
+            value="en",
+            label="🌐 Output Language",
+            info="Extraction auto-detects language from transcript, synthesis uses this setting"
+        )
+        adv_max_tokens = gr.Slider(
+            minimum=512,
+            maximum=4096,
+            step=128,
+            value=2048,
+            label="📏 Max Synthesis Tokens",
+            info="Maximum tokens for final executive summary"
+        )
+    # Logging Control
+    enable_detailed_logging = gr.Checkbox(
+        value=True,
+        label="📝 Enable Detailed Trace Logging",
+        info="Save JSONL trace file (embedded in download JSON) for debugging pipeline"
+    )
+    # Model Info Accordion
+    with gr.Accordion("📋 Model Details & Settings", open=False):
+        with gr.Row():
+            with gr.Column():
+                extraction_model_info = gr.Markdown("**Extraction Model**\n\nSelect a model to see details")
+            with gr.Column():
+                embedding_model_info = gr.Markdown("**Embedding Model**\n\nSelect a model to see details")
+            with gr.Column():
+                synthesis_model_info = gr.Markdown("**Synthesis Model**\n\nSelect a model to see details")
+```
+---
+### Conditional Reasoning Checkbox Visibility Logic
+```python
+def update_extraction_reasoning_visibility(model_key):
+    """Show/hide extraction reasoning checkbox based on model capabilities."""
+    config = EXTRACTION_MODELS.get(model_key, {})
+    supports_toggle = config.get("supports_toggle", False)
+    if supports_toggle:
+        # Hybrid model (qwen3_600m_q4, qwen3_1.7b_q4)
+        return gr.update(
+            visible=True,
+            value=False,
+            interactive=True,
+            label="🧠 Enable Reasoning for Extraction"
+        )
+    elif config.get("supports_reasoning", False) and not supports_toggle:
+        # Thinking-only model (none currently in extraction, but future-proof)
+        return gr.update(
+            visible=True,
+            value=True,
+            interactive=False,
+            label="🧠 Reasoning Mode for Extraction (Always On)"
+        )
+    else:
+        # Non-reasoning model
+        return gr.update(visible=False, value=False)
+def update_synthesis_reasoning_visibility(model_key):
+    """Show/hide synthesis reasoning checkbox based on model capabilities."""
+    # Reuse existing logic from Standard mode
+    return update_reasoning_visibility(model_key)  # Existing function
+# Wire up event handlers
+extraction_model.change(
+    fn=update_extraction_reasoning_visibility,
+    inputs=[extraction_model],
+    outputs=[enable_extraction_reasoning]
+)
+synthesis_model.change(
+    fn=update_synthesis_reasoning_visibility,
+    inputs=[synthesis_model],
+    outputs=[enable_synthesis_reasoning]
+)
+```
+---
+### Model Info Display Functions
+```python
+def get_extraction_model_info(model_key):
+    """Generate markdown info for extraction model."""
+    config = EXTRACTION_MODELS.get(model_key, {})
+    settings = config.get("inference_settings", {})
+    reasoning_support = ""
+    if config.get("supports_toggle"):
+        reasoning_support = "\n**Reasoning:** Hybrid (user-toggleable)"
+    elif config.get("supports_reasoning"):
+        reasoning_support = "\n**Reasoning:** Thinking-only (always on)"
+    return f"""**{config.get('name', 'Unknown')}**
+**Size:** {config.get('params_size', 'N/A')}
+**Max Context:** {config.get('max_context', 0):,} tokens
+**Default n_ctx:** {config.get('default_n_ctx', 4096):,} tokens (user-adjustable via slider)
+**Repository:** `{config.get('repo_id', 'N/A')}`{reasoning_support}
+**Extraction-Optimized Settings:**
+- Temperature: {settings.get('temperature', 'N/A')} (deterministic for JSON)
+- Top P: {settings.get('top_p', 'N/A')}
+- Top K: {settings.get('top_k', 'N/A')}
+- Repeat Penalty: {settings.get('repeat_penalty', 'N/A')}
+"""
+def get_embedding_model_info(model_key):
+    """Generate markdown info for embedding model."""
+    from meeting_summarizer.extraction import EMBEDDING_MODELS
+    config = EMBEDDING_MODELS.get(model_key, {})
+    return f"""**{config.get('name', 'Unknown')}**
+**Embedding Dimension:** {config.get('embedding_dim', 'N/A')}
+**Context:** {config.get('max_context', 0):,} tokens
+**Repository:** `{config.get('repo_id', 'N/A')}`
+**Description:** {config.get('description', 'N/A')}
+"""
+def get_synthesis_model_info(model_key):
+    """Generate markdown info for synthesis model."""
+    config = SYNTHESIS_MODELS.get(model_key, {})
+    settings = config.get("inference_settings", {})
+    reasoning_support = ""
+    if config.get("supports_toggle"):
+        reasoning_support = "\n**Reasoning:** Hybrid (user-toggleable)"
+    elif config.get("supports_reasoning"):
+        reasoning_support = "\n**Reasoning:** Thinking-only (always on)"
+    return f"""**{config.get('name', 'Unknown')}**
+**Max Context:** {config.get('max_context', 0):,} tokens
+**Repository:** `{config.get('repo_id', 'N/A')}`{reasoning_support}
+**Synthesis-Optimized Settings:**
+- Temperature: {settings.get('temperature', 'N/A')} (from Standard mode)
+- Top P: {settings.get('top_p', 'N/A')}
+- Top K: {settings.get('top_k', 'N/A')}
+- Repeat Penalty: {settings.get('repeat_penalty', 'N/A')}
+"""
+# Wire up info update handlers
+extraction_model.change(
+    fn=get_extraction_model_info,
+    inputs=[extraction_model],
+    outputs=[extraction_model_info]
+)
+embedding_model.change(
+    fn=get_embedding_model_info,
+    inputs=[embedding_model],
+    outputs=[embedding_model_info]
+)
+synthesis_model.change(
+    fn=get_synthesis_model_info,
+    inputs=[synthesis_model],
+    outputs=[synthesis_model_info]
+)
+```
+---
+## Model Management Infrastructure
+### Role-Aware Configuration Resolver
+```python
+def get_model_config(model_key: str, model_role: str) -> Dict[str, Any]:
+    """
+    Get model configuration based on role.
+    Ensures same model (e.g., qwen3_1.7b_q4) uses DIFFERENT settings
+    for extraction vs synthesis.
+    Args:
+        model_key: Model identifier (e.g., "qwen3_1.7b_q4")
+        model_role: "extraction" or "synthesis"
+    Returns:
+        Model configuration dict with role-specific settings
+    Raises:
+        ValueError: If model_key not available for specified role
+    """
+    if model_role == "extraction":
+        if model_key not in EXTRACTION_MODELS:
+            available = ", ".join(list(EXTRACTION_MODELS.keys())[:3]) + "..."
+            raise ValueError(
+                f"Model '{model_key}' not available for extraction role. "
+                f"Available: {available}"
+            )
+        return EXTRACTION_MODELS[model_key]
+    elif model_role == "synthesis":
+        if model_key not in SYNTHESIS_MODELS:
+            available = ", ".join(list(SYNTHESIS_MODELS.keys())[:3]) + "..."
+            raise ValueError(
+                f"Model '{model_key}' not available for synthesis role. "
+                f"Available: {available}"
+            )
+        return SYNTHESIS_MODELS[model_key]
+    else:
+        raise ValueError(
+            f"Unknown model role: '{model_role}'. "
+            f"Must be 'extraction' or 'synthesis'"
+        )
+```
+---
+### Role-Aware Model Loader (Q9: Option A - Respect user's n_ctx choice)
+```python
+def load_model_for_role(
+    model_key: str,
+    model_role: str,
+    n_threads: int = 2,
+    user_n_ctx: Optional[int] = None  # For extraction, from slider
+) -> Tuple[Llama, str]:
+    """
+    Load model with role-specific configuration.
+    Args:
+        model_key: Model identifier
+        model_role: "extraction" or "synthesis"
+        n_threads: CPU threads
+        user_n_ctx: User-specified n_ctx (extraction only, from slider)
+    Returns:
+        (loaded_model, info_message)
+    Raises:
+        Exception: If model loading fails (Q10: Option C - fail gracefully)
+    """
+    try:
+        config = get_model_config(model_key, model_role)
+        # Calculate n_ctx (Q9: Option A - Strict adherence to user's choice)
+        if model_role == "extraction" and user_n_ctx is not None:
+            n_ctx = min(user_n_ctx, config["max_context"], MAX_USABLE_CTX)
+        else:
+            # Synthesis or default extraction
+            n_ctx = min(config.get("max_context", 8192), MAX_USABLE_CTX)
+        # Detect GPU support
+        requested_ngl = int(os.environ.get("N_GPU_LAYERS", 0))
+        n_gpu_layers = requested_ngl
+        if requested_ngl != 0:
+            try:
+                from llama_cpp import llama_supports_gpu_offload
+                gpu_available = llama_supports_gpu_offload()
+                if not gpu_available:
+                    logger.warning("GPU requested but not available. Using CPU.")
+                    n_gpu_layers = 0
+            except Exception as e:
+                logger.warning(f"Could not detect GPU: {e}. Using CPU.")
+                n_gpu_layers = 0
+        # Load model
+        logger.info(f"Loading {config['name']} for {model_role} role (n_ctx={n_ctx:,})")
+        llm = Llama.from_pretrained(
+            repo_id=config["repo_id"],
+            filename=config["filename"],
+            n_ctx=n_ctx,
+            n_batch=min(2048, n_ctx),
+            n_threads=n_threads,
+            n_threads_batch=n_threads,
+            n_gpu_layers=n_gpu_layers,
+            verbose=False,
+            seed=1337,
+        )
+        info_msg = (
+            f"✅ Loaded: {config['name']} for {model_role} "
+            f"(n_ctx={n_ctx:,}, threads={n_threads})"
+        )
+        logger.info(info_msg)
+        return llm, info_msg
+    except Exception as e:
+        # Q10: Option C - Fail gracefully, let user select different model
+        error_msg = (
+            f"❌ Failed to load {model_key} for {model_role}: {str(e)}\n\n"
+            f"Please select a different model and try again."
+        )
+        logger.error(error_msg, exc_info=True)
+        raise Exception(error_msg)
+def unload_model(llm: Llama, model_name: str = "model") -> None:
+    """Explicitly unload model and trigger garbage collection."""
+    if llm:
+        logger.info(f"Unloading {model_name}")
+        del llm
+        gc.collect()
+        time.sleep(0.5)  # Allow OS to reclaim memory
+```
+---
+## Extraction Pipeline
+### Extraction System Prompt Builder (Bilingual + Reasoning)
+```python
+def build_extraction_system_prompt(
+    output_language: str,
+    supports_reasoning: bool,
+    supports_toggle: bool,
+    enable_reasoning: bool
+) -> str:
+    """
+    Build extraction system prompt with optional reasoning mode.
+    Args:
+        output_language: "en" or "zh-TW" (auto-detected from transcript)
+        supports_reasoning: Model has reasoning capability
+        supports_toggle: User can toggle reasoning on/off
+        enable_reasoning: User's choice (only applies if supports_toggle=True)
+    Returns:
+        System prompt string
+    """
+    # Determine reasoning mode
+    if supports_toggle and enable_reasoning:
+        # Hybrid model with reasoning enabled
+        reasoning_instruction_en = """
+Use your reasoning capabilities to analyze the content before extracting.
+Your reasoning should:
+1. Identify key decision points and action items
+2. Distinguish explicit decisions from general discussion
+3. Categorize information appropriately (action vs point vs question)
+After reasoning, output ONLY valid JSON."""
+        reasoning_instruction_zh = """
+使用你的推理能力分析內容後再進行提取。
+你的推理應該：
+1. 識別關鍵決策點和行動項目
+2. 區分明確決策與一般討論
+3. 適當分類資訊（行動 vs 要點 vs 問題）
+推理後，僅輸出 JSON。"""
+    else:
+        reasoning_instruction_en = ""
+        reasoning_instruction_zh = ""
+    # Build full prompt
+    if output_language == "zh-TW":
+        return f"""你是會議分析助手。從逐字稿中提取結構化資訊。
+{reasoning_instruction_zh}
+僅輸出有效的 JSON，使用此精確架構：
+{{
+  "action_items": ["包含負責人和截止日期的任務", ...],
+  "decisions": ["包含理由的決策", ...],
+  "key_points": ["重要討論要點", ...],
+  "open_questions": ["未解決的問題或疑慮", ...]
+}}
+規則：
+- 每個項目必須是完整、��立的句子
+- 在每個項目中包含上下文（誰、什麼、何時）
+- 如果類別沒有項目，使用空陣列 []
+- 僅輸出 JSON，無 markdown，無解釋"""
+    else:  # English
+        return f"""You are a meeting analysis assistant. Extract structured information from transcript.
+{reasoning_instruction_en}
+Output ONLY valid JSON with this exact schema:
+{{
+  "action_items": ["Task with owner and deadline", ...],
+  "decisions": ["Decision made with rationale", ...],
+  "key_points": ["Important discussion point", ...],
+  "open_questions": ["Unresolved question or concern", ...]
+}}
+Rules:
+- Each item must be a complete, standalone sentence
+- Include context (who, what, when) in each item
+- If a category has no items, use empty array []
+- Output ONLY JSON, no markdown, no explanations"""
+```
+---
+### Extraction Streaming with Reasoning Parsing (Q8: Option A - Show in "MODEL THINKING PROCESS")
+```python
+def stream_extract_from_window(
+    extraction_llm: Llama,
+    window: Window,
+    window_id: int,
+    total_windows: int,
+    tracer: Tracer,
+    tokenizer: NativeTokenizer,
+    enable_reasoning: bool = False
+) -> Generator[Tuple[str, str, Dict[str, List[str]], bool], None, None]:
+    """
+    Stream extraction from single window with live progress + optional reasoning.
+    Yields:
+        (ticker_text, thinking_text, partial_items, is_complete)
+        - ticker_text: Progress ticker for UI
+        - thinking_text: Reasoning/thinking blocks (if extraction model supports it)
+        - partial_items: Current extracted items
+        - is_complete: True on final yield
+    """
+    # Auto-detect language from window content
+    has_cjk = bool(re.search(r'[\u4e00-\u9fff]', window.content))
+    output_language = "zh-TW" if has_cjk else "en"
+    # Build system prompt with reasoning support
+    config = EXTRACTION_MODELS[window.model_key]  # Assuming we pass model_key in Window
+    system_prompt = build_extraction_system_prompt(
+        output_language=output_language,
+        supports_reasoning=config.get("supports_reasoning", False),
+        supports_toggle=config.get("supports_toggle", False),
+        enable_reasoning=enable_reasoning
+    )
+    user_prompt = f"Transcript:\n\n{window.content}"
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt}
+    ]
+    # Stream extraction
+    full_response = ""
+    thinking_content = ""
+    start_time = time.time()
+    first_token_time = None
+    token_count = 0
+    try:
+        stream = extraction_llm.create_chat_completion(
+            messages=messages,
+            max_tokens=1024,
+            temperature=config["inference_settings"]["temperature"],
+            top_p=config["inference_settings"]["top_p"],
+            top_k=config["inference_settings"]["top_k"],
+            repeat_penalty=config["inference_settings"]["repeat_penalty"],
+            stream=True,
+        )
+        for chunk in stream:
+            if 'choices' in chunk and len(chunk['choices']) > 0:
+                delta = chunk['choices'][0].get('delta', {})
+                content = delta.get('content', '')
+                if content:
+                    if first_token_time is None:
+                        first_token_time = time.time()
+                    token_count += 1
+                    full_response += content
+                    # Parse thinking blocks if reasoning enabled
+                    if enable_reasoning and config.get("supports_reasoning"):
+                        thinking, remaining = parse_thinking_blocks(full_response, streaming=True)
+                        thinking_content = thinking or ""
+                        json_text = remaining
+                    else:
+                        json_text = full_response
+                    # Try to parse JSON
+                    partial_items = _try_parse_extraction_json(json_text)
+                    # Calculate progress metrics
+                    elapsed = time.time() - start_time
+                    tps = token_count / elapsed if elapsed > 0 else 0
+                    remaining_tokens = 1024 - token_count
+                    eta = int(remaining_tokens / tps) if tps > 0 else 0
+                    # Get item counts for ticker
+                    items_count = {
+                        "action_items": len(partial_items.get("action_items", [])),
+                        "decisions": len(partial_items.get("decisions", [])),
+                        "key_points": len(partial_items.get("key_points", [])),
+                        "open_questions": len(partial_items.get("open_questions", []))
+                    }
+                    # Get last extracted item as snippet
+                    last_item = ""
+                    for category in ["action_items", "decisions", "key_points", "open_questions"]:
+                        if partial_items.get(category):
+                            last_item = partial_items[category][-1]
+                            break
+                    # Format progress ticker
+                    input_tokens = tokenizer.count(window.content)
+                    ticker = format_progress_ticker(
+                        current_window=window_id,
+                        total_windows=total_windows,
+                        window_tokens=input_tokens,
+                        max_tokens=4096,  # Reference max for percentage
+                        items_found=items_count,
+                        tokens_per_sec=tps,
+                        eta_seconds=eta,
+                        current_snippet=last_item
+                    )
+                    # Q8: Option A - Show in "MODEL THINKING PROCESS" field
+                    yield (ticker, thinking_content, partial_items, False)
+        # Final parse
+        if enable_reasoning and config.get("supports_reasoning"):
+            thinking, remaining = parse_thinking_blocks(full_response)
+            thinking_content = thinking or ""
+            json_text = remaining
+        else:
+            json_text = full_response
+        final_items = _try_parse_extraction_json(json_text)
+        if not final_items:
+            # JSON parsing failed - FAIL ENTIRE PIPELINE (strict mode)
+            error_msg = f"Failed to parse JSON from window {window_id}. Response: {json_text[:200]}"
+            tracer.log_extraction(
+                window_id=window_id,
+                extraction=None,
+                llm_response=_sample_llm_response(full_response),
+                error=error_msg
+            )
+            raise ValueError(error_msg)
+        # Log successful extraction
+        tracer.log_extraction(
+            window_id=window_id,
+            extraction=final_items,
+            llm_response=_sample_llm_response(full_response),
+            thinking=_sample_llm_response(thinking_content) if thinking_content else None,
+            error=None
+        )
+        # Final ticker
+        elapsed = time.time() - start_time
+        tps = token_count / elapsed if elapsed > 0 else 0
+        items_count = {k: len(v) for k, v in final_items.items()}
+        ticker = format_progress_ticker(
+            current_window=window_id,
+            total_windows=total_windows,
+            window_tokens=input_tokens,
+            max_tokens=4096,
+            items_found=items_count,
+            tokens_per_sec=tps,
+            eta_seconds=0,
+            current_snippet="✅ Extraction complete"
+        )
+        yield (ticker, thinking_content, final_items, True)
+    except Exception as e:
+        # Log error and re-raise to fail entire pipeline
+        tracer.log_extraction(
+            window_id=window_id,
+            extraction=None,
+            llm_response=_sample_llm_response(full_response) if full_response else "",
+            error=str(e)
+        )
+        raise
+```
+---
+## Implementation Checklist
+### Files to Create
+- [ ] `/home/luigi/tiny-scribe/meeting_summarizer/extraction.py` (~900 lines)
+  - [ ] `NativeTokenizer` class
+  - [ ] `EmbeddingModel` class + `EMBEDDING_MODELS` registry
+  - [ ] `format_progress_ticker()` function
+  - [ ] `stream_extract_from_window()` function (with reasoning support)
+  - [ ] `deduplicate_items()` function
+  - [ ] `stream_synthesize_executive_summary()` function
+### Files to Modify
+- [ ] `/home/luigi/tiny-scribe/meeting_summarizer/__init__.py`
+  - [ ] Remove `filter_validated_items` import/export
+- [ ] `/home/luigi/tiny-scribe/meeting_summarizer/trace.py`
+  - [ ] Add `log_extraction()` method
+  - [ ] Add `log_deduplication()` method
+  - [ ] Add `log_synthesis()` method
+- [ ] `/home/luigi/tiny-scribe/app.py` (~800 lines added/modified)
+  - [ ] Add `EXTRACTION_MODELS` registry (13 models)
+  - [ ] Add `SYNTHESIS_MODELS` reference
+  - [ ] Add `get_model_config()` function
+  - [ ] Add `load_model_for_role()` function
+  - [ ] Add `unload_model()` function
+  - [ ] Add `build_extraction_system_prompt()` function
+  - [ ] Add `summarize_advanced()` generator function
+  - [ ] Add Advanced mode UI controls
+  - [ ] Add reasoning visibility logic
+  - [ ] Add model info display functions
+  - [ ] Update `download_summary_json()` for trace embedding
+### Code Statistics
+| Metric | Count |
+|--------|-------|
+| **New Lines** | ~1,800 |
+| **Modified Lines** | ~60 |
+| **Removed Lines** | ~2 |
+| **New Functions** | 12 |
+| **New Classes** | 2 |
+| **UI Controls** | 11 |
+---
+## Testing Strategy
+### Phase 1: Model Registry Validation
+```bash
+python -c "
+from app import EXTRACTION_MODELS, SYNTHESIS_MODELS
+from meeting_summarizer.extraction import EMBEDDING_MODELS
+assert len(EXTRACTION_MODELS) == 13, 'Extraction models count mismatch'
+assert len(EMBEDDING_MODELS) == 4, 'Embedding models count mismatch'
+assert len(SYNTHESIS_MODELS) == 16, 'Synthesis models count mismatch'
+# Verify independent settings
+ext_qwen = EXTRACTION_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature']
+syn_qwen = SYNTHESIS_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature']
+assert ext_qwen == 0.3, f'Extraction temp wrong: {ext_qwen}'
+assert syn_qwen == 0.6, f'Synthesis temp wrong: {syn_qwen}'
+print('✅ All model registries validated!')
+"
+```
+### Phase 2: UI Control Validation
+**Manual Checks:**
+1. Select "Advanced" mode
+2. Verify 3 dropdowns show correct counts (13, 4, 16)
+3. Verify default models selected
+4. Adjust extraction_n_ctx slider (2K → 8K)
+5. Select qwen3_600m_q4 for extraction → reasoning checkbox appears
+6. Select lfm2_extract_1.2b for extraction → reasoning checkbox hidden
+7. Select qwen3_4b_thinking_q3 for synthesis → reasoning locked ON
+8. Verify model info panels update on selection
+### Phase 3: Pipeline Test - min.txt (Quick)
+**Configuration:**
+- Extraction: `lfm2_extract_1.2b` (default)
+- Extraction n_ctx: 4096 (default)
+- Embedding: `granite-107m` (default)
+- Synthesis: `qwen3_1.7b_q4` (default)
+- Similarity threshold: 0.85 (default)
+**Expected:**
+- 1 window created
+- ~2-4 items extracted
+- 0-1 duplicates removed
+- Final summary generated
+- Total time: ~30-60s
+- Download JSON contains trace
+### Phase 4: Pipeline Test - Reasoning Models
+**Configuration:**
+- Extraction: `qwen3_600m_q4`
+- ☑ Enable Reasoning for Extraction (test hybrid model)
+- Extraction n_ctx: 2048 (smaller windows)
+- Embedding: `granite-278m` (test balanced embedding)
+- Synthesis: `qwen3_1.7b_q4`
+- ☑ Enable Reasoning for Synthesis
+**Expected:**
+- More windows (~4-6 with 2K context)
+- "MODEL THINKING PROCESS" shows extraction thinking + ticker
+- ~10-15 items extracted
+- ~2-4 duplicates removed
+- Final summary with thinking blocks
+- Total time: ~2-3 min
+### Phase 5: Pipeline Test - full.txt (Production)
+**Configuration:**
+- Extraction: `lfm2_extract_1.2b` (high quality)
+- Extraction n_ctx: 4096 (default)
+- Embedding: `qwen-600m` (highest quality)
+- Synthesis: `qwen3_4b_thinking_q3` (4B thinking model)
+- Output language: zh-TW (test Chinese)
+**Expected:**
+- ~3-5 windows (4K context)
+- ~40-60 items extracted
+- ~10-15 duplicates removed
+- Final summary in Traditional Chinese
+- Total time: ~5-8 min
+- Download JSON with embedded trace (~1-2MB)
+### Phase 6: Error Handling Test (Q10: Option C)
+**Scenarios:**
+1. Disconnect internet during model download
+2. Manually corrupt model cache
+3. Use invalid model repo_id in EXTRACTION_MODELS
+**Expected behavior:**
+- Error message displayed in UI: "❌ Failed to load lfm2_extract_1.2b..."
+- Pipeline stops (doesn't try fallback)
+- User can select different model and retry
+- Trace file saved with error details
+---
+## Implementation Priority
+### Suggested Implementation Sequence (13-19 hours total)
+**1. Model Registries (1-2 hours)**
+  - [ ] Add `EXTRACTION_MODELS` to `app.py`
+  - [ ] Add `SYNTHESIS_MODELS` reference
+  - [ ] Add `EMBEDDING_MODELS` to `extraction.py`
+  - [ ] Validate with smoke test
+**2. Core Infrastructure (2-3 hours)**
+  - [ ] Implement `get_model_config()`
+  - [ ] Implement `load_model_for_role()` with user_n_ctx support
+  - [ ] Implement `unload_model()`
+  - [ ] Implement `build_extraction_system_prompt()` with reasoning support
+  - [ ] Update `trace.py` with 3 new logging methods
+  - [ ] Update `__init__.py`
+**3. Extraction Module (3-4 hours)**
+  - [ ] Implement `NativeTokenizer` class
+  - [ ] Implement `EmbeddingModel` class
+  - [ ] Implement `format_progress_ticker()`
+  - [ ] Implement `stream_extract_from_window()` with reasoning parsing
+  - [ ] Implement `deduplicate_items()`
+  - [ ] Implement `stream_synthesize_executive_summary()`
+**4. UI Integration (2-3 hours)**
+  - [ ] Add Advanced mode controls to Gradio interface
+  - [ ] Implement reasoning checkbox visibility logic
+  - [ ] Implement model info display functions
+  - [ ] Wire up all event handlers
+  - [ ] Test UI responsiveness
+**5. Pipeline Orchestration (3-4 hours)**
+  - [ ] Implement `summarize_advanced()` generator function
+  - [ ] Sequential model loading/unloading logic
+  - [ ] Error handling with graceful failures
+  - [ ] Progress ticker updates
+  - [ ] Trace embedding in download JSON
+**6. Testing & Validation (2-3 hours)**
+  - [ ] Run all test phases (min.txt → full.txt)
+  - [ ] Validate reasoning models behavior
+  - [ ] Test error handling scenarios
+  - [ ] Performance optimization (if needed)
+---
+## Risk Assessment
+| Risk | Probability | Impact | Mitigation |
+|-------|-------------|--------|------------|
+| **LFM2-Extract models don't exist on HuggingFace** | Medium | High | Verify repo availability before implementation; prepare fallback to qwen3_600m_q4 |
+| **Memory overflow on HF Spaces Free Tier** | Low | High | Sequential loading/unloading tested; add memory monitoring |
+| **Reasoning output breaks JSON parsing** | Medium | Medium | Robust thinking block parsing with fallback; strict error handling |
+| **User n_ctx slider causes OOM** | Low | Medium | Cap at MAX_USABLE_CTX (32K); show warning if user sets too high |
+| **Embedding models slow down pipeline** | Medium | Low | Default to granite-107m (fastest); user can upgrade if needed |
+| **Trace file too large** | Low | Low | Response sampling (400 chars) already implemented; compress if >5MB |
+---
+## Appendix: Model Comparison Tables
+### Extraction Models (13)
+| Model | Size | Context | Reasoning | Settings |
+|--------|------|---------|-----------|----------|
+| falcon_h1_100m | 100M | 32K | No | temp=0.2 |
+| gemma3_270m | 270M | 32K | No | temp=0.3 |
+| ernie_300m | 300M | 131K | No | temp=0.2 |
+| granite_350m | 350M | 32K | No | temp=0.1 |
+| lfm2_350m | 350M | 32K | No | temp=0.2 |
+| bitcpm4_500m | 500M | 128K | No | temp=0.2 |
+| hunyuan_500m | 500M | 256K | No | temp=0.2 |
+| qwen3_600m_q4 | 600M | 32K | **Hybrid** | temp=0.3 |
+| granite_3_1_1b_q8 | 1B | 128K | No | temp=0.3 |
+| falcon_h1_1.5b_q4 | 1.5B | 32K | No | temp=0.2 |
+| qwen3_1.7b_q4 | 1.7B | 32K | **Hybrid** | temp=0.3 |
+| lfm2_extract_350m | 350M | 32K | No | temp=0.2 |
+| lfm2_extract_1.2b | 1.2B | 32K | No | temp=0.2 |
+### Synthesis Models (16)
+| Model | Size | Context | Reasoning | Settings |
+|--------|------|---------|-----------|----------|
+| granite_3_1_1b_q8 | 1B | 128K | No | temp=0.7 |
+| falcon_h1_1.5b_q4 | 1.5B | 32K | No | temp=0.1 |
+| qwen3_1.7b_q4 | 1.7B | 32K | Hybrid | temp=0.6 |
+| granite_3_3_2b_q4 | 2B | 128K | No | temp=0.7 |
+| youtu_llm_2b_q8 | 2B | 128K | Hybrid | temp=0.7 |
+| lfm2_2_6b_transcript | 2.6B | 32K | No | temp=0.6 |
+| breeze_3b_q4 | 3B | 32K | No | temp=0.6 |
+| granite_3_1_3b_q4 | 3B | 128K | No | temp=0.7 |
+| qwen3_4b_thinking_q3 | 4B | 256K | **Thinking-only** | temp=0.6 |
+| granite4_tiny_q3 | 7B | 128K | No | temp=0.7 |
+| ernie_21b_pt_q1 | 21B | 128K | No | temp=0.7 |
+| ernie_21b_thinking_q1 | 21B | 128K | **Thinking-only** | temp=0.8 |
+| glm_4_7_flash_reap_30b | 30B | 128K | **Thinking-only** | temp=0.6 |
+| glm_4_7_flash_30b_iq2 | 30B | 128K | No | temp=0.6 |
+| qwen3_30b_thinking_q1 | 30B | 256K | **Thinking-only** | temp=0.6 |
+| qwen3_30b_instruct_q1 | 30B | 256K | No | temp=0.6 |
+### Embedding Models (4)
+| Model | Size | Dimension | Speed | Quality |
+|--------|------|-----------|-------|---------|
+| granite-107m | 107M | 384 | Fastest | Good |
+| granite-278m | 278M | 768 | Balanced | Better |
+| gemma-300m | 300M | 768 | Fast | Good |
+| qwen-600m | 600M | 1024 | Slower | Best |
+---
+## Next Steps
+Once approved, implementation will proceed in the order outlined in the Priority section. All code will be committed with descriptive messages referencing this plan document.
+**Ready for implementation approval.**
+---
+**Document Version:** 1.0
+**Last Updated:** 2026-02-04
+**Author:** Claude (Anthropic)
+**Reviewer:** [Pending]