Spaces:
Running
Running
| # Advanced 2-Stage Meeting Summarization - Complete Implementation Plan | |
| **Project:** Tiny Scribe - Advanced Mode | |
| **Date:** 2026-02-04 | |
| **Status:** Ready for Implementation | |
| **Estimated Effort:** 13-19 hours | |
| --- | |
| ## Table of Contents | |
| 1. [Executive Summary](#executive-summary) | |
| 2. [Design Decisions](#design-decisions) | |
| 3. [Model Registries](#model-registries) | |
| 4. [UI Implementation](#ui-implementation) | |
| 5. [Model Management Infrastructure](#model-management-infrastructure) | |
| 6. [Extraction Pipeline](#extraction-pipeline) | |
| 7. [Implementation Checklist](#implementation-checklist) | |
| 8. [Testing Strategy](#testing-strategy) | |
| 9. [Implementation Priority](#implementation-priority) | |
| 10. [Risk Assessment](#risk-assessment) | |
| --- | |
| ## Executive Summary | |
| This plan details the implementation of a **3-model Advanced Summarization Pipeline** for Tiny Scribe, featuring: | |
| - ✅ **3 independent model registries** (Extraction, Embedding, Synthesis) | |
| - ✅ **User-configurable extraction context** (2K-8K tokens, default 4K) | |
| - ✅ **Reasoning/thinking model support** with independent toggles per stage | |
| - ✅ **Sequential model loading** for memory efficiency | |
| - ✅ **Bilingual support** (English + Traditional Chinese zh-TW) | |
| - ✅ **Fail-fast error handling** with graceful UI feedback | |
| - ✅ **Complete independence** from Standard mode | |
| ### Architecture | |
| ``` | |
| Stage 1: EXTRACTION → Parse transcript → Create windows → Extract JSON items | |
| Stage 2: DEDUPLICATION → Compute embeddings → Remove semantic duplicates | |
| Stage 3: SYNTHESIS → Generate executive summary from deduplicated items | |
| ``` | |
| ### Key Metrics | |
| | Metric | Value | | |
| |---------|-------| | |
| | **New Code** | ~1,800 lines | | |
| | **Modified Code** | ~60 lines | | |
| | **Total Models** | 33 unique (13 + 4 + 16) | | |
| | **Default Models** | `qwen3_1.7b_q4`, `granite-107m`, `qwen3_1.7b_q4` | | |
| | **Memory Strategy** | Sequential load/unload (safe for HF Spaces Free Tier) | | |
| --- | |
| ## Design Decisions | |
| ### Q1: Extraction Model List Composition (REVISION) | |
| **Decision:** Option A - 11 models (≤1.7B), excluding LFM2-Extract models | |
| **Rationale:** 11 models excluding LFM2-Extract specialized models (removed after testing showed 85.7% failure rate due to hallucination and schema non-compliance. Replaced with Qwen3 models that support reasoning and better handle Chinese content.) | |
| ### Q1a: Synthesis Model Selection (NEW) | |
| **Decision:** Restrict to models ≤4GB (max 4B parameters) | |
| **Rationale:** HF Spaces Free Tier only has 16GB RAM; 7B+ models will OOM. Remove ernie_21b, glm_4_7_flash_reap_30b, qwen3_30b_thinking_q1, qwen3_30b_instruct_q1 | |
| ### Q2: Independence from Standard Mode | |
| **Decision:** Option B - Both Extraction AND Synthesis fully independent from `AVAILABLE_MODELS` | |
| **Rationale:** Full independence prevents parameter cross-contamination; synthesis models have their own optimized temperatures (0.7-0.9) separate from Standard mode | |
| ### Q3: Extraction n_ctx UI Control | |
| **Decision:** Option A - Slider (2K-8K, step 1024, default 4K) | |
| **Rationale:** Maximum flexibility for users to balance precision vs speed | |
| ### Q4: Default Models | |
| **Decision:** | |
| - Extraction: `qwen3_1.7b_q4` (supports reasoning, better Chinese understanding) | |
| - Embedding: `granite-107m` (fastest, good enough) | |
| - Synthesis: `qwen3_1.7b_q4` (larger than extraction, better quality) | |
| **Rationale:** Balanced defaults optimized for quality and speed. Qwen3 1.7B chosen over LFM2-Extract based on empirical testing showing superior extraction success rate and schema compliance. | |
| ### Q5: Model Key Naming | |
| **Decision:** Keep same keys (no prefix like `adv_synth_`) | |
| **Rationale:** Simpler, less duplication, clear role-based config resolution | |
| ### Q6: Model Overlap Between Stages | |
| **Decision:** Allow overlap with independent settings per role | |
| **Rationale:** Same model can be extraction + synthesis with different parameters | |
| ### Q7: Reasoning Checkbox UI Flow | |
| **Decision:** Option B - Separate checkboxes for extraction and synthesis | |
| **Rationale:** Independent control per stage, clearer user intent | |
| ### Q8: Thinking Block Display | |
| **Decision:** Option A - Reuse "MODEL THINKING PROCESS" field | |
| **Rationale:** Consistent with Standard mode, no UI layout changes needed | |
| ### Q9: Window Token Counting with User n_ctx | |
| **Decision:** Option A - Strict adherence to user's slider value | |
| **Rationale:** Respect user's explicit choice, they may want larger/smaller windows | |
| ### Q10: Model Loading Error Handling | |
| **Decision:** Option C - Graceful failure with UI error message | |
| **Rationale:** Most user-friendly, allows retry with different model | |
| --- | |
| ## Model Registries | |
| ### 1. EXTRACTION_MODELS (13 models - FINAL) | |
| **Location:** `/home/luigi/tiny-scribe/app.py` | |
| **Features:** | |
| - ✅ Independent from `AVAILABLE_MODELS` | |
| - ✅ User-adjustable `n_ctx` (2K-8K, default 4K) | |
| - ✅ Extraction-optimized settings (temp 0.1-0.3) | |
| - ✅ 2 hybrid models with reasoning toggle | |
| - ✅ All models verified on HuggingFace | |
| **Complete Registry (LFM2-Extract models removed after testing):** | |
| ```python | |
| EXTRACTION_MODELS = { | |
| "falcon_h1_100m": { | |
| "name": "Falcon-H1 100M", | |
| "repo_id": "mradermacher/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "max_context": 32768, | |
| "default_n_ctx": 4096, | |
| "params_size": "100M", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.2, | |
| "top_p": 0.9, | |
| "top_k": 30, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "gemma3_270m": { | |
| "name": "Gemma-3 270M", | |
| "repo_id": "unsloth/gemma-3-270m-it-qat-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "max_context": 32768, | |
| "default_n_ctx": 4096, | |
| "params_size": "270M", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.3, | |
| "top_p": 0.9, | |
| "top_k": 40, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "ernie_300m": { | |
| "name": "ERNIE-4.5 0.3B (131K Context)", | |
| "repo_id": "unsloth/ERNIE-4.5-0.3B-PT-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "max_context": 131072, | |
| "default_n_ctx": 4096, | |
| "params_size": "300M", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.2, | |
| "top_p": 0.9, | |
| "top_k": 30, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "granite_350m": { | |
| "name": "Granite-4.0 350M", | |
| "repo_id": "unsloth/granite-4.0-h-350m-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "max_context": 32768, | |
| "default_n_ctx": 4096, | |
| "params_size": "350M", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.1, | |
| "top_p": 0.95, | |
| "top_k": 30, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "lfm2_350m": { | |
| "name": "LFM2 350M", | |
| "repo_id": "LiquidAI/LFM2-350M-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "max_context": 32768, | |
| "default_n_ctx": 4096, | |
| "params_size": "350M", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.2, | |
| "top_p": 0.9, | |
| "top_k": 40, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "bitcpm4_500m": { | |
| "name": "BitCPM4 0.5B (128K Context)", | |
| "repo_id": "openbmb/BitCPM4-0.5B-GGUF", | |
| "filename": "*q4_0.gguf", | |
| "max_context": 131072, | |
| "default_n_ctx": 4096, | |
| "params_size": "500M", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.2, | |
| "top_p": 0.9, | |
| "top_k": 30, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "hunyuan_500m": { | |
| "name": "Hunyuan 0.5B (256K Context)", | |
| "repo_id": "mradermacher/Hunyuan-0.5B-Instruct-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "max_context": 262144, | |
| "default_n_ctx": 4096, | |
| "params_size": "500M", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.2, | |
| "top_p": 0.9, | |
| "top_k": 30, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "qwen3_600m_q4": { | |
| "name": "Qwen3 0.6B Q4 (32K Context)", | |
| "repo_id": "unsloth/Qwen3-0.6B-GGUF", | |
| "filename": "*Q4_0.gguf", | |
| "max_context": 32768, | |
| "default_n_ctx": 4096, | |
| "params_size": "600M", | |
| "supports_reasoning": True, # ← HYBRID MODEL | |
| "supports_toggle": True, # ← User can toggle reasoning | |
| "inference_settings": { | |
| "temperature": 0.3, | |
| "top_p": 0.9, | |
| "top_k": 20, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "granite_3_1_1b_q8": { | |
| "name": "Granite 3.1 1B-A400M Instruct (128K Context)", | |
| "repo_id": "bartowski/granite-3.1-1b-a400m-instruct-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "max_context": 131072, | |
| "default_n_ctx": 4096, | |
| "params_size": "1B", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.3, | |
| "top_p": 0.9, | |
| "top_k": 30, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "falcon_h1_1.5b_q4": { | |
| "name": "Falcon-H1 1.5B Q4", | |
| "repo_id": "unsloth/Falcon-H1-1.5B-Deep-Instruct-GGUF", | |
| "filename": "*Q4_K_M.gguf", | |
| "max_context": 32768, | |
| "default_n_ctx": 4096, | |
| "params_size": "1.5B", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.2, | |
| "top_p": 0.9, | |
| "top_k": 30, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "qwen3_1.7b_q4": { | |
| "name": "Qwen3 1.7B Q4 (32K Context)", | |
| "repo_id": "unsloth/Qwen3-1.7B-GGUF", | |
| "filename": "*Q4_0.gguf", | |
| "max_context": 32768, | |
| "default_n_ctx": 4096, | |
| "params_size": "1.7B", | |
| "supports_reasoning": True, # ← HYBRID MODEL | |
| "supports_toggle": True, # ← User can toggle reasoning | |
| "inference_settings": { | |
| "temperature": 0.3, | |
| "top_p": 0.9, | |
| "top_k": 20, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "lfm2_extract_350m": { | |
| "name": "LFM2-Extract 350M (Specialized)", | |
| "repo_id": "LiquidAI/LFM2-350M-Extract-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "max_context": 32768, | |
| "default_n_ctx": 4096, | |
| "params_size": "350M", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.0, # ← Greedy decoding per Liquid AI docs | |
| "top_p": 1.0, | |
| "top_k": 0, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| "lfm2_extract_1.2b": { | |
| "name": "LFM2-Extract 1.2B (Specialized)", | |
| "repo_id": "LiquidAI/LFM2-1.2B-Extract-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "max_context": 32768, | |
| "default_n_ctx": 4096, | |
| "params_size": "1.2B", | |
| "supports_reasoning": False, | |
| "supports_toggle": False, | |
| "inference_settings": { | |
| "temperature": 0.0, # ← Greedy decoding per Liquid AI docs | |
| "top_p": 1.0, | |
| "top_k": 0, | |
| "repeat_penalty": 1.0, | |
| }, | |
| }, | |
| } | |
| ``` | |
| **Hybrid Models (Reasoning Support):** | |
| - `qwen3_600m_q4` - 600M, user-toggleable reasoning | |
| - `qwen3_1.7b_q4` - 1.7B, user-toggleable reasoning | |
| --- | |
| ### 2. SYNTHESIS_MODELS (16 models) | |
| **Location:** `/home/luigi/tiny-scribe/app.py` | |
| **Features:** | |
| - ✅ Fully independent from `AVAILABLE_MODELS` (no shared references) | |
| - ✅ Synthesis-optimized temperatures (0.7-0.9, higher than extraction) | |
| - ✅ 3 hybrid + 5 thinking-only models with reasoning support | |
| - ✅ Default: `qwen3_1.7b_q4` | |
| **Registry Definition:** | |
| ```python | |
| # FULLY INDEPENDENT from AVAILABLE_MODELS (no shared references) | |
| # Synthesis-optimized settings: higher temperatures (0.7-0.9) for creative summary generation | |
| SYNTHESIS_MODELS = { | |
| "granite_3_1_1b_q8": {..., "temperature": 0.8}, | |
| "falcon_h1_1.5b_q4": {..., "temperature": 0.7}, | |
| "qwen3_1.7b_q4": {..., "temperature": 0.8}, # DEFAULT | |
| "granite_3_3_2b_q4": {..., "temperature": 0.8}, | |
| "youtu_llm_2b_q8": {..., "temperature": 0.8}, # reasoning toggle | |
| "lfm2_2_6b_transcript": {..., "temperature": 0.7}, | |
| "breeze_3b_q4": {..., "temperature": 0.7}, | |
| "granite_3_1_3b_q4": {..., "temperature": 0.8}, | |
| "qwen3_4b_thinking_q3": {..., "temperature": 0.8}, # thinking-only | |
| "granite4_tiny_q3": {..., "temperature": 0.8}, | |
| "ernie_21b_pt_q1": {..., "temperature": 0.8}, | |
| "ernie_21b_thinking_q1": {..., "temperature": 0.9}, # thinking-only | |
| "glm_4_7_flash_reap_30b": {..., "temperature": 0.8}, # thinking-only | |
| "glm_4_7_flash_30b_iq2": {..., "temperature": 0.7}, | |
| "qwen3_30b_thinking_q1": {..., "temperature": 0.8}, # thinking-only | |
| "qwen3_30b_instruct_q1": {..., "temperature": 0.7}, | |
| } | |
| ``` | |
| **Reasoning Models:** | |
| - Hybrid (toggleable): `qwen3_1.7b_q4`, `youtu_llm_2b_q8` | |
| - Thinking-only: `qwen3_4b_thinking_q3`, `ernie_21b_thinking_q1`, `glm_4_7_flash_reap_30b`, `qwen3_30b_thinking_q1` | |
| --- | |
| ### 3. EMBEDDING_MODELS (4 models) | |
| **Location:** `/home/luigi/tiny-scribe/meeting_summarizer/extraction.py` | |
| **Features:** | |
| - ✅ Dedicated embedding models (not in AVAILABLE_MODELS) | |
| - ✅ Used exclusively for deduplication phase | |
| - ✅ Range: 384-dim to 1024-dim | |
| - ✅ Default: `granite-107m` | |
| **Registry:** | |
| ```python | |
| EMBEDDING_MODELS = { | |
| "granite-107m": { | |
| "name": "Granite 107M Multilingual (384-dim)", | |
| "repo_id": "ibm-granite/granite-embedding-107m-multilingual", | |
| "filename": "*Q8_0.gguf", | |
| "embedding_dim": 384, | |
| "max_context": 2048, | |
| "description": "Fastest, multilingual, good for quick deduplication", | |
| }, | |
| "granite-278m": { | |
| "name": "Granite 278M Multilingual (768-dim)", | |
| "repo_id": "ibm-granite/granite-embedding-278m-multilingual", | |
| "filename": "*Q8_0.gguf", | |
| "embedding_dim": 768, | |
| "max_context": 2048, | |
| "description": "Balanced speed/quality, multilingual", | |
| }, | |
| "gemma-300m": { | |
| "name": "Embedding Gemma 300M (768-dim)", | |
| "repo_id": "unsloth/embeddinggemma-300m-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "embedding_dim": 768, | |
| "max_context": 2048, | |
| "description": "Google embedding model, strong semantics", | |
| }, | |
| "qwen-600m": { | |
| "name": "Qwen3 Embedding 600M (1024-dim)", | |
| "repo_id": "Qwen/Qwen3-Embedding-0.6B-GGUF", | |
| "filename": "*Q8_0.gguf", | |
| "embedding_dim": 1024, | |
| "max_context": 2048, | |
| "description": "Highest quality, best for critical dedup", | |
| }, | |
| } | |
| ``` | |
| --- | |
| ## UI Implementation | |
| ### Advanced Mode Controls (Option B: Separate Reasoning Checkboxes) | |
| **Location:** `/home/luigi/tiny-scribe/app.py`, Gradio interface section | |
| ```python | |
| # ===== ADVANCED MODE CONTROLS ===== | |
| # Uses gr.TabItem inside gr.Tabs (not gr.Group with visibility toggle) | |
| with gr.TabItem("🧠 Advanced Mode (3-Model Pipeline)"): | |
| # Model Selection Row | |
| with gr.Row(): | |
| extraction_model = gr.Dropdown( | |
| choices=list(EXTRACTION_MODELS.keys()), | |
| value="qwen3_1.7b_q4", # ⭐ DEFAULT | |
| label="🔍 Stage 1: Extraction Model (≤1.7B)", | |
| info="Extracts structured items (action_items, decisions, key_points, questions) from windows" | |
| ) | |
| embedding_model = gr.Dropdown( | |
| choices=list(EMBEDDING_MODELS.keys()), | |
| value="granite-107m", # ⭐ DEFAULT | |
| label="🧬 Stage 2: Embedding Model", | |
| info="Computes semantic embeddings for deduplication across categories" | |
| ) | |
| synthesis_model = gr.Dropdown( | |
| choices=list(SYNTHESIS_MODELS.keys()), | |
| value="qwen3_1.7b_q4", # ⭐ DEFAULT | |
| label="✨ Stage 3: Synthesis Model (1B-30B)", | |
| info="Generates final executive summary from deduplicated items" | |
| ) | |
| # Extraction Parameters Row | |
| with gr.Row(): | |
| extraction_n_ctx = gr.Slider( | |
| minimum=2048, | |
| maximum=8192, | |
| step=1024, | |
| value=4096, # ⭐ DEFAULT 4K | |
| label="🪟 Extraction Context Window (n_ctx)", | |
| info="Smaller = more windows (higher precision), Larger = fewer windows (faster processing)" | |
| ) | |
| overlap_turns = gr.Slider( | |
| minimum=1, | |
| maximum=5, | |
| step=1, | |
| value=2, | |
| label="🔄 Window Overlap (speaker turns)", | |
| info="Number of speaker turns shared between adjacent windows (reduces information loss)" | |
| ) | |
| # Deduplication Parameters Row | |
| with gr.Row(): | |
| similarity_threshold = gr.Slider( | |
| minimum=0.70, | |
| maximum=0.95, | |
| step=0.01, | |
| value=0.85, | |
| label="🎯 Deduplication Similarity Threshold", | |
| info="Items with cosine similarity above this are considered duplicates (higher = stricter)" | |
| ) | |
| # SEPARATE REASONING CONTROLS (Q7: Option B) | |
| with gr.Row(): | |
| enable_extraction_reasoning = gr.Checkbox( | |
| value=False, | |
| visible=False, # Conditional visibility based on extraction model | |
| label="🧠 Enable Reasoning for Extraction", | |
| info="Use thinking process before JSON output (Qwen3 hybrid models only)" | |
| ) | |
| enable_synthesis_reasoning = gr.Checkbox( | |
| value=True, | |
| visible=True, # Conditional visibility based on synthesis model | |
| label="🧠 Enable Reasoning for Synthesis", | |
| info="Use thinking process for final summary generation" | |
| ) | |
| # Output Settings Row | |
| with gr.Row(): | |
| adv_output_language = gr.Radio( | |
| choices=["en", "zh-TW"], | |
| value="en", | |
| label="🌐 Output Language", | |
| info="Extraction auto-detects language from transcript, synthesis uses this setting" | |
| ) | |
| adv_max_tokens = gr.Slider( | |
| minimum=512, | |
| maximum=4096, | |
| step=128, | |
| value=2048, | |
| label="📏 Max Synthesis Tokens", | |
| info="Maximum tokens for final executive summary" | |
| ) | |
| # Logging Control | |
| enable_detailed_logging = gr.Checkbox( | |
| value=True, | |
| label="📝 Enable Detailed Trace Logging", | |
| info="Save JSONL trace file (embedded in download JSON) for debugging pipeline" | |
| ) | |
| # Model Info Accordion | |
| with gr.Accordion("📋 Model Details & Settings", open=False): | |
| with gr.Row(): | |
| with gr.Column(): | |
| extraction_model_info = gr.Markdown("**Extraction Model**\n\nSelect a model to see details") | |
| with gr.Column(): | |
| embedding_model_info = gr.Markdown("**Embedding Model**\n\nSelect a model to see details") | |
| with gr.Column(): | |
| synthesis_model_info = gr.Markdown("**Synthesis Model**\n\nSelect a model to see details") | |
| ``` | |
| --- | |
| ### Conditional Reasoning Checkbox Visibility Logic | |
| ```python | |
| def update_extraction_reasoning_visibility(model_key): | |
| """Show/hide extraction reasoning checkbox based on model capabilities.""" | |
| config = EXTRACTION_MODELS.get(model_key, {}) | |
| supports_toggle = config.get("supports_toggle", False) | |
| if supports_toggle: | |
| # Hybrid model (qwen3_600m_q4, qwen3_1.7b_q4) | |
| return gr.update( | |
| visible=True, | |
| value=False, | |
| interactive=True, | |
| label="🧠 Enable Reasoning for Extraction" | |
| ) | |
| elif config.get("supports_reasoning", False) and not supports_toggle: | |
| # Thinking-only model (none currently in extraction, but future-proof) | |
| return gr.update( | |
| visible=True, | |
| value=True, | |
| interactive=False, | |
| label="🧠 Reasoning Mode for Extraction (Always On)" | |
| ) | |
| else: | |
| # Non-reasoning model | |
| return gr.update(visible=False, value=False) | |
| def update_synthesis_reasoning_visibility(model_key): | |
| """Show/hide synthesis reasoning checkbox based on model capabilities.""" | |
| # Reuse existing logic from Standard mode | |
| return update_reasoning_visibility(model_key) # Existing function | |
| # Wire up event handlers | |
| extraction_model.change( | |
| fn=update_extraction_reasoning_visibility, | |
| inputs=[extraction_model], | |
| outputs=[enable_extraction_reasoning] | |
| ) | |
| synthesis_model.change( | |
| fn=update_synthesis_reasoning_visibility, | |
| inputs=[synthesis_model], | |
| outputs=[enable_synthesis_reasoning] | |
| ) | |
| ``` | |
| --- | |
| ### Model Info Display Functions | |
| ```python | |
| def get_extraction_model_info(model_key): | |
| """Generate markdown info for extraction model.""" | |
| config = EXTRACTION_MODELS.get(model_key, {}) | |
| settings = config.get("inference_settings", {}) | |
| reasoning_support = "" | |
| if config.get("supports_toggle"): | |
| reasoning_support = "\n**Reasoning:** Hybrid (user-toggleable)" | |
| elif config.get("supports_reasoning"): | |
| reasoning_support = "\n**Reasoning:** Thinking-only (always on)" | |
| return f"""**{config.get('name', 'Unknown')}** | |
| **Size:** {config.get('params_size', 'N/A')} | |
| **Max Context:** {config.get('max_context', 0):,} tokens | |
| **Default n_ctx:** {config.get('default_n_ctx', 4096):,} tokens (user-adjustable via slider) | |
| **Repository:** `{config.get('repo_id', 'N/A')}`{reasoning_support} | |
| **Extraction-Optimized Settings:** | |
| - Temperature: {settings.get('temperature', 'N/A')} (deterministic for JSON) | |
| - Top P: {settings.get('top_p', 'N/A')} | |
| - Top K: {settings.get('top_k', 'N/A')} | |
| - Repeat Penalty: {settings.get('repeat_penalty', 'N/A')} | |
| """ | |
| def get_embedding_model_info(model_key): | |
| """Generate markdown info for embedding model.""" | |
| from meeting_summarizer.extraction import EMBEDDING_MODELS | |
| config = EMBEDDING_MODELS.get(model_key, {}) | |
| return f"""**{config.get('name', 'Unknown')}** | |
| **Embedding Dimension:** {config.get('embedding_dim', 'N/A')} | |
| **Context:** {config.get('max_context', 0):,} tokens | |
| **Repository:** `{config.get('repo_id', 'N/A')}` | |
| **Description:** {config.get('description', 'N/A')} | |
| """ | |
| def get_synthesis_model_info(model_key): | |
| """Generate markdown info for synthesis model.""" | |
| config = SYNTHESIS_MODELS.get(model_key, {}) | |
| settings = config.get("inference_settings", {}) | |
| reasoning_support = "" | |
| if config.get("supports_toggle"): | |
| reasoning_support = "\n**Reasoning:** Hybrid (user-toggleable)" | |
| elif config.get("supports_reasoning"): | |
| reasoning_support = "\n**Reasoning:** Thinking-only (always on)" | |
| return f"""**{config.get('name', 'Unknown')}** | |
| **Max Context:** {config.get('max_context', 0):,} tokens | |
| **Repository:** `{config.get('repo_id', 'N/A')}`{reasoning_support} | |
| **Synthesis-Optimized Settings:** | |
| - Temperature: {settings.get('temperature', 'N/A')} (from Standard mode) | |
| - Top P: {settings.get('top_p', 'N/A')} | |
| - Top K: {settings.get('top_k', 'N/A')} | |
| - Repeat Penalty: {settings.get('repeat_penalty', 'N/A')} | |
| """ | |
| # Wire up info update handlers | |
| extraction_model.change( | |
| fn=get_extraction_model_info, | |
| inputs=[extraction_model], | |
| outputs=[extraction_model_info] | |
| ) | |
| embedding_model.change( | |
| fn=get_embedding_model_info, | |
| inputs=[embedding_model], | |
| outputs=[embedding_model_info] | |
| ) | |
| synthesis_model.change( | |
| fn=get_synthesis_model_info, | |
| inputs=[synthesis_model], | |
| outputs=[synthesis_model_info] | |
| ) | |
| ``` | |
| --- | |
| ## Model Management Infrastructure | |
| ### Role-Aware Configuration Resolver | |
| ```python | |
| def get_model_config(model_key: str, model_role: str) -> Dict[str, Any]: | |
| """ | |
| Get model configuration based on role. | |
| Ensures same model (e.g., qwen3_1.7b_q4) uses DIFFERENT settings | |
| for extraction vs synthesis. | |
| Args: | |
| model_key: Model identifier (e.g., "qwen3_1.7b_q4") | |
| model_role: "extraction" or "synthesis" | |
| Returns: | |
| Model configuration dict with role-specific settings | |
| Raises: | |
| ValueError: If model_key not available for specified role | |
| """ | |
| if model_role == "extraction": | |
| if model_key not in EXTRACTION_MODELS: | |
| available = ", ".join(list(EXTRACTION_MODELS.keys())[:3]) + "..." | |
| raise ValueError( | |
| f"Model '{model_key}' not available for extraction role. " | |
| f"Available: {available}" | |
| ) | |
| return EXTRACTION_MODELS[model_key] | |
| elif model_role == "synthesis": | |
| if model_key not in SYNTHESIS_MODELS: | |
| available = ", ".join(list(SYNTHESIS_MODELS.keys())[:3]) + "..." | |
| raise ValueError( | |
| f"Model '{model_key}' not available for synthesis role. " | |
| f"Available: {available}" | |
| ) | |
| return SYNTHESIS_MODELS[model_key] | |
| else: | |
| raise ValueError( | |
| f"Unknown model role: '{model_role}'. " | |
| f"Must be 'extraction' or 'synthesis'" | |
| ) | |
| ``` | |
| --- | |
| ### Role-Aware Model Loader (Q9: Option A - Respect user's n_ctx choice) | |
| ```python | |
| def load_model_for_role( | |
| model_key: str, | |
| model_role: str, | |
| n_threads: int = 2, | |
| user_n_ctx: Optional[int] = None # For extraction, from slider | |
| ) -> Tuple[Llama, str]: | |
| """ | |
| Load model with role-specific configuration. | |
| Args: | |
| model_key: Model identifier | |
| model_role: "extraction" or "synthesis" | |
| n_threads: CPU threads | |
| user_n_ctx: User-specified n_ctx (extraction only, from slider) | |
| Returns: | |
| (loaded_model, info_message) | |
| Raises: | |
| Exception: If model loading fails (Q10: Option C - fail gracefully) | |
| """ | |
| try: | |
| config = get_model_config(model_key, model_role) | |
| # Calculate n_ctx (Q9: Option A - Strict adherence to user's choice) | |
| if model_role == "extraction" and user_n_ctx is not None: | |
| n_ctx = min(user_n_ctx, config["max_context"], MAX_USABLE_CTX) | |
| else: | |
| # Synthesis or default extraction | |
| n_ctx = min(config.get("max_context", 8192), MAX_USABLE_CTX) | |
| # Detect GPU support | |
| requested_ngl = int(os.environ.get("N_GPU_LAYERS", 0)) | |
| n_gpu_layers = requested_ngl | |
| if requested_ngl != 0: | |
| try: | |
| from llama_cpp import llama_supports_gpu_offload | |
| gpu_available = llama_supports_gpu_offload() | |
| if not gpu_available: | |
| logger.warning("GPU requested but not available. Using CPU.") | |
| n_gpu_layers = 0 | |
| except Exception as e: | |
| logger.warning(f"Could not detect GPU: {e}. Using CPU.") | |
| n_gpu_layers = 0 | |
| # Load model | |
| logger.info(f"Loading {config['name']} for {model_role} role (n_ctx={n_ctx:,})") | |
| llm = Llama.from_pretrained( | |
| repo_id=config["repo_id"], | |
| filename=config["filename"], | |
| n_ctx=n_ctx, | |
| n_batch=min(2048, n_ctx), | |
| n_threads=n_threads, | |
| n_threads_batch=n_threads, | |
| n_gpu_layers=n_gpu_layers, | |
| verbose=False, | |
| seed=1337, | |
| ) | |
| info_msg = ( | |
| f"✅ Loaded: {config['name']} for {model_role} " | |
| f"(n_ctx={n_ctx:,}, threads={n_threads})" | |
| ) | |
| logger.info(info_msg) | |
| return llm, info_msg | |
| except Exception as e: | |
| # Q10: Option C - Fail gracefully, let user select different model | |
| error_msg = ( | |
| f"❌ Failed to load {model_key} for {model_role}: {str(e)}\n\n" | |
| f"Please select a different model and try again." | |
| ) | |
| logger.error(error_msg, exc_info=True) | |
| raise Exception(error_msg) | |
| def unload_model(llm: Llama, model_name: str = "model") -> None: | |
| """Explicitly unload model and trigger garbage collection.""" | |
| if llm: | |
| logger.info(f"Unloading {model_name}") | |
| del llm | |
| gc.collect() | |
| time.sleep(0.5) # Allow OS to reclaim memory | |
| ``` | |
| --- | |
| ## Extraction Pipeline | |
| ### Extraction System Prompt Builder (Bilingual + Reasoning) | |
| ```python | |
| def build_extraction_system_prompt( | |
| output_language: str, | |
| supports_reasoning: bool, | |
| supports_toggle: bool, | |
| enable_reasoning: bool | |
| ) -> str: | |
| """ | |
| Build extraction system prompt with optional reasoning mode. | |
| Args: | |
| output_language: "en" or "zh-TW" (auto-detected from transcript) | |
| supports_reasoning: Model has reasoning capability | |
| supports_toggle: User can toggle reasoning on/off | |
| enable_reasoning: User's choice (only applies if supports_toggle=True) | |
| Returns: | |
| System prompt string | |
| """ | |
| # Determine reasoning mode | |
| if supports_toggle and enable_reasoning: | |
| # Hybrid model with reasoning enabled | |
| reasoning_instruction_en = """ | |
| Use your reasoning capabilities to analyze the content before extracting. | |
| Your reasoning should: | |
| 1. Identify key decision points and action items | |
| 2. Distinguish explicit decisions from general discussion | |
| 3. Categorize information appropriately (action vs point vs question) | |
| After reasoning, output ONLY valid JSON.""" | |
| reasoning_instruction_zh = """ | |
| 使用你的推理能力分析內容後再進行提取。 | |
| 你的推理應該: | |
| 1. 識別關鍵決策點和行動項目 | |
| 2. 區分明確決策與一般討論 | |
| 3. 適當分類資訊(行動 vs 要點 vs 問題) | |
| 推理後,僅輸出 JSON。""" | |
| else: | |
| reasoning_instruction_en = "" | |
| reasoning_instruction_zh = "" | |
| # Build full prompt | |
| if output_language == "zh-TW": | |
| return f"""你是會議分析助手。從逐字稿中提取結構化資訊。 | |
| {reasoning_instruction_zh} | |
| 僅輸出有效的 JSON,使用此精確架構: | |
| {{ | |
| "action_items": ["包含負責人和截止日期的任務", ...], | |
| "decisions": ["包含理由的決策", ...], | |
| "key_points": ["重要討論要點", ...], | |
| "open_questions": ["未解決的問題或疑慮", ...] | |
| }} | |
| 規則: | |
| - 每個項目必須是完整、獨立的句子 | |
| - 在每個項目中包含上下文(誰、什麼、何時) | |
| - 如果類別沒有項目,使用空陣列 [] | |
| - 僅輸出 JSON,無 markdown,無解釋""" | |
| else: # English | |
| return f"""You are a meeting analysis assistant. Extract structured information from transcript. | |
| {reasoning_instruction_en} | |
| Output ONLY valid JSON with this exact schema: | |
| {{ | |
| "action_items": ["Task with owner and deadline", ...], | |
| "decisions": ["Decision made with rationale", ...], | |
| "key_points": ["Important discussion point", ...], | |
| "open_questions": ["Unresolved question or concern", ...] | |
| }} | |
| Rules: | |
| - Each item must be a complete, standalone sentence | |
| - Include context (who, what, when) in each item | |
| - If a category has no items, use empty array [] | |
| - Output ONLY JSON, no markdown, no explanations""" | |
| ``` | |
| --- | |
| ### Extraction Streaming with Reasoning Parsing (Q8: Option A - Show in "MODEL THINKING PROCESS") | |
| ```python | |
| def stream_extract_from_window( | |
| extraction_llm: Llama, | |
| window: Window, | |
| window_id: int, | |
| total_windows: int, | |
| tracer: Tracer, | |
| tokenizer: NativeTokenizer, | |
| enable_reasoning: bool = False | |
| ) -> Generator[Tuple[str, str, Dict[str, List[str]], bool], None, None]: | |
| """ | |
| Stream extraction from single window with live progress + optional reasoning. | |
| Yields: | |
| (ticker_text, thinking_text, partial_items, is_complete) | |
| - ticker_text: Progress ticker for UI | |
| - thinking_text: Reasoning/thinking blocks (if extraction model supports it) | |
| - partial_items: Current extracted items | |
| - is_complete: True on final yield | |
| """ | |
| # Auto-detect language from window content | |
| has_cjk = bool(re.search(r'[\u4e00-\u9fff]', window.content)) | |
| output_language = "zh-TW" if has_cjk else "en" | |
| # Build system prompt with reasoning support | |
| config = EXTRACTION_MODELS[window.model_key] # Assuming we pass model_key in Window | |
| system_prompt = build_extraction_system_prompt( | |
| output_language=output_language, | |
| supports_reasoning=config.get("supports_reasoning", False), | |
| supports_toggle=config.get("supports_toggle", False), | |
| enable_reasoning=enable_reasoning | |
| ) | |
| user_prompt = f"Transcript:\n\n{window.content}" | |
| messages = [ | |
| {"role": "system", "content": system_prompt}, | |
| {"role": "user", "content": user_prompt} | |
| ] | |
| # Stream extraction | |
| full_response = "" | |
| thinking_content = "" | |
| start_time = time.time() | |
| first_token_time = None | |
| token_count = 0 | |
| try: | |
| stream = extraction_llm.create_chat_completion( | |
| messages=messages, | |
| max_tokens=1024, | |
| temperature=config["inference_settings"]["temperature"], | |
| top_p=config["inference_settings"]["top_p"], | |
| top_k=config["inference_settings"]["top_k"], | |
| repeat_penalty=config["inference_settings"]["repeat_penalty"], | |
| stream=True, | |
| ) | |
| for chunk in stream: | |
| if 'choices' in chunk and len(chunk['choices']) > 0: | |
| delta = chunk['choices'][0].get('delta', {}) | |
| content = delta.get('content', '') | |
| if content: | |
| if first_token_time is None: | |
| first_token_time = time.time() | |
| token_count += 1 | |
| full_response += content | |
| # Parse thinking blocks if reasoning enabled | |
| if enable_reasoning and config.get("supports_reasoning"): | |
| thinking, remaining = parse_thinking_blocks(full_response, streaming=True) | |
| thinking_content = thinking or "" | |
| json_text = remaining | |
| else: | |
| json_text = full_response | |
| # Try to parse JSON | |
| partial_items = _try_parse_extraction_json(json_text) | |
| # Calculate progress metrics | |
| elapsed = time.time() - start_time | |
| tps = token_count / elapsed if elapsed > 0 else 0 | |
| remaining_tokens = 1024 - token_count | |
| eta = int(remaining_tokens / tps) if tps > 0 else 0 | |
| # Get item counts for ticker | |
| items_count = { | |
| "action_items": len(partial_items.get("action_items", [])), | |
| "decisions": len(partial_items.get("decisions", [])), | |
| "key_points": len(partial_items.get("key_points", [])), | |
| "open_questions": len(partial_items.get("open_questions", [])) | |
| } | |
| # Get last extracted item as snippet | |
| last_item = "" | |
| for category in ["action_items", "decisions", "key_points", "open_questions"]: | |
| if partial_items.get(category): | |
| last_item = partial_items[category][-1] | |
| break | |
| # Format progress ticker | |
| input_tokens = tokenizer.count(window.content) | |
| ticker = format_progress_ticker( | |
| current_window=window_id, | |
| total_windows=total_windows, | |
| window_tokens=input_tokens, | |
| max_tokens=4096, # Reference max for percentage | |
| items_found=items_count, | |
| tokens_per_sec=tps, | |
| eta_seconds=eta, | |
| current_snippet=last_item | |
| ) | |
| # Q8: Option A - Show in "MODEL THINKING PROCESS" field | |
| yield (ticker, thinking_content, partial_items, False) | |
| # Final parse | |
| if enable_reasoning and config.get("supports_reasoning"): | |
| thinking, remaining = parse_thinking_blocks(full_response) | |
| thinking_content = thinking or "" | |
| json_text = remaining | |
| else: | |
| json_text = full_response | |
| final_items = _try_parse_extraction_json(json_text) | |
| if not final_items: | |
| # JSON parsing failed - FAIL ENTIRE PIPELINE (strict mode) | |
| error_msg = f"Failed to parse JSON from window {window_id}. Response: {json_text[:200]}" | |
| tracer.log_extraction( | |
| window_id=window_id, | |
| extraction=None, | |
| llm_response=_sample_llm_response(full_response), | |
| error=error_msg | |
| ) | |
| raise ValueError(error_msg) | |
| # Log successful extraction | |
| tracer.log_extraction( | |
| window_id=window_id, | |
| extraction=final_items, | |
| llm_response=_sample_llm_response(full_response), | |
| thinking=_sample_llm_response(thinking_content) if thinking_content else None, | |
| error=None | |
| ) | |
| # Final ticker | |
| elapsed = time.time() - start_time | |
| tps = token_count / elapsed if elapsed > 0 else 0 | |
| items_count = {k: len(v) for k, v in final_items.items()} | |
| ticker = format_progress_ticker( | |
| current_window=window_id, | |
| total_windows=total_windows, | |
| window_tokens=input_tokens, | |
| max_tokens=4096, | |
| items_found=items_count, | |
| tokens_per_sec=tps, | |
| eta_seconds=0, | |
| current_snippet="✅ Extraction complete" | |
| ) | |
| yield (ticker, thinking_content, final_items, True) | |
| except Exception as e: | |
| # Log error and re-raise to fail entire pipeline | |
| tracer.log_extraction( | |
| window_id=window_id, | |
| extraction=None, | |
| llm_response=_sample_llm_response(full_response) if full_response else "", | |
| error=str(e) | |
| ) | |
| raise | |
| ``` | |
| --- | |
| ## Implementation Checklist | |
| ### Files to Create | |
| - [ ] `/home/luigi/tiny-scribe/meeting_summarizer/extraction.py` (~900 lines) | |
| - [ ] `NativeTokenizer` class | |
| - [ ] `EmbeddingModel` class + `EMBEDDING_MODELS` registry | |
| - [ ] `format_progress_ticker()` function | |
| - [ ] `stream_extract_from_window()` function (with reasoning support) | |
| - [ ] `deduplicate_items()` function | |
| - [ ] `stream_synthesize_executive_summary()` function | |
| ### Files to Modify | |
| - [ ] `/home/luigi/tiny-scribe/meeting_summarizer/__init__.py` | |
| - [ ] Remove `filter_validated_items` import/export | |
| - [ ] `/home/luigi/tiny-scribe/meeting_summarizer/trace.py` | |
| - [ ] Add `log_extraction()` method | |
| - [ ] Add `log_deduplication()` method | |
| - [ ] Add `log_synthesis()` method | |
| - [ ] `/home/luigi/tiny-scribe/app.py` (~800 lines added/modified) | |
| - [ ] Add `EXTRACTION_MODELS` registry (13 models) | |
| - [ ] Add `SYNTHESIS_MODELS` reference | |
| - [ ] Add `get_model_config()` function | |
| - [ ] Add `load_model_for_role()` function | |
| - [ ] Add `unload_model()` function | |
| - [ ] Add `build_extraction_system_prompt()` function | |
| - [ ] Add `summarize_advanced()` generator function | |
| - [ ] Add Advanced mode UI controls | |
| - [ ] Add reasoning visibility logic | |
| - [ ] Add model info display functions | |
| - [ ] Update `download_summary_json()` for trace embedding | |
| ### Code Statistics | |
| | Metric | Count | | |
| |--------|-------| | |
| | **New Lines** | ~1,800 | | |
| | **Modified Lines** | ~60 | | |
| | **Removed Lines** | ~2 | | |
| | **New Functions** | 12 | | |
| | **New Classes** | 2 | | |
| | **UI Controls** | 11 | | |
| --- | |
| ## Testing Strategy | |
| ### Phase 1: Model Registry Validation | |
| ```bash | |
| python -c " | |
| from app import EXTRACTION_MODELS, SYNTHESIS_MODELS | |
| from meeting_summarizer.extraction import EMBEDDING_MODELS | |
| assert len(EXTRACTION_MODELS) == 13, 'Extraction models count mismatch' | |
| assert len(EMBEDDING_MODELS) == 4, 'Embedding models count mismatch' | |
| assert len(SYNTHESIS_MODELS) == 16, 'Synthesis models count mismatch' | |
| # Verify independent settings | |
| ext_qwen = EXTRACTION_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature'] | |
| syn_qwen = SYNTHESIS_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature'] | |
| assert ext_qwen == 0.3, f'Extraction temp wrong: {ext_qwen}' | |
| assert syn_qwen == 0.8, f'Synthesis temp wrong: {syn_qwen}' | |
| print('✅ All model registries validated!') | |
| " | |
| ``` | |
| ### Phase 2: UI Control Validation | |
| **Manual Checks:** | |
| 1. Select "Advanced" mode | |
| 2. Verify 3 dropdowns show correct counts (13, 4, 16) | |
| 3. Verify default models selected | |
| 4. Adjust extraction_n_ctx slider (2K → 8K) | |
| 5. Select qwen3_600m_q4 for extraction → reasoning checkbox appears | |
| 6. Select qwen3_1.7b_q4 for extraction → reasoning checkbox visible (Qwen3 supports reasoning) | |
| 7. Select qwen3_4b_thinking_q3 for synthesis → reasoning locked ON | |
| 8. Verify model info panels update on selection | |
| ### Phase 3: Pipeline Test - min.txt (Quick) | |
| **Configuration:** | |
| - Extraction: `qwen3_1.7b_q4` (default) | |
| - Extraction n_ctx: 4096 (default) | |
| - Embedding: `granite-107m` (default) | |
| - Synthesis: `qwen3_1.7b_q4` (default) | |
| - Similarity threshold: 0.85 (default) | |
| **Expected:** | |
| - 1 window created | |
| - ~2-4 items extracted | |
| - 0-1 duplicates removed | |
| - Final summary generated | |
| - Total time: ~30-60s | |
| - Download JSON contains trace | |
| ### Phase 4: Pipeline Test - Reasoning Models | |
| **Configuration:** | |
| - Extraction: `qwen3_600m_q4` | |
| - ☑ Enable Reasoning for Extraction (test hybrid model) | |
| - Extraction n_ctx: 2048 (smaller windows) | |
| - Embedding: `granite-278m` (test balanced embedding) | |
| - Synthesis: `qwen3_1.7b_q4` | |
| - ☑ Enable Reasoning for Synthesis | |
| **Expected:** | |
| - More windows (~4-6 with 2K context) | |
| - "MODEL THINKING PROCESS" shows extraction thinking + ticker | |
| - ~10-15 items extracted | |
| - ~2-4 duplicates removed | |
| - Final summary with thinking blocks | |
| - Total time: ~2-3 min | |
| ### Phase 5: Pipeline Test - full.txt (Production) | |
| **Configuration:** | |
| - Extraction: `qwen3_1.7b_q4` (high quality, reasoning enabled) | |
| - Extraction n_ctx: 4096 (default) | |
| - Embedding: `qwen-600m` (highest quality) | |
| - Synthesis: `qwen3_4b_thinking_q3` (4B thinking model) | |
| - Output language: zh-TW (test Chinese) | |
| **Expected:** | |
| - ~3-5 windows (4K context) | |
| - ~40-60 items extracted | |
| - ~10-15 duplicates removed | |
| - Final summary in Traditional Chinese | |
| - Total time: ~5-8 min | |
| - Download JSON with embedded trace (~1-2MB) | |
| ### Phase 6: Error Handling Test (Q10: Option C) | |
| **Scenarios:** | |
| 1. Disconnect internet during model download | |
| 2. Manually corrupt model cache | |
| 3. Use invalid model repo_id in EXTRACTION_MODELS | |
| **Expected behavior:** | |
| - Error message displayed in UI: "❌ Failed to load lfm2_extract_1.2b..." | |
| - Pipeline stops (doesn't try fallback) | |
| - User can select different model and retry | |
| - Trace file saved with error details | |
| --- | |
| ## Implementation Priority | |
| ### Suggested Implementation Sequence (13-19 hours total) | |
| **1. Model Registries (1-2 hours)** | |
| - [ ] Add `EXTRACTION_MODELS` to `app.py` | |
| - [ ] Add `SYNTHESIS_MODELS` reference | |
| - [ ] Add `EMBEDDING_MODELS` to `extraction.py` | |
| - [ ] Validate with smoke test | |
| **2. Core Infrastructure (2-3 hours)** | |
| - [ ] Implement `get_model_config()` | |
| - [ ] Implement `load_model_for_role()` with user_n_ctx support | |
| - [ ] Implement `unload_model()` | |
| - [ ] Implement `build_extraction_system_prompt()` with reasoning support | |
| - [ ] Update `trace.py` with 3 new logging methods | |
| - [ ] Update `__init__.py` | |
| **3. Extraction Module (3-4 hours)** | |
| - [ ] Implement `NativeTokenizer` class | |
| - [ ] Implement `EmbeddingModel` class | |
| - [ ] Implement `format_progress_ticker()` | |
| - [ ] Implement `stream_extract_from_window()` with reasoning parsing | |
| - [ ] Implement `deduplicate_items()` | |
| - [ ] Implement `stream_synthesize_executive_summary()` | |
| **4. UI Integration (2-3 hours)** | |
| - [ ] Add Advanced mode controls to Gradio interface | |
| - [ ] Implement reasoning checkbox visibility logic | |
| - [ ] Implement model info display functions | |
| - [ ] Wire up all event handlers | |
| - [ ] Test UI responsiveness | |
| **5. Pipeline Orchestration (3-4 hours)** | |
| - [ ] Implement `summarize_advanced()` generator function | |
| - [ ] Sequential model loading/unloading logic | |
| - [ ] Error handling with graceful failures | |
| - [ ] Progress ticker updates | |
| - [ ] Trace embedding in download JSON | |
| **6. Testing & Validation (2-3 hours)** | |
| - [ ] Run all test phases (min.txt → full.txt) | |
| - [ ] Validate reasoning models behavior | |
| - [ ] Test error handling scenarios | |
| - [ ] Performance optimization (if needed) | |
| --- | |
| ## Risk Assessment | |
| | Risk | Probability | Impact | Mitigation | | |
| |-------|-------------|--------|------------| | |
| | **Memory overflow on HF Spaces Free Tier** | Low | High | Sequential loading/unloading tested; add memory monitoring | | |
| | **Reasoning output breaks JSON parsing** | Medium | Medium | Robust thinking block parsing with fallback; strict error handling | | |
| | **User n_ctx slider causes OOM** | Low | Medium | Cap at MAX_USABLE_CTX (32K); show warning if user sets too high | | |
| | **Embedding models slow down pipeline** | Medium | Low | Default to granite-107m (fastest); user can upgrade if needed | | |
| | **Trace file too large** | Low | Low | Response sampling (400 chars) already implemented; compress if >5MB | | |
| --- | |
| ## Appendix: Model Comparison Tables | |
| ### Extraction Models (11) | |
| | Model | Size | Context | Reasoning | Settings | | |
| |--------|------|---------|-----------|----------| | |
| | falcon_h1_100m | 100M | 32K | No | temp=0.2 | | |
| | gemma3_270m | 270M | 32K | No | temp=0.3 | | |
| | ernie_300m | 300M | 131K | No | temp=0.2 | | |
| | granite_350m | 350M | 32K | No | temp=0.1 | | |
| | bitcpm4_500m | 500M | 128K | No | temp=0.2 | | |
| | hunyuan_500m | 500M | 256K | No | temp=0.2 | | |
| | qwen3_600m_q4 | 600M | 32K | **Hybrid** | temp=0.3 | | |
| | granite_3_1_1b_q8 | 1B | 128K | No | temp=0.3 | | |
| | falcon_h1_1.5b_q4 | 1.5B | 32K | No | temp=0.2 | | |
| | qwen3_1.7b_q4 | 1.7B | 32K | **Hybrid** | temp=0.3 | | |
| | lfm2_extract_1.2b | 1.2B | 32K | No | temp=0.2 | | |
| ### Synthesis Models (16) | |
| | Model | Size | Context | Reasoning | Settings | | |
| |--------|------|---------|-----------|----------| | |
| | granite_3_1_1b_q8 | 1B | 128K | No | temp=0.7 | | |
| | falcon_h1_1.5b_q4 | 1.5B | 32K | No | temp=0.1 | | |
| | qwen3_1.7b_q4 | 1.7B | 32K | Hybrid | temp=0.8 | | |
| | granite_3_3_2b_q4 | 2B | 128K | No | temp=0.8 | | |
| | youtu_llm_2b_q8 | 2B | 128K | Hybrid | temp=0.8 | | |
| | lfm2_2_6b_transcript | 2.6B | 32K | No | temp=0.7 | | |
| | breeze_3b_q4 | 3B | 32K | No | temp=0.7 | | |
| | granite_3_1_3b_q4 | 3B | 128K | No | temp=0.8 | | |
| | qwen3_4b_thinking_q3 | 4B | 256K | **Thinking-only** | temp=0.8 | | |
| | granite4_tiny_q3 | 7B | 128K | No | temp=0.8 | | |
| | ernie_21b_pt_q1 | 21B | 128K | No | temp=0.8 | | |
| | ernie_21b_thinking_q1 | 21B | 128K | **Thinking-only** | temp=0.9 | | |
| | glm_4_7_flash_reap_30b | 30B | 128K | **Thinking-only** | temp=0.8 | | |
| | glm_4_7_flash_30b_iq2 | 30B | 128K | No | temp=0.7 | | |
| | qwen3_30b_thinking_q1 | 30B | 256K | **Thinking-only** | temp=0.8 | | |
| | qwen3_30b_instruct_q1 | 30B | 256K | No | temp=0.7 | | |
| ### Embedding Models (4) | |
| | Model | Size | Dimension | Speed | Quality | | |
| |--------|------|-----------|-------|---------| | |
| | granite-107m | 107M | 384 | Fastest | Good | | |
| | granite-278m | 278M | 768 | Balanced | Better | | |
| | gemma-300m | 300M | 768 | Fast | Good | | |
| | qwen-600m | 600M | 1024 | Slower | Best | | |
| --- | |
| ## Next Steps | |
| Once approved, implementation will proceed in the order outlined in the Priority section. All code will be committed with descriptive messages referencing this plan document. | |
| **Ready for implementation approval.** | |
| --- | |
| **Document Version:** 1.1 | |
| **Last Updated:** 2026-02-05 | |
| **Author:** Claude (Anthropic) | |
| **Reviewer:** Updated post-implementation to match actual code | |