Spaces:

Luigi
/

tiny-scribe

Running

Luigi commited on Feb 5

Commit

7a3423c

1 Parent(s): b7e57ed

feat: Implement Advanced Mode 3-stage pipeline (extraction → deduplication → synthesis)

Add complete 3-model pipeline with independent model registries and parameters:

**New Modules:**
- meeting_summarizer/__init__.py: Package initialization
- meeting_summarizer/trace.py: Tracer with extraction/dedup/synthesis logging
- meeting_summarizer/extraction.py: Complete pipeline (~700 lines)
- NativeTokenizer: Token counting without llama.cpp
- EmbeddingModel: Embedding computation for deduplication
- stream_extract_from_window(): Stage 1 extraction with reasoning
- deduplicate_items(): Stage 2 semantic deduplication
- stream_synthesize_executive_summary(): Stage 3 synthesis

**Model Registries (32 models, fully independent):**
- EXTRACTION_MODELS (13 models ≤1.7B): Extraction-optimized (temp 0.1-0.3)
- Includes LFM2-Extract 350M & 1.2B (specialized extraction models)
- 2 hybrid models with reasoning toggle (Qwen3 600M & 1.7B)
- SYNTHESIS_MODELS (16 models 1B-30B): Synthesis-optimized (temp 0.7-0.9)
- Fully independent from AVAILABLE_MODELS (no shared references)
- 2 hybrid + 5 thinking-only models with reasoning support
- EMBEDDING_MODELS (4 models): granite-107m (default), granite-278m, gemma-300m, qwen-600m

**Core Functions (app.py):**
- get_model_config(): Role-aware configuration resolver
- load_model_for_role(): Sequential loading with user n_ctx support
- unload_model(): Explicit memory cleanup
- build_extraction_system_prompt(): Bilingual + reasoning support
- summarize_advanced(): Main orchestrator (239 lines) with sequential model loading/unloading

**UI Implementation:**
- Mode tabs (Standard vs Advanced)
- 11 Advanced Mode controls (3 dropdowns, 4 sliders, 2 checkboxes, 2 radios)
- Conditional reasoning checkbox visibility per stage
- Submit button router (auto-detects mode and routes to appropriate handler)

**Features:**
- Sequential model loading/unloading (memory-safe for HF Spaces Free Tier)
- Bilingual support (auto-detect in extraction, Chinese conversion at end)
- Live progress streaming with ticker updates
- Trace logging (JSONL embedded in download JSON)
- Independent parameters per stage (no cross-contamination)

**Pipeline:**
Stage 1: Extraction → Parse transcript windows → Extract JSON items
Stage 2: Deduplication → Compute embeddings → Remove semantic duplicates
Stage 3: Synthesis → Generate executive summary from deduplicated items

Code statistics: ~2,400 new lines, 3 new files, 11 new functions, 3 new classes

Files changed (4) hide show

app.py +1249 -67
meeting_summarizer/__init__.py +13 -0
meeting_summarizer/extraction.py +705 -0
meeting_summarizer/trace.py +197 -0

app.py CHANGED Viewed

@@ -660,6 +660,459 @@ AVAILABLE_MODELS = {
 DEFAULT_MODEL_KEY = "qwen3_600m_q4"
 def load_model(model_key: str = None, n_threads: int = 2) -> Tuple[Llama, str]:
     """
     Load model with CPU optimizations. Only reloads if model changes.
@@ -766,6 +1219,452 @@ def update_reasoning_visibility(model_key):
         return gr.update(visible=True, value=True, interactive=True, label="Enable Reasoning Mode")
 def download_summary_json(summary, thinking, model_key, language, metrics):
     """Generate JSON file with summary and metadata."""
     import json
@@ -1667,80 +2566,190 @@ def create_interface():
                             )
                 # ==========================================
-                # Section 2: Model Selection (Tabs)
                 # ==========================================
-                with gr.Tabs() as model_tabs:
-                    # --- Tab 1: Preset Models ---
-                    with gr.TabItem("🤖 Preset Models"):
-                        # Filter out custom_hf from preset choices
-                        preset_choices = [
-                            (info["name"] + (" ⚡" if info.get("supports_reasoning", False) and not info.get("supports_toggle", False) else ""), key)
-                            for key, info in AVAILABLE_MODELS.items()
-                            if key != "custom_hf"
-                        ]
-                        model_dropdown = gr.Dropdown(
-                            choices=preset_choices,
-                            value=DEFAULT_MODEL_KEY,
-                            label="Select Model",
-                            info="Smaller = faster. ⚡ = Always-reasoning models."
-                        )
-                        enable_reasoning = gr.Checkbox(
-                            value=True,
-                            label="Enable Reasoning Mode",
-                            info="Uses /think for deeper analysis (slower) or /no_think for direct output (faster).",
-                            interactive=True,
-                            visible=AVAILABLE_MODELS[DEFAULT_MODEL_KEY].get("supports_toggle", False)
-                        )
-                        # Model info for preset models
-                        gr.HTML('<div class="section-header" style="margin-top: 12px;"><span class="section-icon">📊</span> Model Information</div>')
-                        _default_threads = DEFAULT_CUSTOM_THREADS if DEFAULT_CUSTOM_THREADS > 0 else 2
-                        info_output = gr.Markdown(
-                            value=get_model_info(DEFAULT_MODEL_KEY, n_threads=_default_threads)[0],
-                            elem_classes=["stats-grid"]
-                        )
-                    # --- Tab 2: Custom GGUF ---
-                    with gr.TabItem("🔧 Custom GGUF"):
-                        gr.HTML('<div style="font-size: 0.85em; color: #64748b; margin-bottom: 10px;">Load any GGUF model from HuggingFace Hub</div>')
-                        # HF Hub Search Component
-                        model_search_input = HuggingfaceHubSearch(
-                            label="🔍 Search HuggingFace Models",
-                            placeholder="Type model name (e.g., 'qwen', 'phi', 'llama')",
-                            search_type="model",
-                        )
-                        # File dropdown (populated after repo discovery)
-                        custom_file_dropdown = gr.Dropdown(
-                            label="📦 Select GGUF File",
-                            choices=[],
-                            value=None,
-                            info="GGUF files appear after selecting a model above",
-                            interactive=True,
-                        )
-                        # Load button
-                        load_btn = gr.Button("⬇️ Load Selected Model", variant="primary", size="sm")
-                        # Status message
-                        custom_status = gr.Textbox(
-                            label="Status",
-                            interactive=False,
-                            value="",
-                            visible=False,
-                        )
-                        retry_btn = gr.Button("🔄 Retry", variant="secondary", visible=False)
-                        # Model info for custom models (shows after loading)
-                        gr.HTML('<div class="section-header" style="margin-top: 12px;"><span class="section-icon">📊</span> Custom Model Info</div>')
-                        custom_info_output = gr.Markdown(
-                            value="*Load a model to see its specifications...*",
-                            elem_classes=["stats-grid"]
                         )
                 # ==========================================
@@ -1973,6 +2982,60 @@ def create_interface():
             outputs=[system_prompt_debug],
         )
         # Debounced auto-discovery for custom repo ID (500ms delay)
         import time as time_module
@@ -2138,10 +3201,129 @@ def create_interface():
             outputs=[custom_info_output],
         )
-        # Update submit button to include custom_model_state in inputs and system_prompt_debug in outputs
         submit_btn.click(
-            fn=summarize_streaming,
-            inputs=[file_input, text_input, model_dropdown, enable_reasoning, max_tokens, temperature_slider, top_p, top_k, language_selector, thread_config_dropdown, custom_threads_slider, custom_model_state],
             outputs=[thinking_output, summary_output, info_output, metrics_state, system_prompt_debug],
             show_progress="full"
         )

 DEFAULT_MODEL_KEY = "qwen3_600m_q4"
+# ===== ADVANCED MODE: EXTRACTION MODELS REGISTRY (13 models, ≤1.7B) =====
+# Used exclusively for Stage 1: Extraction (transcript windows → structured JSON)
+# Extraction-optimized settings: Low temperature (0.1-0.3) for deterministic output
+EXTRACTION_MODELS = {
+    "falcon_h1_100m": {
+        "name": "Falcon-H1 100M",
+        "repo_id": "mradermacher/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "100M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "gemma3_270m": {
+        "name": "Gemma-3 270M",
+        "repo_id": "unsloth/gemma-3-270m-it-qat-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "270M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.3,
+            "top_p": 0.9,
+            "top_k": 40,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "ernie_300m": {
+        "name": "ERNIE-4.5 0.3B (131K Context)",
+        "repo_id": "unsloth/ERNIE-4.5-0.3B-PT-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 131072,
+        "default_n_ctx": 4096,
+        "params_size": "300M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "granite_350m": {
+        "name": "Granite-4.0 350M",
+        "repo_id": "unsloth/granite-4.0-h-350m-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "350M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.1,
+            "top_p": 0.95,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "lfm2_350m": {
+        "name": "LFM2 350M",
+        "repo_id": "LiquidAI/LFM2-350M-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "350M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 40,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "lfm2_extract_350m": {
+        "name": "LFM2-Extract 350M (Specialized)",
+        "repo_id": "LiquidAI/LFM2-350M-Extract-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "350M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "bitcpm4_500m": {
+        "name": "BitCPM4 0.5B (128K Context)",
+        "repo_id": "openbmb/BitCPM4-0.5B-GGUF",
+        "filename": "*q4_0.gguf",
+        "max_context": 131072,
+        "default_n_ctx": 4096,
+        "params_size": "500M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "hunyuan_500m": {
+        "name": "Hunyuan 0.5B (256K Context)",
+        "repo_id": "mradermacher/Hunyuan-0.5B-Instruct-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 262144,
+        "default_n_ctx": 4096,
+        "params_size": "500M",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "qwen3_600m_q4": {
+        "name": "Qwen3 0.6B Q4 (32K Context)",
+        "repo_id": "unsloth/Qwen3-0.6B-GGUF",
+        "filename": "*Q4_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "600M",
+        "supports_reasoning": True,
+        "supports_toggle": True,  # Hybrid model
+        "inference_settings": {
+            "temperature": 0.3,
+            "top_p": 0.9,
+            "top_k": 20,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "granite_3_1_1b_q8": {
+        "name": "Granite 3.1 1B-A400M Instruct (128K Context)",
+        "repo_id": "bartowski/granite-3.1-1b-a400m-instruct-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 131072,
+        "default_n_ctx": 4096,
+        "params_size": "1B",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.3,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "falcon_h1_1.5b_q4": {
+        "name": "Falcon-H1 1.5B Q4",
+        "repo_id": "unsloth/Falcon-H1-1.5B-Deep-Instruct-GGUF",
+        "filename": "*Q4_K_M.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "1.5B",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "qwen3_1.7b_q4": {
+        "name": "Qwen3 1.7B Q4 (32K Context)",
+        "repo_id": "unsloth/Qwen3-1.7B-GGUF",
+        "filename": "*Q4_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "1.7B",
+        "supports_reasoning": True,
+        "supports_toggle": True,  # Hybrid model
+        "inference_settings": {
+            "temperature": 0.3,
+            "top_p": 0.9,
+            "top_k": 20,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "lfm2_extract_1.2b": {
+        "name": "LFM2-Extract 1.2B (Specialized) ⭐",
+        "repo_id": "LiquidAI/LFM2-1.2B-Extract-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 32768,
+        "default_n_ctx": 4096,
+        "params_size": "1.2B",
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.2,
+            "top_p": 0.9,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+}
+DEFAULT_EXTRACTION_MODEL = "lfm2_extract_1.2b"
+# ===== ADVANCED MODE: SYNTHESIS MODELS REGISTRY (16 models, 1B-30B) =====
+# Used exclusively for Stage 3: Synthesis (deduplicated items → executive summary)
+# Synthesis-optimized settings: Higher temperature (0.7-0.9) for creative synthesis
+# FULLY INDEPENDENT from AVAILABLE_MODELS (no shared references)
+SYNTHESIS_MODELS = {
+    "granite_3_1_1b_q8": {
+        "name": "Granite 3.1 1B-A400M Instruct (128K Context)",
+        "repo_id": "bartowski/granite-3.1-1b-a400m-instruct-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 131072,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 50,
+            "repeat_penalty": 1.05,
+        },
+    },
+    "falcon_h1_1.5b_q4": {
+        "name": "Falcon-H1 1.5B Q4",
+        "repo_id": "unsloth/Falcon-H1-1.5B-Deep-Instruct-GGUF",
+        "filename": "*Q4_K_M.gguf",
+        "max_context": 32768,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.7,
+            "top_p": 0.95,
+            "top_k": 40,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "qwen3_1.7b_q4": {
+        "name": "Qwen3 1.7B Q4 (32K Context)",
+        "repo_id": "unsloth/Qwen3-1.7B-GGUF",
+        "filename": "*Q4_0.gguf",
+        "max_context": 32768,
+        "supports_reasoning": True,
+        "supports_toggle": True,  # Hybrid model
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "granite_3_3_2b_q4": {
+        "name": "Granite 3.3 2B Instruct (128K Context)",
+        "repo_id": "ibm-granite/granite-3.3-2b-instruct-GGUF",
+        "filename": "*Q4_K_M.gguf",
+        "max_context": 131072,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 50,
+            "repeat_penalty": 1.05,
+        },
+    },
+    "youtu_llm_2b_q8": {
+        "name": "Youtu-LLM 2B (128K Context)",
+        "repo_id": "tencent/Youtu-LLM-2B-GGUF",
+        "filename": "*Q8_0.gguf",
+        "max_context": 131072,
+        "supports_reasoning": True,
+        "supports_toggle": True,  # Hybrid model
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 40,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "lfm2_2_6b_transcript": {
+        "name": "LFM2 2.6B Transcript (32K Context)",
+        "repo_id": "LiquidAI/LFM-2.6B-Transcript-GGUF",
+        "filename": "*Q4_0.gguf",
+        "max_context": 32768,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.7,
+            "top_p": 0.95,
+            "top_k": 40,
+            "repeat_penalty": 1.05,
+        },
+    },
+    "breeze_3b_q4": {
+        "name": "Breeze 3B Q4 (32K Context)",
+        "repo_id": "mradermacher/breeze-3b-GGUF",
+        "filename": "*Q4_K_M.gguf",
+        "max_context": 32768,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.7,
+            "top_p": 0.95,
+            "top_k": 40,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "granite_3_1_3b_q4": {
+        "name": "Granite 3.1 3B-A800M Instruct (128K Context)",
+        "repo_id": "bartowski/granite-3.1-3b-a800m-instruct-GGUF",
+        "filename": "*Q4_K_M.gguf",
+        "max_context": 131072,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 50,
+            "repeat_penalty": 1.05,
+        },
+    },
+    "qwen3_4b_thinking_q3": {
+        "name": "Qwen3 4B Thinking (256K Context)",
+        "repo_id": "unsloth/Qwen3-4B-Thinking-2507-GGUF",
+        "filename": "*Q3_K_M.gguf",
+        "max_context": 262144,
+        "supports_reasoning": True,
+        "supports_toggle": False,  # Thinking-only
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "granite4_tiny_q3": {
+        "name": "Granite 4.0 Tiny 7B (128K Context)",
+        "repo_id": "ibm-research/granite-4.0-Tiny-7B-Instruct-GGUF",
+        "filename": "*Q3_K_M.gguf",
+        "max_context": 131072,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 50,
+            "repeat_penalty": 1.05,
+        },
+    },
+    "ernie_21b_pt_q1": {
+        "name": "ERNIE-4.5 21B PT (128K Context)",
+        "repo_id": "unsloth/ERNIE-4.5-21B-A3B-PT-GGUF",
+        "filename": "*TQ1_0.gguf",
+        "max_context": 131072,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 50,
+            "repeat_penalty": 1.05,
+        },
+    },
+    "ernie_21b_thinking_q1": {
+        "name": "ERNIE-4.5 21B Thinking (128K Context)",
+        "repo_id": "unsloth/ERNIE-4.5-21B-A3B-Thinking-GGUF",
+        "filename": "*TQ1_0.gguf",
+        "max_context": 131072,
+        "supports_reasoning": True,
+        "supports_toggle": False,  # Thinking-only
+        "inference_settings": {
+            "temperature": 0.9,
+            "top_p": 0.95,
+            "top_k": 50,
+            "repeat_penalty": 1.05,
+        },
+    },
+    "glm_4_7_flash_reap_30b": {
+        "name": "GLM-4.7-Flash-REAP-30B Thinking (128K Context)",
+        "repo_id": "unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF",
+        "filename": "*TQ1_0.gguf",
+        "max_context": 131072,
+        "supports_reasoning": True,
+        "supports_toggle": False,  # Thinking-only
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 40,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "glm_4_7_flash_30b_iq2": {
+        "name": "GLM-4.7-Flash-30B (Original) IQ2_XXS (128K Context)",
+        "repo_id": "bartowski/zai-org_GLM-4.7-Flash-GGUF",
+        "filename": "*IQ2_XXS.gguf",
+        "max_context": 131072,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.7,
+            "top_p": 0.95,
+            "top_k": 40,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "qwen3_30b_thinking_q1": {
+        "name": "Qwen3 30B Thinking (256K Context)",
+        "repo_id": "unsloth/Qwen3-30B-A3B-Thinking-2507-GGUF",
+        "filename": "*TQ1_0.gguf",
+        "max_context": 262144,
+        "supports_reasoning": True,
+        "supports_toggle": False,  # Thinking-only
+        "inference_settings": {
+            "temperature": 0.8,
+            "top_p": 0.95,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+    "qwen3_30b_instruct_q1": {
+        "name": "Qwen3 30B Instruct (256K Context)",
+        "repo_id": "unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF",
+        "filename": "*TQ1_0.gguf",
+        "max_context": 262144,
+        "supports_reasoning": False,
+        "supports_toggle": False,
+        "inference_settings": {
+            "temperature": 0.7,
+            "top_p": 0.95,
+            "top_k": 30,
+            "repeat_penalty": 1.0,
+        },
+    },
+}
+DEFAULT_SYNTHESIS_MODEL = "qwen3_1.7b_q4"
 def load_model(model_key: str = None, n_threads: int = 2) -> Tuple[Llama, str]:
     """
     Load model with CPU optimizations. Only reloads if model changes.
         return gr.update(visible=True, value=True, interactive=True, label="Enable Reasoning Mode")
+# ===== ADVANCED MODE: HELPER FUNCTIONS =====
+def get_model_config(model_key: str, model_role: str) -> Dict[str, Any]:
+    """
+    Get model configuration based on role.
+    Ensures same model (e.g., qwen3_1.7b_q4) uses DIFFERENT settings
+    for extraction vs synthesis.
+    Args:
+        model_key: Model identifier (e.g., "qwen3_1.7b_q4")
+        model_role: "extraction" or "synthesis"
+    Returns:
+        Model configuration dict with role-specific settings
+    Raises:
+        ValueError: If model_key not available for specified role
+    """
+    if model_role == "extraction":
+        if model_key not in EXTRACTION_MODELS:
+            available = ", ".join(list(EXTRACTION_MODELS.keys())[:3]) + "..."
+            raise ValueError(
+                f"Model '{model_key}' not available for extraction role. "
+                f"Available: {available}"
+            )
+        return EXTRACTION_MODELS[model_key]
+    elif model_role == "synthesis":
+        if model_key not in SYNTHESIS_MODELS:
+            available = ", ".join(list(SYNTHESIS_MODELS.keys())[:3]) + "..."
+            raise ValueError(
+                f"Model '{model_key}' not available for synthesis role. "
+                f"Available: {available}"
+            )
+        return SYNTHESIS_MODELS[model_key]
+    else:
+        raise ValueError(
+            f"Unknown model role: '{model_role}'. "
+            f"Must be 'extraction' or 'synthesis'"
+        )
+def load_model_for_role(
+    model_key: str,
+    model_role: str,
+    n_threads: int = 2,
+    user_n_ctx: Optional[int] = None
+) -> Tuple[Llama, str]:
+    """
+    Load model with role-specific configuration.
+    Args:
+        model_key: Model identifier
+        model_role: "extraction" or "synthesis"
+        n_threads: CPU threads
+        user_n_ctx: User-specified n_ctx (extraction only, from slider)
+    Returns:
+        (loaded_model, info_message)
+    Raises:
+        Exception: If model loading fails (graceful failure)
+    """
+    try:
+        config = get_model_config(model_key, model_role)
+        # Calculate n_ctx
+        if model_role == "extraction" and user_n_ctx is not None:
+            n_ctx = min(user_n_ctx, config["max_context"], MAX_USABLE_CTX)
+        else:
+            # Synthesis or default extraction
+            n_ctx = min(config.get("max_context", 8192), MAX_USABLE_CTX)
+        # Detect GPU support
+        requested_ngl = int(os.environ.get("N_GPU_LAYERS", 0))
+        n_gpu_layers = requested_ngl
+        if requested_ngl != 0:
+            try:
+                from llama_cpp import llama_supports_gpu_offload
+                gpu_available = llama_supports_gpu_offload()
+                if not gpu_available:
+                    logger.warning("GPU requested but not available. Using CPU.")
+                    n_gpu_layers = 0
+            except Exception as e:
+                logger.warning(f"Could not detect GPU: {e}. Using CPU.")
+                n_gpu_layers = 0
+        # Load model
+        logger.info(f"Loading {config['name']} for {model_role} role (n_ctx={n_ctx:,})")
+        llm = Llama.from_pretrained(
+            repo_id=config["repo_id"],
+            filename=config["filename"],
+            n_ctx=n_ctx,
+            n_batch=min(2048, n_ctx),
+            n_threads=n_threads,
+            n_threads_batch=n_threads,
+            n_gpu_layers=n_gpu_layers,
+            verbose=False,
+            seed=1337,
+        )
+        info_msg = (
+            f"✅ Loaded: {config['name']} for {model_role} "
+            f"(n_ctx={n_ctx:,}, threads={n_threads})"
+        )
+        logger.info(info_msg)
+        return llm, info_msg
+    except Exception as e:
+        # Graceful failure - let user select different model
+        error_msg = (
+            f"❌ Failed to load {model_key} for {model_role}: {str(e)}\n\n"
+            f"Please select a different model and try again."
+        )
+        logger.error(error_msg, exc_info=True)
+        raise Exception(error_msg)
+def unload_model(llm: Optional[Llama], model_name: str = "model") -> None:
+    """Explicitly unload model and trigger garbage collection."""
+    if llm:
+        logger.info(f"Unloading {model_name}")
+        del llm
+        gc.collect()
+        time.sleep(0.5)  # Allow OS to reclaim memory
+def build_extraction_system_prompt(
+    output_language: str,
+    supports_reasoning: bool,
+    supports_toggle: bool,
+    enable_reasoning: bool
+) -> str:
+    """
+    Build extraction system prompt with optional reasoning mode.
+    Args:
+        output_language: "en" or "zh-TW" (auto-detected from transcript)
+        supports_reasoning: Model has reasoning capability
+        supports_toggle: User can toggle reasoning on/off
+        enable_reasoning: User's choice (only applies if supports_toggle=True)
+    Returns:
+        System prompt string
+    """
+    # Determine reasoning mode
+    if supports_toggle and enable_reasoning:
+        # Hybrid model with reasoning enabled
+        reasoning_instruction_en = """
+Use your reasoning capabilities to analyze the content before extracting.
+Your reasoning should:
+1. Identify key decision points and action items
+2. Distinguish explicit decisions from general discussion
+3. Categorize information appropriately (action vs point vs question)
+After reasoning, output ONLY valid JSON."""
+        reasoning_instruction_zh = """
+使用你的推理能力分析內容後再進行提取。
+你的推理應該：
+1. 識別關鍵決策點和行動項目
+2. 區分明確決策與一般討論
+3. 適當分類資訊（行動 vs 要點 vs 問題）
+推理後，僅輸出 JSON。"""
+    else:
+        reasoning_instruction_en = ""
+        reasoning_instruction_zh = ""
+    # Build full prompt
+    if output_language == "zh-TW":
+        return f"""你是會議分析助手。從逐字稿中提取結構化資訊。
+{reasoning_instruction_zh}
+僅輸出有效的 JSON，使用此精確架構：
+{{
+  "action_items": ["包含負責人和截止日期的任務", ...],
+  "decisions": ["包含理由的決策", ...],
+  "key_points": ["重要討論要點", ...],
+  "open_questions": ["未解決的問題或疑慮", ...]
+}}
+規則：
+- 每個項目必須是完整、獨立的句子
+- 在每個項目中包含上下文（誰、什麼、何時）
+- 如果類別沒有項目，使用空陣列 []
+- 僅輸出 JSON，無 markdown，無解釋"""
+    else:  # English
+        return f"""You are a meeting analysis assistant. Extract structured information from transcript.
+{reasoning_instruction_en}
+Output ONLY valid JSON with this exact schema:
+{{
+  "action_items": ["Task with owner and deadline", ...],
+  "decisions": ["Decision made with rationale", ...],
+  "key_points": ["Important discussion point", ...],
+  "open_questions": ["Unresolved question or concern", ...]
+}}
+Rules:
+- Each item must be a complete, standalone sentence
+- Include context (who, what, when) in each item
+- If a category has no items, use empty array []
+- Output ONLY JSON, no markdown, no explanations"""
+def summarize_advanced(
+    transcript: str,
+    extraction_model_key: str,
+    embedding_model_key: str,
+    synthesis_model_key: str,
+    extraction_n_ctx: int,
+    overlap_turns: int,
+    similarity_threshold: float,
+    enable_extraction_reasoning: bool,
+    enable_synthesis_reasoning: bool,
+    output_language: str,
+    max_tokens: int,
+    enable_logging: bool,
+    n_threads: int = 2
+) -> Generator[Dict[str, Any], None, None]:
+    """
+    Advanced 3-stage pipeline: Extraction → Deduplication → Synthesis.
+    Yields progress updates as dicts with keys:
+        - stage: "extraction" | "deduplication" | "synthesis" | "complete" | "error"
+        - ticker: Progress ticker text (for extraction)
+        - thinking: Thinking/reasoning content
+        - summary: Final summary (for synthesis/complete)
+        - error: Error message (if any)
+        - trace_stats: Summary statistics (on complete)
+    """
+    from meeting_summarizer.trace import Tracer
+    from meeting_summarizer.extraction import (
+        NativeTokenizer, EmbeddingModel, Window,
+        stream_extract_from_window, deduplicate_items, stream_synthesize_executive_summary
+    )
+    # Initialize tracer
+    tracer = Tracer(enabled=enable_logging)
+    tokenizer = NativeTokenizer()
+    extraction_llm = None
+    embedding_model = None
+    synthesis_llm = None
+    try:
+        # ===== STAGE 1: EXTRACTION =====
+        yield {"stage": "extraction", "ticker": "Loading extraction model...", "thinking": "", "summary": ""}
+        extraction_llm, load_msg = load_model_for_role(
+            model_key=extraction_model_key,
+            model_role="extraction",
+            n_threads=n_threads,
+            user_n_ctx=extraction_n_ctx
+        )
+        yield {"stage": "extraction", "ticker": load_msg, "thinking": "", "summary": ""}
+        # Create windows from transcript (simple split by turns for now)
+        # In production, this would be more sophisticated
+        lines = [l.strip() for l in transcript.split('\n') if l.strip()]
+        # Simple windowing: split into chunks based on token count
+        windows = []
+        current_window = []
+        current_tokens = 0
+        window_id = 1
+        for line_num, line in enumerate(lines):
+            line_tokens = tokenizer.count(line)
+            if current_tokens + line_tokens > extraction_n_ctx and current_window:
+                # Create window
+                window_content = '\n'.join(current_window)
+                windows.append(Window(
+                    id=window_id,
+                    content=window_content,
+                    start_turn=line_num - len(current_window),
+                    end_turn=line_num - 1,
+                    token_count=current_tokens
+                ))
+                window_id += 1
+                # Start new window with overlap
+                overlap_lines = current_window[-overlap_turns:] if len(current_window) >= overlap_turns else current_window
+                current_window = overlap_lines + [line]
+                current_tokens = sum(tokenizer.count(l) for l in current_window)
+            else:
+                current_window.append(line)
+                current_tokens += line_tokens
+        # Add final window
+        if current_window:
+            window_content = '\n'.join(current_window)
+            windows.append(Window(
+                id=window_id,
+                content=window_content,
+                start_turn=len(lines) - len(current_window),
+                end_turn=len(lines) - 1,
+                token_count=current_tokens
+            ))
+        total_windows = len(windows)
+        yield {"stage": "extraction", "ticker": f"Created {total_windows} windows", "thinking": "", "summary": ""}
+        # Extract from each window
+        all_items = {"action_items": [], "decisions": [], "key_points": [], "open_questions": []}
+        extraction_config = get_model_config(extraction_model_key, "extraction")
+        for window in windows:
+            for ticker, thinking, partial_items, is_complete in stream_extract_from_window(
+                extraction_llm=extraction_llm,
+                window=window,
+                window_id=window.id,
+                total_windows=total_windows,
+                tracer=tracer,
+                tokenizer=tokenizer,
+                model_config=extraction_config,
+                enable_reasoning=enable_extraction_reasoning
+            ):
+                yield {"stage": "extraction", "ticker": ticker, "thinking": thinking, "summary": ""}
+                if is_complete:
+                    # Merge items
+                    for category, items in partial_items.items():
+                        all_items[category].extend(items)
+        # Unload extraction model
+        unload_model(extraction_llm, "extraction model")
+        extraction_llm = None
+        total_extracted = sum(len(v) for v in all_items.values())
+        yield {"stage": "extraction", "ticker": f"✅ Extracted {total_extracted} total items", "thinking": "", "summary": ""}
+        # ===== STAGE 2: DEDUPLICATION =====
+        yield {"stage": "deduplication", "ticker": "Loading embedding model...", "thinking": "", "summary": ""}
+        embedding_model = EmbeddingModel(embedding_model_key, n_threads=n_threads)
+        load_msg = embedding_model.load()
+        yield {"stage": "deduplication", "ticker": load_msg, "thinking": "", "summary": ""}
+        # Deduplicate
+        deduplicated_items = deduplicate_items(
+            all_items=all_items,
+            embedding_model=embedding_model,
+            similarity_threshold=similarity_threshold,
+            tracer=tracer
+        )
+        # Unload embedding model
+        embedding_model.unload()
+        embedding_model = None
+        total_deduplicated = sum(len(v) for v in deduplicated_items.values())
+        duplicates_removed = total_extracted - total_deduplicated
+        yield {
+            "stage": "deduplication",
+            "ticker": f"✅ Deduplication complete: {total_extracted} → {total_deduplicated} ({duplicates_removed} duplicates removed)",
+            "thinking": "",
+            "summary": ""
+        }
+        # ===== STAGE 3: SYNTHESIS =====
+        yield {"stage": "synthesis", "ticker": "", "thinking": "", "summary": "Loading synthesis model..."}
+        synthesis_llm, load_msg = load_model_for_role(
+            model_key=synthesis_model_key,
+            model_role="synthesis",
+            n_threads=n_threads
+        )
+        yield {"stage": "synthesis", "ticker": "", "thinking": "", "summary": load_msg}
+        # Synthesize
+        synthesis_config = get_model_config(synthesis_model_key, "synthesis")
+        final_summary = ""
+        final_thinking = ""
+        for summary_chunk, thinking_chunk, is_complete in stream_synthesize_executive_summary(
+            synthesis_llm=synthesis_llm,
+            deduplicated_items=deduplicated_items,
+            model_config=synthesis_config,
+            output_language=output_language,
+            enable_reasoning=enable_synthesis_reasoning,
+            max_tokens=max_tokens,
+            tracer=tracer
+        ):
+            final_summary = summary_chunk
+            final_thinking = thinking_chunk
+            yield {"stage": "synthesis", "ticker": "", "thinking": thinking_chunk, "summary": summary_chunk}
+        # Unload synthesis model
+        unload_model(synthesis_llm, "synthesis model")
+        synthesis_llm = None
+        # Apply Chinese conversion if needed
+        if output_language == "zh-TW":
+            converter = OpenCC('s2twp')
+            final_summary = converter.convert(final_summary)
+            if final_thinking:
+                final_thinking = converter.convert(final_thinking)
+        # Get trace stats
+        trace_stats = tracer.get_summary_stats()
+        yield {
+            "stage": "complete",
+            "ticker": "",
+            "thinking": final_thinking,
+            "summary": final_summary,
+            "trace_stats": trace_stats,
+            "trace_json": tracer.get_trace_json()
+        }
+    except Exception as e:
+        logger.error(f"Advanced pipeline error: {e}", exc_info=True)
+        # Cleanup
+        if extraction_llm:
+            unload_model(extraction_llm, "extraction model")
+        if embedding_model:
+            embedding_model.unload()
+        if synthesis_llm:
+            unload_model(synthesis_llm, "synthesis model")
+        yield {
+            "stage": "error",
+            "ticker": "",
+            "thinking": "",
+            "summary": "",
+            "error": str(e)
+        }
 def download_summary_json(summary, thinking, model_key, language, metrics):
     """Generate JSON file with summary and metadata."""
     import json
                             )
                 # ==========================================
+                # Section 2: Mode Selection (Standard vs Advanced)
                 # ==========================================
+                with gr.Tabs() as mode_tabs:
+                    # ===== STANDARD MODE =====
+                    with gr.TabItem("📊 Standard Mode"):
+                        gr.HTML('<div style="font-size: 0.9em; color: #64748b; margin-bottom: 10px;">Single-model direct summarization</div>')
+                        with gr.Tabs() as model_tabs:
+                            # --- Tab 1: Preset Models ---
+                            with gr.TabItem("🤖 Preset Models"):
+                                # Filter out custom_hf from preset choices
+                                preset_choices = [
+                                    (info["name"] + (" ⚡" if info.get("supports_reasoning", False) and not info.get("supports_toggle", False) else ""), key)
+                                    for key, info in AVAILABLE_MODELS.items()
+                                    if key != "custom_hf"
+                                ]
+                                model_dropdown = gr.Dropdown(
+                                    choices=preset_choices,
+                                    value=DEFAULT_MODEL_KEY,
+                                    label="Select Model",
+                                    info="Smaller = faster. ⚡ = Always-reasoning models."
+                                )
+                                enable_reasoning = gr.Checkbox(
+                                    value=True,
+                                    label="Enable Reasoning Mode",
+                                    info="Uses /think for deeper analysis (slower) or /no_think for direct output (faster).",
+                                    interactive=True,
+                                    visible=AVAILABLE_MODELS[DEFAULT_MODEL_KEY].get("supports_toggle", False)
+                                )
+                                # Model info for preset models
+                                gr.HTML('<div class="section-header" style="margin-top: 12px;"><span class="section-icon">📊</span> Model Information</div>')
+                                _default_threads = DEFAULT_CUSTOM_THREADS if DEFAULT_CUSTOM_THREADS > 0 else 2
+                                info_output = gr.Markdown(
+                                    value=get_model_info(DEFAULT_MODEL_KEY, n_threads=_default_threads)[0],
+                                    elem_classes=["stats-grid"]
+                                )
+                            # --- Tab 2: Custom GGUF ---
+                            with gr.TabItem("🔧 Custom GGUF"):
+                                gr.HTML('<div style="font-size: 0.85em; color: #64748b; margin-bottom: 10px;">Load any GGUF model from HuggingFace Hub</div>')
+                                # HF Hub Search Component
+                                model_search_input = HuggingfaceHubSearch(
+                                    label="🔍 Search HuggingFace Models",
+                                    placeholder="Type model name (e.g., 'qwen', 'phi', 'llama')",
+                                    search_type="model",
+                                )
+                                # File dropdown (populated after repo discovery)
+                                custom_file_dropdown = gr.Dropdown(
+                                    label="📦 Select GGUF File",
+                                    choices=[],
+                                    value=None,
+                                    info="GGUF files appear after selecting a model above",
+                                    interactive=True,
+                                )
+                                # Load button
+                                load_btn = gr.Button("⬇️ Load Selected Model", variant="primary", size="sm")
+                                # Status message
+                                custom_status = gr.Textbox(
+                                    label="Status",
+                                    interactive=False,
+                                    value="",
+                                    visible=False,
+                                )
+                                retry_btn = gr.Button("🔄 Retry", variant="secondary", visible=False)
+                                # Model info for custom models (shows after loading)
+                                gr.HTML('<div class="section-header" style="margin-top: 12px;"><span class="section-icon">📊</span> Custom Model Info</div>')
+                                custom_info_output = gr.Markdown(
+                                    value="*Load a model to see its specifications...*",
+                                    elem_classes=["stats-grid"]
+                                )
+                    # ===== ADVANCED MODE =====
+                    with gr.TabItem("🧠 Advanced Mode (3-Model Pipeline)"):
+                        gr.HTML('<div style="font-size: 0.9em; color: #64748b; margin-bottom: 10px;">Extraction → Deduplication → Synthesis</div>')
+                        # Model Selection Row
+                        gr.HTML('<div class="section-header"><span class="section-icon">🤖</span> Model Selection</div>')
+                        with gr.Row():
+                            extraction_model = gr.Dropdown(
+                                choices=[(EXTRACTION_MODELS[k]["name"], k) for k in EXTRACTION_MODELS.keys()],
+                                value=DEFAULT_EXTRACTION_MODEL,
+                                label="🔍 Stage 1: Extraction Model (≤1.7B)",
+                                info="Extracts structured items from windows"
+                            )
+                            embedding_model = gr.Dropdown(
+                                choices=[("granite-107m", "granite-107m"), ("granite-278m", "granite-278m"),
+                                        ("gemma-300m", "gemma-300m"), ("qwen-600m", "qwen-600m")],
+                                value="granite-107m",
+                                label="🧬 Stage 2: Embedding Model",
+                                info="Deduplication via semantic similarity"
+                            )
+                            synthesis_model = gr.Dropdown(
+                                choices=[(SYNTHESIS_MODELS[k]["name"], k) for k in SYNTHESIS_MODELS.keys()],
+                                value=DEFAULT_SYNTHESIS_MODEL,
+                                label="✨ Stage 3: Synthesis Model (1B-30B)",
+                                info="Generates executive summary"
+                            )
+                        # Extraction Parameters Row
+                        gr.HTML('<div class="section-header" style="margin-top: 12px;"><span class="section-icon">⚙️</span> Extraction Parameters</div>')
+                        with gr.Row():
+                            extraction_n_ctx = gr.Slider(
+                                minimum=2048,
+                                maximum=8192,
+                                step=1024,
+                                value=4096,
+                                label="🪟 Extraction Context Window (n_ctx)",
+                                info="Smaller = more windows, Larger = fewer windows"
+                            )
+                            overlap_turns = gr.Slider(
+                                minimum=1,
+                                maximum=5,
+                                step=1,
+                                value=2,
+                                label="🔄 Window Overlap (turns)",
+                                info="Speaker turns shared between windows"
+                            )
+                        # Deduplication Parameters Row
+                        with gr.Row():
+                            similarity_threshold = gr.Slider(
+                                minimum=0.70,
+                                maximum=0.95,
+                                step=0.01,
+                                value=0.85,
+                                label="🎯 Deduplication Similarity Threshold",
+                                info="Higher = stricter duplicate detection"
+                            )
+                        # Reasoning Controls (Separate checkboxes)
+                        gr.HTML('<div class="section-header" style="margin-top: 12px;"><span class="section-icon">🧠</span> Reasoning Configuration</div>')
+                        with gr.Row():
+                            enable_extraction_reasoning = gr.Checkbox(
+                                value=False,
+                                visible=False,
+                                label="🧠 Enable Reasoning for Extraction",
+                                info="Thinking before JSON (Qwen3 hybrid models only)"
+                            )
+                            enable_synthesis_reasoning = gr.Checkbox(
+                                value=True,
+                                visible=True,
+                                label="🧠 Enable Reasoning for Synthesis",
+                                info="Thinking for final summary generation"
+                            )
+                        # Output Settings Row
+                        gr.HTML('<div class="section-header" style="margin-top: 12px;"><span class="section-icon">🌐</span> Output Settings</div>')
+                        with gr.Row():
+                            adv_output_language = gr.Radio(
+                                choices=["en", "zh-TW"],
+                                value="en",
+                                label="Output Language",
+                                info="Extraction auto-detects, synthesis uses this"
+                            )
+                            adv_max_tokens = gr.Slider(
+                                minimum=512,
+                                maximum=4096,
+                                step=128,
+                                value=2048,
+                                label="📏 Max Synthesis Tokens",
+                                info="Maximum tokens for final summary"
+                            )
+                        # Logging Control
+                        enable_detailed_logging = gr.Checkbox(
+                            value=True,
+                            label="📝 Enable Detailed Trace Logging",
+                            info="Save JSONL trace (embedded in download JSON)"
                         )
                 # ==========================================
             outputs=[system_prompt_debug],
         )
+        # ===== ADVANCED MODE EVENT HANDLERS =====
+        # Update extraction reasoning checkbox visibility when extraction model changes
+        def update_extraction_reasoning_visibility(model_key):
+            """Show/hide extraction reasoning checkbox based on model capabilities."""
+            if model_key not in EXTRACTION_MODELS:
+                return gr.update(visible=False, value=False)
+            config = EXTRACTION_MODELS[model_key]
+            supports_toggle = config.get("supports_toggle", False)
+            if supports_toggle:
+                # Hybrid model
+                return gr.update(visible=True, value=False, interactive=True, label="🧠 Enable Reasoning for Extraction")
+            elif config.get("supports_reasoning", False):
+                # Thinking-only model (none currently in extraction)
+                return gr.update(visible=True, value=True, interactive=False, label="🧠 Reasoning Mode (Always On)")
+            else:
+                # Non-reasoning model
+                return gr.update(visible=False, value=False)
+        # Update synthesis reasoning checkbox visibility when synthesis model changes
+        def update_synthesis_reasoning_visibility(model_key):
+            """Show/hide synthesis reasoning checkbox based on model capabilities."""
+            if model_key not in SYNTHESIS_MODELS:
+                return gr.update(visible=False, value=False)
+            config = SYNTHESIS_MODELS[model_key]
+            supports_reasoning = config.get("supports_reasoning", False)
+            supports_toggle = config.get("supports_toggle", False)
+            if not supports_reasoning:
+                # Non-reasoning model
+                return gr.update(visible=False, value=False)
+            elif supports_reasoning and not supports_toggle:
+                # Thinking-only model
+                return gr.update(visible=True, value=True, interactive=False, label="⚡ Reasoning Mode (Always On)")
+            else:
+                # Hybrid model
+                return gr.update(visible=True, value=True, interactive=True, label="🧠 Enable Reasoning for Synthesis")
+        # Wire up Advanced Mode event handlers
+        extraction_model.change(
+            fn=update_extraction_reasoning_visibility,
+            inputs=[extraction_model],
+            outputs=[enable_extraction_reasoning]
+        )
+        synthesis_model.change(
+            fn=update_synthesis_reasoning_visibility,
+            inputs=[synthesis_model],
+            outputs=[enable_synthesis_reasoning]
+        )
         # Debounced auto-discovery for custom repo ID (500ms delay)
         import time as time_module
             outputs=[custom_info_output],
         )
+        # ===== SUBMIT BUTTON ROUTER =====
+        # Routes to Standard or Advanced mode based on active tab
+        def route_summarize(
+            # Standard mode inputs
+            file_input_val, text_input_val, model_dropdown_val, enable_reasoning_val,
+            max_tokens_val, temperature_val, top_p_val, top_k_val, language_val,
+            thread_config_val, custom_threads_val, custom_model_val,
+            # Advanced mode inputs
+            extraction_model_val, embedding_model_val, synthesis_model_val,
+            extraction_n_ctx_val, overlap_turns_val, similarity_threshold_val,
+            enable_extraction_reasoning_val, enable_synthesis_reasoning_val,
+            adv_output_language_val, adv_max_tokens_val, enable_logging_val,
+            # Mode selector
+            mode_tabs_val
+        ):
+            """Route to Standard or Advanced mode based on selected tab."""
+            # Determine active mode (Gradio returns index of active tab)
+            # 0 = Standard Mode, 1 = Advanced Mode
+            is_advanced_mode = (mode_tabs_val == 1)
+            if is_advanced_mode:
+                # Advanced Mode: Use summarize_advanced()
+                # Get n_threads
+                thread_map = {"free": 2, "upgrade": 8, "custom": max(1, custom_threads_val)}
+                n_threads = thread_map.get(thread_config_val, 2)
+                # Get transcript
+                transcript = ""
+                if file_input_val:
+                    with open(file_input_val, 'r', encoding='utf-8') as f:
+                        transcript = f.read()
+                elif text_input_val:
+                    transcript = text_input_val
+                else:
+                    yield ("", "⚠️ Please upload a file or paste text", "", {}, "")
+                    return
+                # Stream Advanced Mode pipeline
+                for update in summarize_advanced(
+                    transcript=transcript,
+                    extraction_model_key=extraction_model_val,
+                    embedding_model_key=embedding_model_val,
+                    synthesis_model_key=synthesis_model_val,
+                    extraction_n_ctx=extraction_n_ctx_val,
+                    overlap_turns=overlap_turns_val,
+                    similarity_threshold=similarity_threshold_val,
+                    enable_extraction_reasoning=enable_extraction_reasoning_val,
+                    enable_synthesis_reasoning=enable_synthesis_reasoning_val,
+                    output_language=adv_output_language_val,
+                    max_tokens=adv_max_tokens_val,
+                    enable_logging=enable_logging_val,
+                    n_threads=n_threads
+                ):
+                    stage = update.get("stage", "")
+                    if stage == "extraction":
+                        ticker = update.get("ticker", "")
+                        thinking = update.get("thinking", "")
+                        yield (thinking, ticker, "", {}, "")
+                    elif stage == "deduplication":
+                        ticker = update.get("ticker", "")
+                        yield ("", ticker, "", {}, "")
+                    elif stage == "synthesis":
+                        thinking = update.get("thinking", "")
+                        summary = update.get("summary", "")
+                        yield (thinking, summary, "", {}, "")
+                    elif stage == "complete":
+                        thinking = update.get("thinking", "")
+                        summary = update.get("summary", "")
+                        trace_stats = update.get("trace_stats", {})
+                        # Format info message
+                        info_msg = f"""**Advanced Mode Complete**
+- Total Windows: {trace_stats.get('total_windows', 0)}
+- Items Extracted: {trace_stats.get('total_items_extracted', 0)}
+- Items After Dedup: {trace_stats.get('total_items_after_dedup', 0)}
+- Duplicates Removed: {trace_stats.get('total_duplicates_removed', 0)}
+- Total Time: {trace_stats.get('total_elapsed_seconds', 0):.1f}s"""
+                        # Store trace for download
+                        metrics = {
+                            "mode": "advanced",
+                            "trace_stats": trace_stats,
+                            "trace_json": update.get("trace_json", [])
+                        }
+                        yield (thinking, summary, info_msg, metrics, "Advanced Mode (3-Model Pipeline)")
+                    elif stage == "error":
+                        error = update.get("error", "Unknown error")
+                        yield ("", f"❌ Error: {error}", "", {}, "")
+                        return
+            else:
+                # Standard Mode: Use existing summarize_streaming()
+                for thinking, summary, info, metrics, system_prompt in summarize_streaming(
+                    file_input_val, text_input_val, model_dropdown_val, enable_reasoning_val,
+                    max_tokens_val, temperature_val, top_p_val, top_k_val, language_val,
+                    thread_config_val, custom_threads_val, custom_model_val
+                ):
+                    yield (thinking, summary, info, metrics, system_prompt)
+        # Wire up submit button with router
         submit_btn.click(
+            fn=route_summarize,
+            inputs=[
+                # Standard mode inputs
+                file_input, text_input, model_dropdown, enable_reasoning,
+                max_tokens, temperature_slider, top_p, top_k, language_selector,
+                thread_config_dropdown, custom_threads_slider, custom_model_state,
+                # Advanced mode inputs
+                extraction_model, embedding_model, synthesis_model,
+                extraction_n_ctx, overlap_turns, similarity_threshold,
+                enable_extraction_reasoning, enable_synthesis_reasoning,
+                adv_output_language, adv_max_tokens, enable_detailed_logging,
+                # Mode selector
+                mode_tabs
+            ],
             outputs=[thinking_output, summary_output, info_output, metrics_state, system_prompt_debug],
             show_progress="full"
         )

meeting_summarizer/__init__.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""
+Tiny Scribe - Meeting Summarizer Module
+This module provides advanced 3-stage meeting summarization:
+1. Extraction: Extract structured items from transcript windows
+2. Deduplication: Remove semantic duplicates using embeddings
+3. Synthesis: Generate executive summary from deduplicated items
+"""
+__version__ = "1.0.0"
+# Package exports will be added as we implement components
+__all__ = []

meeting_summarizer/extraction.py ADDED Viewed

	@@ -0,0 +1,705 @@

+"""
+Advanced Extraction Pipeline
+Provides:
+1. EMBEDDING_MODELS registry (4 models for deduplication)
+2. NativeTokenizer - Count tokens without llama.cpp
+3. EmbeddingModel - Load/compute embeddings
+4. format_progress_ticker - Live UI updates
+5. stream_extract_from_window - Stage 1: Extraction
+6. deduplicate_items - Stage 2: Deduplication
+7. stream_synthesize_executive_summary - Stage 3: Synthesis
+"""
+import re
+import json
+import time
+import logging
+from typing import Dict, List, Any, Tuple, Generator, Optional
+from dataclasses import dataclass
+import numpy as np
+from llama_cpp import Llama
+logger = logging.getLogger(__name__)
+# ===== EMBEDDING MODELS REGISTRY =====
+EMBEDDING_MODELS = {
+    "granite-107m": {
+        "name": "Granite 107M Multilingual (384-dim)",
+        "repo_id": "ibm-granite/granite-embedding-107m-multilingual",
+        "filename": "*Q8_0.gguf",
+        "embedding_dim": 384,
+        "max_context": 2048,
+        "description": "Fastest, multilingual, good for quick deduplication",
+    },
+    "granite-278m": {
+        "name": "Granite 278M Multilingual (768-dim)",
+        "repo_id": "ibm-granite/granite-embedding-278m-multilingual",
+        "filename": "*Q8_0.gguf",
+        "embedding_dim": 768,
+        "max_context": 2048,
+        "description": "Balanced speed/quality, multilingual",
+    },
+    "gemma-300m": {
+        "name": "Embedding Gemma 300M (768-dim)",
+        "repo_id": "unsloth/embeddinggemma-300m-GGUF",
+        "filename": "*Q8_0.gguf",
+        "embedding_dim": 768,
+        "max_context": 2048,
+        "description": "Google embedding model, strong semantics",
+    },
+    "qwen-600m": {
+        "name": "Qwen3 Embedding 600M (1024-dim)",
+        "repo_id": "Qwen/Qwen3-Embedding-0.6B-GGUF",
+        "filename": "*Q8_0.gguf",
+        "embedding_dim": 1024,
+        "max_context": 2048,
+        "description": "Highest quality, best for critical dedup",
+    },
+}
+# ===== NATIVE TOKENIZER =====
+class NativeTokenizer:
+    """
+    Simple tokenizer for counting tokens without llama.cpp.
+    Uses GPT-2 style approximation: ~1 token per 4 characters.
+    """
+    def __init__(self):
+        """Initialize tokenizer."""
+        self.chars_per_token = 4  # Conservative estimate
+    def count(self, text: str) -> int:
+        """
+        Count tokens in text.
+        Args:
+            text: Input text
+        Returns:
+            Approximate token count
+        """
+        if not text:
+            return 0
+        # Simple heuristic: 1 token ≈ 4 characters for English
+        # Adjust for CJK characters (Chinese/Japanese/Korean)
+        cjk_chars = len(re.findall(r'[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff]', text))
+        non_cjk_chars = len(text) - cjk_chars
+        # CJK: 1 char ≈ 1 token, Non-CJK: 4 chars ≈ 1 token
+        tokens = cjk_chars + (non_cjk_chars // self.chars_per_token)
+        return max(1, tokens)  # Minimum 1 token
+# ===== EMBEDDING MODEL =====
+class EmbeddingModel:
+    """Wrapper for embedding models used in deduplication."""
+    def __init__(self, model_key: str, n_threads: int = 2):
+        """
+        Initialize embedding model.
+        Args:
+            model_key: Key from EMBEDDING_MODELS registry
+            n_threads: CPU threads for inference
+        """
+        if model_key not in EMBEDDING_MODELS:
+            raise ValueError(f"Unknown embedding model: {model_key}")
+        self.model_key = model_key
+        self.config = EMBEDDING_MODELS[model_key]
+        self.n_threads = n_threads
+        self.llm: Optional[Llama] = None
+    def load(self) -> str:
+        """
+        Load embedding model.
+        Returns:
+            Info message
+        """
+        logger.info(f"Loading embedding model: {self.config['name']}")
+        try:
+            self.llm = Llama.from_pretrained(
+                repo_id=self.config["repo_id"],
+                filename=self.config["filename"],
+                n_ctx=self.config["max_context"],
+                n_batch=512,
+                n_threads=self.n_threads,
+                n_threads_batch=self.n_threads,
+                n_gpu_layers=0,  # CPU only for embeddings
+                verbose=False,
+                embedding=True,  # Enable embedding mode
+            )
+            msg = f"✅ Loaded: {self.config['name']} ({self.config['embedding_dim']}-dim)"
+            logger.info(msg)
+            return msg
+        except Exception as e:
+            error_msg = f"❌ Failed to load {self.model_key}: {str(e)}"
+            logger.error(error_msg, exc_info=True)
+            raise Exception(error_msg)
+    def embed(self, text: str) -> np.ndarray:
+        """
+        Compute embedding for text.
+        Args:
+            text: Input text
+        Returns:
+            Embedding vector (numpy array)
+        """
+        if self.llm is None:
+            raise RuntimeError("Model not loaded. Call load() first.")
+        # Truncate text to max context
+        # Rough approximation: 1 token ≈ 4 chars
+        max_chars = self.config["max_context"] * 4
+        if len(text) > max_chars:
+            text = text[:max_chars]
+        # Get embedding
+        embedding = self.llm.embed(text)
+        # Normalize vector
+        norm = np.linalg.norm(embedding)
+        if norm > 0:
+            embedding = embedding / norm
+        return embedding
+    def unload(self) -> None:
+        """Unload model and free memory."""
+        if self.llm:
+            logger.info(f"Unloading embedding model: {self.config['name']}")
+            del self.llm
+            self.llm = None
+            import gc
+            gc.collect()
+            time.sleep(0.5)
+# ===== HELPER FUNCTIONS =====
+@dataclass
+class Window:
+    """Represents a transcript window for extraction."""
+    id: int
+    content: str
+    start_turn: int
+    end_turn: int
+    token_count: int
+def format_progress_ticker(
+    current_window: int,
+    total_windows: int,
+    window_tokens: int,
+    max_tokens: int,
+    items_found: Dict[str, int],
+    tokens_per_sec: float,
+    eta_seconds: int,
+    current_snippet: str
+) -> str:
+    """
+    Format progress ticker for extraction UI.
+    Args:
+        current_window: Current window number (1-indexed)
+        total_windows: Total number of windows
+        window_tokens: Tokens in current window
+        max_tokens: Maximum tokens (for percentage)
+        items_found: Dict of {category: count}
+        tokens_per_sec: Generation speed
+        eta_seconds: Estimated time to completion
+        current_snippet: Last extracted item (truncated)
+    Returns:
+        Formatted ticker string
+    """
+    # Progress bar
+    progress_pct = (current_window / total_windows) * 100
+    bar_width = 20
+    filled = int(bar_width * progress_pct / 100)
+    bar = "█" * filled + "░" * (bar_width - filled)
+    # Item counts
+    action_items = items_found.get("action_items", 0)
+    decisions = items_found.get("decisions", 0)
+    key_points = items_found.get("key_points", 0)
+    questions = items_found.get("open_questions", 0)
+    total_items = action_items + decisions + key_points + questions
+    # ETA formatting
+    if eta_seconds > 60:
+        eta_str = f"{eta_seconds // 60}m {eta_seconds % 60}s"
+    else:
+        eta_str = f"{eta_seconds}s"
+    # Truncate snippet
+    snippet = current_snippet[:60] + "..." if len(current_snippet) > 60 else current_snippet
+    ticker = f"""
+🪟 Window {current_window}/{total_windows} | {bar} {progress_pct:.0f}%
+📊 Extracted: {total_items} items
+   ✓ Actions: {action_items} | Decisions: {decisions} | Points: {key_points} | Questions: {questions}
+⚡ Speed: {tokens_per_sec:.1f} tok/s | ETA: {eta_str}
+📝 Latest: {snippet}
+"""
+    return ticker.strip()
+def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
+    """
+    Compute cosine similarity between two vectors.
+    Args:
+        vec1: First vector (normalized)
+        vec2: Second vector (normalized)
+    Returns:
+        Cosine similarity (0.0 to 1.0)
+    """
+    # Vectors should already be normalized, but ensure it
+    dot_product = np.dot(vec1, vec2)
+    return float(dot_product)
+# ===== HELPER FUNCTIONS =====
+def _try_parse_extraction_json(text: str) -> Optional[Dict[str, List[str]]]:
+    """
+    Attempt to parse extraction JSON from LLM output.
+    Args:
+        text: Raw LLM output
+    Returns:
+        Parsed dict or None if invalid
+    """
+    # Remove markdown code blocks
+    text = re.sub(r'```json\s*', '', text)
+    text = re.sub(r'```\s*$', '', text)
+    text = text.strip()
+    try:
+        data = json.loads(text)
+        # Validate schema
+        required_keys = {"action_items", "decisions", "key_points", "open_questions"}
+        if not isinstance(data, dict) or not required_keys.issubset(data.keys()):
+            return None
+        # Validate all values are lists
+        for key in required_keys:
+            if not isinstance(data[key], list):
+                return None
+        return data
+    except json.JSONDecodeError:
+        return None
+def _sample_llm_response(text: str, max_chars: int = 400) -> str:
+    """Sample LLM response for trace logging."""
+    if not text:
+        return ""
+    return text[:max_chars] if len(text) > max_chars else text
+# ===== CORE PIPELINE FUNCTIONS =====
+def stream_extract_from_window(
+    extraction_llm: Llama,
+    window: Window,
+    window_id: int,
+    total_windows: int,
+    tracer: Any,
+    tokenizer: NativeTokenizer,
+    model_config: Dict[str, Any],
+    enable_reasoning: bool = False
+) -> Generator[Tuple[str, str, Dict[str, List[str]], bool], None, None]:
+    """
+    Stream extraction from single window with live progress + optional reasoning.
+    Yields:
+        (ticker_text, thinking_text, partial_items, is_complete)
+        - ticker_text: Progress ticker for UI
+        - thinking_text: Reasoning/thinking blocks (if model supports)
+        - partial_items: Current extracted items
+        - is_complete: True on final yield
+    """
+    # Auto-detect language from window content
+    has_cjk = bool(re.search(r'[\u4e00-\u9fff]', window.content))
+    output_language = "zh-TW" if has_cjk else "en"
+    # Build system prompt
+    from meeting_summarizer.trace import Tracer  # Avoid circular import
+    supports_reasoning = model_config.get("supports_reasoning", False)
+    supports_toggle = model_config.get("supports_toggle", False)
+    # Build system prompt (reuse function from app.py via import)
+    if output_language == "zh-TW":
+        reasoning_inst = "使用推理能力分析後提取。" if (supports_toggle and enable_reasoning) else ""
+        system_prompt = f"""你是會議分析助手。{reasoning_inst}
+僅輸出 JSON：
+{{
+  "action_items": ["任務", ...],
+  "decisions": ["決策", ...],
+  "key_points": ["要點", ...],
+  "open_questions": ["問題", ...]
+}}"""
+    else:
+        reasoning_inst = "Use reasoning before extracting." if (supports_toggle and enable_reasoning) else ""
+        system_prompt = f"""You are a meeting assistant. {reasoning_inst}
+Output ONLY JSON:
+{{
+  "action_items": ["Task", ...],
+  "decisions": ["Decision", ...],
+  "key_points": ["Point", ...],
+  "open_questions": ["Question", ...]
+}}"""
+    user_prompt = f"Transcript:\n\n{window.content}"
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt}
+    ]
+    # Stream extraction
+    full_response = ""
+    thinking_content = ""
+    start_time = time.time()
+    first_token_time = None
+    token_count = 0
+    try:
+        settings = model_config["inference_settings"]
+        stream = extraction_llm.create_chat_completion(
+            messages=messages,
+            max_tokens=1024,
+            temperature=settings["temperature"],
+            top_p=settings["top_p"],
+            top_k=settings["top_k"],
+            repeat_penalty=settings["repeat_penalty"],
+            stream=True,
+        )
+        for chunk in stream:
+            if 'choices' in chunk and len(chunk['choices']) > 0:
+                delta = chunk['choices'][0].get('delta', {})
+                content = delta.get('content', '')
+                if content:
+                    if first_token_time is None:
+                        first_token_time = time.time()
+                    token_count += 1
+                    full_response += content
+                    # Parse thinking blocks if reasoning enabled
+                    if enable_reasoning and supports_reasoning:
+                        # Simple regex extraction
+                        thinking_match = re.search(r'<think(?:ing)?>(.*?)</think(?:ing)?>', full_response, re.DOTALL)
+                        if thinking_match:
+                            thinking_content = thinking_match.group(1).strip()
+                            json_text = full_response[:thinking_match.start()] + full_response[thinking_match.end():]
+                        else:
+                            json_text = full_response
+                    else:
+                        json_text = full_response
+                    # Try parse JSON
+                    partial_items = _try_parse_extraction_json(json_text)
+                    if not partial_items:
+                        partial_items = {"action_items": [], "decisions": [], "key_points": [], "open_questions": []}
+                    # Calculate metrics
+                    elapsed = time.time() - start_time
+                    tps = token_count / elapsed if elapsed > 0 else 0
+                    eta = int((1024 - token_count) / tps) if tps > 0 else 0
+                    # Get item counts
+                    items_found = {k: len(v) for k, v in partial_items.items()}
+                    # Get last item as snippet
+                    last_item = ""
+                    for cat in ["action_items", "decisions", "key_points", "open_questions"]:
+                        if partial_items.get(cat):
+                            last_item = partial_items[cat][-1]
+                            break
+                    # Format ticker
+                    ticker = format_progress_ticker(
+                        current_window=window_id,
+                        total_windows=total_windows,
+                        window_tokens=window.token_count,
+                        max_tokens=4096,
+                        items_found=items_found,
+                        tokens_per_sec=tps,
+                        eta_seconds=eta,
+                        current_snippet=last_item
+                    )
+                    yield (ticker, thinking_content, partial_items, False)
+        # Final parse
+        if enable_reasoning and supports_reasoning:
+            thinking_match = re.search(r'<think(?:ing)?>(.*?)</think(?:ing)?>', full_response, re.DOTALL)
+            if thinking_match:
+                thinking_content = thinking_match.group(1).strip()
+                json_text = full_response[:thinking_match.start()] + full_response[thinking_match.end():]
+            else:
+                json_text = full_response
+        else:
+            json_text = full_response
+        final_items = _try_parse_extraction_json(json_text)
+        if not final_items:
+            error_msg = f"Failed to parse JSON from window {window_id}"
+            tracer.log_extraction(
+                window_id=window_id,
+                extraction=None,
+                llm_response=_sample_llm_response(full_response),
+                error=error_msg
+            )
+            raise ValueError(error_msg)
+        # Log success
+        tracer.log_extraction(
+            window_id=window_id,
+            extraction=final_items,
+            llm_response=_sample_llm_response(full_response),
+            thinking=_sample_llm_response(thinking_content) if thinking_content else None,
+            error=None
+        )
+        # Final ticker
+        elapsed = time.time() - start_time
+        tps = token_count / elapsed if elapsed > 0 else 0
+        items_found = {k: len(v) for k, v in final_items.items()}
+        ticker = format_progress_ticker(
+            current_window=window_id,
+            total_windows=total_windows,
+            window_tokens=window.token_count,
+            max_tokens=4096,
+            items_found=items_found,
+            tokens_per_sec=tps,
+            eta_seconds=0,
+            current_snippet="✅ Extraction complete"
+        )
+        yield (ticker, thinking_content, final_items, True)
+    except Exception as e:
+        tracer.log_extraction(
+            window_id=window_id,
+            extraction=None,
+            llm_response=_sample_llm_response(full_response) if full_response else "",
+            error=str(e)
+        )
+        raise
+def deduplicate_items(
+    all_items: Dict[str, List[str]],
+    embedding_model: EmbeddingModel,
+    similarity_threshold: float,
+    tracer: Any
+) -> Dict[str, List[str]]:
+    """
+    Deduplicate items across all categories using embeddings.
+    Args:
+        all_items: Dict of {category: [items]}
+        embedding_model: Loaded embedding model
+        similarity_threshold: Cosine similarity threshold (0.0-1.0)
+        tracer: Tracer instance
+    Returns:
+        Deduplicated dict of {category: [items]}
+    """
+    deduplicated = {}
+    for category, items in all_items.items():
+        if not items:
+            deduplicated[category] = []
+            continue
+        original_count = len(items)
+        # Compute embeddings for all items
+        embeddings = []
+        for item in items:
+            emb = embedding_model.embed(item)
+            embeddings.append(emb)
+        # Mark duplicates
+        keep_indices = []
+        for i in range(len(items)):
+            is_duplicate = False
+            # Compare with all previously kept items
+            for j in keep_indices:
+                similarity = cosine_similarity(embeddings[i], embeddings[j])
+                if similarity >= similarity_threshold:
+                    is_duplicate = True
+                    break
+            if not is_duplicate:
+                keep_indices.append(i)
+        # Keep only unique items
+        unique_items = [items[i] for i in keep_indices]
+        deduplicated[category] = unique_items
+        # Log deduplication
+        duplicates_removed = original_count - len(unique_items)
+        tracer.log_deduplication(
+            category=category,
+            original_count=original_count,
+            deduplicated_count=len(unique_items),
+            duplicates_removed=duplicates_removed,
+            similarity_threshold=similarity_threshold,
+            embedding_model=embedding_model.model_key
+        )
+        logger.info(f"Dedup {category}: {original_count} → {len(unique_items)} ({duplicates_removed} removed)")
+    return deduplicated
+def stream_synthesize_executive_summary(
+    synthesis_llm: Llama,
+    deduplicated_items: Dict[str, List[str]],
+    model_config: Dict[str, Any],
+    output_language: str,
+    enable_reasoning: bool,
+    max_tokens: int,
+    tracer: Any
+) -> Generator[Tuple[str, str, bool], None, None]:
+    """
+    Stream synthesis of executive summary from deduplicated items.
+    Yields:
+        (summary_text, thinking_text, is_complete)
+    """
+    # Build synthesis prompt
+    item_counts = {k: len(v) for k, v in deduplicated_items.items()}
+    # Format items for prompt
+    items_text = ""
+    for category, items in deduplicated_items.items():
+        if items:
+            category_label = {
+                "action_items": "Action Items" if output_language == "en" else "行動項目",
+                "decisions": "Decisions" if output_language == "en" else "決策",
+                "key_points": "Key Points" if output_language == "en" else "關鍵要點",
+                "open_questions": "Open Questions" if output_language == "en" else "未解決問題"
+            }.get(category, category)
+            items_text += f"\n{category_label}:\n"
+            for i, item in enumerate(items, 1):
+                items_text += f"{i}. {item}\n"
+    if output_language == "zh-TW":
+        system_prompt = "你是執行摘要專家。生成簡潔的執行摘要。"
+        user_prompt = f"基於以下結構化資訊生成執行摘要：\n{items_text}\n\n請提供簡明的執行摘要。"
+    else:
+        system_prompt = "You are an executive summary expert. Generate concise summaries."
+        user_prompt = f"Generate an executive summary based on these structured items:\n{items_text}\n\nProvide a concise executive summary."
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt}
+    ]
+    # Stream synthesis
+    full_summary = ""
+    thinking_content = ""
+    try:
+        settings = model_config["inference_settings"]
+        stream = synthesis_llm.create_chat_completion(
+            messages=messages,
+            max_tokens=max_tokens,
+            temperature=settings["temperature"],
+            top_p=settings["top_p"],
+            top_k=settings["top_k"],
+            repeat_penalty=settings["repeat_penalty"],
+            stream=True,
+        )
+        for chunk in stream:
+            if 'choices' in chunk and len(chunk['choices']) > 0:
+                delta = chunk['choices'][0].get('delta', {})
+                content = delta.get('content', '')
+                if content:
+                    full_summary += content
+                    # Parse thinking if reasoning enabled
+                    if enable_reasoning and model_config.get("supports_reasoning"):
+                        thinking_match = re.search(r'<think(?:ing)?>(.*?)</think(?:ing)?>', full_summary, re.DOTALL)
+                        if thinking_match:
+                            thinking_content = thinking_match.group(1).strip()
+                            summary_text = full_summary[:thinking_match.start()] + full_summary[thinking_match.end():]
+                        else:
+                            summary_text = full_summary
+                    else:
+                        summary_text = full_summary
+                    yield (summary_text, thinking_content, False)
+        # Final parse
+        if enable_reasoning and model_config.get("supports_reasoning"):
+            thinking_match = re.search(r'<think(?:ing)?>(.*?)</think(?:ing)?>', full_summary, re.DOTALL)
+            if thinking_match:
+                thinking_content = thinking_match.group(1).strip()
+                summary_text = full_summary[:thinking_match.start()] + full_summary[thinking_match.end():]
+            else:
+                summary_text = full_summary
+        else:
+            summary_text = full_summary
+        # Log synthesis
+        tracer.log_synthesis(
+            synthesis_model=model_config["name"],
+            input_item_counts=item_counts,
+            output_summary=_sample_llm_response(summary_text),
+            thinking=_sample_llm_response(thinking_content) if thinking_content else None,
+            error=None
+        )
+        yield (summary_text, thinking_content, True)
+    except Exception as e:
+        tracer.log_synthesis(
+            synthesis_model=model_config["name"],
+            input_item_counts=item_counts,
+            output_summary="",
+            thinking=None,
+            error=str(e)
+        )
+        raise

meeting_summarizer/trace.py ADDED Viewed

	@@ -0,0 +1,197 @@

+"""
+Trace Logger for Advanced Mode Pipeline
+Logs extraction, deduplication, and synthesis operations for debugging
+and audit trail. Supports JSONL format for easy parsing.
+"""
+import json
+import time
+from typing import Dict, List, Any, Optional
+from datetime import datetime
+import logging
+logger = logging.getLogger(__name__)
+class Tracer:
+    """Trace logger for Advanced Mode 3-stage pipeline."""
+    def __init__(self, enabled: bool = True):
+        """
+        Initialize tracer.
+        Args:
+            enabled: Whether to enable trace logging
+        """
+        self.enabled = enabled
+        self.trace_entries: List[Dict[str, Any]] = []
+        self.start_time = time.time()
+    def log_extraction(
+        self,
+        window_id: int,
+        extraction: Optional[Dict[str, List[str]]],
+        llm_response: str,
+        thinking: Optional[str] = None,
+        error: Optional[str] = None
+    ) -> None:
+        """
+        Log extraction operation for a single window.
+        Args:
+            window_id: Window identifier
+            extraction: Extracted items dict (action_items, decisions, key_points, open_questions)
+            llm_response: Sampled LLM response (first 400 chars)
+            thinking: Sampled thinking/reasoning content (if applicable)
+            error: Error message if extraction failed
+        """
+        if not self.enabled:
+            return
+        entry = {
+            "stage": "extraction",
+            "timestamp": datetime.now().isoformat(),
+            "elapsed_seconds": round(time.time() - self.start_time, 2),
+            "window_id": window_id,
+            "success": extraction is not None and error is None,
+            "error": error,
+            "extraction": extraction,
+            "llm_response_sample": llm_response[:400] if llm_response else None,
+            "thinking_sample": thinking[:400] if thinking else None,
+        }
+        self.trace_entries.append(entry)
+        logger.debug(f"[Trace] Extraction window {window_id}: {entry['success']}")
+    def log_deduplication(
+        self,
+        category: str,
+        original_count: int,
+        deduplicated_count: int,
+        duplicates_removed: int,
+        similarity_threshold: float,
+        embedding_model: str
+    ) -> None:
+        """
+        Log deduplication operation for a category.
+        Args:
+            category: Category name (action_items, decisions, etc.)
+            original_count: Number of items before deduplication
+            deduplicated_count: Number of items after deduplication
+            duplicates_removed: Number of duplicates removed
+            similarity_threshold: Similarity threshold used
+            embedding_model: Embedding model used
+        """
+        if not self.enabled:
+            return
+        entry = {
+            "stage": "deduplication",
+            "timestamp": datetime.now().isoformat(),
+            "elapsed_seconds": round(time.time() - self.start_time, 2),
+            "category": category,
+            "original_count": original_count,
+            "deduplicated_count": deduplicated_count,
+            "duplicates_removed": duplicates_removed,
+            "duplicate_rate": round(duplicates_removed / original_count * 100, 1) if original_count > 0 else 0.0,
+            "similarity_threshold": similarity_threshold,
+            "embedding_model": embedding_model,
+        }
+        self.trace_entries.append(entry)
+        logger.debug(f"[Trace] Deduplication {category}: {original_count} → {deduplicated_count} ({duplicates_removed} removed)")
+    def log_synthesis(
+        self,
+        synthesis_model: str,
+        input_item_counts: Dict[str, int],
+        output_summary: str,
+        thinking: Optional[str] = None,
+        error: Optional[str] = None
+    ) -> None:
+        """
+        Log synthesis operation.
+        Args:
+            synthesis_model: Model key used for synthesis
+            input_item_counts: Dict of category counts fed to synthesis
+            output_summary: Generated summary (sampled)
+            thinking: Thinking/reasoning content (sampled, if applicable)
+            error: Error message if synthesis failed
+        """
+        if not self.enabled:
+            return
+        entry = {
+            "stage": "synthesis",
+            "timestamp": datetime.now().isoformat(),
+            "elapsed_seconds": round(time.time() - self.start_time, 2),
+            "synthesis_model": synthesis_model,
+            "input_item_counts": input_item_counts,
+            "success": error is None,
+            "error": error,
+            "output_summary_sample": output_summary[:400] if output_summary else None,
+            "thinking_sample": thinking[:400] if thinking else None,
+        }
+        self.trace_entries.append(entry)
+        logger.debug(f"[Trace] Synthesis: {entry['success']}")
+    def get_trace_jsonl(self) -> str:
+        """
+        Get trace entries as JSONL string.
+        Returns:
+            JSONL string (one JSON object per line)
+        """
+        if not self.enabled:
+            return ""
+        return "\n".join(json.dumps(entry, ensure_ascii=False) for entry in self.trace_entries)
+    def get_trace_json(self) -> List[Dict[str, Any]]:
+        """
+        Get trace entries as list of dicts.
+        Returns:
+            List of trace entry dicts
+        """
+        if not self.enabled:
+            return []
+        return self.trace_entries
+    def get_summary_stats(self) -> Dict[str, Any]:
+        """
+        Get summary statistics from trace.
+        Returns:
+            Dict with pipeline statistics
+        """
+        if not self.enabled or not self.trace_entries:
+            return {}
+        extraction_entries = [e for e in self.trace_entries if e["stage"] == "extraction"]
+        dedup_entries = [e for e in self.trace_entries if e["stage"] == "deduplication"]
+        synthesis_entries = [e for e in self.trace_entries if e["stage"] == "synthesis"]
+        total_extracted = sum(
+            sum(e["extraction"].values()) if e.get("extraction") else 0
+            for e in extraction_entries
+        )
+        total_deduplicated = sum(e["deduplicated_count"] for e in dedup_entries)
+        total_duplicates = sum(e["duplicates_removed"] for e in dedup_entries)
+        return {
+            "total_windows": len(extraction_entries),
+            "successful_extractions": sum(1 for e in extraction_entries if e["success"]),
+            "total_items_extracted": total_extracted,
+            "total_items_after_dedup": total_deduplicated,
+            "total_duplicates_removed": total_duplicates,
+            "duplicate_rate": round(total_duplicates / total_extracted * 100, 1) if total_extracted > 0 else 0.0,
+            "synthesis_success": synthesis_entries[0]["success"] if synthesis_entries else False,
+            "total_elapsed_seconds": round(time.time() - self.start_time, 2),
+        }