Spaces:
Running
Advanced 2-Stage Meeting Summarization - Complete Implementation Plan
Project: Tiny Scribe - Advanced Mode
Date: 2026-02-04
Status: Ready for Implementation
Estimated Effort: 13-19 hours
Table of Contents
- Executive Summary
- Design Decisions
- Model Registries
- UI Implementation
- Model Management Infrastructure
- Extraction Pipeline
- Implementation Checklist
- Testing Strategy
- Implementation Priority
- Risk Assessment
Executive Summary
This plan details the implementation of a 3-model Advanced Summarization Pipeline for Tiny Scribe, featuring:
- ✅ 3 independent model registries (Extraction, Embedding, Synthesis)
- ✅ User-configurable extraction context (2K-8K tokens, default 4K)
- ✅ Reasoning/thinking model support with independent toggles per stage
- ✅ Sequential model loading for memory efficiency
- ✅ Bilingual support (English + Traditional Chinese zh-TW)
- ✅ Fail-fast error handling with graceful UI feedback
- ✅ Complete independence from Standard mode
Architecture
Stage 1: EXTRACTION → Parse transcript → Create windows → Extract JSON items
Stage 2: DEDUPLICATION → Compute embeddings → Remove semantic duplicates
Stage 3: SYNTHESIS → Generate executive summary from deduplicated items
Key Metrics
| Metric | Value |
|---|---|
| New Code | ~1,800 lines |
| Modified Code | ~60 lines |
| Total Models | 33 unique (13 + 4 + 16) |
| Default Models | qwen3_1.7b_q4, granite-107m, qwen3_1.7b_q4 |
| Memory Strategy | Sequential load/unload (safe for HF Spaces Free Tier) |
Design Decisions
Q1: Extraction Model List Composition (REVISION)
Decision: Option A - 11 models (≤1.7B), excluding LFM2-Extract models
Rationale: 11 models excluding LFM2-Extract specialized models (removed after testing showed 85.7% failure rate due to hallucination and schema non-compliance. Replaced with Qwen3 models that support reasoning and better handle Chinese content.)
Q1a: Synthesis Model Selection (NEW)
Decision: Restrict to models ≤4GB (max 4B parameters)
Rationale: HF Spaces Free Tier only has 16GB RAM; 7B+ models will OOM. Remove ernie_21b, glm_4_7_flash_reap_30b, qwen3_30b_thinking_q1, qwen3_30b_instruct_q1
Q2: Independence from Standard Mode
Decision: Option B - Both Extraction AND Synthesis fully independent from AVAILABLE_MODELS
Rationale: Full independence prevents parameter cross-contamination; synthesis models have their own optimized temperatures (0.7-0.9) separate from Standard mode
Q3: Extraction n_ctx UI Control
Decision: Option A - Slider (2K-8K, step 1024, default 4K)
Rationale: Maximum flexibility for users to balance precision vs speed
Q4: Default Models
Decision:
- Extraction:
qwen3_1.7b_q4(supports reasoning, better Chinese understanding) - Embedding:
granite-107m(fastest, good enough) - Synthesis:
qwen3_1.7b_q4(larger than extraction, better quality)
Rationale: Balanced defaults optimized for quality and speed. Qwen3 1.7B chosen over LFM2-Extract based on empirical testing showing superior extraction success rate and schema compliance.
Q5: Model Key Naming
Decision: Keep same keys (no prefix like adv_synth_)
Rationale: Simpler, less duplication, clear role-based config resolution
Q6: Model Overlap Between Stages
Decision: Allow overlap with independent settings per role
Rationale: Same model can be extraction + synthesis with different parameters
Q7: Reasoning Checkbox UI Flow
Decision: Option B - Separate checkboxes for extraction and synthesis
Rationale: Independent control per stage, clearer user intent
Q8: Thinking Block Display
Decision: Option A - Reuse "MODEL THINKING PROCESS" field
Rationale: Consistent with Standard mode, no UI layout changes needed
Q9: Window Token Counting with User n_ctx
Decision: Option A - Strict adherence to user's slider value
Rationale: Respect user's explicit choice, they may want larger/smaller windows
Q10: Model Loading Error Handling
Decision: Option C - Graceful failure with UI error message
Rationale: Most user-friendly, allows retry with different model
Model Registries
1. EXTRACTION_MODELS (13 models - FINAL)
Location: /home/luigi/tiny-scribe/app.py
Features:
- ✅ Independent from
AVAILABLE_MODELS - ✅ User-adjustable
n_ctx(2K-8K, default 4K) - ✅ Extraction-optimized settings (temp 0.1-0.3)
- ✅ 2 hybrid models with reasoning toggle
- ✅ All models verified on HuggingFace
Complete Registry (LFM2-Extract models removed after testing):
EXTRACTION_MODELS = {
"falcon_h1_100m": {
"name": "Falcon-H1 100M",
"repo_id": "mradermacher/Falcon-H1-Tiny-Multilingual-100M-Instruct-GGUF",
"filename": "*Q8_0.gguf",
"max_context": 32768,
"default_n_ctx": 4096,
"params_size": "100M",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.2,
"top_p": 0.9,
"top_k": 30,
"repeat_penalty": 1.0,
},
},
"gemma3_270m": {
"name": "Gemma-3 270M",
"repo_id": "unsloth/gemma-3-270m-it-qat-GGUF",
"filename": "*Q8_0.gguf",
"max_context": 32768,
"default_n_ctx": 4096,
"params_size": "270M",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.3,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.0,
},
},
"ernie_300m": {
"name": "ERNIE-4.5 0.3B (131K Context)",
"repo_id": "unsloth/ERNIE-4.5-0.3B-PT-GGUF",
"filename": "*Q8_0.gguf",
"max_context": 131072,
"default_n_ctx": 4096,
"params_size": "300M",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.2,
"top_p": 0.9,
"top_k": 30,
"repeat_penalty": 1.0,
},
},
"granite_350m": {
"name": "Granite-4.0 350M",
"repo_id": "unsloth/granite-4.0-h-350m-GGUF",
"filename": "*Q8_0.gguf",
"max_context": 32768,
"default_n_ctx": 4096,
"params_size": "350M",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.1,
"top_p": 0.95,
"top_k": 30,
"repeat_penalty": 1.0,
},
},
"lfm2_350m": {
"name": "LFM2 350M",
"repo_id": "LiquidAI/LFM2-350M-GGUF",
"filename": "*Q8_0.gguf",
"max_context": 32768,
"default_n_ctx": 4096,
"params_size": "350M",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.2,
"top_p": 0.9,
"top_k": 40,
"repeat_penalty": 1.0,
},
},
"bitcpm4_500m": {
"name": "BitCPM4 0.5B (128K Context)",
"repo_id": "openbmb/BitCPM4-0.5B-GGUF",
"filename": "*q4_0.gguf",
"max_context": 131072,
"default_n_ctx": 4096,
"params_size": "500M",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.2,
"top_p": 0.9,
"top_k": 30,
"repeat_penalty": 1.0,
},
},
"hunyuan_500m": {
"name": "Hunyuan 0.5B (256K Context)",
"repo_id": "mradermacher/Hunyuan-0.5B-Instruct-GGUF",
"filename": "*Q8_0.gguf",
"max_context": 262144,
"default_n_ctx": 4096,
"params_size": "500M",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.2,
"top_p": 0.9,
"top_k": 30,
"repeat_penalty": 1.0,
},
},
"qwen3_600m_q4": {
"name": "Qwen3 0.6B Q4 (32K Context)",
"repo_id": "unsloth/Qwen3-0.6B-GGUF",
"filename": "*Q4_0.gguf",
"max_context": 32768,
"default_n_ctx": 4096,
"params_size": "600M",
"supports_reasoning": True, # ← HYBRID MODEL
"supports_toggle": True, # ← User can toggle reasoning
"inference_settings": {
"temperature": 0.3,
"top_p": 0.9,
"top_k": 20,
"repeat_penalty": 1.0,
},
},
"granite_3_1_1b_q8": {
"name": "Granite 3.1 1B-A400M Instruct (128K Context)",
"repo_id": "bartowski/granite-3.1-1b-a400m-instruct-GGUF",
"filename": "*Q8_0.gguf",
"max_context": 131072,
"default_n_ctx": 4096,
"params_size": "1B",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.3,
"top_p": 0.9,
"top_k": 30,
"repeat_penalty": 1.0,
},
},
"falcon_h1_1.5b_q4": {
"name": "Falcon-H1 1.5B Q4",
"repo_id": "unsloth/Falcon-H1-1.5B-Deep-Instruct-GGUF",
"filename": "*Q4_K_M.gguf",
"max_context": 32768,
"default_n_ctx": 4096,
"params_size": "1.5B",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.2,
"top_p": 0.9,
"top_k": 30,
"repeat_penalty": 1.0,
},
},
"qwen3_1.7b_q4": {
"name": "Qwen3 1.7B Q4 (32K Context)",
"repo_id": "unsloth/Qwen3-1.7B-GGUF",
"filename": "*Q4_0.gguf",
"max_context": 32768,
"default_n_ctx": 4096,
"params_size": "1.7B",
"supports_reasoning": True, # ← HYBRID MODEL
"supports_toggle": True, # ← User can toggle reasoning
"inference_settings": {
"temperature": 0.3,
"top_p": 0.9,
"top_k": 20,
"repeat_penalty": 1.0,
},
},
"lfm2_extract_350m": {
"name": "LFM2-Extract 350M (Specialized)",
"repo_id": "LiquidAI/LFM2-350M-Extract-GGUF",
"filename": "*Q8_0.gguf",
"max_context": 32768,
"default_n_ctx": 4096,
"params_size": "350M",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.0, # ← Greedy decoding per Liquid AI docs
"top_p": 1.0,
"top_k": 0,
"repeat_penalty": 1.0,
},
},
"lfm2_extract_1.2b": {
"name": "LFM2-Extract 1.2B (Specialized)",
"repo_id": "LiquidAI/LFM2-1.2B-Extract-GGUF",
"filename": "*Q8_0.gguf",
"max_context": 32768,
"default_n_ctx": 4096,
"params_size": "1.2B",
"supports_reasoning": False,
"supports_toggle": False,
"inference_settings": {
"temperature": 0.0, # ← Greedy decoding per Liquid AI docs
"top_p": 1.0,
"top_k": 0,
"repeat_penalty": 1.0,
},
},
}
Hybrid Models (Reasoning Support):
qwen3_600m_q4- 600M, user-toggleable reasoningqwen3_1.7b_q4- 1.7B, user-toggleable reasoning
2. SYNTHESIS_MODELS (16 models)
Location: /home/luigi/tiny-scribe/app.py
Features:
- ✅ Fully independent from
AVAILABLE_MODELS(no shared references) - ✅ Synthesis-optimized temperatures (0.7-0.9, higher than extraction)
- ✅ 3 hybrid + 5 thinking-only models with reasoning support
- ✅ Default:
qwen3_1.7b_q4
Registry Definition:
# FULLY INDEPENDENT from AVAILABLE_MODELS (no shared references)
# Synthesis-optimized settings: higher temperatures (0.7-0.9) for creative summary generation
SYNTHESIS_MODELS = {
"granite_3_1_1b_q8": {..., "temperature": 0.8},
"falcon_h1_1.5b_q4": {..., "temperature": 0.7},
"qwen3_1.7b_q4": {..., "temperature": 0.8}, # DEFAULT
"granite_3_3_2b_q4": {..., "temperature": 0.8},
"youtu_llm_2b_q8": {..., "temperature": 0.8}, # reasoning toggle
"lfm2_2_6b_transcript": {..., "temperature": 0.7},
"breeze_3b_q4": {..., "temperature": 0.7},
"granite_3_1_3b_q4": {..., "temperature": 0.8},
"qwen3_4b_thinking_q3": {..., "temperature": 0.8}, # thinking-only
"granite4_tiny_q3": {..., "temperature": 0.8},
"ernie_21b_pt_q1": {..., "temperature": 0.8},
"ernie_21b_thinking_q1": {..., "temperature": 0.9}, # thinking-only
"glm_4_7_flash_reap_30b": {..., "temperature": 0.8}, # thinking-only
"glm_4_7_flash_30b_iq2": {..., "temperature": 0.7},
"qwen3_30b_thinking_q1": {..., "temperature": 0.8}, # thinking-only
"qwen3_30b_instruct_q1": {..., "temperature": 0.7},
}
Reasoning Models:
- Hybrid (toggleable):
qwen3_1.7b_q4,youtu_llm_2b_q8 - Thinking-only:
qwen3_4b_thinking_q3,ernie_21b_thinking_q1,glm_4_7_flash_reap_30b,qwen3_30b_thinking_q1
3. EMBEDDING_MODELS (4 models)
Location: /home/luigi/tiny-scribe/meeting_summarizer/extraction.py
Features:
- ✅ Dedicated embedding models (not in AVAILABLE_MODELS)
- ✅ Used exclusively for deduplication phase
- ✅ Range: 384-dim to 1024-dim
- ✅ Default:
granite-107m
Registry:
EMBEDDING_MODELS = {
"granite-107m": {
"name": "Granite 107M Multilingual (384-dim)",
"repo_id": "ibm-granite/granite-embedding-107m-multilingual",
"filename": "*Q8_0.gguf",
"embedding_dim": 384,
"max_context": 2048,
"description": "Fastest, multilingual, good for quick deduplication",
},
"granite-278m": {
"name": "Granite 278M Multilingual (768-dim)",
"repo_id": "ibm-granite/granite-embedding-278m-multilingual",
"filename": "*Q8_0.gguf",
"embedding_dim": 768,
"max_context": 2048,
"description": "Balanced speed/quality, multilingual",
},
"gemma-300m": {
"name": "Embedding Gemma 300M (768-dim)",
"repo_id": "unsloth/embeddinggemma-300m-GGUF",
"filename": "*Q8_0.gguf",
"embedding_dim": 768,
"max_context": 2048,
"description": "Google embedding model, strong semantics",
},
"qwen-600m": {
"name": "Qwen3 Embedding 600M (1024-dim)",
"repo_id": "Qwen/Qwen3-Embedding-0.6B-GGUF",
"filename": "*Q8_0.gguf",
"embedding_dim": 1024,
"max_context": 2048,
"description": "Highest quality, best for critical dedup",
},
}
UI Implementation
Advanced Mode Controls (Option B: Separate Reasoning Checkboxes)
Location: /home/luigi/tiny-scribe/app.py, Gradio interface section
# ===== ADVANCED MODE CONTROLS =====
# Uses gr.TabItem inside gr.Tabs (not gr.Group with visibility toggle)
with gr.TabItem("🧠 Advanced Mode (3-Model Pipeline)"):
# Model Selection Row
with gr.Row():
extraction_model = gr.Dropdown(
choices=list(EXTRACTION_MODELS.keys()),
value="qwen3_1.7b_q4", # ⭐ DEFAULT
label="🔍 Stage 1: Extraction Model (≤1.7B)",
info="Extracts structured items (action_items, decisions, key_points, questions) from windows"
)
embedding_model = gr.Dropdown(
choices=list(EMBEDDING_MODELS.keys()),
value="granite-107m", # ⭐ DEFAULT
label="🧬 Stage 2: Embedding Model",
info="Computes semantic embeddings for deduplication across categories"
)
synthesis_model = gr.Dropdown(
choices=list(SYNTHESIS_MODELS.keys()),
value="qwen3_1.7b_q4", # ⭐ DEFAULT
label="✨ Stage 3: Synthesis Model (1B-30B)",
info="Generates final executive summary from deduplicated items"
)
# Extraction Parameters Row
with gr.Row():
extraction_n_ctx = gr.Slider(
minimum=2048,
maximum=8192,
step=1024,
value=4096, # ⭐ DEFAULT 4K
label="🪟 Extraction Context Window (n_ctx)",
info="Smaller = more windows (higher precision), Larger = fewer windows (faster processing)"
)
overlap_turns = gr.Slider(
minimum=1,
maximum=5,
step=1,
value=2,
label="🔄 Window Overlap (speaker turns)",
info="Number of speaker turns shared between adjacent windows (reduces information loss)"
)
# Deduplication Parameters Row
with gr.Row():
similarity_threshold = gr.Slider(
minimum=0.70,
maximum=0.95,
step=0.01,
value=0.85,
label="🎯 Deduplication Similarity Threshold",
info="Items with cosine similarity above this are considered duplicates (higher = stricter)"
)
# SEPARATE REASONING CONTROLS (Q7: Option B)
with gr.Row():
enable_extraction_reasoning = gr.Checkbox(
value=False,
visible=False, # Conditional visibility based on extraction model
label="🧠 Enable Reasoning for Extraction",
info="Use thinking process before JSON output (Qwen3 hybrid models only)"
)
enable_synthesis_reasoning = gr.Checkbox(
value=True,
visible=True, # Conditional visibility based on synthesis model
label="🧠 Enable Reasoning for Synthesis",
info="Use thinking process for final summary generation"
)
# Output Settings Row
with gr.Row():
adv_output_language = gr.Radio(
choices=["en", "zh-TW"],
value="en",
label="🌐 Output Language",
info="Extraction auto-detects language from transcript, synthesis uses this setting"
)
adv_max_tokens = gr.Slider(
minimum=512,
maximum=4096,
step=128,
value=2048,
label="📏 Max Synthesis Tokens",
info="Maximum tokens for final executive summary"
)
# Logging Control
enable_detailed_logging = gr.Checkbox(
value=True,
label="📝 Enable Detailed Trace Logging",
info="Save JSONL trace file (embedded in download JSON) for debugging pipeline"
)
# Model Info Accordion
with gr.Accordion("📋 Model Details & Settings", open=False):
with gr.Row():
with gr.Column():
extraction_model_info = gr.Markdown("**Extraction Model**\n\nSelect a model to see details")
with gr.Column():
embedding_model_info = gr.Markdown("**Embedding Model**\n\nSelect a model to see details")
with gr.Column():
synthesis_model_info = gr.Markdown("**Synthesis Model**\n\nSelect a model to see details")
Conditional Reasoning Checkbox Visibility Logic
def update_extraction_reasoning_visibility(model_key):
"""Show/hide extraction reasoning checkbox based on model capabilities."""
config = EXTRACTION_MODELS.get(model_key, {})
supports_toggle = config.get("supports_toggle", False)
if supports_toggle:
# Hybrid model (qwen3_600m_q4, qwen3_1.7b_q4)
return gr.update(
visible=True,
value=False,
interactive=True,
label="🧠 Enable Reasoning for Extraction"
)
elif config.get("supports_reasoning", False) and not supports_toggle:
# Thinking-only model (none currently in extraction, but future-proof)
return gr.update(
visible=True,
value=True,
interactive=False,
label="🧠 Reasoning Mode for Extraction (Always On)"
)
else:
# Non-reasoning model
return gr.update(visible=False, value=False)
def update_synthesis_reasoning_visibility(model_key):
"""Show/hide synthesis reasoning checkbox based on model capabilities."""
# Reuse existing logic from Standard mode
return update_reasoning_visibility(model_key) # Existing function
# Wire up event handlers
extraction_model.change(
fn=update_extraction_reasoning_visibility,
inputs=[extraction_model],
outputs=[enable_extraction_reasoning]
)
synthesis_model.change(
fn=update_synthesis_reasoning_visibility,
inputs=[synthesis_model],
outputs=[enable_synthesis_reasoning]
)
Model Info Display Functions
def get_extraction_model_info(model_key):
"""Generate markdown info for extraction model."""
config = EXTRACTION_MODELS.get(model_key, {})
settings = config.get("inference_settings", {})
reasoning_support = ""
if config.get("supports_toggle"):
reasoning_support = "\n**Reasoning:** Hybrid (user-toggleable)"
elif config.get("supports_reasoning"):
reasoning_support = "\n**Reasoning:** Thinking-only (always on)"
return f"""**{config.get('name', 'Unknown')}**
**Size:** {config.get('params_size', 'N/A')}
**Max Context:** {config.get('max_context', 0):,} tokens
**Default n_ctx:** {config.get('default_n_ctx', 4096):,} tokens (user-adjustable via slider)
**Repository:** `{config.get('repo_id', 'N/A')}`{reasoning_support}
**Extraction-Optimized Settings:**
- Temperature: {settings.get('temperature', 'N/A')} (deterministic for JSON)
- Top P: {settings.get('top_p', 'N/A')}
- Top K: {settings.get('top_k', 'N/A')}
- Repeat Penalty: {settings.get('repeat_penalty', 'N/A')}
"""
def get_embedding_model_info(model_key):
"""Generate markdown info for embedding model."""
from meeting_summarizer.extraction import EMBEDDING_MODELS
config = EMBEDDING_MODELS.get(model_key, {})
return f"""**{config.get('name', 'Unknown')}**
**Embedding Dimension:** {config.get('embedding_dim', 'N/A')}
**Context:** {config.get('max_context', 0):,} tokens
**Repository:** `{config.get('repo_id', 'N/A')}`
**Description:** {config.get('description', 'N/A')}
"""
def get_synthesis_model_info(model_key):
"""Generate markdown info for synthesis model."""
config = SYNTHESIS_MODELS.get(model_key, {})
settings = config.get("inference_settings", {})
reasoning_support = ""
if config.get("supports_toggle"):
reasoning_support = "\n**Reasoning:** Hybrid (user-toggleable)"
elif config.get("supports_reasoning"):
reasoning_support = "\n**Reasoning:** Thinking-only (always on)"
return f"""**{config.get('name', 'Unknown')}**
**Max Context:** {config.get('max_context', 0):,} tokens
**Repository:** `{config.get('repo_id', 'N/A')}`{reasoning_support}
**Synthesis-Optimized Settings:**
- Temperature: {settings.get('temperature', 'N/A')} (from Standard mode)
- Top P: {settings.get('top_p', 'N/A')}
- Top K: {settings.get('top_k', 'N/A')}
- Repeat Penalty: {settings.get('repeat_penalty', 'N/A')}
"""
# Wire up info update handlers
extraction_model.change(
fn=get_extraction_model_info,
inputs=[extraction_model],
outputs=[extraction_model_info]
)
embedding_model.change(
fn=get_embedding_model_info,
inputs=[embedding_model],
outputs=[embedding_model_info]
)
synthesis_model.change(
fn=get_synthesis_model_info,
inputs=[synthesis_model],
outputs=[synthesis_model_info]
)
Model Management Infrastructure
Role-Aware Configuration Resolver
def get_model_config(model_key: str, model_role: str) -> Dict[str, Any]:
"""
Get model configuration based on role.
Ensures same model (e.g., qwen3_1.7b_q4) uses DIFFERENT settings
for extraction vs synthesis.
Args:
model_key: Model identifier (e.g., "qwen3_1.7b_q4")
model_role: "extraction" or "synthesis"
Returns:
Model configuration dict with role-specific settings
Raises:
ValueError: If model_key not available for specified role
"""
if model_role == "extraction":
if model_key not in EXTRACTION_MODELS:
available = ", ".join(list(EXTRACTION_MODELS.keys())[:3]) + "..."
raise ValueError(
f"Model '{model_key}' not available for extraction role. "
f"Available: {available}"
)
return EXTRACTION_MODELS[model_key]
elif model_role == "synthesis":
if model_key not in SYNTHESIS_MODELS:
available = ", ".join(list(SYNTHESIS_MODELS.keys())[:3]) + "..."
raise ValueError(
f"Model '{model_key}' not available for synthesis role. "
f"Available: {available}"
)
return SYNTHESIS_MODELS[model_key]
else:
raise ValueError(
f"Unknown model role: '{model_role}'. "
f"Must be 'extraction' or 'synthesis'"
)
Role-Aware Model Loader (Q9: Option A - Respect user's n_ctx choice)
def load_model_for_role(
model_key: str,
model_role: str,
n_threads: int = 2,
user_n_ctx: Optional[int] = None # For extraction, from slider
) -> Tuple[Llama, str]:
"""
Load model with role-specific configuration.
Args:
model_key: Model identifier
model_role: "extraction" or "synthesis"
n_threads: CPU threads
user_n_ctx: User-specified n_ctx (extraction only, from slider)
Returns:
(loaded_model, info_message)
Raises:
Exception: If model loading fails (Q10: Option C - fail gracefully)
"""
try:
config = get_model_config(model_key, model_role)
# Calculate n_ctx (Q9: Option A - Strict adherence to user's choice)
if model_role == "extraction" and user_n_ctx is not None:
n_ctx = min(user_n_ctx, config["max_context"], MAX_USABLE_CTX)
else:
# Synthesis or default extraction
n_ctx = min(config.get("max_context", 8192), MAX_USABLE_CTX)
# Detect GPU support
requested_ngl = int(os.environ.get("N_GPU_LAYERS", 0))
n_gpu_layers = requested_ngl
if requested_ngl != 0:
try:
from llama_cpp import llama_supports_gpu_offload
gpu_available = llama_supports_gpu_offload()
if not gpu_available:
logger.warning("GPU requested but not available. Using CPU.")
n_gpu_layers = 0
except Exception as e:
logger.warning(f"Could not detect GPU: {e}. Using CPU.")
n_gpu_layers = 0
# Load model
logger.info(f"Loading {config['name']} for {model_role} role (n_ctx={n_ctx:,})")
llm = Llama.from_pretrained(
repo_id=config["repo_id"],
filename=config["filename"],
n_ctx=n_ctx,
n_batch=min(2048, n_ctx),
n_threads=n_threads,
n_threads_batch=n_threads,
n_gpu_layers=n_gpu_layers,
verbose=False,
seed=1337,
)
info_msg = (
f"✅ Loaded: {config['name']} for {model_role} "
f"(n_ctx={n_ctx:,}, threads={n_threads})"
)
logger.info(info_msg)
return llm, info_msg
except Exception as e:
# Q10: Option C - Fail gracefully, let user select different model
error_msg = (
f"❌ Failed to load {model_key} for {model_role}: {str(e)}\n\n"
f"Please select a different model and try again."
)
logger.error(error_msg, exc_info=True)
raise Exception(error_msg)
def unload_model(llm: Llama, model_name: str = "model") -> None:
"""Explicitly unload model and trigger garbage collection."""
if llm:
logger.info(f"Unloading {model_name}")
del llm
gc.collect()
time.sleep(0.5) # Allow OS to reclaim memory
Extraction Pipeline
Extraction System Prompt Builder (Bilingual + Reasoning)
def build_extraction_system_prompt(
output_language: str,
supports_reasoning: bool,
supports_toggle: bool,
enable_reasoning: bool
) -> str:
"""
Build extraction system prompt with optional reasoning mode.
Args:
output_language: "en" or "zh-TW" (auto-detected from transcript)
supports_reasoning: Model has reasoning capability
supports_toggle: User can toggle reasoning on/off
enable_reasoning: User's choice (only applies if supports_toggle=True)
Returns:
System prompt string
"""
# Determine reasoning mode
if supports_toggle and enable_reasoning:
# Hybrid model with reasoning enabled
reasoning_instruction_en = """
Use your reasoning capabilities to analyze the content before extracting.
Your reasoning should:
1. Identify key decision points and action items
2. Distinguish explicit decisions from general discussion
3. Categorize information appropriately (action vs point vs question)
After reasoning, output ONLY valid JSON."""
reasoning_instruction_zh = """
使用你的推理能力分析內容後再進行提取。
你的推理應該:
1. 識別關鍵決策點和行動項目
2. 區分明確決策與一般討論
3. 適當分類資訊(行動 vs 要點 vs 問題)
推理後,僅輸出 JSON。"""
else:
reasoning_instruction_en = ""
reasoning_instruction_zh = ""
# Build full prompt
if output_language == "zh-TW":
return f"""你是會議分析助手。從逐字稿中提取結構化資訊。
{reasoning_instruction_zh}
僅輸出有效的 JSON,使用此精確架構:
{{
"action_items": ["包含負責人和截止日期的任務", ...],
"decisions": ["包含理由的決策", ...],
"key_points": ["重要討論要點", ...],
"open_questions": ["未解決的問題或疑慮", ...]
}}
規則:
- 每個項目必須是完整、獨立的句子
- 在每個項目中包含上下文(誰、什麼、何時)
- 如果類別沒有項目,使用空陣列 []
- 僅輸出 JSON,無 markdown,無解釋"""
else: # English
return f"""You are a meeting analysis assistant. Extract structured information from transcript.
{reasoning_instruction_en}
Output ONLY valid JSON with this exact schema:
{{
"action_items": ["Task with owner and deadline", ...],
"decisions": ["Decision made with rationale", ...],
"key_points": ["Important discussion point", ...],
"open_questions": ["Unresolved question or concern", ...]
}}
Rules:
- Each item must be a complete, standalone sentence
- Include context (who, what, when) in each item
- If a category has no items, use empty array []
- Output ONLY JSON, no markdown, no explanations"""
Extraction Streaming with Reasoning Parsing (Q8: Option A - Show in "MODEL THINKING PROCESS")
def stream_extract_from_window(
extraction_llm: Llama,
window: Window,
window_id: int,
total_windows: int,
tracer: Tracer,
tokenizer: NativeTokenizer,
enable_reasoning: bool = False
) -> Generator[Tuple[str, str, Dict[str, List[str]], bool], None, None]:
"""
Stream extraction from single window with live progress + optional reasoning.
Yields:
(ticker_text, thinking_text, partial_items, is_complete)
- ticker_text: Progress ticker for UI
- thinking_text: Reasoning/thinking blocks (if extraction model supports it)
- partial_items: Current extracted items
- is_complete: True on final yield
"""
# Auto-detect language from window content
has_cjk = bool(re.search(r'[\u4e00-\u9fff]', window.content))
output_language = "zh-TW" if has_cjk else "en"
# Build system prompt with reasoning support
config = EXTRACTION_MODELS[window.model_key] # Assuming we pass model_key in Window
system_prompt = build_extraction_system_prompt(
output_language=output_language,
supports_reasoning=config.get("supports_reasoning", False),
supports_toggle=config.get("supports_toggle", False),
enable_reasoning=enable_reasoning
)
user_prompt = f"Transcript:\n\n{window.content}"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
# Stream extraction
full_response = ""
thinking_content = ""
start_time = time.time()
first_token_time = None
token_count = 0
try:
stream = extraction_llm.create_chat_completion(
messages=messages,
max_tokens=1024,
temperature=config["inference_settings"]["temperature"],
top_p=config["inference_settings"]["top_p"],
top_k=config["inference_settings"]["top_k"],
repeat_penalty=config["inference_settings"]["repeat_penalty"],
stream=True,
)
for chunk in stream:
if 'choices' in chunk and len(chunk['choices']) > 0:
delta = chunk['choices'][0].get('delta', {})
content = delta.get('content', '')
if content:
if first_token_time is None:
first_token_time = time.time()
token_count += 1
full_response += content
# Parse thinking blocks if reasoning enabled
if enable_reasoning and config.get("supports_reasoning"):
thinking, remaining = parse_thinking_blocks(full_response, streaming=True)
thinking_content = thinking or ""
json_text = remaining
else:
json_text = full_response
# Try to parse JSON
partial_items = _try_parse_extraction_json(json_text)
# Calculate progress metrics
elapsed = time.time() - start_time
tps = token_count / elapsed if elapsed > 0 else 0
remaining_tokens = 1024 - token_count
eta = int(remaining_tokens / tps) if tps > 0 else 0
# Get item counts for ticker
items_count = {
"action_items": len(partial_items.get("action_items", [])),
"decisions": len(partial_items.get("decisions", [])),
"key_points": len(partial_items.get("key_points", [])),
"open_questions": len(partial_items.get("open_questions", []))
}
# Get last extracted item as snippet
last_item = ""
for category in ["action_items", "decisions", "key_points", "open_questions"]:
if partial_items.get(category):
last_item = partial_items[category][-1]
break
# Format progress ticker
input_tokens = tokenizer.count(window.content)
ticker = format_progress_ticker(
current_window=window_id,
total_windows=total_windows,
window_tokens=input_tokens,
max_tokens=4096, # Reference max for percentage
items_found=items_count,
tokens_per_sec=tps,
eta_seconds=eta,
current_snippet=last_item
)
# Q8: Option A - Show in "MODEL THINKING PROCESS" field
yield (ticker, thinking_content, partial_items, False)
# Final parse
if enable_reasoning and config.get("supports_reasoning"):
thinking, remaining = parse_thinking_blocks(full_response)
thinking_content = thinking or ""
json_text = remaining
else:
json_text = full_response
final_items = _try_parse_extraction_json(json_text)
if not final_items:
# JSON parsing failed - FAIL ENTIRE PIPELINE (strict mode)
error_msg = f"Failed to parse JSON from window {window_id}. Response: {json_text[:200]}"
tracer.log_extraction(
window_id=window_id,
extraction=None,
llm_response=_sample_llm_response(full_response),
error=error_msg
)
raise ValueError(error_msg)
# Log successful extraction
tracer.log_extraction(
window_id=window_id,
extraction=final_items,
llm_response=_sample_llm_response(full_response),
thinking=_sample_llm_response(thinking_content) if thinking_content else None,
error=None
)
# Final ticker
elapsed = time.time() - start_time
tps = token_count / elapsed if elapsed > 0 else 0
items_count = {k: len(v) for k, v in final_items.items()}
ticker = format_progress_ticker(
current_window=window_id,
total_windows=total_windows,
window_tokens=input_tokens,
max_tokens=4096,
items_found=items_count,
tokens_per_sec=tps,
eta_seconds=0,
current_snippet="✅ Extraction complete"
)
yield (ticker, thinking_content, final_items, True)
except Exception as e:
# Log error and re-raise to fail entire pipeline
tracer.log_extraction(
window_id=window_id,
extraction=None,
llm_response=_sample_llm_response(full_response) if full_response else "",
error=str(e)
)
raise
Implementation Checklist
Files to Create
-
/home/luigi/tiny-scribe/meeting_summarizer/extraction.py(~900 lines)-
NativeTokenizerclass -
EmbeddingModelclass +EMBEDDING_MODELSregistry -
format_progress_ticker()function -
stream_extract_from_window()function (with reasoning support) -
deduplicate_items()function -
stream_synthesize_executive_summary()function
-
Files to Modify
/home/luigi/tiny-scribe/meeting_summarizer/__init__.py- Remove
filter_validated_itemsimport/export
- Remove
/home/luigi/tiny-scribe/meeting_summarizer/trace.py- Add
log_extraction()method - Add
log_deduplication()method - Add
log_synthesis()method
- Add
/home/luigi/tiny-scribe/app.py(~800 lines added/modified)- Add
EXTRACTION_MODELSregistry (13 models) - Add
SYNTHESIS_MODELSreference - Add
get_model_config()function - Add
load_model_for_role()function - Add
unload_model()function - Add
build_extraction_system_prompt()function - Add
summarize_advanced()generator function - Add Advanced mode UI controls
- Add reasoning visibility logic
- Add model info display functions
- Update
download_summary_json()for trace embedding
- Add
Code Statistics
| Metric | Count |
|---|---|
| New Lines | ~1,800 |
| Modified Lines | ~60 |
| Removed Lines | ~2 |
| New Functions | 12 |
| New Classes | 2 |
| UI Controls | 11 |
Testing Strategy
Phase 1: Model Registry Validation
python -c "
from app import EXTRACTION_MODELS, SYNTHESIS_MODELS
from meeting_summarizer.extraction import EMBEDDING_MODELS
assert len(EXTRACTION_MODELS) == 13, 'Extraction models count mismatch'
assert len(EMBEDDING_MODELS) == 4, 'Embedding models count mismatch'
assert len(SYNTHESIS_MODELS) == 16, 'Synthesis models count mismatch'
# Verify independent settings
ext_qwen = EXTRACTION_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature']
syn_qwen = SYNTHESIS_MODELS['qwen3_1.7b_q4']['inference_settings']['temperature']
assert ext_qwen == 0.3, f'Extraction temp wrong: {ext_qwen}'
assert syn_qwen == 0.8, f'Synthesis temp wrong: {syn_qwen}'
print('✅ All model registries validated!')
"
Phase 2: UI Control Validation
Manual Checks:
- Select "Advanced" mode
- Verify 3 dropdowns show correct counts (13, 4, 16)
- Verify default models selected
- Adjust extraction_n_ctx slider (2K → 8K)
- Select qwen3_600m_q4 for extraction → reasoning checkbox appears
- Select qwen3_1.7b_q4 for extraction → reasoning checkbox visible (Qwen3 supports reasoning)
- Select qwen3_4b_thinking_q3 for synthesis → reasoning locked ON
- Verify model info panels update on selection
Phase 3: Pipeline Test - min.txt (Quick)
Configuration:
- Extraction:
qwen3_1.7b_q4(default) - Extraction n_ctx: 4096 (default)
- Embedding:
granite-107m(default) - Synthesis:
qwen3_1.7b_q4(default) - Similarity threshold: 0.85 (default)
Expected:
- 1 window created
- ~2-4 items extracted
- 0-1 duplicates removed
- Final summary generated
- Total time: ~30-60s
- Download JSON contains trace
Phase 4: Pipeline Test - Reasoning Models
Configuration:
- Extraction:
qwen3_600m_q4 - ☑ Enable Reasoning for Extraction (test hybrid model)
- Extraction n_ctx: 2048 (smaller windows)
- Embedding:
granite-278m(test balanced embedding) - Synthesis:
qwen3_1.7b_q4 - ☑ Enable Reasoning for Synthesis
Expected:
- More windows (~4-6 with 2K context)
- "MODEL THINKING PROCESS" shows extraction thinking + ticker
- ~10-15 items extracted
- ~2-4 duplicates removed
- Final summary with thinking blocks
- Total time: ~2-3 min
Phase 5: Pipeline Test - full.txt (Production)
Configuration:
- Extraction:
qwen3_1.7b_q4(high quality, reasoning enabled) - Extraction n_ctx: 4096 (default)
- Embedding:
qwen-600m(highest quality) - Synthesis:
qwen3_4b_thinking_q3(4B thinking model) - Output language: zh-TW (test Chinese)
Expected:
- ~3-5 windows (4K context)
- ~40-60 items extracted
- ~10-15 duplicates removed
- Final summary in Traditional Chinese
- Total time: ~5-8 min
- Download JSON with embedded trace (~1-2MB)
Phase 6: Error Handling Test (Q10: Option C)
Scenarios:
- Disconnect internet during model download
- Manually corrupt model cache
- Use invalid model repo_id in EXTRACTION_MODELS
Expected behavior:
- Error message displayed in UI: "❌ Failed to load lfm2_extract_1.2b..."
- Pipeline stops (doesn't try fallback)
- User can select different model and retry
- Trace file saved with error details
Implementation Priority
Suggested Implementation Sequence (13-19 hours total)
1. Model Registries (1-2 hours)
- Add
EXTRACTION_MODELStoapp.py - Add
SYNTHESIS_MODELSreference - Add
EMBEDDING_MODELStoextraction.py - Validate with smoke test
2. Core Infrastructure (2-3 hours)
- Implement
get_model_config() - Implement
load_model_for_role()with user_n_ctx support - Implement
unload_model() - Implement
build_extraction_system_prompt()with reasoning support - Update
trace.pywith 3 new logging methods - Update
__init__.py
3. Extraction Module (3-4 hours)
- Implement
NativeTokenizerclass - Implement
EmbeddingModelclass - Implement
format_progress_ticker() - Implement
stream_extract_from_window()with reasoning parsing - Implement
deduplicate_items() - Implement
stream_synthesize_executive_summary()
4. UI Integration (2-3 hours)
- Add Advanced mode controls to Gradio interface
- Implement reasoning checkbox visibility logic
- Implement model info display functions
- Wire up all event handlers
- Test UI responsiveness
5. Pipeline Orchestration (3-4 hours)
- Implement
summarize_advanced()generator function - Sequential model loading/unloading logic
- Error handling with graceful failures
- Progress ticker updates
- Trace embedding in download JSON
6. Testing & Validation (2-3 hours)
- Run all test phases (min.txt → full.txt)
- Validate reasoning models behavior
- Test error handling scenarios
- Performance optimization (if needed)
Risk Assessment
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Memory overflow on HF Spaces Free Tier | Low | High | Sequential loading/unloading tested; add memory monitoring |
| Reasoning output breaks JSON parsing | Medium | Medium | Robust thinking block parsing with fallback; strict error handling |
| User n_ctx slider causes OOM | Low | Medium | Cap at MAX_USABLE_CTX (32K); show warning if user sets too high |
| Embedding models slow down pipeline | Medium | Low | Default to granite-107m (fastest); user can upgrade if needed |
| Trace file too large | Low | Low | Response sampling (400 chars) already implemented; compress if >5MB |
Appendix: Model Comparison Tables
Extraction Models (11)
| Model | Size | Context | Reasoning | Settings |
|---|---|---|---|---|
| falcon_h1_100m | 100M | 32K | No | temp=0.2 |
| gemma3_270m | 270M | 32K | No | temp=0.3 |
| ernie_300m | 300M | 131K | No | temp=0.2 |
| granite_350m | 350M | 32K | No | temp=0.1 |
| bitcpm4_500m | 500M | 128K | No | temp=0.2 |
| hunyuan_500m | 500M | 256K | No | temp=0.2 |
| qwen3_600m_q4 | 600M | 32K | Hybrid | temp=0.3 |
| granite_3_1_1b_q8 | 1B | 128K | No | temp=0.3 |
| falcon_h1_1.5b_q4 | 1.5B | 32K | No | temp=0.2 |
| qwen3_1.7b_q4 | 1.7B | 32K | Hybrid | temp=0.3 |
| lfm2_extract_1.2b | 1.2B | 32K | No | temp=0.2 |
Synthesis Models (16)
| Model | Size | Context | Reasoning | Settings |
|---|---|---|---|---|
| granite_3_1_1b_q8 | 1B | 128K | No | temp=0.7 |
| falcon_h1_1.5b_q4 | 1.5B | 32K | No | temp=0.1 |
| qwen3_1.7b_q4 | 1.7B | 32K | Hybrid | temp=0.8 |
| granite_3_3_2b_q4 | 2B | 128K | No | temp=0.8 |
| youtu_llm_2b_q8 | 2B | 128K | Hybrid | temp=0.8 |
| lfm2_2_6b_transcript | 2.6B | 32K | No | temp=0.7 |
| breeze_3b_q4 | 3B | 32K | No | temp=0.7 |
| granite_3_1_3b_q4 | 3B | 128K | No | temp=0.8 |
| qwen3_4b_thinking_q3 | 4B | 256K | Thinking-only | temp=0.8 |
| granite4_tiny_q3 | 7B | 128K | No | temp=0.8 |
| ernie_21b_pt_q1 | 21B | 128K | No | temp=0.8 |
| ernie_21b_thinking_q1 | 21B | 128K | Thinking-only | temp=0.9 |
| glm_4_7_flash_reap_30b | 30B | 128K | Thinking-only | temp=0.8 |
| glm_4_7_flash_30b_iq2 | 30B | 128K | No | temp=0.7 |
| qwen3_30b_thinking_q1 | 30B | 256K | Thinking-only | temp=0.8 |
| qwen3_30b_instruct_q1 | 30B | 256K | No | temp=0.7 |
Embedding Models (4)
| Model | Size | Dimension | Speed | Quality |
|---|---|---|---|---|
| granite-107m | 107M | 384 | Fastest | Good |
| granite-278m | 278M | 768 | Balanced | Better |
| gemma-300m | 300M | 768 | Fast | Good |
| qwen-600m | 600M | 1024 | Slower | Best |
Next Steps
Once approved, implementation will proceed in the order outlined in the Priority section. All code will be committed with descriptive messages referencing this plan document.
Ready for implementation approval.
Document Version: 1.1
Last Updated: 2026-02-05
Author: Claude (Anthropic)
Reviewer: Updated post-implementation to match actual code