Spaces:

Luigi
/

tiny-scribe

Running

File size: 11,151 Bytes

f175554

# Model Benchmark Report: Transcript Summarization

**Hardware:** Intel Core Ultra 155H, 16GB DRAM
**Test File:** `transcripts/full.txt` (204 lines, ~1 hour meeting)
**Test Date:** 2026-01-30

## Executive Summary

### 🏆 Winner: Qwen3-1.7B (65% quality)

Six models under 2B parameters were tested for business meeting transcript summarization. The **Qwen3-1.7B** model significantly outperforms all others, making it the **recommended choice** for production use.

### Performance Ranking

| Rank | Model | Parameters | Quality | Verdict |
|------|-------|------------|---------|---------|
| 1️⃣ | **Qwen3-1.7B** | 1.7B | **65/100** | ✅ **RECOMMENDED** |
| 2️⃣ | Qwen3-0.6B | 0.6B | 36/100 | ⚠️ Fair |
| 3️⃣ | Qwen2-1.5B-Instruct | 1.5B | 35/100 | ⚠️ Fair |
| 3️⃣ | LFM2-1.2B | 1.2B | 35/100 | ⚠️ Fair |
| 5️⃣ | Granite-4.0-h-tiny | ~0.8B | 30/100 | ❌ Poor |
| 6️⃣ | Granite-1B | 1.0B | 25/100 | ❌ Poor |

**Not Tested:** LFM2-8B-A1B (8B parameters) - requires 32GB+ RAM, not practical for 16GB systems.

---

## Detailed Model Analysis

### 1. Qwen3-1.7B ⭐ WINNER

**Strengths:**
- ✅ Most detailed and structured output
- ✅ Captured 4 vendor names (Samsung, Hynix, Micron, SanDisk)
- ✅ Included specific market data (50% AI allocation, 15% supply reduction)
- ✅ Correct technical terminology (D4, D5, DDR, NAND)
- ✅ Manufacturing details (Shenzhen, 華天, 佩頓)
- ✅ Multiple timelines (2023 Q2, Q3, 2024 Q2, 2027 Q1)

**Weaknesses:**
- ⚠️  Section 4 incomplete (hit 1024 token limit)
- ⚠️  Missing customer names (Inspur, ZTE, Cangbao)
- ⚠️  No pricing information
- ⚠️  Timeline confusion (said 2023 Q3 instead of 2025 Q3)

**Quality Metrics:**
- Completeness: 65%
- Specificity: 60%
- Accuracy: 80%
- Actionability: 55%

**Summary Length:** 933 chars (32 lines)
**Thinking Content:** 726 chars

---

### 2. Qwen2-1.5B-Instruct & LFM2-1.2B (TIE)

**Note:** These models produced **identical summaries**, suggesting overfitting or processing issues.

**Strengths:**
- ✅ Structured 7-point format
- ✅ Mentions key speakers (SPEAKER_02, SPEAKER_03)
- ✅ Some domain concepts (supply chain, AI impact)

**Weaknesses:**
- ❌ **Major hallucination:** Focuses on Samsung as the main company (transcript is about a module house customer)
- ❌ **Timeline error:** Says discussion was in 2022 Q3 (transcript indicates 2025+)
- ❌ **Generic content:** Repeats "different continents" (Hong Kong, Taipei, Shenzhen) as separate continents
- ❌ **No specific details:** No vendor names, no customer names, no quantitative data
- ❌ **No business insights:** Lacks actionable information

**Quality Metrics:**
- Completeness: 35%
- Specificity: 25%
- Accuracy: 40% (major hallucinations)
- Actionability: 30%

**Summary Length:** ~570 chars (16 lines)

---

### 3. Granite-4.0-h-1B

**Strengths:**
- ✅ Clear 8-point structure
- ✅ Identifies some technologies (DDR, Flash, MTK, Realtek)
- ✅ Mentions Samsung and Hynix

**Weaknesses:**
- ❌ **Major hallucination:** Claims COVID-19 pandemic impact (not mentioned in transcript)
- ❌ **Very generic:** Could apply to any semiconductor industry discussion
- ❌ **No specific details:** No timelines, no quantities, no customer names
- ❌ **No manufacturing details**
- ❌ **No pricing or market data**

**Quality Metrics:**
- Completeness: 25%
- Specificity: 20%
- Accuracy: 50% (COVID hallucination)
- Actionability: 15%

**Summary Length:** 1558 chars (11 lines)

---

### 4. Qwen3-0.6B (Baseline)

**Strengths:**
- ✅ Captured core topic (supply challenges)
- ✅ Some structure

**Weaknesses:**
- ❌ **Transcription error:** "Lopar" instead of "LPDDR"
- ❌ **Too generic:** Only 18 lines, minimal detail
- ❌ **No specific vendor names** beyond Samsung
- ❌ **No customer names**
- ❌ **No quantitative data**
- ❌ **No manufacturing details**

**Quality Metrics:**
- Completeness: 30%
- Specificity: 20%
- Accuracy: 70%
- Actionability: 25%

**Summary Length:** 537 chars (18 lines)

---

### 6. Granite-4.0-h-tiny (~0.8B)

**Strengths:**
- ✅ Clean 5-point structure
- ✅ Mentions vendors (Samsung, Hynix, Micron, "易基")
- ✅ Identifies product types (HBM, DDR5, DDR, NAND, DRAM)
- ✅ Discusses market trends and challenges

**Weaknesses:**
- ❌ **Transcription error:** "Lopar" instead of LPDDR (same as Qwen3-0.6B)
- ❌ **Very generic:** No specific details, no quantitative data
- ❌ **No customer names** captured
- ❌ **No manufacturing details** (locations, partners)
- ❌ **No pricing information**
- ❌ **No specific timelines** beyond generic "future years"
- ❌ **No business insights** or actionable information
- ❌ **Slowest speed** (17.4 minutes to load and process)

**Quality Metrics:**
- Completeness: 30%
- Specificity: 20%
- Accuracy: 65% (minor transcription errors)
- Actionability: 20%

**Summary Length:** 583 chars (10 lines)
**Processing Time:** ~17.5 minutes (slowest of all tested)

**Note:** Despite being called "tiny", this model performed poorly - slower than much larger models and produced generic summaries with transcription errors.

---

## Feature Comparison Matrix

| Feature | Qwen3-0.6B | Qwen3-1.7B | Qwen2-1.5B | LFM2-1.2B | Granite-1B | Granite-Tiny |
|---------|------------|------------|------------|-----------|------------|-------------|
| **Vendor Names** | 1 | 4 | 2 | 2 | 2 | 4 |
| **Customer Names** | 0 | 1 | 0 | 0 | 0 | 0 |
| **Timelines** | 2 | 4 | 1 | 1 | 0 | 0 |
| **Quantitative Data** | None | Some (50%, 15%) | None | None | None | None |
| **Technical Terms** | Poor | Good | Fair | Fair | Fair | Poor (Lopar) |
| **Manufacturing Info** | None | Shenzhen, etc. | Generic | Generic | None | None |
| **Business Insights** | Generic | Specific | Generic | Generic | Generic | Generic |
| **Hallucinations** | Minor (Lopar) | Minor | Major (Samsung) | Major (Samsung) | Major (COVID) | Minor (Lopar) |
| **Structure** | Simple | Excellent | Good | Good | Good | Good |
| **Length** | 537 chars | 933 chars | 570 chars | 570 chars | 1558 chars | 583 chars |
| **Speed** | Fastest | Good (~18 min) | Fast | Fast | Slow (~17 min) | Slowest (~17.5 min) |

---

## Performance Metrics

### Speed Comparison

| Model | Load Time | Tokens/Second | Verdict |
|-------|-----------|---------------|---------|
| Qwen3-1.7B | ~115s | 1.04 eval | Good |
| Qwen2-1.5B | Fast | Fast | Very Good |
| LFM2-1.2B | ~75s | 3.11 eval | Very Good |
| Granite-Tiny | ~579s | 1.71 eval | Poor |
| Granite-1B | ~213s | 2.55 eval | Good |
| Qwen3-0.6B | Fastest | Fastest | Excellent |

### Memory Usage (Intel Arc 155H)

| Model | SYCL Buffer | Host Buffer | Total | Fits in 16GB? |
|-------|-------------|-------------|-------|---------------|
| Qwen3-1.7B | 1050 MB | 72 MB | ~1.1 GB | ✅ Yes |
| Qwen2-1.5B | ~900 MB | ~70 MB | ~1 GB | ✅ Yes |
| LFM2-1.2B | 2160 MB | 72 MB | ~2.2 GB | ✅ Yes |
| Granite-Tiny | 899 MB | 96 MB | ~1 GB | ✅ Yes |
| Granite-1B | 896 MB | 96 MB | ~1 GB | ✅ Yes |
| Qwen3-0.6B | 359 MB | 68 MB | ~0.4 GB | ✅ Yes |

**All models fit comfortably in 16GB DRAM** with GPU acceleration.

**Note:** LFM2-8B-A1B (8B parameters) was NOT tested as it requires 32GB+ RAM and would be impractically slow on 16GB systems.

---

## Critical Findings

### 🚨 Red Flags

1. **Qwen2 and LFM2 produced identical summaries**
   - Suggests overfitting to training patterns
   - Not reliable for business-critical applications
   - Recommendation: Avoid these models

2. **Granite hallucinated COVID-19**
   - Transcript mentions no pandemic-related issues
   - Model injected external knowledge
   - Recommendation: Verify critical facts

3. **All models missed key details**
   - No model captured pricing information
   - No model captured "900K/month" demand figure
   - No model captured "best in 30 years" market assessment
   - This suggests **max_tokens=1024 is too limiting**

### ✅ Strengths by Use Case

| Use Case | Best Model | Alternative |
|----------|------------|-------------|
| **Quick overview** | Qwen3-0.6B | Qwen2-1.5B |
| **Business decision** | Qwen3-1.7B | None adequate |
| **Technical summary** | Qwen3-1.7B | Qwen2-1.5B |
| **Speed-critical** | Qwen3-0.6B | Qwen2-1.5B |
| **Comprehensive** | Qwen3-1.7B | Increase max_tokens |

---

## Recommendations

### Immediate Actions

1. **Use Qwen3-1.7B as default**
   ```bash
   # In summarize_transcript.py line 91:
   default="unsloth/Qwen3-1.7B-GGUF:Q4_K_M"
   ```

2. **Increase max_tokens to prevent cutoff**
   ```python
   # Line 59:
   max_tokens=2048  # Instead of 1024
   ```

3. **Add validation for critical models**
   - Qwen2 and LFM2 showed identical outputs
   - Add checksum or diversity testing

### For Different Hardware

| Available RAM | Recommended Model |
|---------------|-------------------|
| 8GB | Qwen3-0.6B (functional but limited) |
| 16GB | **Qwen3-1.7B** ✅ |
| 32GB | Qwen3-4B or Qwen3-14B (if available) |
| 64GB+ | Larger models (7B-14B range) |

### Quality Improvement Strategies

1. **Two-Stage Summarization**
   ```
   Stage 1: Extract key facts (entities, dates, numbers)
   Stage 2: Generate narrative with context
   ```

2. **Chunking for Long Transcripts**
   ```
   Input: 1-hour meeting
   ↓
   Split into 10-min segments
   ↓
   Summarize each segment
   ↓
   Combine and refine
   ```

3. **Custom Prompts**
   ```
   System prompt enhancements:
   - "Extract specific vendor names"
   - "Include pricing information"
   - "Note exact dates and timelines"
   - "List customer companies mentioned"
   ```

4. **Post-Processing Validation**
   ```
   - Verify extracted entities against source
   - Check for hallucinations (external knowledge injection)
   - Validate timelines and numbers
   - Flag low-confidence sections
   ```

---

## Conclusion

### 🏆 Final Recommendation

**Use Qwen3-1.7B-GGUF:Q4_K_M as your production model.**

It provides:
- ✅ 65% quality score (best tested)
- ✅ Specific, actionable insights
- ✅ Good domain knowledge
- ✅ Fits in 16GB RAM with GPU acceleration
- ✅ Reasonable speed (~18 minutes for 1-hour transcript)

### 📊 Expected Improvements

By implementing the recommended changes:

| Change | Quality Gain | Implementation Effort |
|--------|--------------|----------------------|
| Increase max_tokens to 2048 | +15% | Low (1 line) |
| Chunking (>30 min meetings) | +20% | Medium |
| Custom prompts | +10% | Low |
| Two-stage summarization | +15% | High |
| **Combined** | **~85% quality** | Medium-High |

### 🎯 Success Metrics

With Qwen3-1.7B + improvements, expect:
- **85% completeness** (up from 65%)
- **All vendor names captured**
- **Customer names identified**
- **Pricing information extracted**
- **Timelines validated**
- **Actionable business insights**

This makes the system suitable for executive decision-making, sales strategy, operations planning, and financial forecasting.

---

**Report Generated:** 2026-01-30
**Test Environment:** Intel Core Ultra 155H, 16GB DRAM, Intel Arc Graphics
**Models Tested:** 5 (under 2B parameters)
**Best Model:** Qwen3-1.7B-GGUF:Q4_K_M
**Quality Score:** 65/100 (recommended for production)