# ShortSmith v2 - Testing & Evaluation Guide

## Overview
This document outlines our testing strategy for evaluating ShortSmith v2's highlight extraction quality. It covers what to test, how to measure results, and how to score/store findings.

---

## 1. What We're Testing

### 1.1 Core Quality Metrics

| Component | What It Does | Key Output |
|-----------|--------------|------------|
| **Visual Analysis** | Qwen2-VL rates frame excitement | hype_score (0-1), action, emotion |
| **Audio Analysis** | Librosa detects energy/beats | energy_score, excitement_score (0-1) |
| **Motion Detection** | RAFT optical flow measures action | magnitude (0-1), is_action flag |
| **Viral Hooks** | Finds optimal clip start points | hook_type, confidence, intensity |
| **Hype Scoring** | Combines all signals | combined_score (0-1), rank |

### 1.2 End-to-End Quality

- **Clip Relevance**: Are the extracted clips actually highlights?
- **Hook Effectiveness**: Do clips start at engaging moments?
- **Domain Accuracy**: Does sports mode pick different clips than podcast mode?
- **Person Filtering**: When enabled, does target person appear in clips?

---

## 2. Test Dataset Requirements

### 2.1 Video Categories (Minimum 3 videos per category)

| Domain | Video Type | Duration | Source Examples |
|--------|-----------|----------|-----------------|
| **Sports** | Football/Basketball highlights | 5-15 min | YouTube sports channels |
| **Music** | Music videos, concerts | 3-8 min | Official MVs, live performances |
| **Gaming** | Gameplay with commentary | 10-20 min | Twitch clips, YouTube gaming |
| **Vlogs** | Personal vlogs with reactions | 8-15 min | YouTube vlogs |
| **Podcasts** | Interview/discussion clips | 15-30 min | Podcast video episodes |
| **General** | Mixed content | 5-15 min | Random viral videos |

### 2.2 Test Video Naming Convention
```
{domain}_{video_id}_{duration_mins}.mp4

Examples:
sports_001_10min.mp4
music_002_5min.mp4
gaming_003_15min.mp4
```

---

## 3. Testing Methodology

### 3.1 Automated Metrics (System-Generated)

For each test run, capture these from the pipeline output:

```
Processing Metrics:
- processing_time_seconds
- frames_analyzed
- scenes_detected
- hooks_detected

Per Clip:
- clip_id (1, 2, 3)
- start_time
- end_time
- hype_score
- visual_score
- audio_score
- motion_score
- hook_type (if any)
- hook_confidence
```

### 3.2 Human Evaluation Criteria

Each clip needs human scoring on these dimensions:

#### A. Highlight Quality (1-5 scale)
```
1 = Not a highlight at all (boring/irrelevant)
2 = Weak highlight (somewhat interesting)
3 = Decent highlight (would watch)
4 = Good highlight (engaging)
5 = Perfect highlight (would share/rewatch)
```

#### B. Hook Effectiveness (1-5 scale)
```
1 = Starts at wrong moment (confusing/boring start)
2 = Starts too early/late (misses the peak)
3 = Acceptable start (gets the point across)
4 = Good start (grabs attention)
5 = Perfect start (immediately engaging, viral potential)
```

#### C. Domain Appropriateness (Yes/No/Partial)
```
Does this clip match what you'd expect for this content type?
- Sports: Action/celebration/crowd moment?
- Music: Beat drop/chorus/dance peak?
- Gaming: Clutch play/reaction/funny moment?
- Vlogs: Emotional/funny/reveal moment?
- Podcasts: Hot take/laugh/interesting point?
```

### 3.3 A/B Testing (Optional)

Compare outputs with different settings:
- Viral hooks ON vs OFF
- Different domain presets for same video
- Custom prompt vs no prompt

---

## 4. Test Execution Process

### 4.1 Per-Video Test Run

```
1. Upload video to ShortSmith
2. Select appropriate domain
3. Set: 3 clips, 15 seconds each
4. Run extraction
5. Download all 3 clips
6. Record automated metrics from log
7. Have 2+ team members score each clip
8. Calculate average scores
9. Log results in spreadsheet
```

### 4.2 Test Run Naming
```
{date}_{domain}_{video_id}_{tester_initials}

Example: 2024-01-15_sports_001_CM
```

---

## 5. Scoring & Storage

### 5.1 Results Spreadsheet Structure

Create a Google Sheet / Excel with these columns:

| Date | Video_ID | Domain | Clip# | Start | End | Hype_Score | Visual | Audio | Motion | Hook_Type | Human_Highlight (1-5) | Human_Hook (1-5) | Domain_Match (Y/N/P) | Notes |
|------|----------|--------|-------|-------|-----|------------|--------|-------|--------|-----------|----------------------|------------------|---------------------|-------|

### 5.2 Aggregate Metrics to Track

Calculate weekly/monthly:

```
Overall Quality:
- Avg Human Highlight Score (target: >3.5)
- Avg Human Hook Score (target: >3.5)
- Domain Match Rate (target: >80%)

Per Domain:
- Avg scores broken down by sports/music/gaming/vlogs/podcasts

Correlation Analysis:
- System hype_score vs Human rating correlation
- Hook confidence vs Human hook rating correlation
```

### 5.3 Quality Thresholds

| Metric | Poor | Acceptable | Good | Excellent |
|--------|------|------------|------|-----------|
| Highlight Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
| Hook Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
| Domain Match | <60% | 60-75% | 75-90% | >90% |
| Hype-Human Correlation | <0.3 | 0.3-0.5 | 0.5-0.7 | >0.7 |

---

## 6. Issue Tracking

### 6.1 Common Issues to Watch For

| Issue | How to Identify | Priority |
|-------|-----------------|----------|
| Wrong clip selected | Human score <2, but hype_score >0.7 | High |
| Bad hook timing | Good highlight but hook score <2 | High |
| Domain mismatch | Domain_Match = N for >30% clips | Medium |
| Person filter fail | Target person not in clips when filter enabled | Medium |
| Processing errors | Pipeline fails/crashes | Critical |

### 6.2 Bug Report Template

```
Video: {video_id}
Domain: {domain}
Issue Type: [Wrong Clip / Bad Hook / Domain Mismatch / Other]
Clip #: {1/2/3}
Timestamp: {start}-{end}
Expected: {what should have been selected}
Actual: {what was selected}
System Scores: hype={}, visual={}, audio={}, motion={}
Human Scores: highlight={}, hook={}
Notes: {additional context}
```

---

## 7. Test Schedule

tentative

## 8. Success Criteria

### MVP Launch Requirements
- [ ] Avg Highlight Score > 3.5 across all domains
- [ ] Avg Hook Score > 3.5 across all domains
- [ ] Domain Match Rate > 75%
- [ ] No critical processing errors
- [ ] Processing time < 10 min for 10-min video on A10G

### Stretch Goals
- [ ] Avg Highlight Score > 4.0
- [ ] Avg Hook Score > 4.0
- [ ] Domain Match Rate > 90%
- [ ] Hype-Human correlation > 0.6

---

## 11. Experimentation Guide

This section explains how to systematically test different settings to find optimal configurations.

### 11.1 What Settings Can Be Changed

| Setting | Location | Default | Range | Impact |
|---------|----------|---------|-------|--------|
| **Coarse Sample Interval** | config.py | 5.0 sec | 2-10 sec | More frames = better accuracy, slower processing |
| **Clip Duration** | UI Slider | 15 sec | 5-30 sec | Longer clips = more context, harder to keep attention |
| **Number of Clips** | UI Slider | 3 | 1-3 | More clips = more variety, possibly lower avg quality |
| **Domain Selection** | UI Dropdown | General | 6 options | Changes weight distribution for scoring |
| **Hype Threshold** | config.py | 0.3 | 0.1-0.6 | Higher = stricter filtering, fewer candidates |
| **Min Gap Between Clips** | config.py | 30 sec | 10-60 sec | Higher = more spread out clips |
| **Scene Threshold** | config.py | 27.0 | 20-35 | Lower = more scene cuts detected |

### 11.2 Experiment Types

#### Experiment A: Domain Preset Comparison
**Goal**: Verify that domain selection actually affects results

**Method**:
1. Pick 1 video that could fit multiple domains (e.g., a gaming video with music)
2. Run ShortSmith 3 times with different domain settings:
   - Run 1: Gaming
   - Run 2: Music
   - Run 3: General
3. Compare the 3 clips from each run

**What to Record**:
```
Video: {video_id}
Experiment: Domain Comparison

| Run | Domain | Clip1 Start | Clip2 Start | Clip3 Start | Best Clip# | Notes |
|-----|--------|-------------|-------------|-------------|------------|-------|
| 1   | Gaming |             |             |             |            |       |
| 2   | Music  |             |             |             |            |       |
| 3   | General|             |             |             |            |       |

Question: Did different domains pick different moments?  Y / N
Question: Which domain worked best for this video? ____________
```

#### Experiment B: Clip Duration Impact
**Goal**: Find optimal clip length for engagement

**Method**:
1. Pick 1 video
2. Run ShortSmith 3 times with different durations:
   - Run 1: 10 seconds
   - Run 2: 15 seconds
   - Run 3: 25 seconds
3. Watch all clips and score

**What to Record**:
```
Video: {video_id}
Experiment: Duration Comparison

| Duration | Clip1 Hook Score | Clip1 "Feels Complete?" | Clip1 "Too Long?" |
|----------|------------------|-------------------------|-------------------|
| 10 sec   |                  | Y/N                     | Y/N               |
| 15 sec   |                  | Y/N                     | Y/N               |
| 25 sec   |                  | Y/N                     | Y/N               |

Optimal duration for this content type: ___ seconds
```

#### Experiment C: Custom Prompt Effectiveness
**Goal**: Test if custom prompts improve clip selection

**Method**:
1. Pick 1 video with a specific element you want to capture (e.g., "crowd reactions")
2. Run twice:
   - Run 1: No custom prompt
   - Run 2: Custom prompt = "Focus on crowd reactions"
3. Compare results

**What to Record**:
```
Video: {video_id}
Custom Prompt: "________________________"
Experiment: Custom Prompt Test

| Run | Has Prompt? | Clip1 Has Target Element? | Clip2 Has Target? | Clip3 Has Target? |
|-----|-------------|---------------------------|-------------------|-------------------|
| 1   | No          | Y/N                       | Y/N               | Y/N               |
| 2   | Yes         | Y/N                       | Y/N               | Y/N               |

Did custom prompt help? Y / N / Unclear
```

#### Experiment D: Person Filter Accuracy
**Goal**: Test if person filtering actually prioritizes target person

**Method**:
1. Pick a video with 2+ people clearly visible
2. Get a clear reference photo of 1 person
3. Run twice:
   - Run 1: No reference image
   - Run 2: With reference image
4. Count target person screen time in each clip

**What to Record**:
```
Video: {video_id}
Target Person: {description}
Experiment: Person Filter Test

| Run | Filter On? | Clip1 Person % | Clip2 Person % | Clip3 Person % | Avg % |
|-----|------------|----------------|----------------|----------------|-------|
| 1   | No         |                |                |                |       |
| 2   | Yes        |                |                |                |       |

Did filter increase target person screen time? Y / N
By how much? ___% improvement
```

### 11.3 How to Document Experiments

#### Step 1: Create Experiment Log
Make a new sheet/tab called "Experiments" with columns:
```
| Date | Experiment_Type | Video_ID | Variable_Tested | Setting_A | Setting_B | Winner | Notes |
```

#### Step 2: Record Before & After
Always record:
- What you changed (the variable)
- What stayed the same (controls)
- The measurable outcome

#### Step 3: Take Screenshots
For each experiment:
- Screenshot the settings used
- Screenshot the processing log
- Save the output clips with clear naming:
  ```
  exp_{type}_{video}_{setting}.mp4

  Examples:
  exp_domain_sports001_gaming.mp4
  exp_domain_sports001_music.mp4
  exp_duration_vlog002_10sec.mp4
  exp_duration_vlog002_25sec.mp4
  ```

### 11.4 Recommended Experiment Order

**Week 1: Baseline + Domain Testing**
```
Day 1-2: Run all 6 domains on 2 "neutral" videos
         Record which domain performs best

Day 3-4: Run domain-matched tests
         (Sports video with Sports setting, Music with Music, etc.)
         Record scores

Day 5: Analyze - Do domain presets actually help vs General?
```

**Week 2: Duration & Clip Count Testing**
```
Day 1-2: Test 10s vs 15s vs 20s vs 25s clips
         On 3 different video types

Day 3-4: Test 1 clip vs 2 clips vs 3 clips
         Does quality drop with more clips?

Day 5: Analyze - Find optimal duration per domain
```

**Week 3: Advanced Features Testing**
```
Day 1-2: Custom prompt experiments
         Try 5 different prompts on same video

Day 3-4: Person filter experiments
         Test accuracy on 3 multi-person videos

Day 5: Analyze - Document which features add value
```

### 11.5 Experiment Scoring Template

For each experiment, fill out:

```
=== EXPERIMENT REPORT ===

Date: ____________
Tester: ____________
Experiment Type: [Domain / Duration / Custom Prompt / Person Filter / Other]

VIDEO DETAILS
- Video ID: ____________
- Domain: ____________
- Duration: ____________
- Content Description: ____________

WHAT I TESTED
- Variable Changed: ____________
- Setting A: ____________
- Setting B: ____________
- Setting C (if any): ____________

RESULTS

Setting A Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___

Setting B Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___

WINNER: Setting ___

WHY?
________________________________
________________________________

RECOMMENDATION:
________________________________
________________________________
```

### 11.6 Key Questions to Answer Through Experiments

After completing experiments, you should be able to answer:

**Domain Settings:**
- [ ] Does Sports mode actually pick action moments better than General?
- [ ] Does Music mode find beat drops better?
- [ ] Does Podcast mode find speaking highlights?
- [ ] Which domain works best for "general" content?

**Clip Settings:**
- [ ] What's the ideal clip duration for TikTok/Reels (vertical short-form)?
- [ ] What's the ideal clip duration for YouTube Shorts?
- [ ] Does requesting 1 clip give higher quality than requesting 3?

**Features:**
- [ ] Do custom prompts actually improve results?
- [ ] How accurate is person filtering? (% of clips with target person)
- [ ] What types of custom prompts work best?

**Quality Patterns:**
- [ ] Which content types get the best results?
- [ ] Which content types struggle?
- [ ] Are there patterns in "bad" clips? (e.g., always too early, always misses peak)

---

## Appendix A: Quick Evaluation Form

```
=== ShortSmith Clip Evaluation ===

Video ID: ____________
Domain: ____________
Date: ____________
Evaluator: ____________

CLIP 1 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________

CLIP 2 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________

CLIP 3 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________

Overall Comments:
________________________________
________________________________
```

---

## Appendix B: Domain-Specific Hook Types Reference

### Sports
| Hook Type | What to Look For |
|-----------|-----------------|
| GOAL_MOMENT | Scoring play, basket, touchdown |
| CROWD_ERUPTION | Audience going wild |
| COMMENTATOR_HYPE | Excited commentary voice |
| REPLAY_WORTHY | Impressive athletic move |

### Music
| Hook Type | What to Look For |
|-----------|-----------------|
| BEAT_DROP | Bass drop, beat switch |
| CHORUS_HIT | Chorus/hook starts |
| DANCE_PEAK | Peak choreography moment |
| VISUAL_CLIMAX | Visual spectacle |

### Gaming
| Hook Type | What to Look For |
|-----------|-----------------|
| CLUTCH_PLAY | Skillful play under pressure |
| ELIMINATION | Kill/win moment |
| RAGE_REACTION | Streamer emotional reaction |
| UNEXPECTED | Surprise/plot twist |

### Vlogs
| Hook Type | What to Look For |
|-----------|-----------------|
| REVEAL | Surprise reveal |
| PUNCHLINE | Joke landing |
| EMOTIONAL_MOMENT | Tears/joy/shock |
| CONFRONTATION | Drama/tension |

### Podcasts
| Hook Type | What to Look For |
|-----------|-----------------|
| HOT_TAKE | Controversial opinion |
| BIG_LAUGH | Group laughter |
| REVELATION | Surprising information |
| HEATED_DEBATE | Passionate argument |

---

## Appendix C: Sample Test Results Format

```csv
date,video_id,domain,clip_num,start_time,end_time,hype_score,visual_score,audio_score,motion_score,hook_type,human_highlight,human_hook,domain_match,notes
2024-01-15,sports_001,sports,1,45.2,60.2,0.82,0.75,0.88,0.71,GOAL_MOMENT,5,5,Y,Perfect goal celebration
2024-01-15,sports_001,sports,2,120.5,135.5,0.71,0.68,0.74,0.65,CROWD_ERUPTION,4,4,Y,Good crowd reaction
2024-01-15,sports_001,sports,3,200.0,215.0,0.65,0.60,0.70,0.55,PEAK_ENERGY,3,3,P,Decent but not the best moment
```