dev_caio / testing_guide.md
Chaitanya-aitf's picture
Update testing_guide.md
8556570 verified
# ShortSmith v2 - Testing & Evaluation Guide
## Overview
This document outlines our testing strategy for evaluating ShortSmith v2's highlight extraction quality. It covers what to test, how to measure results, and how to score/store findings.
---
## 1. What We're Testing
### 1.1 Core Quality Metrics
| Component | What It Does | Key Output |
|-----------|--------------|------------|
| **Visual Analysis** | Qwen2-VL rates frame excitement | hype_score (0-1), action, emotion |
| **Audio Analysis** | Librosa detects energy/beats | energy_score, excitement_score (0-1) |
| **Motion Detection** | RAFT optical flow measures action | magnitude (0-1), is_action flag |
| **Viral Hooks** | Finds optimal clip start points | hook_type, confidence, intensity |
| **Hype Scoring** | Combines all signals | combined_score (0-1), rank |
### 1.2 End-to-End Quality
- **Clip Relevance**: Are the extracted clips actually highlights?
- **Hook Effectiveness**: Do clips start at engaging moments?
- **Domain Accuracy**: Does sports mode pick different clips than podcast mode?
- **Person Filtering**: When enabled, does target person appear in clips?
---
## 2. Test Dataset Requirements
### 2.1 Video Categories (Minimum 3 videos per category)
| Domain | Video Type | Duration | Source Examples |
|--------|-----------|----------|-----------------|
| **Sports** | Football/Basketball highlights | 5-15 min | YouTube sports channels |
| **Music** | Music videos, concerts | 3-8 min | Official MVs, live performances |
| **Gaming** | Gameplay with commentary | 10-20 min | Twitch clips, YouTube gaming |
| **Vlogs** | Personal vlogs with reactions | 8-15 min | YouTube vlogs |
| **Podcasts** | Interview/discussion clips | 15-30 min | Podcast video episodes |
| **General** | Mixed content | 5-15 min | Random viral videos |
### 2.2 Test Video Naming Convention
```
{domain}_{video_id}_{duration_mins}.mp4
Examples:
sports_001_10min.mp4
music_002_5min.mp4
gaming_003_15min.mp4
```
---
## 3. Testing Methodology
### 3.1 Automated Metrics (System-Generated)
For each test run, capture these from the pipeline output:
```
Processing Metrics:
- processing_time_seconds
- frames_analyzed
- scenes_detected
- hooks_detected
Per Clip:
- clip_id (1, 2, 3)
- start_time
- end_time
- hype_score
- visual_score
- audio_score
- motion_score
- hook_type (if any)
- hook_confidence
```
### 3.2 Human Evaluation Criteria
Each clip needs human scoring on these dimensions:
#### A. Highlight Quality (1-5 scale)
```
1 = Not a highlight at all (boring/irrelevant)
2 = Weak highlight (somewhat interesting)
3 = Decent highlight (would watch)
4 = Good highlight (engaging)
5 = Perfect highlight (would share/rewatch)
```
#### B. Hook Effectiveness (1-5 scale)
```
1 = Starts at wrong moment (confusing/boring start)
2 = Starts too early/late (misses the peak)
3 = Acceptable start (gets the point across)
4 = Good start (grabs attention)
5 = Perfect start (immediately engaging, viral potential)
```
#### C. Domain Appropriateness (Yes/No/Partial)
```
Does this clip match what you'd expect for this content type?
- Sports: Action/celebration/crowd moment?
- Music: Beat drop/chorus/dance peak?
- Gaming: Clutch play/reaction/funny moment?
- Vlogs: Emotional/funny/reveal moment?
- Podcasts: Hot take/laugh/interesting point?
```
### 3.3 A/B Testing (Optional)
Compare outputs with different settings:
- Viral hooks ON vs OFF
- Different domain presets for same video
- Custom prompt vs no prompt
---
## 4. Test Execution Process
### 4.1 Per-Video Test Run
```
1. Upload video to ShortSmith
2. Select appropriate domain
3. Set: 3 clips, 15 seconds each
4. Run extraction
5. Download all 3 clips
6. Record automated metrics from log
7. Have 2+ team members score each clip
8. Calculate average scores
9. Log results in spreadsheet
```
### 4.2 Test Run Naming
```
{date}_{domain}_{video_id}_{tester_initials}
Example: 2024-01-15_sports_001_CM
```
---
## 5. Scoring & Storage
### 5.1 Results Spreadsheet Structure
Create a Google Sheet / Excel with these columns:
| Date | Video_ID | Domain | Clip# | Start | End | Hype_Score | Visual | Audio | Motion | Hook_Type | Human_Highlight (1-5) | Human_Hook (1-5) | Domain_Match (Y/N/P) | Notes |
|------|----------|--------|-------|-------|-----|------------|--------|-------|--------|-----------|----------------------|------------------|---------------------|-------|
### 5.2 Aggregate Metrics to Track
Calculate weekly/monthly:
```
Overall Quality:
- Avg Human Highlight Score (target: >3.5)
- Avg Human Hook Score (target: >3.5)
- Domain Match Rate (target: >80%)
Per Domain:
- Avg scores broken down by sports/music/gaming/vlogs/podcasts
Correlation Analysis:
- System hype_score vs Human rating correlation
- Hook confidence vs Human hook rating correlation
```
### 5.3 Quality Thresholds
| Metric | Poor | Acceptable | Good | Excellent |
|--------|------|------------|------|-----------|
| Highlight Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
| Hook Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
| Domain Match | <60% | 60-75% | 75-90% | >90% |
| Hype-Human Correlation | <0.3 | 0.3-0.5 | 0.5-0.7 | >0.7 |
---
## 6. Issue Tracking
### 6.1 Common Issues to Watch For
| Issue | How to Identify | Priority |
|-------|-----------------|----------|
| Wrong clip selected | Human score <2, but hype_score >0.7 | High |
| Bad hook timing | Good highlight but hook score <2 | High |
| Domain mismatch | Domain_Match = N for >30% clips | Medium |
| Person filter fail | Target person not in clips when filter enabled | Medium |
| Processing errors | Pipeline fails/crashes | Critical |
### 6.2 Bug Report Template
```
Video: {video_id}
Domain: {domain}
Issue Type: [Wrong Clip / Bad Hook / Domain Mismatch / Other]
Clip #: {1/2/3}
Timestamp: {start}-{end}
Expected: {what should have been selected}
Actual: {what was selected}
System Scores: hype={}, visual={}, audio={}, motion={}
Human Scores: highlight={}, hook={}
Notes: {additional context}
```
---
## 7. Test Schedule
tentative
## 8. Success Criteria
### MVP Launch Requirements
- [ ] Avg Highlight Score > 3.5 across all domains
- [ ] Avg Hook Score > 3.5 across all domains
- [ ] Domain Match Rate > 75%
- [ ] No critical processing errors
- [ ] Processing time < 10 min for 10-min video on A10G
### Stretch Goals
- [ ] Avg Highlight Score > 4.0
- [ ] Avg Hook Score > 4.0
- [ ] Domain Match Rate > 90%
- [ ] Hype-Human correlation > 0.6
---
## 11. Experimentation Guide
This section explains how to systematically test different settings to find optimal configurations.
### 11.1 What Settings Can Be Changed
| Setting | Location | Default | Range | Impact |
|---------|----------|---------|-------|--------|
| **Coarse Sample Interval** | config.py | 5.0 sec | 2-10 sec | More frames = better accuracy, slower processing |
| **Clip Duration** | UI Slider | 15 sec | 5-30 sec | Longer clips = more context, harder to keep attention |
| **Number of Clips** | UI Slider | 3 | 1-3 | More clips = more variety, possibly lower avg quality |
| **Domain Selection** | UI Dropdown | General | 6 options | Changes weight distribution for scoring |
| **Hype Threshold** | config.py | 0.3 | 0.1-0.6 | Higher = stricter filtering, fewer candidates |
| **Min Gap Between Clips** | config.py | 30 sec | 10-60 sec | Higher = more spread out clips |
| **Scene Threshold** | config.py | 27.0 | 20-35 | Lower = more scene cuts detected |
### 11.2 Experiment Types
#### Experiment A: Domain Preset Comparison
**Goal**: Verify that domain selection actually affects results
**Method**:
1. Pick 1 video that could fit multiple domains (e.g., a gaming video with music)
2. Run ShortSmith 3 times with different domain settings:
- Run 1: Gaming
- Run 2: Music
- Run 3: General
3. Compare the 3 clips from each run
**What to Record**:
```
Video: {video_id}
Experiment: Domain Comparison
| Run | Domain | Clip1 Start | Clip2 Start | Clip3 Start | Best Clip# | Notes |
|-----|--------|-------------|-------------|-------------|------------|-------|
| 1 | Gaming | | | | | |
| 2 | Music | | | | | |
| 3 | General| | | | | |
Question: Did different domains pick different moments? Y / N
Question: Which domain worked best for this video? ____________
```
#### Experiment B: Clip Duration Impact
**Goal**: Find optimal clip length for engagement
**Method**:
1. Pick 1 video
2. Run ShortSmith 3 times with different durations:
- Run 1: 10 seconds
- Run 2: 15 seconds
- Run 3: 25 seconds
3. Watch all clips and score
**What to Record**:
```
Video: {video_id}
Experiment: Duration Comparison
| Duration | Clip1 Hook Score | Clip1 "Feels Complete?" | Clip1 "Too Long?" |
|----------|------------------|-------------------------|-------------------|
| 10 sec | | Y/N | Y/N |
| 15 sec | | Y/N | Y/N |
| 25 sec | | Y/N | Y/N |
Optimal duration for this content type: ___ seconds
```
#### Experiment C: Custom Prompt Effectiveness
**Goal**: Test if custom prompts improve clip selection
**Method**:
1. Pick 1 video with a specific element you want to capture (e.g., "crowd reactions")
2. Run twice:
- Run 1: No custom prompt
- Run 2: Custom prompt = "Focus on crowd reactions"
3. Compare results
**What to Record**:
```
Video: {video_id}
Custom Prompt: "________________________"
Experiment: Custom Prompt Test
| Run | Has Prompt? | Clip1 Has Target Element? | Clip2 Has Target? | Clip3 Has Target? |
|-----|-------------|---------------------------|-------------------|-------------------|
| 1 | No | Y/N | Y/N | Y/N |
| 2 | Yes | Y/N | Y/N | Y/N |
Did custom prompt help? Y / N / Unclear
```
#### Experiment D: Person Filter Accuracy
**Goal**: Test if person filtering actually prioritizes target person
**Method**:
1. Pick a video with 2+ people clearly visible
2. Get a clear reference photo of 1 person
3. Run twice:
- Run 1: No reference image
- Run 2: With reference image
4. Count target person screen time in each clip
**What to Record**:
```
Video: {video_id}
Target Person: {description}
Experiment: Person Filter Test
| Run | Filter On? | Clip1 Person % | Clip2 Person % | Clip3 Person % | Avg % |
|-----|------------|----------------|----------------|----------------|-------|
| 1 | No | | | | |
| 2 | Yes | | | | |
Did filter increase target person screen time? Y / N
By how much? ___% improvement
```
### 11.3 How to Document Experiments
#### Step 1: Create Experiment Log
Make a new sheet/tab called "Experiments" with columns:
```
| Date | Experiment_Type | Video_ID | Variable_Tested | Setting_A | Setting_B | Winner | Notes |
```
#### Step 2: Record Before & After
Always record:
- What you changed (the variable)
- What stayed the same (controls)
- The measurable outcome
#### Step 3: Take Screenshots
For each experiment:
- Screenshot the settings used
- Screenshot the processing log
- Save the output clips with clear naming:
```
exp_{type}_{video}_{setting}.mp4
Examples:
exp_domain_sports001_gaming.mp4
exp_domain_sports001_music.mp4
exp_duration_vlog002_10sec.mp4
exp_duration_vlog002_25sec.mp4
```
### 11.4 Recommended Experiment Order
**Week 1: Baseline + Domain Testing**
```
Day 1-2: Run all 6 domains on 2 "neutral" videos
Record which domain performs best
Day 3-4: Run domain-matched tests
(Sports video with Sports setting, Music with Music, etc.)
Record scores
Day 5: Analyze - Do domain presets actually help vs General?
```
**Week 2: Duration & Clip Count Testing**
```
Day 1-2: Test 10s vs 15s vs 20s vs 25s clips
On 3 different video types
Day 3-4: Test 1 clip vs 2 clips vs 3 clips
Does quality drop with more clips?
Day 5: Analyze - Find optimal duration per domain
```
**Week 3: Advanced Features Testing**
```
Day 1-2: Custom prompt experiments
Try 5 different prompts on same video
Day 3-4: Person filter experiments
Test accuracy on 3 multi-person videos
Day 5: Analyze - Document which features add value
```
### 11.5 Experiment Scoring Template
For each experiment, fill out:
```
=== EXPERIMENT REPORT ===
Date: ____________
Tester: ____________
Experiment Type: [Domain / Duration / Custom Prompt / Person Filter / Other]
VIDEO DETAILS
- Video ID: ____________
- Domain: ____________
- Duration: ____________
- Content Description: ____________
WHAT I TESTED
- Variable Changed: ____________
- Setting A: ____________
- Setting B: ____________
- Setting C (if any): ____________
RESULTS
Setting A Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___
Setting B Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___
WINNER: Setting ___
WHY?
________________________________
________________________________
RECOMMENDATION:
________________________________
________________________________
```
### 11.6 Key Questions to Answer Through Experiments
After completing experiments, you should be able to answer:
**Domain Settings:**
- [ ] Does Sports mode actually pick action moments better than General?
- [ ] Does Music mode find beat drops better?
- [ ] Does Podcast mode find speaking highlights?
- [ ] Which domain works best for "general" content?
**Clip Settings:**
- [ ] What's the ideal clip duration for TikTok/Reels (vertical short-form)?
- [ ] What's the ideal clip duration for YouTube Shorts?
- [ ] Does requesting 1 clip give higher quality than requesting 3?
**Features:**
- [ ] Do custom prompts actually improve results?
- [ ] How accurate is person filtering? (% of clips with target person)
- [ ] What types of custom prompts work best?
**Quality Patterns:**
- [ ] Which content types get the best results?
- [ ] Which content types struggle?
- [ ] Are there patterns in "bad" clips? (e.g., always too early, always misses peak)
---
## Appendix A: Quick Evaluation Form
```
=== ShortSmith Clip Evaluation ===
Video ID: ____________
Domain: ____________
Date: ____________
Evaluator: ____________
CLIP 1 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________
CLIP 2 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________
CLIP 3 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________
Overall Comments:
________________________________
________________________________
```
---
## Appendix B: Domain-Specific Hook Types Reference
### Sports
| Hook Type | What to Look For |
|-----------|-----------------|
| GOAL_MOMENT | Scoring play, basket, touchdown |
| CROWD_ERUPTION | Audience going wild |
| COMMENTATOR_HYPE | Excited commentary voice |
| REPLAY_WORTHY | Impressive athletic move |
### Music
| Hook Type | What to Look For |
|-----------|-----------------|
| BEAT_DROP | Bass drop, beat switch |
| CHORUS_HIT | Chorus/hook starts |
| DANCE_PEAK | Peak choreography moment |
| VISUAL_CLIMAX | Visual spectacle |
### Gaming
| Hook Type | What to Look For |
|-----------|-----------------|
| CLUTCH_PLAY | Skillful play under pressure |
| ELIMINATION | Kill/win moment |
| RAGE_REACTION | Streamer emotional reaction |
| UNEXPECTED | Surprise/plot twist |
### Vlogs
| Hook Type | What to Look For |
|-----------|-----------------|
| REVEAL | Surprise reveal |
| PUNCHLINE | Joke landing |
| EMOTIONAL_MOMENT | Tears/joy/shock |
| CONFRONTATION | Drama/tension |
### Podcasts
| Hook Type | What to Look For |
|-----------|-----------------|
| HOT_TAKE | Controversial opinion |
| BIG_LAUGH | Group laughter |
| REVELATION | Surprising information |
| HEATED_DEBATE | Passionate argument |
---
## Appendix C: Sample Test Results Format
```csv
date,video_id,domain,clip_num,start_time,end_time,hype_score,visual_score,audio_score,motion_score,hook_type,human_highlight,human_hook,domain_match,notes
2024-01-15,sports_001,sports,1,45.2,60.2,0.82,0.75,0.88,0.71,GOAL_MOMENT,5,5,Y,Perfect goal celebration
2024-01-15,sports_001,sports,2,120.5,135.5,0.71,0.68,0.74,0.65,CROWD_ERUPTION,4,4,Y,Good crowd reaction
2024-01-15,sports_001,sports,3,200.0,215.0,0.65,0.60,0.70,0.55,PEAK_ENERGY,3,3,P,Decent but not the best moment
```