Spaces:
Paused
A newer version of the Gradio SDK is available:
6.7.0
ShortSmith v2 - Testing & Evaluation Guide
Overview
This document outlines our testing strategy for evaluating ShortSmith v2's highlight extraction quality. It covers what to test, how to measure results, and how to score/store findings.
1. What We're Testing
1.1 Core Quality Metrics
| Component | What It Does | Key Output |
|---|---|---|
| Visual Analysis | Qwen2-VL rates frame excitement | hype_score (0-1), action, emotion |
| Audio Analysis | Librosa detects energy/beats | energy_score, excitement_score (0-1) |
| Motion Detection | RAFT optical flow measures action | magnitude (0-1), is_action flag |
| Viral Hooks | Finds optimal clip start points | hook_type, confidence, intensity |
| Hype Scoring | Combines all signals | combined_score (0-1), rank |
1.2 End-to-End Quality
- Clip Relevance: Are the extracted clips actually highlights?
- Hook Effectiveness: Do clips start at engaging moments?
- Domain Accuracy: Does sports mode pick different clips than podcast mode?
- Person Filtering: When enabled, does target person appear in clips?
2. Test Dataset Requirements
2.1 Video Categories (Minimum 3 videos per category)
| Domain | Video Type | Duration | Source Examples |
|---|---|---|---|
| Sports | Football/Basketball highlights | 5-15 min | YouTube sports channels |
| Music | Music videos, concerts | 3-8 min | Official MVs, live performances |
| Gaming | Gameplay with commentary | 10-20 min | Twitch clips, YouTube gaming |
| Vlogs | Personal vlogs with reactions | 8-15 min | YouTube vlogs |
| Podcasts | Interview/discussion clips | 15-30 min | Podcast video episodes |
| General | Mixed content | 5-15 min | Random viral videos |
2.2 Test Video Naming Convention
{domain}_{video_id}_{duration_mins}.mp4
Examples:
sports_001_10min.mp4
music_002_5min.mp4
gaming_003_15min.mp4
3. Testing Methodology
3.1 Automated Metrics (System-Generated)
For each test run, capture these from the pipeline output:
Processing Metrics:
- processing_time_seconds
- frames_analyzed
- scenes_detected
- hooks_detected
Per Clip:
- clip_id (1, 2, 3)
- start_time
- end_time
- hype_score
- visual_score
- audio_score
- motion_score
- hook_type (if any)
- hook_confidence
3.2 Human Evaluation Criteria
Each clip needs human scoring on these dimensions:
A. Highlight Quality (1-5 scale)
1 = Not a highlight at all (boring/irrelevant)
2 = Weak highlight (somewhat interesting)
3 = Decent highlight (would watch)
4 = Good highlight (engaging)
5 = Perfect highlight (would share/rewatch)
B. Hook Effectiveness (1-5 scale)
1 = Starts at wrong moment (confusing/boring start)
2 = Starts too early/late (misses the peak)
3 = Acceptable start (gets the point across)
4 = Good start (grabs attention)
5 = Perfect start (immediately engaging, viral potential)
C. Domain Appropriateness (Yes/No/Partial)
Does this clip match what you'd expect for this content type?
- Sports: Action/celebration/crowd moment?
- Music: Beat drop/chorus/dance peak?
- Gaming: Clutch play/reaction/funny moment?
- Vlogs: Emotional/funny/reveal moment?
- Podcasts: Hot take/laugh/interesting point?
3.3 A/B Testing (Optional)
Compare outputs with different settings:
- Viral hooks ON vs OFF
- Different domain presets for same video
- Custom prompt vs no prompt
4. Test Execution Process
4.1 Per-Video Test Run
1. Upload video to ShortSmith
2. Select appropriate domain
3. Set: 3 clips, 15 seconds each
4. Run extraction
5. Download all 3 clips
6. Record automated metrics from log
7. Have 2+ team members score each clip
8. Calculate average scores
9. Log results in spreadsheet
4.2 Test Run Naming
{date}_{domain}_{video_id}_{tester_initials}
Example: 2024-01-15_sports_001_CM
5. Scoring & Storage
5.1 Results Spreadsheet Structure
Create a Google Sheet / Excel with these columns:
| Date | Video_ID | Domain | Clip# | Start | End | Hype_Score | Visual | Audio | Motion | Hook_Type | Human_Highlight (1-5) | Human_Hook (1-5) | Domain_Match (Y/N/P) | Notes |
|---|
5.2 Aggregate Metrics to Track
Calculate weekly/monthly:
Overall Quality:
- Avg Human Highlight Score (target: >3.5)
- Avg Human Hook Score (target: >3.5)
- Domain Match Rate (target: >80%)
Per Domain:
- Avg scores broken down by sports/music/gaming/vlogs/podcasts
Correlation Analysis:
- System hype_score vs Human rating correlation
- Hook confidence vs Human hook rating correlation
5.3 Quality Thresholds
| Metric | Poor | Acceptable | Good | Excellent |
|---|---|---|---|---|
| Highlight Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
| Hook Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
| Domain Match | <60% | 60-75% | 75-90% | >90% |
| Hype-Human Correlation | <0.3 | 0.3-0.5 | 0.5-0.7 | >0.7 |
6. Issue Tracking
6.1 Common Issues to Watch For
| Issue | How to Identify | Priority |
|---|---|---|
| Wrong clip selected | Human score <2, but hype_score >0.7 | High |
| Bad hook timing | Good highlight but hook score <2 | High |
| Domain mismatch | Domain_Match = N for >30% clips | Medium |
| Person filter fail | Target person not in clips when filter enabled | Medium |
| Processing errors | Pipeline fails/crashes | Critical |
6.2 Bug Report Template
Video: {video_id}
Domain: {domain}
Issue Type: [Wrong Clip / Bad Hook / Domain Mismatch / Other]
Clip #: {1/2/3}
Timestamp: {start}-{end}
Expected: {what should have been selected}
Actual: {what was selected}
System Scores: hype={}, visual={}, audio={}, motion={}
Human Scores: highlight={}, hook={}
Notes: {additional context}
7. Test Schedule
tentative
8. Success Criteria
MVP Launch Requirements
- Avg Highlight Score > 3.5 across all domains
- Avg Hook Score > 3.5 across all domains
- Domain Match Rate > 75%
- No critical processing errors
- Processing time < 10 min for 10-min video on A10G
Stretch Goals
- Avg Highlight Score > 4.0
- Avg Hook Score > 4.0
- Domain Match Rate > 90%
- Hype-Human correlation > 0.6
11. Experimentation Guide
This section explains how to systematically test different settings to find optimal configurations.
11.1 What Settings Can Be Changed
| Setting | Location | Default | Range | Impact |
|---|---|---|---|---|
| Coarse Sample Interval | config.py | 5.0 sec | 2-10 sec | More frames = better accuracy, slower processing |
| Clip Duration | UI Slider | 15 sec | 5-30 sec | Longer clips = more context, harder to keep attention |
| Number of Clips | UI Slider | 3 | 1-3 | More clips = more variety, possibly lower avg quality |
| Domain Selection | UI Dropdown | General | 6 options | Changes weight distribution for scoring |
| Hype Threshold | config.py | 0.3 | 0.1-0.6 | Higher = stricter filtering, fewer candidates |
| Min Gap Between Clips | config.py | 30 sec | 10-60 sec | Higher = more spread out clips |
| Scene Threshold | config.py | 27.0 | 20-35 | Lower = more scene cuts detected |
11.2 Experiment Types
Experiment A: Domain Preset Comparison
Goal: Verify that domain selection actually affects results
Method:
- Pick 1 video that could fit multiple domains (e.g., a gaming video with music)
- Run ShortSmith 3 times with different domain settings:
- Run 1: Gaming
- Run 2: Music
- Run 3: General
- Compare the 3 clips from each run
What to Record:
Video: {video_id}
Experiment: Domain Comparison
| Run | Domain | Clip1 Start | Clip2 Start | Clip3 Start | Best Clip# | Notes |
|-----|--------|-------------|-------------|-------------|------------|-------|
| 1 | Gaming | | | | | |
| 2 | Music | | | | | |
| 3 | General| | | | | |
Question: Did different domains pick different moments? Y / N
Question: Which domain worked best for this video? ____________
Experiment B: Clip Duration Impact
Goal: Find optimal clip length for engagement
Method:
- Pick 1 video
- Run ShortSmith 3 times with different durations:
- Run 1: 10 seconds
- Run 2: 15 seconds
- Run 3: 25 seconds
- Watch all clips and score
What to Record:
Video: {video_id}
Experiment: Duration Comparison
| Duration | Clip1 Hook Score | Clip1 "Feels Complete?" | Clip1 "Too Long?" |
|----------|------------------|-------------------------|-------------------|
| 10 sec | | Y/N | Y/N |
| 15 sec | | Y/N | Y/N |
| 25 sec | | Y/N | Y/N |
Optimal duration for this content type: ___ seconds
Experiment C: Custom Prompt Effectiveness
Goal: Test if custom prompts improve clip selection
Method:
- Pick 1 video with a specific element you want to capture (e.g., "crowd reactions")
- Run twice:
- Run 1: No custom prompt
- Run 2: Custom prompt = "Focus on crowd reactions"
- Compare results
What to Record:
Video: {video_id}
Custom Prompt: "________________________"
Experiment: Custom Prompt Test
| Run | Has Prompt? | Clip1 Has Target Element? | Clip2 Has Target? | Clip3 Has Target? |
|-----|-------------|---------------------------|-------------------|-------------------|
| 1 | No | Y/N | Y/N | Y/N |
| 2 | Yes | Y/N | Y/N | Y/N |
Did custom prompt help? Y / N / Unclear
Experiment D: Person Filter Accuracy
Goal: Test if person filtering actually prioritizes target person
Method:
- Pick a video with 2+ people clearly visible
- Get a clear reference photo of 1 person
- Run twice:
- Run 1: No reference image
- Run 2: With reference image
- Count target person screen time in each clip
What to Record:
Video: {video_id}
Target Person: {description}
Experiment: Person Filter Test
| Run | Filter On? | Clip1 Person % | Clip2 Person % | Clip3 Person % | Avg % |
|-----|------------|----------------|----------------|----------------|-------|
| 1 | No | | | | |
| 2 | Yes | | | | |
Did filter increase target person screen time? Y / N
By how much? ___% improvement
11.3 How to Document Experiments
Step 1: Create Experiment Log
Make a new sheet/tab called "Experiments" with columns:
| Date | Experiment_Type | Video_ID | Variable_Tested | Setting_A | Setting_B | Winner | Notes |
Step 2: Record Before & After
Always record:
- What you changed (the variable)
- What stayed the same (controls)
- The measurable outcome
Step 3: Take Screenshots
For each experiment:
- Screenshot the settings used
- Screenshot the processing log
- Save the output clips with clear naming:
exp_{type}_{video}_{setting}.mp4 Examples: exp_domain_sports001_gaming.mp4 exp_domain_sports001_music.mp4 exp_duration_vlog002_10sec.mp4 exp_duration_vlog002_25sec.mp4
11.4 Recommended Experiment Order
Week 1: Baseline + Domain Testing
Day 1-2: Run all 6 domains on 2 "neutral" videos
Record which domain performs best
Day 3-4: Run domain-matched tests
(Sports video with Sports setting, Music with Music, etc.)
Record scores
Day 5: Analyze - Do domain presets actually help vs General?
Week 2: Duration & Clip Count Testing
Day 1-2: Test 10s vs 15s vs 20s vs 25s clips
On 3 different video types
Day 3-4: Test 1 clip vs 2 clips vs 3 clips
Does quality drop with more clips?
Day 5: Analyze - Find optimal duration per domain
Week 3: Advanced Features Testing
Day 1-2: Custom prompt experiments
Try 5 different prompts on same video
Day 3-4: Person filter experiments
Test accuracy on 3 multi-person videos
Day 5: Analyze - Document which features add value
11.5 Experiment Scoring Template
For each experiment, fill out:
=== EXPERIMENT REPORT ===
Date: ____________
Tester: ____________
Experiment Type: [Domain / Duration / Custom Prompt / Person Filter / Other]
VIDEO DETAILS
- Video ID: ____________
- Domain: ____________
- Duration: ____________
- Content Description: ____________
WHAT I TESTED
- Variable Changed: ____________
- Setting A: ____________
- Setting B: ____________
- Setting C (if any): ____________
RESULTS
Setting A Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___
Setting B Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___
WINNER: Setting ___
WHY?
________________________________
________________________________
RECOMMENDATION:
________________________________
________________________________
11.6 Key Questions to Answer Through Experiments
After completing experiments, you should be able to answer:
Domain Settings:
- Does Sports mode actually pick action moments better than General?
- Does Music mode find beat drops better?
- Does Podcast mode find speaking highlights?
- Which domain works best for "general" content?
Clip Settings:
- What's the ideal clip duration for TikTok/Reels (vertical short-form)?
- What's the ideal clip duration for YouTube Shorts?
- Does requesting 1 clip give higher quality than requesting 3?
Features:
- Do custom prompts actually improve results?
- How accurate is person filtering? (% of clips with target person)
- What types of custom prompts work best?
Quality Patterns:
- Which content types get the best results?
- Which content types struggle?
- Are there patterns in "bad" clips? (e.g., always too early, always misses peak)
Appendix A: Quick Evaluation Form
=== ShortSmith Clip Evaluation ===
Video ID: ____________
Domain: ____________
Date: ____________
Evaluator: ____________
CLIP 1 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________
CLIP 2 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________
CLIP 3 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________
Overall Comments:
________________________________
________________________________
Appendix B: Domain-Specific Hook Types Reference
Sports
| Hook Type | What to Look For |
|---|---|
| GOAL_MOMENT | Scoring play, basket, touchdown |
| CROWD_ERUPTION | Audience going wild |
| COMMENTATOR_HYPE | Excited commentary voice |
| REPLAY_WORTHY | Impressive athletic move |
Music
| Hook Type | What to Look For |
|---|---|
| BEAT_DROP | Bass drop, beat switch |
| CHORUS_HIT | Chorus/hook starts |
| DANCE_PEAK | Peak choreography moment |
| VISUAL_CLIMAX | Visual spectacle |
Gaming
| Hook Type | What to Look For |
|---|---|
| CLUTCH_PLAY | Skillful play under pressure |
| ELIMINATION | Kill/win moment |
| RAGE_REACTION | Streamer emotional reaction |
| UNEXPECTED | Surprise/plot twist |
Vlogs
| Hook Type | What to Look For |
|---|---|
| REVEAL | Surprise reveal |
| PUNCHLINE | Joke landing |
| EMOTIONAL_MOMENT | Tears/joy/shock |
| CONFRONTATION | Drama/tension |
Podcasts
| Hook Type | What to Look For |
|---|---|
| HOT_TAKE | Controversial opinion |
| BIG_LAUGH | Group laughter |
| REVELATION | Surprising information |
| HEATED_DEBATE | Passionate argument |
Appendix C: Sample Test Results Format
date,video_id,domain,clip_num,start_time,end_time,hype_score,visual_score,audio_score,motion_score,hook_type,human_highlight,human_hook,domain_match,notes
2024-01-15,sports_001,sports,1,45.2,60.2,0.82,0.75,0.88,0.71,GOAL_MOMENT,5,5,Y,Perfect goal celebration
2024-01-15,sports_001,sports,2,120.5,135.5,0.71,0.68,0.74,0.65,CROWD_ERUPTION,4,4,Y,Good crowd reaction
2024-01-15,sports_001,sports,3,200.0,215.0,0.65,0.60,0.70,0.55,PEAK_ENERGY,3,3,P,Decent but not the best moment