dev_caio / testing_guide.md
Chaitanya-aitf's picture
Update testing_guide.md
8556570 verified

A newer version of the Gradio SDK is available: 6.7.0

Upgrade

ShortSmith v2 - Testing & Evaluation Guide

Overview

This document outlines our testing strategy for evaluating ShortSmith v2's highlight extraction quality. It covers what to test, how to measure results, and how to score/store findings.


1. What We're Testing

1.1 Core Quality Metrics

Component What It Does Key Output
Visual Analysis Qwen2-VL rates frame excitement hype_score (0-1), action, emotion
Audio Analysis Librosa detects energy/beats energy_score, excitement_score (0-1)
Motion Detection RAFT optical flow measures action magnitude (0-1), is_action flag
Viral Hooks Finds optimal clip start points hook_type, confidence, intensity
Hype Scoring Combines all signals combined_score (0-1), rank

1.2 End-to-End Quality

  • Clip Relevance: Are the extracted clips actually highlights?
  • Hook Effectiveness: Do clips start at engaging moments?
  • Domain Accuracy: Does sports mode pick different clips than podcast mode?
  • Person Filtering: When enabled, does target person appear in clips?

2. Test Dataset Requirements

2.1 Video Categories (Minimum 3 videos per category)

Domain Video Type Duration Source Examples
Sports Football/Basketball highlights 5-15 min YouTube sports channels
Music Music videos, concerts 3-8 min Official MVs, live performances
Gaming Gameplay with commentary 10-20 min Twitch clips, YouTube gaming
Vlogs Personal vlogs with reactions 8-15 min YouTube vlogs
Podcasts Interview/discussion clips 15-30 min Podcast video episodes
General Mixed content 5-15 min Random viral videos

2.2 Test Video Naming Convention

{domain}_{video_id}_{duration_mins}.mp4

Examples:
sports_001_10min.mp4
music_002_5min.mp4
gaming_003_15min.mp4

3. Testing Methodology

3.1 Automated Metrics (System-Generated)

For each test run, capture these from the pipeline output:

Processing Metrics:
- processing_time_seconds
- frames_analyzed
- scenes_detected
- hooks_detected

Per Clip:
- clip_id (1, 2, 3)
- start_time
- end_time
- hype_score
- visual_score
- audio_score
- motion_score
- hook_type (if any)
- hook_confidence

3.2 Human Evaluation Criteria

Each clip needs human scoring on these dimensions:

A. Highlight Quality (1-5 scale)

1 = Not a highlight at all (boring/irrelevant)
2 = Weak highlight (somewhat interesting)
3 = Decent highlight (would watch)
4 = Good highlight (engaging)
5 = Perfect highlight (would share/rewatch)

B. Hook Effectiveness (1-5 scale)

1 = Starts at wrong moment (confusing/boring start)
2 = Starts too early/late (misses the peak)
3 = Acceptable start (gets the point across)
4 = Good start (grabs attention)
5 = Perfect start (immediately engaging, viral potential)

C. Domain Appropriateness (Yes/No/Partial)

Does this clip match what you'd expect for this content type?
- Sports: Action/celebration/crowd moment?
- Music: Beat drop/chorus/dance peak?
- Gaming: Clutch play/reaction/funny moment?
- Vlogs: Emotional/funny/reveal moment?
- Podcasts: Hot take/laugh/interesting point?

3.3 A/B Testing (Optional)

Compare outputs with different settings:

  • Viral hooks ON vs OFF
  • Different domain presets for same video
  • Custom prompt vs no prompt

4. Test Execution Process

4.1 Per-Video Test Run

1. Upload video to ShortSmith
2. Select appropriate domain
3. Set: 3 clips, 15 seconds each
4. Run extraction
5. Download all 3 clips
6. Record automated metrics from log
7. Have 2+ team members score each clip
8. Calculate average scores
9. Log results in spreadsheet

4.2 Test Run Naming

{date}_{domain}_{video_id}_{tester_initials}

Example: 2024-01-15_sports_001_CM

5. Scoring & Storage

5.1 Results Spreadsheet Structure

Create a Google Sheet / Excel with these columns:

Date Video_ID Domain Clip# Start End Hype_Score Visual Audio Motion Hook_Type Human_Highlight (1-5) Human_Hook (1-5) Domain_Match (Y/N/P) Notes

5.2 Aggregate Metrics to Track

Calculate weekly/monthly:

Overall Quality:
- Avg Human Highlight Score (target: >3.5)
- Avg Human Hook Score (target: >3.5)
- Domain Match Rate (target: >80%)

Per Domain:
- Avg scores broken down by sports/music/gaming/vlogs/podcasts

Correlation Analysis:
- System hype_score vs Human rating correlation
- Hook confidence vs Human hook rating correlation

5.3 Quality Thresholds

Metric Poor Acceptable Good Excellent
Highlight Score <2.5 2.5-3.5 3.5-4.2 >4.2
Hook Score <2.5 2.5-3.5 3.5-4.2 >4.2
Domain Match <60% 60-75% 75-90% >90%
Hype-Human Correlation <0.3 0.3-0.5 0.5-0.7 >0.7

6. Issue Tracking

6.1 Common Issues to Watch For

Issue How to Identify Priority
Wrong clip selected Human score <2, but hype_score >0.7 High
Bad hook timing Good highlight but hook score <2 High
Domain mismatch Domain_Match = N for >30% clips Medium
Person filter fail Target person not in clips when filter enabled Medium
Processing errors Pipeline fails/crashes Critical

6.2 Bug Report Template

Video: {video_id}
Domain: {domain}
Issue Type: [Wrong Clip / Bad Hook / Domain Mismatch / Other]
Clip #: {1/2/3}
Timestamp: {start}-{end}
Expected: {what should have been selected}
Actual: {what was selected}
System Scores: hype={}, visual={}, audio={}, motion={}
Human Scores: highlight={}, hook={}
Notes: {additional context}

7. Test Schedule

tentative

8. Success Criteria

MVP Launch Requirements

  • Avg Highlight Score > 3.5 across all domains
  • Avg Hook Score > 3.5 across all domains
  • Domain Match Rate > 75%
  • No critical processing errors
  • Processing time < 10 min for 10-min video on A10G

Stretch Goals

  • Avg Highlight Score > 4.0
  • Avg Hook Score > 4.0
  • Domain Match Rate > 90%
  • Hype-Human correlation > 0.6

11. Experimentation Guide

This section explains how to systematically test different settings to find optimal configurations.

11.1 What Settings Can Be Changed

Setting Location Default Range Impact
Coarse Sample Interval config.py 5.0 sec 2-10 sec More frames = better accuracy, slower processing
Clip Duration UI Slider 15 sec 5-30 sec Longer clips = more context, harder to keep attention
Number of Clips UI Slider 3 1-3 More clips = more variety, possibly lower avg quality
Domain Selection UI Dropdown General 6 options Changes weight distribution for scoring
Hype Threshold config.py 0.3 0.1-0.6 Higher = stricter filtering, fewer candidates
Min Gap Between Clips config.py 30 sec 10-60 sec Higher = more spread out clips
Scene Threshold config.py 27.0 20-35 Lower = more scene cuts detected

11.2 Experiment Types

Experiment A: Domain Preset Comparison

Goal: Verify that domain selection actually affects results

Method:

  1. Pick 1 video that could fit multiple domains (e.g., a gaming video with music)
  2. Run ShortSmith 3 times with different domain settings:
    • Run 1: Gaming
    • Run 2: Music
    • Run 3: General
  3. Compare the 3 clips from each run

What to Record:

Video: {video_id}
Experiment: Domain Comparison

| Run | Domain | Clip1 Start | Clip2 Start | Clip3 Start | Best Clip# | Notes |
|-----|--------|-------------|-------------|-------------|------------|-------|
| 1   | Gaming |             |             |             |            |       |
| 2   | Music  |             |             |             |            |       |
| 3   | General|             |             |             |            |       |

Question: Did different domains pick different moments?  Y / N
Question: Which domain worked best for this video? ____________

Experiment B: Clip Duration Impact

Goal: Find optimal clip length for engagement

Method:

  1. Pick 1 video
  2. Run ShortSmith 3 times with different durations:
    • Run 1: 10 seconds
    • Run 2: 15 seconds
    • Run 3: 25 seconds
  3. Watch all clips and score

What to Record:

Video: {video_id}
Experiment: Duration Comparison

| Duration | Clip1 Hook Score | Clip1 "Feels Complete?" | Clip1 "Too Long?" |
|----------|------------------|-------------------------|-------------------|
| 10 sec   |                  | Y/N                     | Y/N               |
| 15 sec   |                  | Y/N                     | Y/N               |
| 25 sec   |                  | Y/N                     | Y/N               |

Optimal duration for this content type: ___ seconds

Experiment C: Custom Prompt Effectiveness

Goal: Test if custom prompts improve clip selection

Method:

  1. Pick 1 video with a specific element you want to capture (e.g., "crowd reactions")
  2. Run twice:
    • Run 1: No custom prompt
    • Run 2: Custom prompt = "Focus on crowd reactions"
  3. Compare results

What to Record:

Video: {video_id}
Custom Prompt: "________________________"
Experiment: Custom Prompt Test

| Run | Has Prompt? | Clip1 Has Target Element? | Clip2 Has Target? | Clip3 Has Target? |
|-----|-------------|---------------------------|-------------------|-------------------|
| 1   | No          | Y/N                       | Y/N               | Y/N               |
| 2   | Yes         | Y/N                       | Y/N               | Y/N               |

Did custom prompt help? Y / N / Unclear

Experiment D: Person Filter Accuracy

Goal: Test if person filtering actually prioritizes target person

Method:

  1. Pick a video with 2+ people clearly visible
  2. Get a clear reference photo of 1 person
  3. Run twice:
    • Run 1: No reference image
    • Run 2: With reference image
  4. Count target person screen time in each clip

What to Record:

Video: {video_id}
Target Person: {description}
Experiment: Person Filter Test

| Run | Filter On? | Clip1 Person % | Clip2 Person % | Clip3 Person % | Avg % |
|-----|------------|----------------|----------------|----------------|-------|
| 1   | No         |                |                |                |       |
| 2   | Yes        |                |                |                |       |

Did filter increase target person screen time? Y / N
By how much? ___% improvement

11.3 How to Document Experiments

Step 1: Create Experiment Log

Make a new sheet/tab called "Experiments" with columns:

| Date | Experiment_Type | Video_ID | Variable_Tested | Setting_A | Setting_B | Winner | Notes |

Step 2: Record Before & After

Always record:

  • What you changed (the variable)
  • What stayed the same (controls)
  • The measurable outcome

Step 3: Take Screenshots

For each experiment:

  • Screenshot the settings used
  • Screenshot the processing log
  • Save the output clips with clear naming:
    exp_{type}_{video}_{setting}.mp4
    
    Examples:
    exp_domain_sports001_gaming.mp4
    exp_domain_sports001_music.mp4
    exp_duration_vlog002_10sec.mp4
    exp_duration_vlog002_25sec.mp4
    

11.4 Recommended Experiment Order

Week 1: Baseline + Domain Testing

Day 1-2: Run all 6 domains on 2 "neutral" videos
         Record which domain performs best

Day 3-4: Run domain-matched tests
         (Sports video with Sports setting, Music with Music, etc.)
         Record scores

Day 5: Analyze - Do domain presets actually help vs General?

Week 2: Duration & Clip Count Testing

Day 1-2: Test 10s vs 15s vs 20s vs 25s clips
         On 3 different video types

Day 3-4: Test 1 clip vs 2 clips vs 3 clips
         Does quality drop with more clips?

Day 5: Analyze - Find optimal duration per domain

Week 3: Advanced Features Testing

Day 1-2: Custom prompt experiments
         Try 5 different prompts on same video

Day 3-4: Person filter experiments
         Test accuracy on 3 multi-person videos

Day 5: Analyze - Document which features add value

11.5 Experiment Scoring Template

For each experiment, fill out:

=== EXPERIMENT REPORT ===

Date: ____________
Tester: ____________
Experiment Type: [Domain / Duration / Custom Prompt / Person Filter / Other]

VIDEO DETAILS
- Video ID: ____________
- Domain: ____________
- Duration: ____________
- Content Description: ____________

WHAT I TESTED
- Variable Changed: ____________
- Setting A: ____________
- Setting B: ____________
- Setting C (if any): ____________

RESULTS

Setting A Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___

Setting B Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___

WINNER: Setting ___

WHY?
________________________________
________________________________

RECOMMENDATION:
________________________________
________________________________

11.6 Key Questions to Answer Through Experiments

After completing experiments, you should be able to answer:

Domain Settings:

  • Does Sports mode actually pick action moments better than General?
  • Does Music mode find beat drops better?
  • Does Podcast mode find speaking highlights?
  • Which domain works best for "general" content?

Clip Settings:

  • What's the ideal clip duration for TikTok/Reels (vertical short-form)?
  • What's the ideal clip duration for YouTube Shorts?
  • Does requesting 1 clip give higher quality than requesting 3?

Features:

  • Do custom prompts actually improve results?
  • How accurate is person filtering? (% of clips with target person)
  • What types of custom prompts work best?

Quality Patterns:

  • Which content types get the best results?
  • Which content types struggle?
  • Are there patterns in "bad" clips? (e.g., always too early, always misses peak)

Appendix A: Quick Evaluation Form

=== ShortSmith Clip Evaluation ===

Video ID: ____________
Domain: ____________
Date: ____________
Evaluator: ____________

CLIP 1 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________

CLIP 2 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________

CLIP 3 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________

Overall Comments:
________________________________
________________________________

Appendix B: Domain-Specific Hook Types Reference

Sports

Hook Type What to Look For
GOAL_MOMENT Scoring play, basket, touchdown
CROWD_ERUPTION Audience going wild
COMMENTATOR_HYPE Excited commentary voice
REPLAY_WORTHY Impressive athletic move

Music

Hook Type What to Look For
BEAT_DROP Bass drop, beat switch
CHORUS_HIT Chorus/hook starts
DANCE_PEAK Peak choreography moment
VISUAL_CLIMAX Visual spectacle

Gaming

Hook Type What to Look For
CLUTCH_PLAY Skillful play under pressure
ELIMINATION Kill/win moment
RAGE_REACTION Streamer emotional reaction
UNEXPECTED Surprise/plot twist

Vlogs

Hook Type What to Look For
REVEAL Surprise reveal
PUNCHLINE Joke landing
EMOTIONAL_MOMENT Tears/joy/shock
CONFRONTATION Drama/tension

Podcasts

Hook Type What to Look For
HOT_TAKE Controversial opinion
BIG_LAUGH Group laughter
REVELATION Surprising information
HEATED_DEBATE Passionate argument

Appendix C: Sample Test Results Format

date,video_id,domain,clip_num,start_time,end_time,hype_score,visual_score,audio_score,motion_score,hook_type,human_highlight,human_hook,domain_match,notes
2024-01-15,sports_001,sports,1,45.2,60.2,0.82,0.75,0.88,0.71,GOAL_MOMENT,5,5,Y,Perfect goal celebration
2024-01-15,sports_001,sports,2,120.5,135.5,0.71,0.68,0.74,0.65,CROWD_ERUPTION,4,4,Y,Good crowd reaction
2024-01-15,sports_001,sports,3,200.0,215.0,0.65,0.60,0.70,0.55,PEAK_ENERGY,3,3,P,Decent but not the best moment