Spaces:

AI-Talent-Force
/

dev_caio

Paused

App Files Files Community

dev_caio / testing_guide.md

Chaitanya-aitf

Update testing_guide.md

8556570 verified 2 months ago

preview code

raw

history blame contribute delete

17.2 kB

A newer version of the Gradio SDK is available: 6.7.0

Upgrade

ShortSmith v2 - Testing & Evaluation Guide

Overview

This document outlines our testing strategy for evaluating ShortSmith v2's highlight extraction quality. It covers what to test, how to measure results, and how to score/store findings.

1. What We're Testing

1.1 Core Quality Metrics

Component	What It Does	Key Output
Visual Analysis	Qwen2-VL rates frame excitement	hype_score (0-1), action, emotion
Audio Analysis	Librosa detects energy/beats	energy_score, excitement_score (0-1)
Motion Detection	RAFT optical flow measures action	magnitude (0-1), is_action flag
Viral Hooks	Finds optimal clip start points	hook_type, confidence, intensity
Hype Scoring	Combines all signals	combined_score (0-1), rank

1.2 End-to-End Quality

Clip Relevance: Are the extracted clips actually highlights?
Hook Effectiveness: Do clips start at engaging moments?
Domain Accuracy: Does sports mode pick different clips than podcast mode?
Person Filtering: When enabled, does target person appear in clips?

2. Test Dataset Requirements

2.1 Video Categories (Minimum 3 videos per category)

Domain	Video Type	Duration	Source Examples
Sports	Football/Basketball highlights	5-15 min	YouTube sports channels
Music	Music videos, concerts	3-8 min	Official MVs, live performances
Gaming	Gameplay with commentary	10-20 min	Twitch clips, YouTube gaming
Vlogs	Personal vlogs with reactions	8-15 min	YouTube vlogs
Podcasts	Interview/discussion clips	15-30 min	Podcast video episodes
General	Mixed content	5-15 min	Random viral videos

2.2 Test Video Naming Convention

{domain}_{video_id}_{duration_mins}.mp4

Examples:
sports_001_10min.mp4
music_002_5min.mp4
gaming_003_15min.mp4

3. Testing Methodology

3.1 Automated Metrics (System-Generated)

For each test run, capture these from the pipeline output:

Processing Metrics:
- processing_time_seconds
- frames_analyzed
- scenes_detected
- hooks_detected

Per Clip:
- clip_id (1, 2, 3)
- start_time
- end_time
- hype_score
- visual_score
- audio_score
- motion_score
- hook_type (if any)
- hook_confidence

3.2 Human Evaluation Criteria

Each clip needs human scoring on these dimensions:

A. Highlight Quality (1-5 scale)

1 = Not a highlight at all (boring/irrelevant)
2 = Weak highlight (somewhat interesting)
3 = Decent highlight (would watch)
4 = Good highlight (engaging)
5 = Perfect highlight (would share/rewatch)

B. Hook Effectiveness (1-5 scale)

1 = Starts at wrong moment (confusing/boring start)
2 = Starts too early/late (misses the peak)
3 = Acceptable start (gets the point across)
4 = Good start (grabs attention)
5 = Perfect start (immediately engaging, viral potential)

C. Domain Appropriateness (Yes/No/Partial)

Does this clip match what you'd expect for this content type?
- Sports: Action/celebration/crowd moment?
- Music: Beat drop/chorus/dance peak?
- Gaming: Clutch play/reaction/funny moment?
- Vlogs: Emotional/funny/reveal moment?
- Podcasts: Hot take/laugh/interesting point?

3.3 A/B Testing (Optional)

Compare outputs with different settings:

Viral hooks ON vs OFF
Different domain presets for same video
Custom prompt vs no prompt

4. Test Execution Process

4.1 Per-Video Test Run

1. Upload video to ShortSmith
2. Select appropriate domain
3. Set: 3 clips, 15 seconds each
4. Run extraction
5. Download all 3 clips
6. Record automated metrics from log
7. Have 2+ team members score each clip
8. Calculate average scores
9. Log results in spreadsheet

4.2 Test Run Naming

{date}_{domain}_{video_id}_{tester_initials}

Example: 2024-01-15_sports_001_CM

5. Scoring & Storage

5.1 Results Spreadsheet Structure

Create a Google Sheet / Excel with these columns:

Date	Video_ID	Domain	Clip#	Start	End	Hype_Score	Visual	Audio	Motion	Hook_Type	Human_Highlight (1-5)	Human_Hook (1-5)	Domain_Match (Y/N/P)	Notes

5.2 Aggregate Metrics to Track

Calculate weekly/monthly:

Overall Quality:
- Avg Human Highlight Score (target: >3.5)
- Avg Human Hook Score (target: >3.5)
- Domain Match Rate (target: >80%)

Per Domain:
- Avg scores broken down by sports/music/gaming/vlogs/podcasts

Correlation Analysis:
- System hype_score vs Human rating correlation
- Hook confidence vs Human hook rating correlation

5.3 Quality Thresholds

Metric	Poor	Acceptable	Good	Excellent
Highlight Score	<2.5	2.5-3.5	3.5-4.2	>4.2
Hook Score	<2.5	2.5-3.5	3.5-4.2	>4.2
Domain Match	<60%	60-75%	75-90%	>90%
Hype-Human Correlation	<0.3	0.3-0.5	0.5-0.7	>0.7

6. Issue Tracking

6.1 Common Issues to Watch For

Issue	How to Identify	Priority
Wrong clip selected	Human score <2, but hype_score >0.7	High
Bad hook timing	Good highlight but hook score <2	High
Domain mismatch	Domain_Match = N for >30% clips	Medium
Person filter fail	Target person not in clips when filter enabled	Medium
Processing errors	Pipeline fails/crashes	Critical

6.2 Bug Report Template

Video: {video_id}
Domain: {domain}
Issue Type: [Wrong Clip / Bad Hook / Domain Mismatch / Other]
Clip #: {1/2/3}
Timestamp: {start}-{end}
Expected: {what should have been selected}
Actual: {what was selected}
System Scores: hype={}, visual={}, audio={}, motion={}
Human Scores: highlight={}, hook={}
Notes: {additional context}

7. Test Schedule

tentative

8. Success Criteria

MVP Launch Requirements

Avg Highlight Score > 3.5 across all domains
Avg Hook Score > 3.5 across all domains
Domain Match Rate > 75%
No critical processing errors
Processing time < 10 min for 10-min video on A10G

Stretch Goals

Avg Highlight Score > 4.0
Avg Hook Score > 4.0
Domain Match Rate > 90%
Hype-Human correlation > 0.6

11. Experimentation Guide

This section explains how to systematically test different settings to find optimal configurations.

11.1 What Settings Can Be Changed

Setting	Location	Default	Range	Impact
Coarse Sample Interval	config.py	5.0 sec	2-10 sec	More frames = better accuracy, slower processing
Clip Duration	UI Slider	15 sec	5-30 sec	Longer clips = more context, harder to keep attention
Number of Clips	UI Slider	3	1-3	More clips = more variety, possibly lower avg quality
Domain Selection	UI Dropdown	General	6 options	Changes weight distribution for scoring
Hype Threshold	config.py	0.3	0.1-0.6	Higher = stricter filtering, fewer candidates
Min Gap Between Clips	config.py	30 sec	10-60 sec	Higher = more spread out clips
Scene Threshold	config.py	27.0	20-35	Lower = more scene cuts detected

11.2 Experiment Types

Experiment A: Domain Preset Comparison

Goal: Verify that domain selection actually affects results

Method:

Pick 1 video that could fit multiple domains (e.g., a gaming video with music)
Run ShortSmith 3 times with different domain settings:
- Run 1: Gaming
- Run 2: Music
- Run 3: General
Compare the 3 clips from each run

What to Record:

Video: {video_id}
Experiment: Domain Comparison

| Run | Domain | Clip1 Start | Clip2 Start | Clip3 Start | Best Clip# | Notes |
|-----|--------|-------------|-------------|-------------|------------|-------|
| 1   | Gaming |             |             |             |            |       |
| 2   | Music  |             |             |             |            |       |
| 3   | General|             |             |             |            |       |

Question: Did different domains pick different moments?  Y / N
Question: Which domain worked best for this video? ____________

Experiment B: Clip Duration Impact

Goal: Find optimal clip length for engagement

Method:

Pick 1 video
Run ShortSmith 3 times with different durations:
- Run 1: 10 seconds
- Run 2: 15 seconds
- Run 3: 25 seconds
Watch all clips and score

What to Record:

Video: {video_id}
Experiment: Duration Comparison

| Duration | Clip1 Hook Score | Clip1 "Feels Complete?" | Clip1 "Too Long?" |
|----------|------------------|-------------------------|-------------------|
| 10 sec   |                  | Y/N                     | Y/N               |
| 15 sec   |                  | Y/N                     | Y/N               |
| 25 sec   |                  | Y/N                     | Y/N               |

Optimal duration for this content type: ___ seconds

Experiment C: Custom Prompt Effectiveness

Goal: Test if custom prompts improve clip selection

Method:

Pick 1 video with a specific element you want to capture (e.g., "crowd reactions")
Run twice:
- Run 1: No custom prompt
- Run 2: Custom prompt = "Focus on crowd reactions"
Compare results

What to Record:

Video: {video_id}
Custom Prompt: "________________________"
Experiment: Custom Prompt Test

| Run | Has Prompt? | Clip1 Has Target Element? | Clip2 Has Target? | Clip3 Has Target? |
|-----|-------------|---------------------------|-------------------|-------------------|
| 1   | No          | Y/N                       | Y/N               | Y/N               |
| 2   | Yes         | Y/N                       | Y/N               | Y/N               |

Did custom prompt help? Y / N / Unclear

Experiment D: Person Filter Accuracy

Goal: Test if person filtering actually prioritizes target person

Method:

Pick a video with 2+ people clearly visible
Get a clear reference photo of 1 person
Run twice:
- Run 1: No reference image
- Run 2: With reference image
Count target person screen time in each clip

What to Record:

Video: {video_id}
Target Person: {description}
Experiment: Person Filter Test

| Run | Filter On? | Clip1 Person % | Clip2 Person % | Clip3 Person % | Avg % |
|-----|------------|----------------|----------------|----------------|-------|
| 1   | No         |                |                |                |       |
| 2   | Yes        |                |                |                |       |

Did filter increase target person screen time? Y / N
By how much? ___% improvement

11.3 How to Document Experiments

Step 1: Create Experiment Log

Make a new sheet/tab called "Experiments" with columns:

| Date | Experiment_Type | Video_ID | Variable_Tested | Setting_A | Setting_B | Winner | Notes |

Step 2: Record Before & After

Always record:

What you changed (the variable)
What stayed the same (controls)
The measurable outcome

Step 3: Take Screenshots

For each experiment:

Screenshot the settings used
Screenshot the processing log

Save the output clips with clear naming:

exp_{type}_{video}_{setting}.mp4

Examples:
exp_domain_sports001_gaming.mp4
exp_domain_sports001_music.mp4
exp_duration_vlog002_10sec.mp4
exp_duration_vlog002_25sec.mp4

11.4 Recommended Experiment Order

Week 1: Baseline + Domain Testing

Day 1-2: Run all 6 domains on 2 "neutral" videos
         Record which domain performs best

Day 3-4: Run domain-matched tests
         (Sports video with Sports setting, Music with Music, etc.)
         Record scores

Day 5: Analyze - Do domain presets actually help vs General?

Week 2: Duration & Clip Count Testing

Day 1-2: Test 10s vs 15s vs 20s vs 25s clips
         On 3 different video types

Day 3-4: Test 1 clip vs 2 clips vs 3 clips
         Does quality drop with more clips?

Day 5: Analyze - Find optimal duration per domain

Week 3: Advanced Features Testing

Day 1-2: Custom prompt experiments
         Try 5 different prompts on same video

Day 3-4: Person filter experiments
         Test accuracy on 3 multi-person videos

Day 5: Analyze - Document which features add value

11.5 Experiment Scoring Template

For each experiment, fill out:

=== EXPERIMENT REPORT ===

Date: ____________
Tester: ____________
Experiment Type: [Domain / Duration / Custom Prompt / Person Filter / Other]

VIDEO DETAILS
- Video ID: ____________
- Domain: ____________
- Duration: ____________
- Content Description: ____________

WHAT I TESTED
- Variable Changed: ____________
- Setting A: ____________
- Setting B: ____________
- Setting C (if any): ____________

RESULTS

Setting A Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___

Setting B Output:
- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
- Overall Quality (1-5): ___

WINNER: Setting ___

WHY?
________________________________
________________________________

RECOMMENDATION:
________________________________
________________________________

11.6 Key Questions to Answer Through Experiments

After completing experiments, you should be able to answer:

Domain Settings:

Does Sports mode actually pick action moments better than General?
Does Music mode find beat drops better?
Does Podcast mode find speaking highlights?
Which domain works best for "general" content?

Clip Settings:

What's the ideal clip duration for TikTok/Reels (vertical short-form)?
What's the ideal clip duration for YouTube Shorts?
Does requesting 1 clip give higher quality than requesting 3?

Features:

Do custom prompts actually improve results?
How accurate is person filtering? (% of clips with target person)
What types of custom prompts work best?

Quality Patterns:

Which content types get the best results?
Which content types struggle?
Are there patterns in "bad" clips? (e.g., always too early, always misses peak)

Appendix A: Quick Evaluation Form

=== ShortSmith Clip Evaluation ===

Video ID: ____________
Domain: ____________
Date: ____________
Evaluator: ____________

CLIP 1 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________

CLIP 2 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________

CLIP 3 (___:___ - ___:___)
System Hype Score: ___
Highlight Quality (1-5): ___
Hook Effectiveness (1-5): ___
Domain Match (Y/N/P): ___
Notes: ________________________________

Overall Comments:
________________________________
________________________________

Appendix B: Domain-Specific Hook Types Reference

Sports

Hook Type	What to Look For
GOAL_MOMENT	Scoring play, basket, touchdown
CROWD_ERUPTION	Audience going wild
COMMENTATOR_HYPE	Excited commentary voice
REPLAY_WORTHY	Impressive athletic move

Music

Hook Type	What to Look For
BEAT_DROP	Bass drop, beat switch
CHORUS_HIT	Chorus/hook starts
DANCE_PEAK	Peak choreography moment
VISUAL_CLIMAX	Visual spectacle

Gaming

Hook Type	What to Look For
CLUTCH_PLAY	Skillful play under pressure
ELIMINATION	Kill/win moment
RAGE_REACTION	Streamer emotional reaction
UNEXPECTED	Surprise/plot twist

Vlogs

Hook Type	What to Look For
REVEAL	Surprise reveal
PUNCHLINE	Joke landing
EMOTIONAL_MOMENT	Tears/joy/shock
CONFRONTATION	Drama/tension

Podcasts

Hook Type	What to Look For
HOT_TAKE	Controversial opinion
BIG_LAUGH	Group laughter
REVELATION	Surprising information
HEATED_DEBATE	Passionate argument

Appendix C: Sample Test Results Format

date,video_id,domain,clip_num,start_time,end_time,hype_score,visual_score,audio_score,motion_score,hook_type,human_highlight,human_hook,domain_match,notes
2024-01-15,sports_001,sports,1,45.2,60.2,0.82,0.75,0.88,0.71,GOAL_MOMENT,5,5,Y,Perfect goal celebration
2024-01-15,sports_001,sports,2,120.5,135.5,0.71,0.68,0.74,0.65,CROWD_ERUPTION,4,4,Y,Good crowd reaction
2024-01-15,sports_001,sports,3,200.0,215.0,0.65,0.60,0.70,0.55,PEAK_ENERGY,3,3,P,Decent but not the best moment