# ShortSmith v2 - Testing & Evaluation Guide ## Overview This document outlines our testing strategy for evaluating ShortSmith v2's highlight extraction quality. It covers what to test, how to measure results, and how to score/store findings. --- ## 1. What We're Testing ### 1.1 Core Quality Metrics | Component | What It Does | Key Output | |-----------|--------------|------------| | **Visual Analysis** | Qwen2-VL rates frame excitement | hype_score (0-1), action, emotion | | **Audio Analysis** | Librosa detects energy/beats | energy_score, excitement_score (0-1) | | **Motion Detection** | RAFT optical flow measures action | magnitude (0-1), is_action flag | | **Viral Hooks** | Finds optimal clip start points | hook_type, confidence, intensity | | **Hype Scoring** | Combines all signals | combined_score (0-1), rank | ### 1.2 End-to-End Quality - **Clip Relevance**: Are the extracted clips actually highlights? - **Hook Effectiveness**: Do clips start at engaging moments? - **Domain Accuracy**: Does sports mode pick different clips than podcast mode? - **Person Filtering**: When enabled, does target person appear in clips? --- ## 2. Test Dataset Requirements ### 2.1 Video Categories (Minimum 3 videos per category) | Domain | Video Type | Duration | Source Examples | |--------|-----------|----------|-----------------| | **Sports** | Football/Basketball highlights | 5-15 min | YouTube sports channels | | **Music** | Music videos, concerts | 3-8 min | Official MVs, live performances | | **Gaming** | Gameplay with commentary | 10-20 min | Twitch clips, YouTube gaming | | **Vlogs** | Personal vlogs with reactions | 8-15 min | YouTube vlogs | | **Podcasts** | Interview/discussion clips | 15-30 min | Podcast video episodes | | **General** | Mixed content | 5-15 min | Random viral videos | ### 2.2 Test Video Naming Convention ``` {domain}_{video_id}_{duration_mins}.mp4 Examples: sports_001_10min.mp4 music_002_5min.mp4 gaming_003_15min.mp4 ``` --- ## 3. Testing Methodology ### 3.1 Automated Metrics (System-Generated) For each test run, capture these from the pipeline output: ``` Processing Metrics: - processing_time_seconds - frames_analyzed - scenes_detected - hooks_detected Per Clip: - clip_id (1, 2, 3) - start_time - end_time - hype_score - visual_score - audio_score - motion_score - hook_type (if any) - hook_confidence ``` ### 3.2 Human Evaluation Criteria Each clip needs human scoring on these dimensions: #### A. Highlight Quality (1-5 scale) ``` 1 = Not a highlight at all (boring/irrelevant) 2 = Weak highlight (somewhat interesting) 3 = Decent highlight (would watch) 4 = Good highlight (engaging) 5 = Perfect highlight (would share/rewatch) ``` #### B. Hook Effectiveness (1-5 scale) ``` 1 = Starts at wrong moment (confusing/boring start) 2 = Starts too early/late (misses the peak) 3 = Acceptable start (gets the point across) 4 = Good start (grabs attention) 5 = Perfect start (immediately engaging, viral potential) ``` #### C. Domain Appropriateness (Yes/No/Partial) ``` Does this clip match what you'd expect for this content type? - Sports: Action/celebration/crowd moment? - Music: Beat drop/chorus/dance peak? - Gaming: Clutch play/reaction/funny moment? - Vlogs: Emotional/funny/reveal moment? - Podcasts: Hot take/laugh/interesting point? ``` ### 3.3 A/B Testing (Optional) Compare outputs with different settings: - Viral hooks ON vs OFF - Different domain presets for same video - Custom prompt vs no prompt --- ## 4. Test Execution Process ### 4.1 Per-Video Test Run ``` 1. Upload video to ShortSmith 2. Select appropriate domain 3. Set: 3 clips, 15 seconds each 4. Run extraction 5. Download all 3 clips 6. Record automated metrics from log 7. Have 2+ team members score each clip 8. Calculate average scores 9. Log results in spreadsheet ``` ### 4.2 Test Run Naming ``` {date}_{domain}_{video_id}_{tester_initials} Example: 2024-01-15_sports_001_CM ``` --- ## 5. Scoring & Storage ### 5.1 Results Spreadsheet Structure Create a Google Sheet / Excel with these columns: | Date | Video_ID | Domain | Clip# | Start | End | Hype_Score | Visual | Audio | Motion | Hook_Type | Human_Highlight (1-5) | Human_Hook (1-5) | Domain_Match (Y/N/P) | Notes | |------|----------|--------|-------|-------|-----|------------|--------|-------|--------|-----------|----------------------|------------------|---------------------|-------| ### 5.2 Aggregate Metrics to Track Calculate weekly/monthly: ``` Overall Quality: - Avg Human Highlight Score (target: >3.5) - Avg Human Hook Score (target: >3.5) - Domain Match Rate (target: >80%) Per Domain: - Avg scores broken down by sports/music/gaming/vlogs/podcasts Correlation Analysis: - System hype_score vs Human rating correlation - Hook confidence vs Human hook rating correlation ``` ### 5.3 Quality Thresholds | Metric | Poor | Acceptable | Good | Excellent | |--------|------|------------|------|-----------| | Highlight Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 | | Hook Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 | | Domain Match | <60% | 60-75% | 75-90% | >90% | | Hype-Human Correlation | <0.3 | 0.3-0.5 | 0.5-0.7 | >0.7 | --- ## 6. Issue Tracking ### 6.1 Common Issues to Watch For | Issue | How to Identify | Priority | |-------|-----------------|----------| | Wrong clip selected | Human score <2, but hype_score >0.7 | High | | Bad hook timing | Good highlight but hook score <2 | High | | Domain mismatch | Domain_Match = N for >30% clips | Medium | | Person filter fail | Target person not in clips when filter enabled | Medium | | Processing errors | Pipeline fails/crashes | Critical | ### 6.2 Bug Report Template ``` Video: {video_id} Domain: {domain} Issue Type: [Wrong Clip / Bad Hook / Domain Mismatch / Other] Clip #: {1/2/3} Timestamp: {start}-{end} Expected: {what should have been selected} Actual: {what was selected} System Scores: hype={}, visual={}, audio={}, motion={} Human Scores: highlight={}, hook={} Notes: {additional context} ``` --- ## 7. Test Schedule tentative ## 8. Success Criteria ### MVP Launch Requirements - [ ] Avg Highlight Score > 3.5 across all domains - [ ] Avg Hook Score > 3.5 across all domains - [ ] Domain Match Rate > 75% - [ ] No critical processing errors - [ ] Processing time < 10 min for 10-min video on A10G ### Stretch Goals - [ ] Avg Highlight Score > 4.0 - [ ] Avg Hook Score > 4.0 - [ ] Domain Match Rate > 90% - [ ] Hype-Human correlation > 0.6 --- ## 11. Experimentation Guide This section explains how to systematically test different settings to find optimal configurations. ### 11.1 What Settings Can Be Changed | Setting | Location | Default | Range | Impact | |---------|----------|---------|-------|--------| | **Coarse Sample Interval** | config.py | 5.0 sec | 2-10 sec | More frames = better accuracy, slower processing | | **Clip Duration** | UI Slider | 15 sec | 5-30 sec | Longer clips = more context, harder to keep attention | | **Number of Clips** | UI Slider | 3 | 1-3 | More clips = more variety, possibly lower avg quality | | **Domain Selection** | UI Dropdown | General | 6 options | Changes weight distribution for scoring | | **Hype Threshold** | config.py | 0.3 | 0.1-0.6 | Higher = stricter filtering, fewer candidates | | **Min Gap Between Clips** | config.py | 30 sec | 10-60 sec | Higher = more spread out clips | | **Scene Threshold** | config.py | 27.0 | 20-35 | Lower = more scene cuts detected | ### 11.2 Experiment Types #### Experiment A: Domain Preset Comparison **Goal**: Verify that domain selection actually affects results **Method**: 1. Pick 1 video that could fit multiple domains (e.g., a gaming video with music) 2. Run ShortSmith 3 times with different domain settings: - Run 1: Gaming - Run 2: Music - Run 3: General 3. Compare the 3 clips from each run **What to Record**: ``` Video: {video_id} Experiment: Domain Comparison | Run | Domain | Clip1 Start | Clip2 Start | Clip3 Start | Best Clip# | Notes | |-----|--------|-------------|-------------|-------------|------------|-------| | 1 | Gaming | | | | | | | 2 | Music | | | | | | | 3 | General| | | | | | Question: Did different domains pick different moments? Y / N Question: Which domain worked best for this video? ____________ ``` #### Experiment B: Clip Duration Impact **Goal**: Find optimal clip length for engagement **Method**: 1. Pick 1 video 2. Run ShortSmith 3 times with different durations: - Run 1: 10 seconds - Run 2: 15 seconds - Run 3: 25 seconds 3. Watch all clips and score **What to Record**: ``` Video: {video_id} Experiment: Duration Comparison | Duration | Clip1 Hook Score | Clip1 "Feels Complete?" | Clip1 "Too Long?" | |----------|------------------|-------------------------|-------------------| | 10 sec | | Y/N | Y/N | | 15 sec | | Y/N | Y/N | | 25 sec | | Y/N | Y/N | Optimal duration for this content type: ___ seconds ``` #### Experiment C: Custom Prompt Effectiveness **Goal**: Test if custom prompts improve clip selection **Method**: 1. Pick 1 video with a specific element you want to capture (e.g., "crowd reactions") 2. Run twice: - Run 1: No custom prompt - Run 2: Custom prompt = "Focus on crowd reactions" 3. Compare results **What to Record**: ``` Video: {video_id} Custom Prompt: "________________________" Experiment: Custom Prompt Test | Run | Has Prompt? | Clip1 Has Target Element? | Clip2 Has Target? | Clip3 Has Target? | |-----|-------------|---------------------------|-------------------|-------------------| | 1 | No | Y/N | Y/N | Y/N | | 2 | Yes | Y/N | Y/N | Y/N | Did custom prompt help? Y / N / Unclear ``` #### Experiment D: Person Filter Accuracy **Goal**: Test if person filtering actually prioritizes target person **Method**: 1. Pick a video with 2+ people clearly visible 2. Get a clear reference photo of 1 person 3. Run twice: - Run 1: No reference image - Run 2: With reference image 4. Count target person screen time in each clip **What to Record**: ``` Video: {video_id} Target Person: {description} Experiment: Person Filter Test | Run | Filter On? | Clip1 Person % | Clip2 Person % | Clip3 Person % | Avg % | |-----|------------|----------------|----------------|----------------|-------| | 1 | No | | | | | | 2 | Yes | | | | | Did filter increase target person screen time? Y / N By how much? ___% improvement ``` ### 11.3 How to Document Experiments #### Step 1: Create Experiment Log Make a new sheet/tab called "Experiments" with columns: ``` | Date | Experiment_Type | Video_ID | Variable_Tested | Setting_A | Setting_B | Winner | Notes | ``` #### Step 2: Record Before & After Always record: - What you changed (the variable) - What stayed the same (controls) - The measurable outcome #### Step 3: Take Screenshots For each experiment: - Screenshot the settings used - Screenshot the processing log - Save the output clips with clear naming: ``` exp_{type}_{video}_{setting}.mp4 Examples: exp_domain_sports001_gaming.mp4 exp_domain_sports001_music.mp4 exp_duration_vlog002_10sec.mp4 exp_duration_vlog002_25sec.mp4 ``` ### 11.4 Recommended Experiment Order **Week 1: Baseline + Domain Testing** ``` Day 1-2: Run all 6 domains on 2 "neutral" videos Record which domain performs best Day 3-4: Run domain-matched tests (Sports video with Sports setting, Music with Music, etc.) Record scores Day 5: Analyze - Do domain presets actually help vs General? ``` **Week 2: Duration & Clip Count Testing** ``` Day 1-2: Test 10s vs 15s vs 20s vs 25s clips On 3 different video types Day 3-4: Test 1 clip vs 2 clips vs 3 clips Does quality drop with more clips? Day 5: Analyze - Find optimal duration per domain ``` **Week 3: Advanced Features Testing** ``` Day 1-2: Custom prompt experiments Try 5 different prompts on same video Day 3-4: Person filter experiments Test accuracy on 3 multi-person videos Day 5: Analyze - Document which features add value ``` ### 11.5 Experiment Scoring Template For each experiment, fill out: ``` === EXPERIMENT REPORT === Date: ____________ Tester: ____________ Experiment Type: [Domain / Duration / Custom Prompt / Person Filter / Other] VIDEO DETAILS - Video ID: ____________ - Domain: ____________ - Duration: ____________ - Content Description: ____________ WHAT I TESTED - Variable Changed: ____________ - Setting A: ____________ - Setting B: ____________ - Setting C (if any): ____________ RESULTS Setting A Output: - Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___ - Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___ - Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___ - Overall Quality (1-5): ___ Setting B Output: - Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___ - Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___ - Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___ - Overall Quality (1-5): ___ WINNER: Setting ___ WHY? ________________________________ ________________________________ RECOMMENDATION: ________________________________ ________________________________ ``` ### 11.6 Key Questions to Answer Through Experiments After completing experiments, you should be able to answer: **Domain Settings:** - [ ] Does Sports mode actually pick action moments better than General? - [ ] Does Music mode find beat drops better? - [ ] Does Podcast mode find speaking highlights? - [ ] Which domain works best for "general" content? **Clip Settings:** - [ ] What's the ideal clip duration for TikTok/Reels (vertical short-form)? - [ ] What's the ideal clip duration for YouTube Shorts? - [ ] Does requesting 1 clip give higher quality than requesting 3? **Features:** - [ ] Do custom prompts actually improve results? - [ ] How accurate is person filtering? (% of clips with target person) - [ ] What types of custom prompts work best? **Quality Patterns:** - [ ] Which content types get the best results? - [ ] Which content types struggle? - [ ] Are there patterns in "bad" clips? (e.g., always too early, always misses peak) --- ## Appendix A: Quick Evaluation Form ``` === ShortSmith Clip Evaluation === Video ID: ____________ Domain: ____________ Date: ____________ Evaluator: ____________ CLIP 1 (___:___ - ___:___) System Hype Score: ___ Highlight Quality (1-5): ___ Hook Effectiveness (1-5): ___ Domain Match (Y/N/P): ___ Notes: ________________________________ CLIP 2 (___:___ - ___:___) System Hype Score: ___ Highlight Quality (1-5): ___ Hook Effectiveness (1-5): ___ Domain Match (Y/N/P): ___ Notes: ________________________________ CLIP 3 (___:___ - ___:___) System Hype Score: ___ Highlight Quality (1-5): ___ Hook Effectiveness (1-5): ___ Domain Match (Y/N/P): ___ Notes: ________________________________ Overall Comments: ________________________________ ________________________________ ``` --- ## Appendix B: Domain-Specific Hook Types Reference ### Sports | Hook Type | What to Look For | |-----------|-----------------| | GOAL_MOMENT | Scoring play, basket, touchdown | | CROWD_ERUPTION | Audience going wild | | COMMENTATOR_HYPE | Excited commentary voice | | REPLAY_WORTHY | Impressive athletic move | ### Music | Hook Type | What to Look For | |-----------|-----------------| | BEAT_DROP | Bass drop, beat switch | | CHORUS_HIT | Chorus/hook starts | | DANCE_PEAK | Peak choreography moment | | VISUAL_CLIMAX | Visual spectacle | ### Gaming | Hook Type | What to Look For | |-----------|-----------------| | CLUTCH_PLAY | Skillful play under pressure | | ELIMINATION | Kill/win moment | | RAGE_REACTION | Streamer emotional reaction | | UNEXPECTED | Surprise/plot twist | ### Vlogs | Hook Type | What to Look For | |-----------|-----------------| | REVEAL | Surprise reveal | | PUNCHLINE | Joke landing | | EMOTIONAL_MOMENT | Tears/joy/shock | | CONFRONTATION | Drama/tension | ### Podcasts | Hook Type | What to Look For | |-----------|-----------------| | HOT_TAKE | Controversial opinion | | BIG_LAUGH | Group laughter | | REVELATION | Surprising information | | HEATED_DEBATE | Passionate argument | --- ## Appendix C: Sample Test Results Format ```csv date,video_id,domain,clip_num,start_time,end_time,hype_score,visual_score,audio_score,motion_score,hook_type,human_highlight,human_hook,domain_match,notes 2024-01-15,sports_001,sports,1,45.2,60.2,0.82,0.75,0.88,0.71,GOAL_MOMENT,5,5,Y,Perfect goal celebration 2024-01-15,sports_001,sports,2,120.5,135.5,0.71,0.68,0.74,0.65,CROWD_ERUPTION,4,4,Y,Good crowd reaction 2024-01-15,sports_001,sports,3,200.0,215.0,0.65,0.60,0.70,0.55,PEAK_ENERGY,3,3,P,Decent but not the best moment ```