Spaces:

AI-Talent-Force
/

dev_caio

Paused

App Files Files Community

dev_caio / testing_guide.md

Chaitanya-aitf

Update testing_guide.md

8556570 verified 2 months ago

preview code

raw

history blame contribute delete

17.2 kB

	# ShortSmith v2 - Testing & Evaluation Guide

	## Overview
	This document outlines our testing strategy for evaluating ShortSmith v2's highlight extraction quality. It covers what to test, how to measure results, and how to score/store findings.

	---

	## 1. What We're Testing

	### 1.1 Core Quality Metrics

	\| Component \| What It Does \| Key Output \|
	\|-----------\|--------------\|------------\|
	\| Visual Analysis \| Qwen2-VL rates frame excitement \| hype_score (0-1), action, emotion \|
	\| Audio Analysis \| Librosa detects energy/beats \| energy_score, excitement_score (0-1) \|
	\| Motion Detection \| RAFT optical flow measures action \| magnitude (0-1), is_action flag \|
	\| Viral Hooks \| Finds optimal clip start points \| hook_type, confidence, intensity \|
	\| Hype Scoring \| Combines all signals \| combined_score (0-1), rank \|

	### 1.2 End-to-End Quality

	- Clip Relevance: Are the extracted clips actually highlights?
	- Hook Effectiveness: Do clips start at engaging moments?
	- Domain Accuracy: Does sports mode pick different clips than podcast mode?
	- Person Filtering: When enabled, does target person appear in clips?

	---

	## 2. Test Dataset Requirements

	### 2.1 Video Categories (Minimum 3 videos per category)

	\| Domain \| Video Type \| Duration \| Source Examples \|
	\|--------\|-----------\|----------\|-----------------\|
	\| Sports \| Football/Basketball highlights \| 5-15 min \| YouTube sports channels \|
	\| Music \| Music videos, concerts \| 3-8 min \| Official MVs, live performances \|
	\| Gaming \| Gameplay with commentary \| 10-20 min \| Twitch clips, YouTube gaming \|
	\| Vlogs \| Personal vlogs with reactions \| 8-15 min \| YouTube vlogs \|
	\| Podcasts \| Interview/discussion clips \| 15-30 min \| Podcast video episodes \|
	\| General \| Mixed content \| 5-15 min \| Random viral videos \|

	### 2.2 Test Video Naming Convention
	```
	{domain}_{video_id}_{duration_mins}.mp4

	Examples:
	sports_001_10min.mp4
	music_002_5min.mp4
	gaming_003_15min.mp4
	```

	---

	## 3. Testing Methodology

	### 3.1 Automated Metrics (System-Generated)

	For each test run, capture these from the pipeline output:

	```
	Processing Metrics:
	- processing_time_seconds
	- frames_analyzed
	- scenes_detected
	- hooks_detected

	Per Clip:
	- clip_id (1, 2, 3)
	- start_time
	- end_time
	- hype_score
	- visual_score
	- audio_score
	- motion_score
	- hook_type (if any)
	- hook_confidence
	```

	### 3.2 Human Evaluation Criteria

	Each clip needs human scoring on these dimensions:

	#### A. Highlight Quality (1-5 scale)
	```
	1 = Not a highlight at all (boring/irrelevant)
	2 = Weak highlight (somewhat interesting)
	3 = Decent highlight (would watch)
	4 = Good highlight (engaging)
	5 = Perfect highlight (would share/rewatch)
	```

	#### B. Hook Effectiveness (1-5 scale)
	```
	1 = Starts at wrong moment (confusing/boring start)
	2 = Starts too early/late (misses the peak)
	3 = Acceptable start (gets the point across)
	4 = Good start (grabs attention)
	5 = Perfect start (immediately engaging, viral potential)
	```

	#### C. Domain Appropriateness (Yes/No/Partial)
	```
	Does this clip match what you'd expect for this content type?
	- Sports: Action/celebration/crowd moment?
	- Music: Beat drop/chorus/dance peak?
	- Gaming: Clutch play/reaction/funny moment?
	- Vlogs: Emotional/funny/reveal moment?
	- Podcasts: Hot take/laugh/interesting point?
	```

	### 3.3 A/B Testing (Optional)

	Compare outputs with different settings:
	- Viral hooks ON vs OFF
	- Different domain presets for same video
	- Custom prompt vs no prompt

	---

	## 4. Test Execution Process

	### 4.1 Per-Video Test Run

	```
	1. Upload video to ShortSmith
	2. Select appropriate domain
	3. Set: 3 clips, 15 seconds each
	4. Run extraction
	5. Download all 3 clips
	6. Record automated metrics from log
	7. Have 2+ team members score each clip
	8. Calculate average scores
	9. Log results in spreadsheet
	```

	### 4.2 Test Run Naming
	```
	{date}_{domain}_{video_id}_{tester_initials}

	Example: 2024-01-15_sports_001_CM
	```

	---

	## 5. Scoring & Storage

	### 5.1 Results Spreadsheet Structure

	Create a Google Sheet / Excel with these columns:

	\| Date \| Video_ID \| Domain \| Clip# \| Start \| End \| Hype_Score \| Visual \| Audio \| Motion \| Hook_Type \| Human_Highlight (1-5) \| Human_Hook (1-5) \| Domain_Match (Y/N/P) \| Notes \|
	\|------\|----------\|--------\|-------\|-------\|-----\|------------\|--------\|-------\|--------\|-----------\|----------------------\|------------------\|---------------------\|-------\|

	### 5.2 Aggregate Metrics to Track

	Calculate weekly/monthly:

	```
	Overall Quality:
	- Avg Human Highlight Score (target: >3.5)
	- Avg Human Hook Score (target: >3.5)
	- Domain Match Rate (target: >80%)

	Per Domain:
	- Avg scores broken down by sports/music/gaming/vlogs/podcasts

	Correlation Analysis:
	- System hype_score vs Human rating correlation
	- Hook confidence vs Human hook rating correlation
	```

	### 5.3 Quality Thresholds

	\| Metric \| Poor \| Acceptable \| Good \| Excellent \|
	\|--------\|------\|------------\|------\|-----------\|
	\| Highlight Score \| <2.5 \| 2.5-3.5 \| 3.5-4.2 \| >4.2 \|
	\| Hook Score \| <2.5 \| 2.5-3.5 \| 3.5-4.2 \| >4.2 \|
	\| Domain Match \| <60% \| 60-75% \| 75-90% \| >90% \|
	\| Hype-Human Correlation \| <0.3 \| 0.3-0.5 \| 0.5-0.7 \| >0.7 \|

	---

	## 6. Issue Tracking

	### 6.1 Common Issues to Watch For

	\| Issue \| How to Identify \| Priority \|
	\|-------\|-----------------\|----------\|
	\| Wrong clip selected \| Human score <2, but hype_score >0.7 \| High \|
	\| Bad hook timing \| Good highlight but hook score <2 \| High \|
	\| Domain mismatch \| Domain_Match = N for >30% clips \| Medium \|
	\| Person filter fail \| Target person not in clips when filter enabled \| Medium \|
	\| Processing errors \| Pipeline fails/crashes \| Critical \|

	### 6.2 Bug Report Template

	```
	Video: {video_id}
	Domain: {domain}
	Issue Type: [Wrong Clip / Bad Hook / Domain Mismatch / Other]
	Clip #: {1/2/3}
	Timestamp: {start}-{end}
	Expected: {what should have been selected}
	Actual: {what was selected}
	System Scores: hype={}, visual={}, audio={}, motion={}
	Human Scores: highlight={}, hook={}
	Notes: {additional context}
	```

	---

	## 7. Test Schedule

	tentative

	## 8. Success Criteria

	### MVP Launch Requirements
	- [ ] Avg Highlight Score > 3.5 across all domains
	- [ ] Avg Hook Score > 3.5 across all domains
	- [ ] Domain Match Rate > 75%
	- [ ] No critical processing errors
	- [ ] Processing time < 10 min for 10-min video on A10G

	### Stretch Goals
	- [ ] Avg Highlight Score > 4.0
	- [ ] Avg Hook Score > 4.0
	- [ ] Domain Match Rate > 90%
	- [ ] Hype-Human correlation > 0.6

	---

	## 11. Experimentation Guide

	This section explains how to systematically test different settings to find optimal configurations.

	### 11.1 What Settings Can Be Changed

	\| Setting \| Location \| Default \| Range \| Impact \|
	\|---------\|----------\|---------\|-------\|--------\|
	\| Coarse Sample Interval \| config.py \| 5.0 sec \| 2-10 sec \| More frames = better accuracy, slower processing \|
	\| Clip Duration \| UI Slider \| 15 sec \| 5-30 sec \| Longer clips = more context, harder to keep attention \|
	\| Number of Clips \| UI Slider \| 3 \| 1-3 \| More clips = more variety, possibly lower avg quality \|
	\| Domain Selection \| UI Dropdown \| General \| 6 options \| Changes weight distribution for scoring \|
	\| Hype Threshold \| config.py \| 0.3 \| 0.1-0.6 \| Higher = stricter filtering, fewer candidates \|
	\| Min Gap Between Clips \| config.py \| 30 sec \| 10-60 sec \| Higher = more spread out clips \|
	\| Scene Threshold \| config.py \| 27.0 \| 20-35 \| Lower = more scene cuts detected \|

	### 11.2 Experiment Types

	#### Experiment A: Domain Preset Comparison
	Goal: Verify that domain selection actually affects results

	Method:
	1. Pick 1 video that could fit multiple domains (e.g., a gaming video with music)
	2. Run ShortSmith 3 times with different domain settings:
	- Run 1: Gaming
	- Run 2: Music
	- Run 3: General
	3. Compare the 3 clips from each run

	What to Record:
	```
	Video: {video_id}
	Experiment: Domain Comparison

	\| Run \| Domain \| Clip1 Start \| Clip2 Start \| Clip3 Start \| Best Clip# \| Notes \|
	\|-----\|--------\|-------------\|-------------\|-------------\|------------\|-------\|
	\| 1 \| Gaming \| \| \| \| \| \|
	\| 2 \| Music \| \| \| \| \| \|
	\| 3 \| General\| \| \| \| \| \|

	Question: Did different domains pick different moments? Y / N
	Question: Which domain worked best for this video? ____________
	```

	#### Experiment B: Clip Duration Impact
	Goal: Find optimal clip length for engagement

	Method:
	1. Pick 1 video
	2. Run ShortSmith 3 times with different durations:
	- Run 1: 10 seconds
	- Run 2: 15 seconds
	- Run 3: 25 seconds
	3. Watch all clips and score

	What to Record:
	```
	Video: {video_id}
	Experiment: Duration Comparison

	\| Duration \| Clip1 Hook Score \| Clip1 "Feels Complete?" \| Clip1 "Too Long?" \|
	\|----------\|------------------\|-------------------------\|-------------------\|
	\| 10 sec \| \| Y/N \| Y/N \|
	\| 15 sec \| \| Y/N \| Y/N \|
	\| 25 sec \| \| Y/N \| Y/N \|

	Optimal duration for this content type: ___ seconds
	```

	#### Experiment C: Custom Prompt Effectiveness
	Goal: Test if custom prompts improve clip selection

	Method:
	1. Pick 1 video with a specific element you want to capture (e.g., "crowd reactions")
	2. Run twice:
	- Run 1: No custom prompt
	- Run 2: Custom prompt = "Focus on crowd reactions"
	3. Compare results

	What to Record:
	```
	Video: {video_id}
	Custom Prompt: "________________________"
	Experiment: Custom Prompt Test

	\| Run \| Has Prompt? \| Clip1 Has Target Element? \| Clip2 Has Target? \| Clip3 Has Target? \|
	\|-----\|-------------\|---------------------------\|-------------------\|-------------------\|
	\| 1 \| No \| Y/N \| Y/N \| Y/N \|
	\| 2 \| Yes \| Y/N \| Y/N \| Y/N \|

	Did custom prompt help? Y / N / Unclear
	```

	#### Experiment D: Person Filter Accuracy
	Goal: Test if person filtering actually prioritizes target person

	Method:
	1. Pick a video with 2+ people clearly visible
	2. Get a clear reference photo of 1 person
	3. Run twice:
	- Run 1: No reference image
	- Run 2: With reference image
	4. Count target person screen time in each clip

	What to Record:
	```
	Video: {video_id}
	Target Person: {description}
	Experiment: Person Filter Test

	\| Run \| Filter On? \| Clip1 Person % \| Clip2 Person % \| Clip3 Person % \| Avg % \|
	\|-----\|------------\|----------------\|----------------\|----------------\|-------\|
	\| 1 \| No \| \| \| \| \|
	\| 2 \| Yes \| \| \| \| \|

	Did filter increase target person screen time? Y / N
	By how much? ___% improvement
	```

	### 11.3 How to Document Experiments

	#### Step 1: Create Experiment Log
	Make a new sheet/tab called "Experiments" with columns:
	```
	\| Date \| Experiment_Type \| Video_ID \| Variable_Tested \| Setting_A \| Setting_B \| Winner \| Notes \|
	```

	#### Step 2: Record Before & After
	Always record:
	- What you changed (the variable)
	- What stayed the same (controls)
	- The measurable outcome

	#### Step 3: Take Screenshots
	For each experiment:
	- Screenshot the settings used
	- Screenshot the processing log
	- Save the output clips with clear naming:
	```
	exp_{type}_{video}_{setting}.mp4

	Examples:
	exp_domain_sports001_gaming.mp4
	exp_domain_sports001_music.mp4
	exp_duration_vlog002_10sec.mp4
	exp_duration_vlog002_25sec.mp4
	```

	### 11.4 Recommended Experiment Order

	Week 1: Baseline + Domain Testing
	```
	Day 1-2: Run all 6 domains on 2 "neutral" videos
	Record which domain performs best

	Day 3-4: Run domain-matched tests
	(Sports video with Sports setting, Music with Music, etc.)
	Record scores

	Day 5: Analyze - Do domain presets actually help vs General?
	```

	Week 2: Duration & Clip Count Testing
	```
	Day 1-2: Test 10s vs 15s vs 20s vs 25s clips
	On 3 different video types

	Day 3-4: Test 1 clip vs 2 clips vs 3 clips
	Does quality drop with more clips?

	Day 5: Analyze - Find optimal duration per domain
	```

	Week 3: Advanced Features Testing
	```
	Day 1-2: Custom prompt experiments
	Try 5 different prompts on same video

	Day 3-4: Person filter experiments
	Test accuracy on 3 multi-person videos

	Day 5: Analyze - Document which features add value
	```

	### 11.5 Experiment Scoring Template

	For each experiment, fill out:

	```
	=== EXPERIMENT REPORT ===

	Date: ____________
	Tester: ____________
	Experiment Type: [Domain / Duration / Custom Prompt / Person Filter / Other]

	VIDEO DETAILS
	- Video ID: ____________
	- Domain: ____________
	- Duration: ____________
	- Content Description: ____________

	WHAT I TESTED
	- Variable Changed: ____________
	- Setting A: ____________
	- Setting B: ____________
	- Setting C (if any): ____________

	RESULTS

	Setting A Output:
	- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
	- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
	- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
	- Overall Quality (1-5): ___

	Setting B Output:
	- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
	- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
	- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
	- Overall Quality (1-5): ___

	WINNER: Setting ___

	WHY?
	________________________________
	________________________________

	RECOMMENDATION:
	________________________________
	________________________________
	```

	### 11.6 Key Questions to Answer Through Experiments

	After completing experiments, you should be able to answer:

	Domain Settings:
	- [ ] Does Sports mode actually pick action moments better than General?
	- [ ] Does Music mode find beat drops better?
	- [ ] Does Podcast mode find speaking highlights?
	- [ ] Which domain works best for "general" content?

	Clip Settings:
	- [ ] What's the ideal clip duration for TikTok/Reels (vertical short-form)?
	- [ ] What's the ideal clip duration for YouTube Shorts?
	- [ ] Does requesting 1 clip give higher quality than requesting 3?

	Features:
	- [ ] Do custom prompts actually improve results?
	- [ ] How accurate is person filtering? (% of clips with target person)
	- [ ] What types of custom prompts work best?

	Quality Patterns:
	- [ ] Which content types get the best results?
	- [ ] Which content types struggle?
	- [ ] Are there patterns in "bad" clips? (e.g., always too early, always misses peak)

	---

	## Appendix A: Quick Evaluation Form

	```
	=== ShortSmith Clip Evaluation ===

	Video ID: ____________
	Domain: ____________
	Date: ____________
	Evaluator: ____________

	CLIP 1 (___:___ - ___:___)
	System Hype Score: ___
	Highlight Quality (1-5): ___
	Hook Effectiveness (1-5): ___
	Domain Match (Y/N/P): ___
	Notes: ________________________________

	CLIP 2 (___:___ - ___:___)
	System Hype Score: ___
	Highlight Quality (1-5): ___
	Hook Effectiveness (1-5): ___
	Domain Match (Y/N/P): ___
	Notes: ________________________________

	CLIP 3 (___:___ - ___:___)
	System Hype Score: ___
	Highlight Quality (1-5): ___
	Hook Effectiveness (1-5): ___
	Domain Match (Y/N/P): ___
	Notes: ________________________________

	Overall Comments:
	________________________________
	________________________________
	```

	---

	## Appendix B: Domain-Specific Hook Types Reference

	### Sports
	\| Hook Type \| What to Look For \|
	\|-----------\|-----------------\|
	\| GOAL_MOMENT \| Scoring play, basket, touchdown \|
	\| CROWD_ERUPTION \| Audience going wild \|
	\| COMMENTATOR_HYPE \| Excited commentary voice \|
	\| REPLAY_WORTHY \| Impressive athletic move \|

	### Music
	\| Hook Type \| What to Look For \|
	\|-----------\|-----------------\|
	\| BEAT_DROP \| Bass drop, beat switch \|
	\| CHORUS_HIT \| Chorus/hook starts \|
	\| DANCE_PEAK \| Peak choreography moment \|
	\| VISUAL_CLIMAX \| Visual spectacle \|

	### Gaming
	\| Hook Type \| What to Look For \|
	\|-----------\|-----------------\|
	\| CLUTCH_PLAY \| Skillful play under pressure \|
	\| ELIMINATION \| Kill/win moment \|
	\| RAGE_REACTION \| Streamer emotional reaction \|
	\| UNEXPECTED \| Surprise/plot twist \|

	### Vlogs
	\| Hook Type \| What to Look For \|
	\|-----------\|-----------------\|
	\| REVEAL \| Surprise reveal \|
	\| PUNCHLINE \| Joke landing \|
	\| EMOTIONAL_MOMENT \| Tears/joy/shock \|
	\| CONFRONTATION \| Drama/tension \|

	### Podcasts
	\| Hook Type \| What to Look For \|
	\|-----------\|-----------------\|
	\| HOT_TAKE \| Controversial opinion \|
	\| BIG_LAUGH \| Group laughter \|
	\| REVELATION \| Surprising information \|
	\| HEATED_DEBATE \| Passionate argument \|

	---

	## Appendix C: Sample Test Results Format

	```csv
	date,video_id,domain,clip_num,start_time,end_time,hype_score,visual_score,audio_score,motion_score,hook_type,human_highlight,human_hook,domain_match,notes
	2024-01-15,sports_001,sports,1,45.2,60.2,0.82,0.75,0.88,0.71,GOAL_MOMENT,5,5,Y,Perfect goal celebration
	2024-01-15,sports_001,sports,2,120.5,135.5,0.71,0.68,0.74,0.65,CROWD_ERUPTION,4,4,Y,Good crowd reaction
	2024-01-15,sports_001,sports,3,200.0,215.0,0.65,0.60,0.70,0.55,PEAK_ENERGY,3,3,P,Decent but not the best moment
	```