Spaces:

AI-Talent-Force
/

dev_caio

Paused

App Files Files Community

Chaitanya-aitf commited on Dec 22, 2025

Commit

b64816d

verified ·

1 Parent(s): 374f92b

Create testing_guide.md

Browse files

Files changed (1) hide show

testing_guide.md +625 -0

testing_guide.md ADDED Viewed

	@@ -0,0 +1,625 @@

+# ShortSmith v2 - Testing & Evaluation Guide
+## Overview
+This document outlines our testing strategy for evaluating ShortSmith v2's highlight extraction quality. It covers what to test, how to measure results, and how to score/store findings.
+---
+## 1. What We're Testing
+### 1.1 Core Quality Metrics
+| Component | What It Does | Key Output |
+|-----------|--------------|------------|
+| **Visual Analysis** | Qwen2-VL rates frame excitement | hype_score (0-1), action, emotion |
+| **Audio Analysis** | Librosa detects energy/beats | energy_score, excitement_score (0-1) |
+| **Motion Detection** | RAFT optical flow measures action | magnitude (0-1), is_action flag |
+| **Viral Hooks** | Finds optimal clip start points | hook_type, confidence, intensity |
+| **Hype Scoring** | Combines all signals | combined_score (0-1), rank |
+### 1.2 End-to-End Quality
+- **Clip Relevance**: Are the extracted clips actually highlights?
+- **Hook Effectiveness**: Do clips start at engaging moments?
+- **Domain Accuracy**: Does sports mode pick different clips than podcast mode?
+- **Person Filtering**: When enabled, does target person appear in clips?
+---
+## 2. Test Dataset Requirements
+### 2.1 Video Categories (Minimum 3 videos per category)
+| Domain | Video Type | Duration | Source Examples |
+|--------|-----------|----------|-----------------|
+| **Sports** | Football/Basketball highlights | 5-15 min | YouTube sports channels |
+| **Music** | Music videos, concerts | 3-8 min | Official MVs, live performances |
+| **Gaming** | Gameplay with commentary | 10-20 min | Twitch clips, YouTube gaming |
+| **Vlogs** | Personal vlogs with reactions | 8-15 min | YouTube vlogs |
+| **Podcasts** | Interview/discussion clips | 15-30 min | Podcast video episodes |
+| **General** | Mixed content | 5-15 min | Random viral videos |
+### 2.2 Test Video Naming Convention
+```
+{domain}_{video_id}_{duration_mins}.mp4
+Examples:
+sports_001_10min.mp4
+music_002_5min.mp4
+gaming_003_15min.mp4
+```
+---
+## 3. Testing Methodology
+### 3.1 Automated Metrics (System-Generated)
+For each test run, capture these from the pipeline output:
+```
+Processing Metrics:
+- processing_time_seconds
+- frames_analyzed
+- scenes_detected
+- hooks_detected
+Per Clip:
+- clip_id (1, 2, 3)
+- start_time
+- end_time
+- hype_score
+- visual_score
+- audio_score
+- motion_score
+- hook_type (if any)
+- hook_confidence
+```
+### 3.2 Human Evaluation Criteria
+Each clip needs human scoring on these dimensions:
+#### A. Highlight Quality (1-5 scale)
+```
+1 = Not a highlight at all (boring/irrelevant)
+2 = Weak highlight (somewhat interesting)
+3 = Decent highlight (would watch)
+4 = Good highlight (engaging)
+5 = Perfect highlight (would share/rewatch)
+```
+#### B. Hook Effectiveness (1-5 scale)
+```
+1 = Starts at wrong moment (confusing/boring start)
+2 = Starts too early/late (misses the peak)
+3 = Acceptable start (gets the point across)
+4 = Good start (grabs attention)
+5 = Perfect start (immediately engaging, viral potential)
+```
+#### C. Domain Appropriateness (Yes/No/Partial)
+```
+Does this clip match what you'd expect for this content type?
+- Sports: Action/celebration/crowd moment?
+- Music: Beat drop/chorus/dance peak?
+- Gaming: Clutch play/reaction/funny moment?
+- Vlogs: Emotional/funny/reveal moment?
+- Podcasts: Hot take/laugh/interesting point?
+```
+### 3.3 A/B Testing (Optional)
+Compare outputs with different settings:
+- Viral hooks ON vs OFF
+- Different domain presets for same video
+- Custom prompt vs no prompt
+---
+## 4. Test Execution Process
+### 4.1 Per-Video Test Run
+```
+1. Upload video to ShortSmith
+2. Select appropriate domain
+3. Set: 3 clips, 15 seconds each
+4. Run extraction
+5. Download all 3 clips
+6. Record automated metrics from log
+7. Have 2+ team members score each clip
+8. Calculate average scores
+9. Log results in spreadsheet
+```
+### 4.2 Test Run Naming
+```
+{date}_{domain}_{video_id}_{tester_initials}
+Example: 2024-01-15_sports_001_CM
+```
+---
+## 5. Scoring & Storage
+### 5.1 Results Spreadsheet Structure
+Create a Google Sheet / Excel with these columns:
+| Date | Video_ID | Domain | Clip# | Start | End | Hype_Score | Visual | Audio | Motion | Hook_Type | Human_Highlight (1-5) | Human_Hook (1-5) | Domain_Match (Y/N/P) | Notes |
+|------|----------|--------|-------|-------|-----|------------|--------|-------|--------|-----------|----------------------|------------------|---------------------|-------|
+### 5.2 Aggregate Metrics to Track
+Calculate weekly/monthly:
+```
+Overall Quality:
+- Avg Human Highlight Score (target: >3.5)
+- Avg Human Hook Score (target: >3.5)
+- Domain Match Rate (target: >80%)
+Per Domain:
+- Avg scores broken down by sports/music/gaming/vlogs/podcasts
+Correlation Analysis:
+- System hype_score vs Human rating correlation
+- Hook confidence vs Human hook rating correlation
+```
+### 5.3 Quality Thresholds
+| Metric | Poor | Acceptable | Good | Excellent |
+|--------|------|------------|------|-----------|
+| Highlight Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
+| Hook Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
+| Domain Match | <60% | 60-75% | 75-90% | >90% |
+| Hype-Human Correlation | <0.3 | 0.3-0.5 | 0.5-0.7 | >0.7 |
+---
+## 6. Issue Tracking
+### 6.1 Common Issues to Watch For
+| Issue | How to Identify | Priority |
+|-------|-----------------|----------|
+| Wrong clip selected | Human score <2, but hype_score >0.7 | High |
+| Bad hook timing | Good highlight but hook score <2 | High |
+| Domain mismatch | Domain_Match = N for >30% clips | Medium |
+| Person filter fail | Target person not in clips when filter enabled | Medium |
+| Processing errors | Pipeline fails/crashes | Critical |
+### 6.2 Bug Report Template
+```
+Video: {video_id}
+Domain: {domain}
+Issue Type: [Wrong Clip / Bad Hook / Domain Mismatch / Other]
+Clip #: {1/2/3}
+Timestamp: {start}-{end}
+Expected: {what should have been selected}
+Actual: {what was selected}
+System Scores: hype={}, visual={}, audio={}, motion={}
+Human Scores: highlight={}, hook={}
+Notes: {additional context}
+```
+---
+## 7. Test Schedule
+### Phase 1: Baseline Testing (Week 1)
+- Test 2 videos per domain (12 total)
+- Establish baseline scores
+- Identify major issues
+### Phase 2: Domain Deep Dive (Week 2-3)
+- Focus on underperforming domains
+- Test 5+ videos per weak domain
+- Tune domain presets if needed
+### Phase 3: Edge Cases (Week 4)
+- Very short videos (<3 min)
+- Very long videos (>30 min)
+- Low quality/audio videos
+- Multiple language content
+### Phase 4: Regression Testing (Ongoing)
+- After any code changes
+- Run standard test set (1 video per domain)
+- Compare to baseline
+---
+## 8. Success Criteria
+### MVP Launch Requirements
+- [ ] Avg Highlight Score > 3.5 across all domains
+- [ ] Avg Hook Score > 3.5 across all domains
+- [ ] Domain Match Rate > 75%
+- [ ] No critical processing errors
+- [ ] Processing time < 10 min for 10-min video on A10G
+### Stretch Goals
+- [ ] Avg Highlight Score > 4.0
+- [ ] Avg Hook Score > 4.0
+- [ ] Domain Match Rate > 90%
+- [ ] Hype-Human correlation > 0.6
+---
+## 9. Team Responsibilities
+| Role | Responsibilities |
+|------|------------------|
+| **Tester 1** | Sports, Gaming domains |
+| **Tester 2** | Music, Vlogs domains |
+| **Tester 3** | Podcasts, General domains |
+| **Lead** | Aggregate results, track issues, coordinate fixes |
+---
+## 10. Tools Needed
+- [ ] Test video dataset (18+ videos minimum)
+- [ ] Google Sheet for results tracking
+- [ ] Video player for clip review
+- [ ] Screen recording for bug reports (optional)
+- [ ] Stopwatch for processing time measurement
+---
+---
+## 11. Experimentation Guide
+This section explains how to systematically test different settings to find optimal configurations.
+### 11.1 What Settings Can Be Changed
+| Setting | Location | Default | Range | Impact |
+|---------|----------|---------|-------|--------|
+| **Coarse Sample Interval** | config.py | 5.0 sec | 2-10 sec | More frames = better accuracy, slower processing |
+| **Clip Duration** | UI Slider | 15 sec | 5-30 sec | Longer clips = more context, harder to keep attention |
+| **Number of Clips** | UI Slider | 3 | 1-3 | More clips = more variety, possibly lower avg quality |
+| **Domain Selection** | UI Dropdown | General | 6 options | Changes weight distribution for scoring |
+| **Hype Threshold** | config.py | 0.3 | 0.1-0.6 | Higher = stricter filtering, fewer candidates |
+| **Min Gap Between Clips** | config.py | 30 sec | 10-60 sec | Higher = more spread out clips |
+| **Scene Threshold** | config.py | 27.0 | 20-35 | Lower = more scene cuts detected |
+### 11.2 Experiment Types
+#### Experiment A: Domain Preset Comparison
+**Goal**: Verify that domain selection actually affects results
+**Method**:
+1. Pick 1 video that could fit multiple domains (e.g., a gaming video with music)
+2. Run ShortSmith 3 times with different domain settings:
+   - Run 1: Gaming
+   - Run 2: Music
+   - Run 3: General
+3. Compare the 3 clips from each run
+**What to Record**:
+```
+Video: {video_id}
+Experiment: Domain Comparison
+| Run | Domain | Clip1 Start | Clip2 Start | Clip3 Start | Best Clip# | Notes |
+|-----|--------|-------------|-------------|-------------|------------|-------|
+| 1   | Gaming |             |             |             |            |       |
+| 2   | Music  |             |             |             |            |       |
+| 3   | General|             |             |             |            |       |
+Question: Did different domains pick different moments?  Y / N
+Question: Which domain worked best for this video? ____________
+```
+#### Experiment B: Clip Duration Impact
+**Goal**: Find optimal clip length for engagement
+**Method**:
+1. Pick 1 video
+2. Run ShortSmith 3 times with different durations:
+   - Run 1: 10 seconds
+   - Run 2: 15 seconds
+   - Run 3: 25 seconds
+3. Watch all clips and score
+**What to Record**:
+```
+Video: {video_id}
+Experiment: Duration Comparison
+| Duration | Clip1 Hook Score | Clip1 "Feels Complete?" | Clip1 "Too Long?" |
+|----------|------------------|-------------------------|-------------------|
+| 10 sec   |                  | Y/N                     | Y/N               |
+| 15 sec   |                  | Y/N                     | Y/N               |
+| 25 sec   |                  | Y/N                     | Y/N               |
+Optimal duration for this content type: ___ seconds
+```
+#### Experiment C: Custom Prompt Effectiveness
+**Goal**: Test if custom prompts improve clip selection
+**Method**:
+1. Pick 1 video with a specific element you want to capture (e.g., "crowd reactions")
+2. Run twice:
+   - Run 1: No custom prompt
+   - Run 2: Custom prompt = "Focus on crowd reactions"
+3. Compare results
+**What to Record**:
+```
+Video: {video_id}
+Custom Prompt: "________________________"
+Experiment: Custom Prompt Test
+| Run | Has Prompt? | Clip1 Has Target Element? | Clip2 Has Target? | Clip3 Has Target? |
+|-----|-------------|---------------------------|-------------------|-------------------|
+| 1   | No          | Y/N                       | Y/N               | Y/N               |
+| 2   | Yes         | Y/N                       | Y/N               | Y/N               |
+Did custom prompt help? Y / N / Unclear
+```
+#### Experiment D: Person Filter Accuracy
+**Goal**: Test if person filtering actually prioritizes target person
+**Method**:
+1. Pick a video with 2+ people clearly visible
+2. Get a clear reference photo of 1 person
+3. Run twice:
+   - Run 1: No reference image
+   - Run 2: With reference image
+4. Count target person screen time in each clip
+**What to Record**:
+```
+Video: {video_id}
+Target Person: {description}
+Experiment: Person Filter Test
+| Run | Filter On? | Clip1 Person % | Clip2 Person % | Clip3 Person % | Avg % |
+|-----|------------|----------------|----------------|----------------|-------|
+| 1   | No         |                |                |                |       |
+| 2   | Yes        |                |                |                |       |
+Did filter increase target person screen time? Y / N
+By how much? ___% improvement
+```
+### 11.3 How to Document Experiments
+#### Step 1: Create Experiment Log
+Make a new sheet/tab called "Experiments" with columns:
+```
+| Date | Experiment_Type | Video_ID | Variable_Tested | Setting_A | Setting_B | Winner | Notes |
+```
+#### Step 2: Record Before & After
+Always record:
+- What you changed (the variable)
+- What stayed the same (controls)
+- The measurable outcome
+#### Step 3: Take Screenshots
+For each experiment:
+- Screenshot the settings used
+- Screenshot the processing log
+- Save the output clips with clear naming:
+  ```
+  exp_{type}_{video}_{setting}.mp4
+  Examples:
+  exp_domain_sports001_gaming.mp4
+  exp_domain_sports001_music.mp4
+  exp_duration_vlog002_10sec.mp4
+  exp_duration_vlog002_25sec.mp4
+  ```
+### 11.4 Recommended Experiment Order
+**Week 1: Baseline + Domain Testing**
+```
+Day 1-2: Run all 6 domains on 2 "neutral" videos
+         Record which domain performs best
+Day 3-4: Run domain-matched tests
+         (Sports video with Sports setting, Music with Music, etc.)
+         Record scores
+Day 5: Analyze - Do domain presets actually help vs General?
+```
+**Week 2: Duration & Clip Count Testing**
+```
+Day 1-2: Test 10s vs 15s vs 20s vs 25s clips
+         On 3 different video types
+Day 3-4: Test 1 clip vs 2 clips vs 3 clips
+         Does quality drop with more clips?
+Day 5: Analyze - Find optimal duration per domain
+```
+**Week 3: Advanced Features Testing**
+```
+Day 1-2: Custom prompt experiments
+         Try 5 different prompts on same video
+Day 3-4: Person filter experiments
+         Test accuracy on 3 multi-person videos
+Day 5: Analyze - Document which features add value
+```
+### 11.5 Experiment Scoring Template
+For each experiment, fill out:
+```
+=== EXPERIMENT REPORT ===
+Date: ____________
+Tester: ____________
+Experiment Type: [Domain / Duration / Custom Prompt / Person Filter / Other]
+VIDEO DETAILS
+- Video ID: ____________
+- Domain: ____________
+- Duration: ____________
+- Content Description: ____________
+WHAT I TESTED
+- Variable Changed: ____________
+- Setting A: ____________
+- Setting B: ____________
+- Setting C (if any): ____________
+RESULTS
+Setting A Output:
+- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
+- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
+- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
+- Overall Quality (1-5): ___
+Setting B Output:
+- Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
+- Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
+- Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
+- Overall Quality (1-5): ___
+WINNER: Setting ___
+WHY?
+________________________________
+________________________________
+RECOMMENDATION:
+________________________________
+________________________________
+```
+### 11.6 Key Questions to Answer Through Experiments
+After completing experiments, you should be able to answer:
+**Domain Settings:**
+- [ ] Does Sports mode actually pick action moments better than General?
+- [ ] Does Music mode find beat drops better?
+- [ ] Does Podcast mode find speaking highlights?
+- [ ] Which domain works best for "general" content?
+**Clip Settings:**
+- [ ] What's the ideal clip duration for TikTok/Reels (vertical short-form)?
+- [ ] What's the ideal clip duration for YouTube Shorts?
+- [ ] Does requesting 1 clip give higher quality than requesting 3?
+**Features:**
+- [ ] Do custom prompts actually improve results?
+- [ ] How accurate is person filtering? (% of clips with target person)
+- [ ] What types of custom prompts work best?
+**Quality Patterns:**
+- [ ] Which content types get the best results?
+- [ ] Which content types struggle?
+- [ ] Are there patterns in "bad" clips? (e.g., always too early, always misses peak)
+---
+## Appendix A: Quick Evaluation Form
+```
+=== ShortSmith Clip Evaluation ===
+Video ID: ____________
+Domain: ____________
+Date: ____________
+Evaluator: ____________
+CLIP 1 (___:___ - ___:___)
+System Hype Score: ___
+Highlight Quality (1-5): ___
+Hook Effectiveness (1-5): ___
+Domain Match (Y/N/P): ___
+Notes: ________________________________
+CLIP 2 (___:___ - ___:___)
+System Hype Score: ___
+Highlight Quality (1-5): ___
+Hook Effectiveness (1-5): ___
+Domain Match (Y/N/P): ___
+Notes: ________________________________
+CLIP 3 (___:___ - ___:___)
+System Hype Score: ___
+Highlight Quality (1-5): ___
+Hook Effectiveness (1-5): ___
+Domain Match (Y/N/P): ___
+Notes: ________________________________
+Overall Comments:
+________________________________
+________________________________
+```
+---
+## Appendix B: Domain-Specific Hook Types Reference
+### Sports
+| Hook Type | What to Look For |
+|-----------|-----------------|
+| GOAL_MOMENT | Scoring play, basket, touchdown |
+| CROWD_ERUPTION | Audience going wild |
+| COMMENTATOR_HYPE | Excited commentary voice |
+| REPLAY_WORTHY | Impressive athletic move |
+### Music
+| Hook Type | What to Look For |
+|-----------|-----------------|
+| BEAT_DROP | Bass drop, beat switch |
+| CHORUS_HIT | Chorus/hook starts |
+| DANCE_PEAK | Peak choreography moment |
+| VISUAL_CLIMAX | Visual spectacle |
+### Gaming
+| Hook Type | What to Look For |
+|-----------|-----------------|
+| CLUTCH_PLAY | Skillful play under pressure |
+| ELIMINATION | Kill/win moment |
+| RAGE_REACTION | Streamer emotional reaction |
+| UNEXPECTED | Surprise/plot twist |
+### Vlogs
+| Hook Type | What to Look For |
+|-----------|-----------------|
+| REVEAL | Surprise reveal |
+| PUNCHLINE | Joke landing |
+| EMOTIONAL_MOMENT | Tears/joy/shock |
+| CONFRONTATION | Drama/tension |
+### Podcasts
+| Hook Type | What to Look For |
+|-----------|-----------------|
+| HOT_TAKE | Controversial opinion |
+| BIG_LAUGH | Group laughter |
+| REVELATION | Surprising information |
+| HEATED_DEBATE | Passionate argument |
+---
+## Appendix C: Sample Test Results Format
+```csv
+date,video_id,domain,clip_num,start_time,end_time,hype_score,visual_score,audio_score,motion_score,hook_type,human_highlight,human_hook,domain_match,notes
+2024-01-15,sports_001,sports,1,45.2,60.2,0.82,0.75,0.88,0.71,GOAL_MOMENT,5,5,Y,Perfect goal celebration
+2024-01-15,sports_001,sports,2,120.5,135.5,0.71,0.68,0.74,0.65,CROWD_ERUPTION,4,4,Y,Good crowd reaction
+2024-01-15,sports_001,sports,3,200.0,215.0,0.65,0.60,0.70,0.55,PEAK_ENERGY,3,3,P,Decent but not the best moment
+```