Chaitanya-aitf commited on
Commit
b64816d
·
verified ·
1 Parent(s): 374f92b

Create testing_guide.md

Browse files
Files changed (1) hide show
  1. testing_guide.md +625 -0
testing_guide.md ADDED
@@ -0,0 +1,625 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ShortSmith v2 - Testing & Evaluation Guide
2
+
3
+ ## Overview
4
+ This document outlines our testing strategy for evaluating ShortSmith v2's highlight extraction quality. It covers what to test, how to measure results, and how to score/store findings.
5
+
6
+ ---
7
+
8
+ ## 1. What We're Testing
9
+
10
+ ### 1.1 Core Quality Metrics
11
+
12
+ | Component | What It Does | Key Output |
13
+ |-----------|--------------|------------|
14
+ | **Visual Analysis** | Qwen2-VL rates frame excitement | hype_score (0-1), action, emotion |
15
+ | **Audio Analysis** | Librosa detects energy/beats | energy_score, excitement_score (0-1) |
16
+ | **Motion Detection** | RAFT optical flow measures action | magnitude (0-1), is_action flag |
17
+ | **Viral Hooks** | Finds optimal clip start points | hook_type, confidence, intensity |
18
+ | **Hype Scoring** | Combines all signals | combined_score (0-1), rank |
19
+
20
+ ### 1.2 End-to-End Quality
21
+
22
+ - **Clip Relevance**: Are the extracted clips actually highlights?
23
+ - **Hook Effectiveness**: Do clips start at engaging moments?
24
+ - **Domain Accuracy**: Does sports mode pick different clips than podcast mode?
25
+ - **Person Filtering**: When enabled, does target person appear in clips?
26
+
27
+ ---
28
+
29
+ ## 2. Test Dataset Requirements
30
+
31
+ ### 2.1 Video Categories (Minimum 3 videos per category)
32
+
33
+ | Domain | Video Type | Duration | Source Examples |
34
+ |--------|-----------|----------|-----------------|
35
+ | **Sports** | Football/Basketball highlights | 5-15 min | YouTube sports channels |
36
+ | **Music** | Music videos, concerts | 3-8 min | Official MVs, live performances |
37
+ | **Gaming** | Gameplay with commentary | 10-20 min | Twitch clips, YouTube gaming |
38
+ | **Vlogs** | Personal vlogs with reactions | 8-15 min | YouTube vlogs |
39
+ | **Podcasts** | Interview/discussion clips | 15-30 min | Podcast video episodes |
40
+ | **General** | Mixed content | 5-15 min | Random viral videos |
41
+
42
+ ### 2.2 Test Video Naming Convention
43
+ ```
44
+ {domain}_{video_id}_{duration_mins}.mp4
45
+
46
+ Examples:
47
+ sports_001_10min.mp4
48
+ music_002_5min.mp4
49
+ gaming_003_15min.mp4
50
+ ```
51
+
52
+ ---
53
+
54
+ ## 3. Testing Methodology
55
+
56
+ ### 3.1 Automated Metrics (System-Generated)
57
+
58
+ For each test run, capture these from the pipeline output:
59
+
60
+ ```
61
+ Processing Metrics:
62
+ - processing_time_seconds
63
+ - frames_analyzed
64
+ - scenes_detected
65
+ - hooks_detected
66
+
67
+ Per Clip:
68
+ - clip_id (1, 2, 3)
69
+ - start_time
70
+ - end_time
71
+ - hype_score
72
+ - visual_score
73
+ - audio_score
74
+ - motion_score
75
+ - hook_type (if any)
76
+ - hook_confidence
77
+ ```
78
+
79
+ ### 3.2 Human Evaluation Criteria
80
+
81
+ Each clip needs human scoring on these dimensions:
82
+
83
+ #### A. Highlight Quality (1-5 scale)
84
+ ```
85
+ 1 = Not a highlight at all (boring/irrelevant)
86
+ 2 = Weak highlight (somewhat interesting)
87
+ 3 = Decent highlight (would watch)
88
+ 4 = Good highlight (engaging)
89
+ 5 = Perfect highlight (would share/rewatch)
90
+ ```
91
+
92
+ #### B. Hook Effectiveness (1-5 scale)
93
+ ```
94
+ 1 = Starts at wrong moment (confusing/boring start)
95
+ 2 = Starts too early/late (misses the peak)
96
+ 3 = Acceptable start (gets the point across)
97
+ 4 = Good start (grabs attention)
98
+ 5 = Perfect start (immediately engaging, viral potential)
99
+ ```
100
+
101
+ #### C. Domain Appropriateness (Yes/No/Partial)
102
+ ```
103
+ Does this clip match what you'd expect for this content type?
104
+ - Sports: Action/celebration/crowd moment?
105
+ - Music: Beat drop/chorus/dance peak?
106
+ - Gaming: Clutch play/reaction/funny moment?
107
+ - Vlogs: Emotional/funny/reveal moment?
108
+ - Podcasts: Hot take/laugh/interesting point?
109
+ ```
110
+
111
+ ### 3.3 A/B Testing (Optional)
112
+
113
+ Compare outputs with different settings:
114
+ - Viral hooks ON vs OFF
115
+ - Different domain presets for same video
116
+ - Custom prompt vs no prompt
117
+
118
+ ---
119
+
120
+ ## 4. Test Execution Process
121
+
122
+ ### 4.1 Per-Video Test Run
123
+
124
+ ```
125
+ 1. Upload video to ShortSmith
126
+ 2. Select appropriate domain
127
+ 3. Set: 3 clips, 15 seconds each
128
+ 4. Run extraction
129
+ 5. Download all 3 clips
130
+ 6. Record automated metrics from log
131
+ 7. Have 2+ team members score each clip
132
+ 8. Calculate average scores
133
+ 9. Log results in spreadsheet
134
+ ```
135
+
136
+ ### 4.2 Test Run Naming
137
+ ```
138
+ {date}_{domain}_{video_id}_{tester_initials}
139
+
140
+ Example: 2024-01-15_sports_001_CM
141
+ ```
142
+
143
+ ---
144
+
145
+ ## 5. Scoring & Storage
146
+
147
+ ### 5.1 Results Spreadsheet Structure
148
+
149
+ Create a Google Sheet / Excel with these columns:
150
+
151
+ | Date | Video_ID | Domain | Clip# | Start | End | Hype_Score | Visual | Audio | Motion | Hook_Type | Human_Highlight (1-5) | Human_Hook (1-5) | Domain_Match (Y/N/P) | Notes |
152
+ |------|----------|--------|-------|-------|-----|------------|--------|-------|--------|-----------|----------------------|------------------|---------------------|-------|
153
+
154
+ ### 5.2 Aggregate Metrics to Track
155
+
156
+ Calculate weekly/monthly:
157
+
158
+ ```
159
+ Overall Quality:
160
+ - Avg Human Highlight Score (target: >3.5)
161
+ - Avg Human Hook Score (target: >3.5)
162
+ - Domain Match Rate (target: >80%)
163
+
164
+ Per Domain:
165
+ - Avg scores broken down by sports/music/gaming/vlogs/podcasts
166
+
167
+ Correlation Analysis:
168
+ - System hype_score vs Human rating correlation
169
+ - Hook confidence vs Human hook rating correlation
170
+ ```
171
+
172
+ ### 5.3 Quality Thresholds
173
+
174
+ | Metric | Poor | Acceptable | Good | Excellent |
175
+ |--------|------|------------|------|-----------|
176
+ | Highlight Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
177
+ | Hook Score | <2.5 | 2.5-3.5 | 3.5-4.2 | >4.2 |
178
+ | Domain Match | <60% | 60-75% | 75-90% | >90% |
179
+ | Hype-Human Correlation | <0.3 | 0.3-0.5 | 0.5-0.7 | >0.7 |
180
+
181
+ ---
182
+
183
+ ## 6. Issue Tracking
184
+
185
+ ### 6.1 Common Issues to Watch For
186
+
187
+ | Issue | How to Identify | Priority |
188
+ |-------|-----------------|----------|
189
+ | Wrong clip selected | Human score <2, but hype_score >0.7 | High |
190
+ | Bad hook timing | Good highlight but hook score <2 | High |
191
+ | Domain mismatch | Domain_Match = N for >30% clips | Medium |
192
+ | Person filter fail | Target person not in clips when filter enabled | Medium |
193
+ | Processing errors | Pipeline fails/crashes | Critical |
194
+
195
+ ### 6.2 Bug Report Template
196
+
197
+ ```
198
+ Video: {video_id}
199
+ Domain: {domain}
200
+ Issue Type: [Wrong Clip / Bad Hook / Domain Mismatch / Other]
201
+ Clip #: {1/2/3}
202
+ Timestamp: {start}-{end}
203
+ Expected: {what should have been selected}
204
+ Actual: {what was selected}
205
+ System Scores: hype={}, visual={}, audio={}, motion={}
206
+ Human Scores: highlight={}, hook={}
207
+ Notes: {additional context}
208
+ ```
209
+
210
+ ---
211
+
212
+ ## 7. Test Schedule
213
+
214
+ ### Phase 1: Baseline Testing (Week 1)
215
+ - Test 2 videos per domain (12 total)
216
+ - Establish baseline scores
217
+ - Identify major issues
218
+
219
+ ### Phase 2: Domain Deep Dive (Week 2-3)
220
+ - Focus on underperforming domains
221
+ - Test 5+ videos per weak domain
222
+ - Tune domain presets if needed
223
+
224
+ ### Phase 3: Edge Cases (Week 4)
225
+ - Very short videos (<3 min)
226
+ - Very long videos (>30 min)
227
+ - Low quality/audio videos
228
+ - Multiple language content
229
+
230
+ ### Phase 4: Regression Testing (Ongoing)
231
+ - After any code changes
232
+ - Run standard test set (1 video per domain)
233
+ - Compare to baseline
234
+
235
+ ---
236
+
237
+ ## 8. Success Criteria
238
+
239
+ ### MVP Launch Requirements
240
+ - [ ] Avg Highlight Score > 3.5 across all domains
241
+ - [ ] Avg Hook Score > 3.5 across all domains
242
+ - [ ] Domain Match Rate > 75%
243
+ - [ ] No critical processing errors
244
+ - [ ] Processing time < 10 min for 10-min video on A10G
245
+
246
+ ### Stretch Goals
247
+ - [ ] Avg Highlight Score > 4.0
248
+ - [ ] Avg Hook Score > 4.0
249
+ - [ ] Domain Match Rate > 90%
250
+ - [ ] Hype-Human correlation > 0.6
251
+
252
+ ---
253
+
254
+ ## 9. Team Responsibilities
255
+
256
+ | Role | Responsibilities |
257
+ |------|------------------|
258
+ | **Tester 1** | Sports, Gaming domains |
259
+ | **Tester 2** | Music, Vlogs domains |
260
+ | **Tester 3** | Podcasts, General domains |
261
+ | **Lead** | Aggregate results, track issues, coordinate fixes |
262
+
263
+ ---
264
+
265
+ ## 10. Tools Needed
266
+
267
+ - [ ] Test video dataset (18+ videos minimum)
268
+ - [ ] Google Sheet for results tracking
269
+ - [ ] Video player for clip review
270
+ - [ ] Screen recording for bug reports (optional)
271
+ - [ ] Stopwatch for processing time measurement
272
+
273
+ ---
274
+
275
+ ---
276
+
277
+ ## 11. Experimentation Guide
278
+
279
+ This section explains how to systematically test different settings to find optimal configurations.
280
+
281
+ ### 11.1 What Settings Can Be Changed
282
+
283
+ | Setting | Location | Default | Range | Impact |
284
+ |---------|----------|---------|-------|--------|
285
+ | **Coarse Sample Interval** | config.py | 5.0 sec | 2-10 sec | More frames = better accuracy, slower processing |
286
+ | **Clip Duration** | UI Slider | 15 sec | 5-30 sec | Longer clips = more context, harder to keep attention |
287
+ | **Number of Clips** | UI Slider | 3 | 1-3 | More clips = more variety, possibly lower avg quality |
288
+ | **Domain Selection** | UI Dropdown | General | 6 options | Changes weight distribution for scoring |
289
+ | **Hype Threshold** | config.py | 0.3 | 0.1-0.6 | Higher = stricter filtering, fewer candidates |
290
+ | **Min Gap Between Clips** | config.py | 30 sec | 10-60 sec | Higher = more spread out clips |
291
+ | **Scene Threshold** | config.py | 27.0 | 20-35 | Lower = more scene cuts detected |
292
+
293
+ ### 11.2 Experiment Types
294
+
295
+ #### Experiment A: Domain Preset Comparison
296
+ **Goal**: Verify that domain selection actually affects results
297
+
298
+ **Method**:
299
+ 1. Pick 1 video that could fit multiple domains (e.g., a gaming video with music)
300
+ 2. Run ShortSmith 3 times with different domain settings:
301
+ - Run 1: Gaming
302
+ - Run 2: Music
303
+ - Run 3: General
304
+ 3. Compare the 3 clips from each run
305
+
306
+ **What to Record**:
307
+ ```
308
+ Video: {video_id}
309
+ Experiment: Domain Comparison
310
+
311
+ | Run | Domain | Clip1 Start | Clip2 Start | Clip3 Start | Best Clip# | Notes |
312
+ |-----|--------|-------------|-------------|-------------|------------|-------|
313
+ | 1 | Gaming | | | | | |
314
+ | 2 | Music | | | | | |
315
+ | 3 | General| | | | | |
316
+
317
+ Question: Did different domains pick different moments? Y / N
318
+ Question: Which domain worked best for this video? ____________
319
+ ```
320
+
321
+ #### Experiment B: Clip Duration Impact
322
+ **Goal**: Find optimal clip length for engagement
323
+
324
+ **Method**:
325
+ 1. Pick 1 video
326
+ 2. Run ShortSmith 3 times with different durations:
327
+ - Run 1: 10 seconds
328
+ - Run 2: 15 seconds
329
+ - Run 3: 25 seconds
330
+ 3. Watch all clips and score
331
+
332
+ **What to Record**:
333
+ ```
334
+ Video: {video_id}
335
+ Experiment: Duration Comparison
336
+
337
+ | Duration | Clip1 Hook Score | Clip1 "Feels Complete?" | Clip1 "Too Long?" |
338
+ |----------|------------------|-------------------------|-------------------|
339
+ | 10 sec | | Y/N | Y/N |
340
+ | 15 sec | | Y/N | Y/N |
341
+ | 25 sec | | Y/N | Y/N |
342
+
343
+ Optimal duration for this content type: ___ seconds
344
+ ```
345
+
346
+ #### Experiment C: Custom Prompt Effectiveness
347
+ **Goal**: Test if custom prompts improve clip selection
348
+
349
+ **Method**:
350
+ 1. Pick 1 video with a specific element you want to capture (e.g., "crowd reactions")
351
+ 2. Run twice:
352
+ - Run 1: No custom prompt
353
+ - Run 2: Custom prompt = "Focus on crowd reactions"
354
+ 3. Compare results
355
+
356
+ **What to Record**:
357
+ ```
358
+ Video: {video_id}
359
+ Custom Prompt: "________________________"
360
+ Experiment: Custom Prompt Test
361
+
362
+ | Run | Has Prompt? | Clip1 Has Target Element? | Clip2 Has Target? | Clip3 Has Target? |
363
+ |-----|-------------|---------------------------|-------------------|-------------------|
364
+ | 1 | No | Y/N | Y/N | Y/N |
365
+ | 2 | Yes | Y/N | Y/N | Y/N |
366
+
367
+ Did custom prompt help? Y / N / Unclear
368
+ ```
369
+
370
+ #### Experiment D: Person Filter Accuracy
371
+ **Goal**: Test if person filtering actually prioritizes target person
372
+
373
+ **Method**:
374
+ 1. Pick a video with 2+ people clearly visible
375
+ 2. Get a clear reference photo of 1 person
376
+ 3. Run twice:
377
+ - Run 1: No reference image
378
+ - Run 2: With reference image
379
+ 4. Count target person screen time in each clip
380
+
381
+ **What to Record**:
382
+ ```
383
+ Video: {video_id}
384
+ Target Person: {description}
385
+ Experiment: Person Filter Test
386
+
387
+ | Run | Filter On? | Clip1 Person % | Clip2 Person % | Clip3 Person % | Avg % |
388
+ |-----|------------|----------------|----------------|----------------|-------|
389
+ | 1 | No | | | | |
390
+ | 2 | Yes | | | | |
391
+
392
+ Did filter increase target person screen time? Y / N
393
+ By how much? ___% improvement
394
+ ```
395
+
396
+ ### 11.3 How to Document Experiments
397
+
398
+ #### Step 1: Create Experiment Log
399
+ Make a new sheet/tab called "Experiments" with columns:
400
+ ```
401
+ | Date | Experiment_Type | Video_ID | Variable_Tested | Setting_A | Setting_B | Winner | Notes |
402
+ ```
403
+
404
+ #### Step 2: Record Before & After
405
+ Always record:
406
+ - What you changed (the variable)
407
+ - What stayed the same (controls)
408
+ - The measurable outcome
409
+
410
+ #### Step 3: Take Screenshots
411
+ For each experiment:
412
+ - Screenshot the settings used
413
+ - Screenshot the processing log
414
+ - Save the output clips with clear naming:
415
+ ```
416
+ exp_{type}_{video}_{setting}.mp4
417
+
418
+ Examples:
419
+ exp_domain_sports001_gaming.mp4
420
+ exp_domain_sports001_music.mp4
421
+ exp_duration_vlog002_10sec.mp4
422
+ exp_duration_vlog002_25sec.mp4
423
+ ```
424
+
425
+ ### 11.4 Recommended Experiment Order
426
+
427
+ **Week 1: Baseline + Domain Testing**
428
+ ```
429
+ Day 1-2: Run all 6 domains on 2 "neutral" videos
430
+ Record which domain performs best
431
+
432
+ Day 3-4: Run domain-matched tests
433
+ (Sports video with Sports setting, Music with Music, etc.)
434
+ Record scores
435
+
436
+ Day 5: Analyze - Do domain presets actually help vs General?
437
+ ```
438
+
439
+ **Week 2: Duration & Clip Count Testing**
440
+ ```
441
+ Day 1-2: Test 10s vs 15s vs 20s vs 25s clips
442
+ On 3 different video types
443
+
444
+ Day 3-4: Test 1 clip vs 2 clips vs 3 clips
445
+ Does quality drop with more clips?
446
+
447
+ Day 5: Analyze - Find optimal duration per domain
448
+ ```
449
+
450
+ **Week 3: Advanced Features Testing**
451
+ ```
452
+ Day 1-2: Custom prompt experiments
453
+ Try 5 different prompts on same video
454
+
455
+ Day 3-4: Person filter experiments
456
+ Test accuracy on 3 multi-person videos
457
+
458
+ Day 5: Analyze - Document which features add value
459
+ ```
460
+
461
+ ### 11.5 Experiment Scoring Template
462
+
463
+ For each experiment, fill out:
464
+
465
+ ```
466
+ === EXPERIMENT REPORT ===
467
+
468
+ Date: ____________
469
+ Tester: ____________
470
+ Experiment Type: [Domain / Duration / Custom Prompt / Person Filter / Other]
471
+
472
+ VIDEO DETAILS
473
+ - Video ID: ____________
474
+ - Domain: ____________
475
+ - Duration: ____________
476
+ - Content Description: ____________
477
+
478
+ WHAT I TESTED
479
+ - Variable Changed: ____________
480
+ - Setting A: ____________
481
+ - Setting B: ____________
482
+ - Setting C (if any): ____________
483
+
484
+ RESULTS
485
+
486
+ Setting A Output:
487
+ - Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
488
+ - Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
489
+ - Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
490
+ - Overall Quality (1-5): ___
491
+
492
+ Setting B Output:
493
+ - Clip 1: Start=___ End=___ Highlight Score=___ Hook Score=___
494
+ - Clip 2: Start=___ End=___ Highlight Score=___ Hook Score=___
495
+ - Clip 3: Start=___ End=___ Highlight Score=___ Hook Score=___
496
+ - Overall Quality (1-5): ___
497
+
498
+ WINNER: Setting ___
499
+
500
+ WHY?
501
+ ________________________________
502
+ ________________________________
503
+
504
+ RECOMMENDATION:
505
+ ________________________________
506
+ ________________________________
507
+ ```
508
+
509
+ ### 11.6 Key Questions to Answer Through Experiments
510
+
511
+ After completing experiments, you should be able to answer:
512
+
513
+ **Domain Settings:**
514
+ - [ ] Does Sports mode actually pick action moments better than General?
515
+ - [ ] Does Music mode find beat drops better?
516
+ - [ ] Does Podcast mode find speaking highlights?
517
+ - [ ] Which domain works best for "general" content?
518
+
519
+ **Clip Settings:**
520
+ - [ ] What's the ideal clip duration for TikTok/Reels (vertical short-form)?
521
+ - [ ] What's the ideal clip duration for YouTube Shorts?
522
+ - [ ] Does requesting 1 clip give higher quality than requesting 3?
523
+
524
+ **Features:**
525
+ - [ ] Do custom prompts actually improve results?
526
+ - [ ] How accurate is person filtering? (% of clips with target person)
527
+ - [ ] What types of custom prompts work best?
528
+
529
+ **Quality Patterns:**
530
+ - [ ] Which content types get the best results?
531
+ - [ ] Which content types struggle?
532
+ - [ ] Are there patterns in "bad" clips? (e.g., always too early, always misses peak)
533
+
534
+ ---
535
+
536
+ ## Appendix A: Quick Evaluation Form
537
+
538
+ ```
539
+ === ShortSmith Clip Evaluation ===
540
+
541
+ Video ID: ____________
542
+ Domain: ____________
543
+ Date: ____________
544
+ Evaluator: ____________
545
+
546
+ CLIP 1 (___:___ - ___:___)
547
+ System Hype Score: ___
548
+ Highlight Quality (1-5): ___
549
+ Hook Effectiveness (1-5): ___
550
+ Domain Match (Y/N/P): ___
551
+ Notes: ________________________________
552
+
553
+ CLIP 2 (___:___ - ___:___)
554
+ System Hype Score: ___
555
+ Highlight Quality (1-5): ___
556
+ Hook Effectiveness (1-5): ___
557
+ Domain Match (Y/N/P): ___
558
+ Notes: ________________________________
559
+
560
+ CLIP 3 (___:___ - ___:___)
561
+ System Hype Score: ___
562
+ Highlight Quality (1-5): ___
563
+ Hook Effectiveness (1-5): ___
564
+ Domain Match (Y/N/P): ___
565
+ Notes: ________________________________
566
+
567
+ Overall Comments:
568
+ ________________________________
569
+ ________________________________
570
+ ```
571
+
572
+ ---
573
+
574
+ ## Appendix B: Domain-Specific Hook Types Reference
575
+
576
+ ### Sports
577
+ | Hook Type | What to Look For |
578
+ |-----------|-----------------|
579
+ | GOAL_MOMENT | Scoring play, basket, touchdown |
580
+ | CROWD_ERUPTION | Audience going wild |
581
+ | COMMENTATOR_HYPE | Excited commentary voice |
582
+ | REPLAY_WORTHY | Impressive athletic move |
583
+
584
+ ### Music
585
+ | Hook Type | What to Look For |
586
+ |-----------|-----------------|
587
+ | BEAT_DROP | Bass drop, beat switch |
588
+ | CHORUS_HIT | Chorus/hook starts |
589
+ | DANCE_PEAK | Peak choreography moment |
590
+ | VISUAL_CLIMAX | Visual spectacle |
591
+
592
+ ### Gaming
593
+ | Hook Type | What to Look For |
594
+ |-----------|-----------------|
595
+ | CLUTCH_PLAY | Skillful play under pressure |
596
+ | ELIMINATION | Kill/win moment |
597
+ | RAGE_REACTION | Streamer emotional reaction |
598
+ | UNEXPECTED | Surprise/plot twist |
599
+
600
+ ### Vlogs
601
+ | Hook Type | What to Look For |
602
+ |-----------|-----------------|
603
+ | REVEAL | Surprise reveal |
604
+ | PUNCHLINE | Joke landing |
605
+ | EMOTIONAL_MOMENT | Tears/joy/shock |
606
+ | CONFRONTATION | Drama/tension |
607
+
608
+ ### Podcasts
609
+ | Hook Type | What to Look For |
610
+ |-----------|-----------------|
611
+ | HOT_TAKE | Controversial opinion |
612
+ | BIG_LAUGH | Group laughter |
613
+ | REVELATION | Surprising information |
614
+ | HEATED_DEBATE | Passionate argument |
615
+
616
+ ---
617
+
618
+ ## Appendix C: Sample Test Results Format
619
+
620
+ ```csv
621
+ date,video_id,domain,clip_num,start_time,end_time,hype_score,visual_score,audio_score,motion_score,hook_type,human_highlight,human_hook,domain_match,notes
622
+ 2024-01-15,sports_001,sports,1,45.2,60.2,0.82,0.75,0.88,0.71,GOAL_MOMENT,5,5,Y,Perfect goal celebration
623
+ 2024-01-15,sports_001,sports,2,120.5,135.5,0.71,0.68,0.74,0.65,CROWD_ERUPTION,4,4,Y,Good crowd reaction
624
+ 2024-01-15,sports_001,sports,3,200.0,215.0,0.65,0.60,0.70,0.55,PEAK_ENERGY,3,3,P,Decent but not the best moment
625
+ ```