File size: 9,427 Bytes
c4ee290
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
# ShortSmith v2 - Implementation Plan

## Overview
Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.

## Project Structure
```
shortsmith-v2/
β”œβ”€β”€ app.py                    # Gradio UI (Hugging Face interface)
β”œβ”€β”€ requirements.txt          # Dependencies
β”œβ”€β”€ config.py                 # Configuration and constants
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ logger.py             # Centralized logging
β”‚   └── helpers.py            # Utility functions
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ video_processor.py    # FFmpeg video/audio extraction
β”‚   β”œβ”€β”€ scene_detector.py     # PySceneDetect integration
β”‚   β”œβ”€β”€ frame_sampler.py      # Hierarchical sampling logic
β”‚   └── clip_extractor.py     # Final clip cutting
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ visual_analyzer.py    # Qwen2-VL integration
β”‚   β”œβ”€β”€ audio_analyzer.py     # Wav2Vec 2.0 + Librosa
β”‚   β”œβ”€β”€ face_recognizer.py    # InsightFace (SCRFD + ArcFace)
β”‚   β”œβ”€β”€ body_recognizer.py    # OSNet for body recognition
β”‚   β”œβ”€β”€ motion_detector.py    # RAFT optical flow
β”‚   └── tracker.py            # ByteTrack integration
β”œβ”€β”€ scoring/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ hype_scorer.py        # Hype scoring logic
β”‚   └── domain_presets.py     # Domain-specific weights
└── pipeline/
    β”œβ”€β”€ __init__.py
    └── orchestrator.py       # Main pipeline coordinator
```

## Implementation Phases

### Phase 1: Core Infrastructure
1. **config.py** - Configuration management
   - Model paths, thresholds, domain presets
   - HuggingFace API key handling

2. **utils/logger.py** - Centralized logging
   - File and console handlers
   - Different log levels per module
   - Timing decorators for performance tracking

3. **utils/helpers.py** - Common utilities
   - File validation
   - Temporary file management
   - Error formatting

### Phase 2: Video Processing Layer
4. **core/video_processor.py** - FFmpeg operations
   - Extract frames at specified FPS
   - Extract audio track
   - Get video metadata (duration, resolution, fps)
   - Cut clips at timestamps

5. **core/scene_detector.py** - Scene boundary detection
   - PySceneDetect integration
   - Content-aware detection
   - Return scene timestamps

6. **core/frame_sampler.py** - Hierarchical sampling
   - First pass: 1 frame per 5-10 seconds
   - Second pass: Dense sampling on candidates
   - Dynamic FPS based on motion

### Phase 3: AI Models
7. **models/visual_analyzer.py** - Qwen2-VL-2B
   - Load quantized model
   - Process frame batches
   - Extract visual embeddings/scores

8. **models/audio_analyzer.py** - Audio analysis
   - Librosa for basic features (RMS, spectral flux, centroid)
   - Optional Wav2Vec 2.0 for advanced understanding
   - Return audio hype signals per segment

9. **models/face_recognizer.py** - Face detection/recognition
   - InsightFace SCRFD for detection
   - ArcFace for embeddings
   - Reference image matching

10. **models/body_recognizer.py** - Body recognition
    - OSNet for full-body embeddings
    - Handle non-frontal views

11. **models/motion_detector.py** - Motion analysis
    - RAFT optical flow
    - Motion magnitude scoring

12. **models/tracker.py** - Multi-object tracking
    - ByteTrack integration
    - Maintain identity across frames

### Phase 4: Scoring & Selection
13. **scoring/domain_presets.py** - Domain configurations
    - Sports, Vlogs, Music, Podcasts presets
    - Custom weight definitions

14. **scoring/hype_scorer.py** - Hype calculation
    - Combine visual + audio scores
    - Apply domain weights
    - Normalize and rank segments

### Phase 5: Pipeline & UI
15. **pipeline/orchestrator.py** - Main coordinator
    - Coordinate all components
    - Handle errors gracefully
    - Progress reporting

16. **app.py** - Gradio interface
    - Video upload
    - API key input (secure)
    - Prompt/instructions input
    - Domain selection
    - Reference image upload (for person filtering)
    - Progress bar
    - Output video gallery

## Key Design Decisions

### Error Handling Strategy
- Each module has try/except with specific exception types
- Errors bubble up with context
- Pipeline continues with degraded functionality when possible
- User-friendly error messages in UI

### Logging Strategy
- DEBUG: Model loading, frame processing details
- INFO: Pipeline stages, timing, results
- WARNING: Fallback triggers, degraded mode
- ERROR: Failures with stack traces

### Memory Management
- Process frames in batches
- Clear GPU memory between stages
- Use generators where possible
- Temporary file cleanup

### HuggingFace Space Considerations
- Use `gr.State` for session data
- Respect ZeroGPU limits (if using)
- Cache models in `/tmp` or HF cache
- Handle timeouts gracefully

## API Key Usage
The API key input is for future extensibility (e.g., external services).
For MVP, all processing is local using open-weight models.

## Gradio UI Layout
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ShortSmith v2 - AI Video Highlight Extractor               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Upload Video        β”‚  β”‚ Settings                    β”‚   β”‚
β”‚  β”‚ [Drop zone]         β”‚  β”‚ Domain: [Dropdown]          β”‚   β”‚
β”‚  β”‚                     β”‚  β”‚ Clip Duration: [Slider]     β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ Num Clips: [Slider]         β”‚   β”‚
β”‚                           β”‚ API Key: [Password field]   β”‚   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚  β”‚ Reference Image     β”‚                                    β”‚
β”‚  β”‚ (Optional)          β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ [Drop zone]         β”‚  β”‚ Additional Instructions     β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ [Textbox]                   β”‚   β”‚
β”‚                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  [πŸš€ Extract Highlights]                                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Progress: [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 60%                       β”‚
β”‚  Status: Analyzing audio...                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Results                                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚ Clip 1   β”‚ β”‚ Clip 2   β”‚ β”‚ Clip 3   β”‚                    β”‚
β”‚  β”‚ [Video]  β”‚ β”‚ [Video]  β”‚ β”‚ [Video]  β”‚                    β”‚
β”‚  β”‚ Score:85 β”‚ β”‚ Score:78 β”‚ β”‚ Score:72 β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚  [Download All]                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Dependencies (requirements.txt)
```
gradio>=4.0.0
torch>=2.0.0
transformers>=4.35.0
accelerate
bitsandbytes
qwen-vl-utils
librosa>=0.10.0
soundfile
insightface
onnxruntime-gpu
opencv-python-headless
scenedetect[opencv]
numpy
pillow
tqdm
ffmpeg-python
```

## Implementation Order
1. config.py, utils/ (foundation)
2. core/video_processor.py (essential)
3. models/audio_analyzer.py (simpler, Librosa first)
4. core/scene_detector.py
5. core/frame_sampler.py
6. scoring/ modules
7. models/visual_analyzer.py (Qwen2-VL)
8. models/face_recognizer.py, body_recognizer.py
9. models/tracker.py, motion_detector.py
10. pipeline/orchestrator.py
11. app.py (Gradio UI)

## Notes
- Start with Librosa-only audio (MVP), add Wav2Vec later
- Face/body recognition is optional (triggered by reference image)
- Motion detection can be skipped in MVP for speed
- ByteTrack only needed when person filtering is enabled