Spaces:
Paused
Paused
File size: 9,427 Bytes
c4ee290 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 | # ShortSmith v2 - Implementation Plan
## Overview
Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.
## Project Structure
```
shortsmith-v2/
βββ app.py # Gradio UI (Hugging Face interface)
βββ requirements.txt # Dependencies
βββ config.py # Configuration and constants
βββ utils/
β βββ __init__.py
β βββ logger.py # Centralized logging
β βββ helpers.py # Utility functions
βββ core/
β βββ __init__.py
β βββ video_processor.py # FFmpeg video/audio extraction
β βββ scene_detector.py # PySceneDetect integration
β βββ frame_sampler.py # Hierarchical sampling logic
β βββ clip_extractor.py # Final clip cutting
βββ models/
β βββ __init__.py
β βββ visual_analyzer.py # Qwen2-VL integration
β βββ audio_analyzer.py # Wav2Vec 2.0 + Librosa
β βββ face_recognizer.py # InsightFace (SCRFD + ArcFace)
β βββ body_recognizer.py # OSNet for body recognition
β βββ motion_detector.py # RAFT optical flow
β βββ tracker.py # ByteTrack integration
βββ scoring/
β βββ __init__.py
β βββ hype_scorer.py # Hype scoring logic
β βββ domain_presets.py # Domain-specific weights
βββ pipeline/
βββ __init__.py
βββ orchestrator.py # Main pipeline coordinator
```
## Implementation Phases
### Phase 1: Core Infrastructure
1. **config.py** - Configuration management
- Model paths, thresholds, domain presets
- HuggingFace API key handling
2. **utils/logger.py** - Centralized logging
- File and console handlers
- Different log levels per module
- Timing decorators for performance tracking
3. **utils/helpers.py** - Common utilities
- File validation
- Temporary file management
- Error formatting
### Phase 2: Video Processing Layer
4. **core/video_processor.py** - FFmpeg operations
- Extract frames at specified FPS
- Extract audio track
- Get video metadata (duration, resolution, fps)
- Cut clips at timestamps
5. **core/scene_detector.py** - Scene boundary detection
- PySceneDetect integration
- Content-aware detection
- Return scene timestamps
6. **core/frame_sampler.py** - Hierarchical sampling
- First pass: 1 frame per 5-10 seconds
- Second pass: Dense sampling on candidates
- Dynamic FPS based on motion
### Phase 3: AI Models
7. **models/visual_analyzer.py** - Qwen2-VL-2B
- Load quantized model
- Process frame batches
- Extract visual embeddings/scores
8. **models/audio_analyzer.py** - Audio analysis
- Librosa for basic features (RMS, spectral flux, centroid)
- Optional Wav2Vec 2.0 for advanced understanding
- Return audio hype signals per segment
9. **models/face_recognizer.py** - Face detection/recognition
- InsightFace SCRFD for detection
- ArcFace for embeddings
- Reference image matching
10. **models/body_recognizer.py** - Body recognition
- OSNet for full-body embeddings
- Handle non-frontal views
11. **models/motion_detector.py** - Motion analysis
- RAFT optical flow
- Motion magnitude scoring
12. **models/tracker.py** - Multi-object tracking
- ByteTrack integration
- Maintain identity across frames
### Phase 4: Scoring & Selection
13. **scoring/domain_presets.py** - Domain configurations
- Sports, Vlogs, Music, Podcasts presets
- Custom weight definitions
14. **scoring/hype_scorer.py** - Hype calculation
- Combine visual + audio scores
- Apply domain weights
- Normalize and rank segments
### Phase 5: Pipeline & UI
15. **pipeline/orchestrator.py** - Main coordinator
- Coordinate all components
- Handle errors gracefully
- Progress reporting
16. **app.py** - Gradio interface
- Video upload
- API key input (secure)
- Prompt/instructions input
- Domain selection
- Reference image upload (for person filtering)
- Progress bar
- Output video gallery
## Key Design Decisions
### Error Handling Strategy
- Each module has try/except with specific exception types
- Errors bubble up with context
- Pipeline continues with degraded functionality when possible
- User-friendly error messages in UI
### Logging Strategy
- DEBUG: Model loading, frame processing details
- INFO: Pipeline stages, timing, results
- WARNING: Fallback triggers, degraded mode
- ERROR: Failures with stack traces
### Memory Management
- Process frames in batches
- Clear GPU memory between stages
- Use generators where possible
- Temporary file cleanup
### HuggingFace Space Considerations
- Use `gr.State` for session data
- Respect ZeroGPU limits (if using)
- Cache models in `/tmp` or HF cache
- Handle timeouts gracefully
## API Key Usage
The API key input is for future extensibility (e.g., external services).
For MVP, all processing is local using open-weight models.
## Gradio UI Layout
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ShortSmith v2 - AI Video Highlight Extractor β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β Upload Video β β Settings β β
β β [Drop zone] β β Domain: [Dropdown] β β
β β β β Clip Duration: [Slider] β β
β βββββββββββββββββββββββ β Num Clips: [Slider] β β
β β API Key: [Password field] β β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β Reference Image β β
β β (Optional) β βββββββββββββββββββββββββββββββ β
β β [Drop zone] β β Additional Instructions β β
β βββββββββββββββββββββββ β [Textbox] β β
β βββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β [π Extract Highlights] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Progress: [ββββββββββββββββββββ] 60% β
β Status: Analyzing audio... β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Results β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Clip 1 β β Clip 2 β β Clip 3 β β
β β [Video] β β [Video] β β [Video] β β
β β Score:85 β β Score:78 β β Score:72 β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β [Download All] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Dependencies (requirements.txt)
```
gradio>=4.0.0
torch>=2.0.0
transformers>=4.35.0
accelerate
bitsandbytes
qwen-vl-utils
librosa>=0.10.0
soundfile
insightface
onnxruntime-gpu
opencv-python-headless
scenedetect[opencv]
numpy
pillow
tqdm
ffmpeg-python
```
## Implementation Order
1. config.py, utils/ (foundation)
2. core/video_processor.py (essential)
3. models/audio_analyzer.py (simpler, Librosa first)
4. core/scene_detector.py
5. core/frame_sampler.py
6. scoring/ modules
7. models/visual_analyzer.py (Qwen2-VL)
8. models/face_recognizer.py, body_recognizer.py
9. models/tracker.py, motion_detector.py
10. pipeline/orchestrator.py
11. app.py (Gradio UI)
## Notes
- Start with Librosa-only audio (MVP), add Wav2Vec later
- Face/body recognition is optional (triggered by reference image)
- Motion detection can be skipped in MVP for speed
- ByteTrack only needed when person filtering is enabled
|