Spaces:
Paused
Paused
| # ShortSmith v2 - Implementation Plan | |
| ## Overview | |
| Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering. | |
| ## Project Structure | |
| ``` | |
| shortsmith-v2/ | |
| βββ app.py # Gradio UI (Hugging Face interface) | |
| βββ requirements.txt # Dependencies | |
| βββ config.py # Configuration and constants | |
| βββ utils/ | |
| β βββ __init__.py | |
| β βββ logger.py # Centralized logging | |
| β βββ helpers.py # Utility functions | |
| βββ core/ | |
| β βββ __init__.py | |
| β βββ video_processor.py # FFmpeg video/audio extraction | |
| β βββ scene_detector.py # PySceneDetect integration | |
| β βββ frame_sampler.py # Hierarchical sampling logic | |
| β βββ clip_extractor.py # Final clip cutting | |
| βββ models/ | |
| β βββ __init__.py | |
| β βββ visual_analyzer.py # Qwen2-VL integration | |
| β βββ audio_analyzer.py # Wav2Vec 2.0 + Librosa | |
| β βββ face_recognizer.py # InsightFace (SCRFD + ArcFace) | |
| β βββ body_recognizer.py # OSNet for body recognition | |
| β βββ motion_detector.py # RAFT optical flow | |
| β βββ tracker.py # ByteTrack integration | |
| βββ scoring/ | |
| β βββ __init__.py | |
| β βββ hype_scorer.py # Hype scoring logic | |
| β βββ domain_presets.py # Domain-specific weights | |
| βββ pipeline/ | |
| βββ __init__.py | |
| βββ orchestrator.py # Main pipeline coordinator | |
| ``` | |
| ## Implementation Phases | |
| ### Phase 1: Core Infrastructure | |
| 1. **config.py** - Configuration management | |
| - Model paths, thresholds, domain presets | |
| - HuggingFace API key handling | |
| 2. **utils/logger.py** - Centralized logging | |
| - File and console handlers | |
| - Different log levels per module | |
| - Timing decorators for performance tracking | |
| 3. **utils/helpers.py** - Common utilities | |
| - File validation | |
| - Temporary file management | |
| - Error formatting | |
| ### Phase 2: Video Processing Layer | |
| 4. **core/video_processor.py** - FFmpeg operations | |
| - Extract frames at specified FPS | |
| - Extract audio track | |
| - Get video metadata (duration, resolution, fps) | |
| - Cut clips at timestamps | |
| 5. **core/scene_detector.py** - Scene boundary detection | |
| - PySceneDetect integration | |
| - Content-aware detection | |
| - Return scene timestamps | |
| 6. **core/frame_sampler.py** - Hierarchical sampling | |
| - First pass: 1 frame per 5-10 seconds | |
| - Second pass: Dense sampling on candidates | |
| - Dynamic FPS based on motion | |
| ### Phase 3: AI Models | |
| 7. **models/visual_analyzer.py** - Qwen2-VL-2B | |
| - Load quantized model | |
| - Process frame batches | |
| - Extract visual embeddings/scores | |
| 8. **models/audio_analyzer.py** - Audio analysis | |
| - Librosa for basic features (RMS, spectral flux, centroid) | |
| - Optional Wav2Vec 2.0 for advanced understanding | |
| - Return audio hype signals per segment | |
| 9. **models/face_recognizer.py** - Face detection/recognition | |
| - InsightFace SCRFD for detection | |
| - ArcFace for embeddings | |
| - Reference image matching | |
| 10. **models/body_recognizer.py** - Body recognition | |
| - OSNet for full-body embeddings | |
| - Handle non-frontal views | |
| 11. **models/motion_detector.py** - Motion analysis | |
| - RAFT optical flow | |
| - Motion magnitude scoring | |
| 12. **models/tracker.py** - Multi-object tracking | |
| - ByteTrack integration | |
| - Maintain identity across frames | |
| ### Phase 4: Scoring & Selection | |
| 13. **scoring/domain_presets.py** - Domain configurations | |
| - Sports, Vlogs, Music, Podcasts presets | |
| - Custom weight definitions | |
| 14. **scoring/hype_scorer.py** - Hype calculation | |
| - Combine visual + audio scores | |
| - Apply domain weights | |
| - Normalize and rank segments | |
| ### Phase 5: Pipeline & UI | |
| 15. **pipeline/orchestrator.py** - Main coordinator | |
| - Coordinate all components | |
| - Handle errors gracefully | |
| - Progress reporting | |
| 16. **app.py** - Gradio interface | |
| - Video upload | |
| - API key input (secure) | |
| - Prompt/instructions input | |
| - Domain selection | |
| - Reference image upload (for person filtering) | |
| - Progress bar | |
| - Output video gallery | |
| ## Key Design Decisions | |
| ### Error Handling Strategy | |
| - Each module has try/except with specific exception types | |
| - Errors bubble up with context | |
| - Pipeline continues with degraded functionality when possible | |
| - User-friendly error messages in UI | |
| ### Logging Strategy | |
| - DEBUG: Model loading, frame processing details | |
| - INFO: Pipeline stages, timing, results | |
| - WARNING: Fallback triggers, degraded mode | |
| - ERROR: Failures with stack traces | |
| ### Memory Management | |
| - Process frames in batches | |
| - Clear GPU memory between stages | |
| - Use generators where possible | |
| - Temporary file cleanup | |
| ### HuggingFace Space Considerations | |
| - Use `gr.State` for session data | |
| - Respect ZeroGPU limits (if using) | |
| - Cache models in `/tmp` or HF cache | |
| - Handle timeouts gracefully | |
| ## API Key Usage | |
| The API key input is for future extensibility (e.g., external services). | |
| For MVP, all processing is local using open-weight models. | |
| ## Gradio UI Layout | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β ShortSmith v2 - AI Video Highlight Extractor β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β | |
| β β Upload Video β β Settings β β | |
| β β [Drop zone] β β Domain: [Dropdown] β β | |
| β β β β Clip Duration: [Slider] β β | |
| β βββββββββββββββββββββββ β Num Clips: [Slider] β β | |
| β β API Key: [Password field] β β | |
| β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β | |
| β β Reference Image β β | |
| β β (Optional) β βββββββββββββββββββββββββββββββ β | |
| β β [Drop zone] β β Additional Instructions β β | |
| β βββββββββββββββββββββββ β [Textbox] β β | |
| β βββββββββββββββββββββββββββββββ β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β [π Extract Highlights] β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Progress: [ββββββββββββββββββββ] 60% β | |
| β Status: Analyzing audio... β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Results β | |
| β ββββββββββββ ββββββββββββ ββββββββββββ β | |
| β β Clip 1 β β Clip 2 β β Clip 3 β β | |
| β β [Video] β β [Video] β β [Video] β β | |
| β β Score:85 β β Score:78 β β Score:72 β β | |
| β ββββββββββββ ββββββββββββ ββββββββββββ β | |
| β [Download All] β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ## Dependencies (requirements.txt) | |
| ``` | |
| gradio>=4.0.0 | |
| torch>=2.0.0 | |
| transformers>=4.35.0 | |
| accelerate | |
| bitsandbytes | |
| qwen-vl-utils | |
| librosa>=0.10.0 | |
| soundfile | |
| insightface | |
| onnxruntime-gpu | |
| opencv-python-headless | |
| scenedetect[opencv] | |
| numpy | |
| pillow | |
| tqdm | |
| ffmpeg-python | |
| ``` | |
| ## Implementation Order | |
| 1. config.py, utils/ (foundation) | |
| 2. core/video_processor.py (essential) | |
| 3. models/audio_analyzer.py (simpler, Librosa first) | |
| 4. core/scene_detector.py | |
| 5. core/frame_sampler.py | |
| 6. scoring/ modules | |
| 7. models/visual_analyzer.py (Qwen2-VL) | |
| 8. models/face_recognizer.py, body_recognizer.py | |
| 9. models/tracker.py, motion_detector.py | |
| 10. pipeline/orchestrator.py | |
| 11. app.py (Gradio UI) | |
| ## Notes | |
| - Start with Librosa-only audio (MVP), add Wav2Vec later | |
| - Face/body recognition is optional (triggered by reference image) | |
| - Motion detection can be skipped in MVP for speed | |
| - ByteTrack only needed when person filtering is enabled | |