# ShortSmith v2 - Implementation Plan ## Overview Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering. ## Project Structure ``` shortsmith-v2/ ├── app.py # Gradio UI (Hugging Face interface) ├── requirements.txt # Dependencies ├── config.py # Configuration and constants ├── utils/ │ ├── __init__.py │ ├── logger.py # Centralized logging │ └── helpers.py # Utility functions ├── core/ │ ├── __init__.py │ ├── video_processor.py # FFmpeg video/audio extraction │ ├── scene_detector.py # PySceneDetect integration │ ├── frame_sampler.py # Hierarchical sampling logic │ └── clip_extractor.py # Final clip cutting ├── models/ │ ├── __init__.py │ ├── visual_analyzer.py # Qwen2-VL integration │ ├── audio_analyzer.py # Wav2Vec 2.0 + Librosa │ ├── face_recognizer.py # InsightFace (SCRFD + ArcFace) │ ├── body_recognizer.py # OSNet for body recognition │ ├── motion_detector.py # RAFT optical flow │ └── tracker.py # ByteTrack integration ├── scoring/ │ ├── __init__.py │ ├── hype_scorer.py # Hype scoring logic │ └── domain_presets.py # Domain-specific weights └── pipeline/ ├── __init__.py └── orchestrator.py # Main pipeline coordinator ``` ## Implementation Phases ### Phase 1: Core Infrastructure 1. **config.py** - Configuration management - Model paths, thresholds, domain presets - HuggingFace API key handling 2. **utils/logger.py** - Centralized logging - File and console handlers - Different log levels per module - Timing decorators for performance tracking 3. **utils/helpers.py** - Common utilities - File validation - Temporary file management - Error formatting ### Phase 2: Video Processing Layer 4. **core/video_processor.py** - FFmpeg operations - Extract frames at specified FPS - Extract audio track - Get video metadata (duration, resolution, fps) - Cut clips at timestamps 5. **core/scene_detector.py** - Scene boundary detection - PySceneDetect integration - Content-aware detection - Return scene timestamps 6. **core/frame_sampler.py** - Hierarchical sampling - First pass: 1 frame per 5-10 seconds - Second pass: Dense sampling on candidates - Dynamic FPS based on motion ### Phase 3: AI Models 7. **models/visual_analyzer.py** - Qwen2-VL-2B - Load quantized model - Process frame batches - Extract visual embeddings/scores 8. **models/audio_analyzer.py** - Audio analysis - Librosa for basic features (RMS, spectral flux, centroid) - Optional Wav2Vec 2.0 for advanced understanding - Return audio hype signals per segment 9. **models/face_recognizer.py** - Face detection/recognition - InsightFace SCRFD for detection - ArcFace for embeddings - Reference image matching 10. **models/body_recognizer.py** - Body recognition - OSNet for full-body embeddings - Handle non-frontal views 11. **models/motion_detector.py** - Motion analysis - RAFT optical flow - Motion magnitude scoring 12. **models/tracker.py** - Multi-object tracking - ByteTrack integration - Maintain identity across frames ### Phase 4: Scoring & Selection 13. **scoring/domain_presets.py** - Domain configurations - Sports, Vlogs, Music, Podcasts presets - Custom weight definitions 14. **scoring/hype_scorer.py** - Hype calculation - Combine visual + audio scores - Apply domain weights - Normalize and rank segments ### Phase 5: Pipeline & UI 15. **pipeline/orchestrator.py** - Main coordinator - Coordinate all components - Handle errors gracefully - Progress reporting 16. **app.py** - Gradio interface - Video upload - API key input (secure) - Prompt/instructions input - Domain selection - Reference image upload (for person filtering) - Progress bar - Output video gallery ## Key Design Decisions ### Error Handling Strategy - Each module has try/except with specific exception types - Errors bubble up with context - Pipeline continues with degraded functionality when possible - User-friendly error messages in UI ### Logging Strategy - DEBUG: Model loading, frame processing details - INFO: Pipeline stages, timing, results - WARNING: Fallback triggers, degraded mode - ERROR: Failures with stack traces ### Memory Management - Process frames in batches - Clear GPU memory between stages - Use generators where possible - Temporary file cleanup ### HuggingFace Space Considerations - Use `gr.State` for session data - Respect ZeroGPU limits (if using) - Cache models in `/tmp` or HF cache - Handle timeouts gracefully ## API Key Usage The API key input is for future extensibility (e.g., external services). For MVP, all processing is local using open-weight models. ## Gradio UI Layout ``` ┌─────────────────────────────────────────────────────────────┐ │ ShortSmith v2 - AI Video Highlight Extractor │ ├─────────────────────────────────────────────────────────────┤ │ ┌─────────────────────┐ ┌─────────────────────────────┐ │ │ │ Upload Video │ │ Settings │ │ │ │ [Drop zone] │ │ Domain: [Dropdown] │ │ │ │ │ │ Clip Duration: [Slider] │ │ │ └─────────────────────┘ │ Num Clips: [Slider] │ │ │ │ API Key: [Password field] │ │ │ ┌─────────────────────┐ └─────────────────────────────┘ │ │ │ Reference Image │ │ │ │ (Optional) │ ┌─────────────────────────────┐ │ │ │ [Drop zone] │ │ Additional Instructions │ │ │ └─────────────────────┘ │ [Textbox] │ │ │ └─────────────────────────────┘ │ ├─────────────────────────────────────────────────────────────┤ │ [🚀 Extract Highlights] │ ├─────────────────────────────────────────────────────────────┤ │ Progress: [████████████░░░░░░░░] 60% │ │ Status: Analyzing audio... │ ├─────────────────────────────────────────────────────────────┤ │ Results │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Clip 1 │ │ Clip 2 │ │ Clip 3 │ │ │ │ [Video] │ │ [Video] │ │ [Video] │ │ │ │ Score:85 │ │ Score:78 │ │ Score:72 │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ [Download All] │ └─────────────────────────────────────────────────────────────┘ ``` ## Dependencies (requirements.txt) ``` gradio>=4.0.0 torch>=2.0.0 transformers>=4.35.0 accelerate bitsandbytes qwen-vl-utils librosa>=0.10.0 soundfile insightface onnxruntime-gpu opencv-python-headless scenedetect[opencv] numpy pillow tqdm ffmpeg-python ``` ## Implementation Order 1. config.py, utils/ (foundation) 2. core/video_processor.py (essential) 3. models/audio_analyzer.py (simpler, Librosa first) 4. core/scene_detector.py 5. core/frame_sampler.py 6. scoring/ modules 7. models/visual_analyzer.py (Qwen2-VL) 8. models/face_recognizer.py, body_recognizer.py 9. models/tracker.py, motion_detector.py 10. pipeline/orchestrator.py 11. app.py (Gradio UI) ## Notes - Start with Librosa-only audio (MVP), add Wav2Vec later - Face/body recognition is optional (triggered by reference image) - Motion detection can be skipped in MVP for speed - ByteTrack only needed when person filtering is enabled