Spaces:
Paused
A newer version of the Gradio SDK is available:
6.6.0
ShortSmith v2 - Implementation Plan
Overview
Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.
Project Structure
shortsmith-v2/
βββ app.py # Gradio UI (Hugging Face interface)
βββ requirements.txt # Dependencies
βββ config.py # Configuration and constants
βββ utils/
β βββ __init__.py
β βββ logger.py # Centralized logging
β βββ helpers.py # Utility functions
βββ core/
β βββ __init__.py
β βββ video_processor.py # FFmpeg video/audio extraction
β βββ scene_detector.py # PySceneDetect integration
β βββ frame_sampler.py # Hierarchical sampling logic
β βββ clip_extractor.py # Final clip cutting
βββ models/
β βββ __init__.py
β βββ visual_analyzer.py # Qwen2-VL integration
β βββ audio_analyzer.py # Wav2Vec 2.0 + Librosa
β βββ face_recognizer.py # InsightFace (SCRFD + ArcFace)
β βββ body_recognizer.py # OSNet for body recognition
β βββ motion_detector.py # RAFT optical flow
β βββ tracker.py # ByteTrack integration
βββ scoring/
β βββ __init__.py
β βββ hype_scorer.py # Hype scoring logic
β βββ domain_presets.py # Domain-specific weights
βββ pipeline/
βββ __init__.py
βββ orchestrator.py # Main pipeline coordinator
Implementation Phases
Phase 1: Core Infrastructure
config.py - Configuration management
- Model paths, thresholds, domain presets
- HuggingFace API key handling
utils/logger.py - Centralized logging
- File and console handlers
- Different log levels per module
- Timing decorators for performance tracking
utils/helpers.py - Common utilities
- File validation
- Temporary file management
- Error formatting
Phase 2: Video Processing Layer
core/video_processor.py - FFmpeg operations
- Extract frames at specified FPS
- Extract audio track
- Get video metadata (duration, resolution, fps)
- Cut clips at timestamps
core/scene_detector.py - Scene boundary detection
- PySceneDetect integration
- Content-aware detection
- Return scene timestamps
core/frame_sampler.py - Hierarchical sampling
- First pass: 1 frame per 5-10 seconds
- Second pass: Dense sampling on candidates
- Dynamic FPS based on motion
Phase 3: AI Models
models/visual_analyzer.py - Qwen2-VL-2B
- Load quantized model
- Process frame batches
- Extract visual embeddings/scores
models/audio_analyzer.py - Audio analysis
- Librosa for basic features (RMS, spectral flux, centroid)
- Optional Wav2Vec 2.0 for advanced understanding
- Return audio hype signals per segment
models/face_recognizer.py - Face detection/recognition
- InsightFace SCRFD for detection
- ArcFace for embeddings
- Reference image matching
models/body_recognizer.py - Body recognition
- OSNet for full-body embeddings
- Handle non-frontal views
models/motion_detector.py - Motion analysis
- RAFT optical flow
- Motion magnitude scoring
models/tracker.py - Multi-object tracking
- ByteTrack integration
- Maintain identity across frames
Phase 4: Scoring & Selection
scoring/domain_presets.py - Domain configurations
- Sports, Vlogs, Music, Podcasts presets
- Custom weight definitions
scoring/hype_scorer.py - Hype calculation
- Combine visual + audio scores
- Apply domain weights
- Normalize and rank segments
Phase 5: Pipeline & UI
pipeline/orchestrator.py - Main coordinator
- Coordinate all components
- Handle errors gracefully
- Progress reporting
app.py - Gradio interface
- Video upload
- API key input (secure)
- Prompt/instructions input
- Domain selection
- Reference image upload (for person filtering)
- Progress bar
- Output video gallery
Key Design Decisions
Error Handling Strategy
- Each module has try/except with specific exception types
- Errors bubble up with context
- Pipeline continues with degraded functionality when possible
- User-friendly error messages in UI
Logging Strategy
- DEBUG: Model loading, frame processing details
- INFO: Pipeline stages, timing, results
- WARNING: Fallback triggers, degraded mode
- ERROR: Failures with stack traces
Memory Management
- Process frames in batches
- Clear GPU memory between stages
- Use generators where possible
- Temporary file cleanup
HuggingFace Space Considerations
- Use
gr.Statefor session data - Respect ZeroGPU limits (if using)
- Cache models in
/tmpor HF cache - Handle timeouts gracefully
API Key Usage
The API key input is for future extensibility (e.g., external services). For MVP, all processing is local using open-weight models.
Gradio UI Layout
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ShortSmith v2 - AI Video Highlight Extractor β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β Upload Video β β Settings β β
β β [Drop zone] β β Domain: [Dropdown] β β
β β β β Clip Duration: [Slider] β β
β βββββββββββββββββββββββ β Num Clips: [Slider] β β
β β API Key: [Password field] β β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β Reference Image β β
β β (Optional) β βββββββββββββββββββββββββββββββ β
β β [Drop zone] β β Additional Instructions β β
β βββββββββββββββββββββββ β [Textbox] β β
β βββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β [π Extract Highlights] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Progress: [ββββββββββββββββββββ] 60% β
β Status: Analyzing audio... β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Results β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Clip 1 β β Clip 2 β β Clip 3 β β
β β [Video] β β [Video] β β [Video] β β
β β Score:85 β β Score:78 β β Score:72 β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β [Download All] β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Dependencies (requirements.txt)
gradio>=4.0.0
torch>=2.0.0
transformers>=4.35.0
accelerate
bitsandbytes
qwen-vl-utils
librosa>=0.10.0
soundfile
insightface
onnxruntime-gpu
opencv-python-headless
scenedetect[opencv]
numpy
pillow
tqdm
ffmpeg-python
Implementation Order
- config.py, utils/ (foundation)
- core/video_processor.py (essential)
- models/audio_analyzer.py (simpler, Librosa first)
- core/scene_detector.py
- core/frame_sampler.py
- scoring/ modules
- models/visual_analyzer.py (Qwen2-VL)
- models/face_recognizer.py, body_recognizer.py
- models/tracker.py, motion_detector.py
- pipeline/orchestrator.py
- app.py (Gradio UI)
Notes
- Start with Librosa-only audio (MVP), add Wav2Vec later
- Face/body recognition is optional (triggered by reference image)
- Motion detection can be skipped in MVP for speed
- ByteTrack only needed when person filtering is enabled