dev_caio / PLAN.md
Chaitanya-aitf's picture
Upload 30 files
c4ee290 verified
# ShortSmith v2 - Implementation Plan
## Overview
Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.
## Project Structure
```
shortsmith-v2/
β”œβ”€β”€ app.py # Gradio UI (Hugging Face interface)
β”œβ”€β”€ requirements.txt # Dependencies
β”œβ”€β”€ config.py # Configuration and constants
β”œβ”€β”€ utils/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ logger.py # Centralized logging
β”‚ └── helpers.py # Utility functions
β”œβ”€β”€ core/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ video_processor.py # FFmpeg video/audio extraction
β”‚ β”œβ”€β”€ scene_detector.py # PySceneDetect integration
β”‚ β”œβ”€β”€ frame_sampler.py # Hierarchical sampling logic
β”‚ └── clip_extractor.py # Final clip cutting
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ visual_analyzer.py # Qwen2-VL integration
β”‚ β”œβ”€β”€ audio_analyzer.py # Wav2Vec 2.0 + Librosa
β”‚ β”œβ”€β”€ face_recognizer.py # InsightFace (SCRFD + ArcFace)
β”‚ β”œβ”€β”€ body_recognizer.py # OSNet for body recognition
β”‚ β”œβ”€β”€ motion_detector.py # RAFT optical flow
β”‚ └── tracker.py # ByteTrack integration
β”œβ”€β”€ scoring/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ hype_scorer.py # Hype scoring logic
β”‚ └── domain_presets.py # Domain-specific weights
└── pipeline/
β”œβ”€β”€ __init__.py
└── orchestrator.py # Main pipeline coordinator
```
## Implementation Phases
### Phase 1: Core Infrastructure
1. **config.py** - Configuration management
- Model paths, thresholds, domain presets
- HuggingFace API key handling
2. **utils/logger.py** - Centralized logging
- File and console handlers
- Different log levels per module
- Timing decorators for performance tracking
3. **utils/helpers.py** - Common utilities
- File validation
- Temporary file management
- Error formatting
### Phase 2: Video Processing Layer
4. **core/video_processor.py** - FFmpeg operations
- Extract frames at specified FPS
- Extract audio track
- Get video metadata (duration, resolution, fps)
- Cut clips at timestamps
5. **core/scene_detector.py** - Scene boundary detection
- PySceneDetect integration
- Content-aware detection
- Return scene timestamps
6. **core/frame_sampler.py** - Hierarchical sampling
- First pass: 1 frame per 5-10 seconds
- Second pass: Dense sampling on candidates
- Dynamic FPS based on motion
### Phase 3: AI Models
7. **models/visual_analyzer.py** - Qwen2-VL-2B
- Load quantized model
- Process frame batches
- Extract visual embeddings/scores
8. **models/audio_analyzer.py** - Audio analysis
- Librosa for basic features (RMS, spectral flux, centroid)
- Optional Wav2Vec 2.0 for advanced understanding
- Return audio hype signals per segment
9. **models/face_recognizer.py** - Face detection/recognition
- InsightFace SCRFD for detection
- ArcFace for embeddings
- Reference image matching
10. **models/body_recognizer.py** - Body recognition
- OSNet for full-body embeddings
- Handle non-frontal views
11. **models/motion_detector.py** - Motion analysis
- RAFT optical flow
- Motion magnitude scoring
12. **models/tracker.py** - Multi-object tracking
- ByteTrack integration
- Maintain identity across frames
### Phase 4: Scoring & Selection
13. **scoring/domain_presets.py** - Domain configurations
- Sports, Vlogs, Music, Podcasts presets
- Custom weight definitions
14. **scoring/hype_scorer.py** - Hype calculation
- Combine visual + audio scores
- Apply domain weights
- Normalize and rank segments
### Phase 5: Pipeline & UI
15. **pipeline/orchestrator.py** - Main coordinator
- Coordinate all components
- Handle errors gracefully
- Progress reporting
16. **app.py** - Gradio interface
- Video upload
- API key input (secure)
- Prompt/instructions input
- Domain selection
- Reference image upload (for person filtering)
- Progress bar
- Output video gallery
## Key Design Decisions
### Error Handling Strategy
- Each module has try/except with specific exception types
- Errors bubble up with context
- Pipeline continues with degraded functionality when possible
- User-friendly error messages in UI
### Logging Strategy
- DEBUG: Model loading, frame processing details
- INFO: Pipeline stages, timing, results
- WARNING: Fallback triggers, degraded mode
- ERROR: Failures with stack traces
### Memory Management
- Process frames in batches
- Clear GPU memory between stages
- Use generators where possible
- Temporary file cleanup
### HuggingFace Space Considerations
- Use `gr.State` for session data
- Respect ZeroGPU limits (if using)
- Cache models in `/tmp` or HF cache
- Handle timeouts gracefully
## API Key Usage
The API key input is for future extensibility (e.g., external services).
For MVP, all processing is local using open-weight models.
## Gradio UI Layout
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ShortSmith v2 - AI Video Highlight Extractor β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Upload Video β”‚ β”‚ Settings β”‚ β”‚
β”‚ β”‚ [Drop zone] β”‚ β”‚ Domain: [Dropdown] β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ Clip Duration: [Slider] β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ Num Clips: [Slider] β”‚ β”‚
β”‚ β”‚ API Key: [Password field] β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ Reference Image β”‚ β”‚
β”‚ β”‚ (Optional) β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ [Drop zone] β”‚ β”‚ Additional Instructions β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ [Textbox] β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ [πŸš€ Extract Highlights] β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Progress: [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 60% β”‚
β”‚ Status: Analyzing audio... β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Results β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Clip 1 β”‚ β”‚ Clip 2 β”‚ β”‚ Clip 3 β”‚ β”‚
β”‚ β”‚ [Video] β”‚ β”‚ [Video] β”‚ β”‚ [Video] β”‚ β”‚
β”‚ β”‚ Score:85 β”‚ β”‚ Score:78 β”‚ β”‚ Score:72 β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ [Download All] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Dependencies (requirements.txt)
```
gradio>=4.0.0
torch>=2.0.0
transformers>=4.35.0
accelerate
bitsandbytes
qwen-vl-utils
librosa>=0.10.0
soundfile
insightface
onnxruntime-gpu
opencv-python-headless
scenedetect[opencv]
numpy
pillow
tqdm
ffmpeg-python
```
## Implementation Order
1. config.py, utils/ (foundation)
2. core/video_processor.py (essential)
3. models/audio_analyzer.py (simpler, Librosa first)
4. core/scene_detector.py
5. core/frame_sampler.py
6. scoring/ modules
7. models/visual_analyzer.py (Qwen2-VL)
8. models/face_recognizer.py, body_recognizer.py
9. models/tracker.py, motion_detector.py
10. pipeline/orchestrator.py
11. app.py (Gradio UI)
## Notes
- Start with Librosa-only audio (MVP), add Wav2Vec later
- Face/body recognition is optional (triggered by reference image)
- Motion detection can be skipped in MVP for speed
- ByteTrack only needed when person filtering is enabled