Spaces:

AI-Talent-Force
/

dev_caio

Paused

App Files Files Community

dev_caio / PLAN.md

Chaitanya-aitf

Upload 30 files

c4ee290 verified 2 months ago

preview code

raw

history blame contribute delete

9.43 kB

	# ShortSmith v2 - Implementation Plan

	## Overview
	Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.

	## Project Structure
	```
	shortsmith-v2/
	├── app.py # Gradio UI (Hugging Face interface)
	├── requirements.txt # Dependencies
	├── config.py # Configuration and constants
	├── utils/
	│ ├── __init__.py
	│ ├── logger.py # Centralized logging
	│ └── helpers.py # Utility functions
	├── core/
	│ ├── __init__.py
	│ ├── video_processor.py # FFmpeg video/audio extraction
	│ ├── scene_detector.py # PySceneDetect integration
	│ ├── frame_sampler.py # Hierarchical sampling logic
	│ └── clip_extractor.py # Final clip cutting
	├── models/
	│ ├── __init__.py
	│ ├── visual_analyzer.py # Qwen2-VL integration
	│ ├── audio_analyzer.py # Wav2Vec 2.0 + Librosa
	│ ├── face_recognizer.py # InsightFace (SCRFD + ArcFace)
	│ ├── body_recognizer.py # OSNet for body recognition
	│ ├── motion_detector.py # RAFT optical flow
	│ └── tracker.py # ByteTrack integration
	├── scoring/
	│ ├── __init__.py
	│ ├── hype_scorer.py # Hype scoring logic
	│ └── domain_presets.py # Domain-specific weights
	└── pipeline/
	├── __init__.py
	└── orchestrator.py # Main pipeline coordinator
	```

	## Implementation Phases

	### Phase 1: Core Infrastructure
	1. config.py - Configuration management
	- Model paths, thresholds, domain presets
	- HuggingFace API key handling

	2. utils/logger.py - Centralized logging
	- File and console handlers
	- Different log levels per module
	- Timing decorators for performance tracking

	3. utils/helpers.py - Common utilities
	- File validation
	- Temporary file management
	- Error formatting

	### Phase 2: Video Processing Layer
	4. core/video_processor.py - FFmpeg operations
	- Extract frames at specified FPS
	- Extract audio track
	- Get video metadata (duration, resolution, fps)
	- Cut clips at timestamps

	5. core/scene_detector.py - Scene boundary detection
	- PySceneDetect integration
	- Content-aware detection
	- Return scene timestamps

	6. core/frame_sampler.py - Hierarchical sampling
	- First pass: 1 frame per 5-10 seconds
	- Second pass: Dense sampling on candidates
	- Dynamic FPS based on motion

	### Phase 3: AI Models
	7. models/visual_analyzer.py - Qwen2-VL-2B
	- Load quantized model
	- Process frame batches
	- Extract visual embeddings/scores

	8. models/audio_analyzer.py - Audio analysis
	- Librosa for basic features (RMS, spectral flux, centroid)
	- Optional Wav2Vec 2.0 for advanced understanding
	- Return audio hype signals per segment

	9. models/face_recognizer.py - Face detection/recognition
	- InsightFace SCRFD for detection
	- ArcFace for embeddings
	- Reference image matching

	10. models/body_recognizer.py - Body recognition
	- OSNet for full-body embeddings
	- Handle non-frontal views

	11. models/motion_detector.py - Motion analysis
	- RAFT optical flow
	- Motion magnitude scoring

	12. models/tracker.py - Multi-object tracking
	- ByteTrack integration
	- Maintain identity across frames

	### Phase 4: Scoring & Selection
	13. scoring/domain_presets.py - Domain configurations
	- Sports, Vlogs, Music, Podcasts presets
	- Custom weight definitions

	14. scoring/hype_scorer.py - Hype calculation
	- Combine visual + audio scores
	- Apply domain weights
	- Normalize and rank segments

	### Phase 5: Pipeline & UI
	15. pipeline/orchestrator.py - Main coordinator
	- Coordinate all components
	- Handle errors gracefully
	- Progress reporting

	16. app.py - Gradio interface
	- Video upload
	- API key input (secure)
	- Prompt/instructions input
	- Domain selection
	- Reference image upload (for person filtering)
	- Progress bar
	- Output video gallery

	## Key Design Decisions

	### Error Handling Strategy
	- Each module has try/except with specific exception types
	- Errors bubble up with context
	- Pipeline continues with degraded functionality when possible
	- User-friendly error messages in UI

	### Logging Strategy
	- DEBUG: Model loading, frame processing details
	- INFO: Pipeline stages, timing, results
	- WARNING: Fallback triggers, degraded mode
	- ERROR: Failures with stack traces

	### Memory Management
	- Process frames in batches
	- Clear GPU memory between stages
	- Use generators where possible
	- Temporary file cleanup

	### HuggingFace Space Considerations
	- Use `gr.State` for session data
	- Respect ZeroGPU limits (if using)
	- Cache models in `/tmp` or HF cache
	- Handle timeouts gracefully

	## API Key Usage
	The API key input is for future extensibility (e.g., external services).
	For MVP, all processing is local using open-weight models.

	## Gradio UI Layout
	```
	┌─────────────────────────────────────────────────────────────┐
	│ ShortSmith v2 - AI Video Highlight Extractor │
	├─────────────────────────────────────────────────────────────┤
	│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
	│ │ Upload Video │ │ Settings │ │
	│ │ [Drop zone] │ │ Domain: [Dropdown] │ │
	│ │ │ │ Clip Duration: [Slider] │ │
	│ └─────────────────────┘ │ Num Clips: [Slider] │ │
	│ │ API Key: [Password field] │ │
	│ ┌─────────────────────┐ └─────────────────────────────┘ │
	│ │ Reference Image │ │
	│ │ (Optional) │ ┌─────────────────────────────┐ │
	│ │ [Drop zone] │ │ Additional Instructions │ │
	│ └─────────────────────┘ │ [Textbox] │ │
	│ └─────────────────────────────┘ │
	├─────────────────────────────────────────────────────────────┤
	│ [🚀 Extract Highlights] │
	├─────────────────────────────────────────────────────────────┤
	│ Progress: [████████████░░░░░░░░] 60% │
	│ Status: Analyzing audio... │
	├─────────────────────────────────────────────────────────────┤
	│ Results │
	│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
	│ │ Clip 1 │ │ Clip 2 │ │ Clip 3 │ │
	│ │ [Video] │ │ [Video] │ │ [Video] │ │
	│ │ Score:85 │ │ Score:78 │ │ Score:72 │ │
	│ └──────────┘ └──────────┘ └──────────┘ │
	│ [Download All] │
	└─────────────────────────────────────────────────────────────┘
	```

	## Dependencies (requirements.txt)
	```
	gradio>=4.0.0
	torch>=2.0.0
	transformers>=4.35.0
	accelerate
	bitsandbytes
	qwen-vl-utils
	librosa>=0.10.0
	soundfile
	insightface
	onnxruntime-gpu
	opencv-python-headless
	scenedetect[opencv]
	numpy
	pillow
	tqdm
	ffmpeg-python
	```

	## Implementation Order
	1. config.py, utils/ (foundation)
	2. core/video_processor.py (essential)
	3. models/audio_analyzer.py (simpler, Librosa first)
	4. core/scene_detector.py
	5. core/frame_sampler.py
	6. scoring/ modules
	7. models/visual_analyzer.py (Qwen2-VL)
	8. models/face_recognizer.py, body_recognizer.py
	9. models/tracker.py, motion_detector.py
	10. pipeline/orchestrator.py
	11. app.py (Gradio UI)

	## Notes
	- Start with Librosa-only audio (MVP), add Wav2Vec later
	- Face/body recognition is optional (triggered by reference image)
	- Motion detection can be skipped in MVP for speed
	- ByteTrack only needed when person filtering is enabled