Fixed frontend issue where the number of highlights selected did not appear
Browse files- .gitattributes +35 -35
- PLAN.md +226 -226
- README.md +48 -48
- REQUIREMENTS_CHECKLIST.md +162 -162
- app.py +1096 -1014
- config.py +201 -201
- requirements.txt +103 -103
- space.yaml +31 -31
.gitattributes
CHANGED
|
@@ -1,35 +1,35 @@
|
|
| 1 |
-
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
-
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
-
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
-
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
-
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
-
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
-
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
-
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
-
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
-
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
-
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
-
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
-
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
-
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
-
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
-
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
-
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
-
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
-
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
-
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
-
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
-
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
-
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
-
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
-
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
-
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
-
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
-
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
-
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
-
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
-
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
-
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
-
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
-
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
-
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
PLAN.md
CHANGED
|
@@ -1,226 +1,226 @@
|
|
| 1 |
-
# ShortSmith v2 - Implementation Plan
|
| 2 |
-
|
| 3 |
-
## Overview
|
| 4 |
-
Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.
|
| 5 |
-
|
| 6 |
-
## Project Structure
|
| 7 |
-
```
|
| 8 |
-
shortsmith-v2/
|
| 9 |
-
├── app.py # Gradio UI (Hugging Face interface)
|
| 10 |
-
├── requirements.txt # Dependencies
|
| 11 |
-
├── config.py # Configuration and constants
|
| 12 |
-
├── utils/
|
| 13 |
-
│ ├── __init__.py
|
| 14 |
-
│ ├── logger.py # Centralized logging
|
| 15 |
-
│ └── helpers.py # Utility functions
|
| 16 |
-
├── core/
|
| 17 |
-
│ ├── __init__.py
|
| 18 |
-
│ ├── video_processor.py # FFmpeg video/audio extraction
|
| 19 |
-
│ ├── scene_detector.py # PySceneDetect integration
|
| 20 |
-
│ ├── frame_sampler.py # Hierarchical sampling logic
|
| 21 |
-
│ └── clip_extractor.py # Final clip cutting
|
| 22 |
-
├── models/
|
| 23 |
-
│ ├── __init__.py
|
| 24 |
-
│ ├── visual_analyzer.py # Qwen2-VL integration
|
| 25 |
-
│ ├── audio_analyzer.py # Wav2Vec 2.0 + Librosa
|
| 26 |
-
│ ├── face_recognizer.py # InsightFace (SCRFD + ArcFace)
|
| 27 |
-
│ ├── body_recognizer.py # OSNet for body recognition
|
| 28 |
-
│ ├── motion_detector.py # RAFT optical flow
|
| 29 |
-
│ └── tracker.py # ByteTrack integration
|
| 30 |
-
├── scoring/
|
| 31 |
-
│ ├── __init__.py
|
| 32 |
-
│ ├── hype_scorer.py # Hype scoring logic
|
| 33 |
-
│ └── domain_presets.py # Domain-specific weights
|
| 34 |
-
└── pipeline/
|
| 35 |
-
├── __init__.py
|
| 36 |
-
└── orchestrator.py # Main pipeline coordinator
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
-
## Implementation Phases
|
| 40 |
-
|
| 41 |
-
### Phase 1: Core Infrastructure
|
| 42 |
-
1. **config.py** - Configuration management
|
| 43 |
-
- Model paths, thresholds, domain presets
|
| 44 |
-
- HuggingFace API key handling
|
| 45 |
-
|
| 46 |
-
2. **utils/logger.py** - Centralized logging
|
| 47 |
-
- File and console handlers
|
| 48 |
-
- Different log levels per module
|
| 49 |
-
- Timing decorators for performance tracking
|
| 50 |
-
|
| 51 |
-
3. **utils/helpers.py** - Common utilities
|
| 52 |
-
- File validation
|
| 53 |
-
- Temporary file management
|
| 54 |
-
- Error formatting
|
| 55 |
-
|
| 56 |
-
### Phase 2: Video Processing Layer
|
| 57 |
-
4. **core/video_processor.py** - FFmpeg operations
|
| 58 |
-
- Extract frames at specified FPS
|
| 59 |
-
- Extract audio track
|
| 60 |
-
- Get video metadata (duration, resolution, fps)
|
| 61 |
-
- Cut clips at timestamps
|
| 62 |
-
|
| 63 |
-
5. **core/scene_detector.py** - Scene boundary detection
|
| 64 |
-
- PySceneDetect integration
|
| 65 |
-
- Content-aware detection
|
| 66 |
-
- Return scene timestamps
|
| 67 |
-
|
| 68 |
-
6. **core/frame_sampler.py** - Hierarchical sampling
|
| 69 |
-
- First pass: 1 frame per 5-10 seconds
|
| 70 |
-
- Second pass: Dense sampling on candidates
|
| 71 |
-
- Dynamic FPS based on motion
|
| 72 |
-
|
| 73 |
-
### Phase 3: AI Models
|
| 74 |
-
7. **models/visual_analyzer.py** - Qwen2-VL-2B
|
| 75 |
-
- Load quantized model
|
| 76 |
-
- Process frame batches
|
| 77 |
-
- Extract visual embeddings/scores
|
| 78 |
-
|
| 79 |
-
8. **models/audio_analyzer.py** - Audio analysis
|
| 80 |
-
- Librosa for basic features (RMS, spectral flux, centroid)
|
| 81 |
-
- Optional Wav2Vec 2.0 for advanced understanding
|
| 82 |
-
- Return audio hype signals per segment
|
| 83 |
-
|
| 84 |
-
9. **models/face_recognizer.py** - Face detection/recognition
|
| 85 |
-
- InsightFace SCRFD for detection
|
| 86 |
-
- ArcFace for embeddings
|
| 87 |
-
- Reference image matching
|
| 88 |
-
|
| 89 |
-
10. **models/body_recognizer.py** - Body recognition
|
| 90 |
-
- OSNet for full-body embeddings
|
| 91 |
-
- Handle non-frontal views
|
| 92 |
-
|
| 93 |
-
11. **models/motion_detector.py** - Motion analysis
|
| 94 |
-
- RAFT optical flow
|
| 95 |
-
- Motion magnitude scoring
|
| 96 |
-
|
| 97 |
-
12. **models/tracker.py** - Multi-object tracking
|
| 98 |
-
- ByteTrack integration
|
| 99 |
-
- Maintain identity across frames
|
| 100 |
-
|
| 101 |
-
### Phase 4: Scoring & Selection
|
| 102 |
-
13. **scoring/domain_presets.py** - Domain configurations
|
| 103 |
-
- Sports, Vlogs, Music, Podcasts presets
|
| 104 |
-
- Custom weight definitions
|
| 105 |
-
|
| 106 |
-
14. **scoring/hype_scorer.py** - Hype calculation
|
| 107 |
-
- Combine visual + audio scores
|
| 108 |
-
- Apply domain weights
|
| 109 |
-
- Normalize and rank segments
|
| 110 |
-
|
| 111 |
-
### Phase 5: Pipeline & UI
|
| 112 |
-
15. **pipeline/orchestrator.py** - Main coordinator
|
| 113 |
-
- Coordinate all components
|
| 114 |
-
- Handle errors gracefully
|
| 115 |
-
- Progress reporting
|
| 116 |
-
|
| 117 |
-
16. **app.py** - Gradio interface
|
| 118 |
-
- Video upload
|
| 119 |
-
- API key input (secure)
|
| 120 |
-
- Prompt/instructions input
|
| 121 |
-
- Domain selection
|
| 122 |
-
- Reference image upload (for person filtering)
|
| 123 |
-
- Progress bar
|
| 124 |
-
- Output video gallery
|
| 125 |
-
|
| 126 |
-
## Key Design Decisions
|
| 127 |
-
|
| 128 |
-
### Error Handling Strategy
|
| 129 |
-
- Each module has try/except with specific exception types
|
| 130 |
-
- Errors bubble up with context
|
| 131 |
-
- Pipeline continues with degraded functionality when possible
|
| 132 |
-
- User-friendly error messages in UI
|
| 133 |
-
|
| 134 |
-
### Logging Strategy
|
| 135 |
-
- DEBUG: Model loading, frame processing details
|
| 136 |
-
- INFO: Pipeline stages, timing, results
|
| 137 |
-
- WARNING: Fallback triggers, degraded mode
|
| 138 |
-
- ERROR: Failures with stack traces
|
| 139 |
-
|
| 140 |
-
### Memory Management
|
| 141 |
-
- Process frames in batches
|
| 142 |
-
- Clear GPU memory between stages
|
| 143 |
-
- Use generators where possible
|
| 144 |
-
- Temporary file cleanup
|
| 145 |
-
|
| 146 |
-
### HuggingFace Space Considerations
|
| 147 |
-
- Use `gr.State` for session data
|
| 148 |
-
- Respect ZeroGPU limits (if using)
|
| 149 |
-
- Cache models in `/tmp` or HF cache
|
| 150 |
-
- Handle timeouts gracefully
|
| 151 |
-
|
| 152 |
-
## API Key Usage
|
| 153 |
-
The API key input is for future extensibility (e.g., external services).
|
| 154 |
-
For MVP, all processing is local using open-weight models.
|
| 155 |
-
|
| 156 |
-
## Gradio UI Layout
|
| 157 |
-
```
|
| 158 |
-
┌─────────────────────────────────────────────────────────────┐
|
| 159 |
-
│ ShortSmith v2 - AI Video Highlight Extractor │
|
| 160 |
-
├─────────────────────────────────────────────────────────────┤
|
| 161 |
-
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
|
| 162 |
-
│ │ Upload Video │ │ Settings │ │
|
| 163 |
-
│ │ [Drop zone] │ │ Domain: [Dropdown] │ │
|
| 164 |
-
│ │ │ │ Clip Duration: [Slider] │ │
|
| 165 |
-
│ └─────────────────────┘ │ Num Clips: [Slider] │ │
|
| 166 |
-
│ │ API Key: [Password field] │ │
|
| 167 |
-
│ ┌─────────────────────┐ └─────────────────────────────┘ │
|
| 168 |
-
│ │ Reference Image │ │
|
| 169 |
-
│ │ (Optional) │ ┌─────────────────────────────┐ │
|
| 170 |
-
│ │ [Drop zone] │ │ Additional Instructions │ │
|
| 171 |
-
│ └─────────────────────┘ │ [Textbox] │ │
|
| 172 |
-
│ └─────────────────────────────┘ │
|
| 173 |
-
├─────────────────────────────────────────────────────────────┤
|
| 174 |
-
│ [🚀 Extract Highlights] │
|
| 175 |
-
├─────────────────────────────────────────────────────────────┤
|
| 176 |
-
│ Progress: [████████████░░░░░░░░] 60% │
|
| 177 |
-
│ Status: Analyzing audio... │
|
| 178 |
-
├─────────────────────────────────────────────────────────────┤
|
| 179 |
-
│ Results │
|
| 180 |
-
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
| 181 |
-
│ │ Clip 1 │ │ Clip 2 │ │ Clip 3 │ │
|
| 182 |
-
│ │ [Video] │ │ [Video] │ │ [Video] │ │
|
| 183 |
-
│ │ Score:85 │ │ Score:78 │ │ Score:72 │ │
|
| 184 |
-
│ └──────────┘ └──────────┘ └──────────┘ │
|
| 185 |
-
│ [Download All] │
|
| 186 |
-
└─────────────────────────────────────────────────────────────┘
|
| 187 |
-
```
|
| 188 |
-
|
| 189 |
-
## Dependencies (requirements.txt)
|
| 190 |
-
```
|
| 191 |
-
gradio>=4.0.0
|
| 192 |
-
torch>=2.0.0
|
| 193 |
-
transformers>=4.35.0
|
| 194 |
-
accelerate
|
| 195 |
-
bitsandbytes
|
| 196 |
-
qwen-vl-utils
|
| 197 |
-
librosa>=0.10.0
|
| 198 |
-
soundfile
|
| 199 |
-
insightface
|
| 200 |
-
onnxruntime-gpu
|
| 201 |
-
opencv-python-headless
|
| 202 |
-
scenedetect[opencv]
|
| 203 |
-
numpy
|
| 204 |
-
pillow
|
| 205 |
-
tqdm
|
| 206 |
-
ffmpeg-python
|
| 207 |
-
```
|
| 208 |
-
|
| 209 |
-
## Implementation Order
|
| 210 |
-
1. config.py, utils/ (foundation)
|
| 211 |
-
2. core/video_processor.py (essential)
|
| 212 |
-
3. models/audio_analyzer.py (simpler, Librosa first)
|
| 213 |
-
4. core/scene_detector.py
|
| 214 |
-
5. core/frame_sampler.py
|
| 215 |
-
6. scoring/ modules
|
| 216 |
-
7. models/visual_analyzer.py (Qwen2-VL)
|
| 217 |
-
8. models/face_recognizer.py, body_recognizer.py
|
| 218 |
-
9. models/tracker.py, motion_detector.py
|
| 219 |
-
10. pipeline/orchestrator.py
|
| 220 |
-
11. app.py (Gradio UI)
|
| 221 |
-
|
| 222 |
-
## Notes
|
| 223 |
-
- Start with Librosa-only audio (MVP), add Wav2Vec later
|
| 224 |
-
- Face/body recognition is optional (triggered by reference image)
|
| 225 |
-
- Motion detection can be skipped in MVP for speed
|
| 226 |
-
- ByteTrack only needed when person filtering is enabled
|
|
|
|
| 1 |
+
# ShortSmith v2 - Implementation Plan
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.
|
| 5 |
+
|
| 6 |
+
## Project Structure
|
| 7 |
+
```
|
| 8 |
+
shortsmith-v2/
|
| 9 |
+
├── app.py # Gradio UI (Hugging Face interface)
|
| 10 |
+
├── requirements.txt # Dependencies
|
| 11 |
+
├── config.py # Configuration and constants
|
| 12 |
+
├── utils/
|
| 13 |
+
│ ├── __init__.py
|
| 14 |
+
│ ├── logger.py # Centralized logging
|
| 15 |
+
│ └── helpers.py # Utility functions
|
| 16 |
+
├── core/
|
| 17 |
+
│ ├── __init__.py
|
| 18 |
+
│ ├── video_processor.py # FFmpeg video/audio extraction
|
| 19 |
+
│ ├── scene_detector.py # PySceneDetect integration
|
| 20 |
+
│ ├── frame_sampler.py # Hierarchical sampling logic
|
| 21 |
+
│ └── clip_extractor.py # Final clip cutting
|
| 22 |
+
├── models/
|
| 23 |
+
│ ├── __init__.py
|
| 24 |
+
│ ├── visual_analyzer.py # Qwen2-VL integration
|
| 25 |
+
│ ├── audio_analyzer.py # Wav2Vec 2.0 + Librosa
|
| 26 |
+
│ ├── face_recognizer.py # InsightFace (SCRFD + ArcFace)
|
| 27 |
+
│ ├── body_recognizer.py # OSNet for body recognition
|
| 28 |
+
│ ├── motion_detector.py # RAFT optical flow
|
| 29 |
+
│ └── tracker.py # ByteTrack integration
|
| 30 |
+
├── scoring/
|
| 31 |
+
│ ├── __init__.py
|
| 32 |
+
│ ├── hype_scorer.py # Hype scoring logic
|
| 33 |
+
│ └── domain_presets.py # Domain-specific weights
|
| 34 |
+
└── pipeline/
|
| 35 |
+
├── __init__.py
|
| 36 |
+
└── orchestrator.py # Main pipeline coordinator
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Implementation Phases
|
| 40 |
+
|
| 41 |
+
### Phase 1: Core Infrastructure
|
| 42 |
+
1. **config.py** - Configuration management
|
| 43 |
+
- Model paths, thresholds, domain presets
|
| 44 |
+
- HuggingFace API key handling
|
| 45 |
+
|
| 46 |
+
2. **utils/logger.py** - Centralized logging
|
| 47 |
+
- File and console handlers
|
| 48 |
+
- Different log levels per module
|
| 49 |
+
- Timing decorators for performance tracking
|
| 50 |
+
|
| 51 |
+
3. **utils/helpers.py** - Common utilities
|
| 52 |
+
- File validation
|
| 53 |
+
- Temporary file management
|
| 54 |
+
- Error formatting
|
| 55 |
+
|
| 56 |
+
### Phase 2: Video Processing Layer
|
| 57 |
+
4. **core/video_processor.py** - FFmpeg operations
|
| 58 |
+
- Extract frames at specified FPS
|
| 59 |
+
- Extract audio track
|
| 60 |
+
- Get video metadata (duration, resolution, fps)
|
| 61 |
+
- Cut clips at timestamps
|
| 62 |
+
|
| 63 |
+
5. **core/scene_detector.py** - Scene boundary detection
|
| 64 |
+
- PySceneDetect integration
|
| 65 |
+
- Content-aware detection
|
| 66 |
+
- Return scene timestamps
|
| 67 |
+
|
| 68 |
+
6. **core/frame_sampler.py** - Hierarchical sampling
|
| 69 |
+
- First pass: 1 frame per 5-10 seconds
|
| 70 |
+
- Second pass: Dense sampling on candidates
|
| 71 |
+
- Dynamic FPS based on motion
|
| 72 |
+
|
| 73 |
+
### Phase 3: AI Models
|
| 74 |
+
7. **models/visual_analyzer.py** - Qwen2-VL-2B
|
| 75 |
+
- Load quantized model
|
| 76 |
+
- Process frame batches
|
| 77 |
+
- Extract visual embeddings/scores
|
| 78 |
+
|
| 79 |
+
8. **models/audio_analyzer.py** - Audio analysis
|
| 80 |
+
- Librosa for basic features (RMS, spectral flux, centroid)
|
| 81 |
+
- Optional Wav2Vec 2.0 for advanced understanding
|
| 82 |
+
- Return audio hype signals per segment
|
| 83 |
+
|
| 84 |
+
9. **models/face_recognizer.py** - Face detection/recognition
|
| 85 |
+
- InsightFace SCRFD for detection
|
| 86 |
+
- ArcFace for embeddings
|
| 87 |
+
- Reference image matching
|
| 88 |
+
|
| 89 |
+
10. **models/body_recognizer.py** - Body recognition
|
| 90 |
+
- OSNet for full-body embeddings
|
| 91 |
+
- Handle non-frontal views
|
| 92 |
+
|
| 93 |
+
11. **models/motion_detector.py** - Motion analysis
|
| 94 |
+
- RAFT optical flow
|
| 95 |
+
- Motion magnitude scoring
|
| 96 |
+
|
| 97 |
+
12. **models/tracker.py** - Multi-object tracking
|
| 98 |
+
- ByteTrack integration
|
| 99 |
+
- Maintain identity across frames
|
| 100 |
+
|
| 101 |
+
### Phase 4: Scoring & Selection
|
| 102 |
+
13. **scoring/domain_presets.py** - Domain configurations
|
| 103 |
+
- Sports, Vlogs, Music, Podcasts presets
|
| 104 |
+
- Custom weight definitions
|
| 105 |
+
|
| 106 |
+
14. **scoring/hype_scorer.py** - Hype calculation
|
| 107 |
+
- Combine visual + audio scores
|
| 108 |
+
- Apply domain weights
|
| 109 |
+
- Normalize and rank segments
|
| 110 |
+
|
| 111 |
+
### Phase 5: Pipeline & UI
|
| 112 |
+
15. **pipeline/orchestrator.py** - Main coordinator
|
| 113 |
+
- Coordinate all components
|
| 114 |
+
- Handle errors gracefully
|
| 115 |
+
- Progress reporting
|
| 116 |
+
|
| 117 |
+
16. **app.py** - Gradio interface
|
| 118 |
+
- Video upload
|
| 119 |
+
- API key input (secure)
|
| 120 |
+
- Prompt/instructions input
|
| 121 |
+
- Domain selection
|
| 122 |
+
- Reference image upload (for person filtering)
|
| 123 |
+
- Progress bar
|
| 124 |
+
- Output video gallery
|
| 125 |
+
|
| 126 |
+
## Key Design Decisions
|
| 127 |
+
|
| 128 |
+
### Error Handling Strategy
|
| 129 |
+
- Each module has try/except with specific exception types
|
| 130 |
+
- Errors bubble up with context
|
| 131 |
+
- Pipeline continues with degraded functionality when possible
|
| 132 |
+
- User-friendly error messages in UI
|
| 133 |
+
|
| 134 |
+
### Logging Strategy
|
| 135 |
+
- DEBUG: Model loading, frame processing details
|
| 136 |
+
- INFO: Pipeline stages, timing, results
|
| 137 |
+
- WARNING: Fallback triggers, degraded mode
|
| 138 |
+
- ERROR: Failures with stack traces
|
| 139 |
+
|
| 140 |
+
### Memory Management
|
| 141 |
+
- Process frames in batches
|
| 142 |
+
- Clear GPU memory between stages
|
| 143 |
+
- Use generators where possible
|
| 144 |
+
- Temporary file cleanup
|
| 145 |
+
|
| 146 |
+
### HuggingFace Space Considerations
|
| 147 |
+
- Use `gr.State` for session data
|
| 148 |
+
- Respect ZeroGPU limits (if using)
|
| 149 |
+
- Cache models in `/tmp` or HF cache
|
| 150 |
+
- Handle timeouts gracefully
|
| 151 |
+
|
| 152 |
+
## API Key Usage
|
| 153 |
+
The API key input is for future extensibility (e.g., external services).
|
| 154 |
+
For MVP, all processing is local using open-weight models.
|
| 155 |
+
|
| 156 |
+
## Gradio UI Layout
|
| 157 |
+
```
|
| 158 |
+
┌─────────────────────────────────────────────────────────────┐
|
| 159 |
+
│ ShortSmith v2 - AI Video Highlight Extractor │
|
| 160 |
+
├─────────────────────────────────────────────────────────────┤
|
| 161 |
+
│ ┌─────────────────────┐ ┌─────────────────────────────┐ │
|
| 162 |
+
│ │ Upload Video │ │ Settings │ │
|
| 163 |
+
│ │ [Drop zone] │ │ Domain: [Dropdown] │ │
|
| 164 |
+
│ │ │ │ Clip Duration: [Slider] │ │
|
| 165 |
+
│ └─────────────────────┘ │ Num Clips: [Slider] │ │
|
| 166 |
+
│ │ API Key: [Password field] │ │
|
| 167 |
+
│ ┌─────────────────────┐ └─────────────────────────────┘ │
|
| 168 |
+
│ │ Reference Image │ │
|
| 169 |
+
│ │ (Optional) │ ┌─────────────────────────────┐ │
|
| 170 |
+
│ │ [Drop zone] │ │ Additional Instructions │ │
|
| 171 |
+
│ └─────────────────────┘ │ [Textbox] │ │
|
| 172 |
+
│ └─────────────────────────────┘ │
|
| 173 |
+
├─────────────────────────────────────────────────────────────┤
|
| 174 |
+
│ [🚀 Extract Highlights] │
|
| 175 |
+
├─────────────────────────────────────────────────────────────┤
|
| 176 |
+
│ Progress: [████████████░░░░░░░░] 60% │
|
| 177 |
+
│ Status: Analyzing audio... │
|
| 178 |
+
├─────────────────────────────────────────────────────────────┤
|
| 179 |
+
│ Results │
|
| 180 |
+
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
| 181 |
+
│ │ Clip 1 │ │ Clip 2 │ │ Clip 3 │ │
|
| 182 |
+
│ │ [Video] │ │ [Video] │ │ [Video] │ │
|
| 183 |
+
│ │ Score:85 │ │ Score:78 │ │ Score:72 │ │
|
| 184 |
+
│ └──────────┘ └──────────┘ └──────────┘ │
|
| 185 |
+
│ [Download All] │
|
| 186 |
+
└─────────────────────────────────────────────────────────────┘
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
## Dependencies (requirements.txt)
|
| 190 |
+
```
|
| 191 |
+
gradio>=4.0.0
|
| 192 |
+
torch>=2.0.0
|
| 193 |
+
transformers>=4.35.0
|
| 194 |
+
accelerate
|
| 195 |
+
bitsandbytes
|
| 196 |
+
qwen-vl-utils
|
| 197 |
+
librosa>=0.10.0
|
| 198 |
+
soundfile
|
| 199 |
+
insightface
|
| 200 |
+
onnxruntime-gpu
|
| 201 |
+
opencv-python-headless
|
| 202 |
+
scenedetect[opencv]
|
| 203 |
+
numpy
|
| 204 |
+
pillow
|
| 205 |
+
tqdm
|
| 206 |
+
ffmpeg-python
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
## Implementation Order
|
| 210 |
+
1. config.py, utils/ (foundation)
|
| 211 |
+
2. core/video_processor.py (essential)
|
| 212 |
+
3. models/audio_analyzer.py (simpler, Librosa first)
|
| 213 |
+
4. core/scene_detector.py
|
| 214 |
+
5. core/frame_sampler.py
|
| 215 |
+
6. scoring/ modules
|
| 216 |
+
7. models/visual_analyzer.py (Qwen2-VL)
|
| 217 |
+
8. models/face_recognizer.py, body_recognizer.py
|
| 218 |
+
9. models/tracker.py, motion_detector.py
|
| 219 |
+
10. pipeline/orchestrator.py
|
| 220 |
+
11. app.py (Gradio UI)
|
| 221 |
+
|
| 222 |
+
## Notes
|
| 223 |
+
- Start with Librosa-only audio (MVP), add Wav2Vec later
|
| 224 |
+
- Face/body recognition is optional (triggered by reference image)
|
| 225 |
+
- Motion detection can be skipped in MVP for speed
|
| 226 |
+
- ByteTrack only needed when person filtering is enabled
|
README.md
CHANGED
|
@@ -1,48 +1,48 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: ShortSmith v2
|
| 3 |
-
emoji: 🎬
|
| 4 |
-
colorFrom: purple
|
| 5 |
-
colorTo: blue
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version: "4.44.1"
|
| 8 |
-
app_file: app.py
|
| 9 |
-
pinned: false
|
| 10 |
-
license: mit
|
| 11 |
-
hardware: a10g-large
|
| 12 |
-
tags:
|
| 13 |
-
- video
|
| 14 |
-
- highlight-detection
|
| 15 |
-
- ai
|
| 16 |
-
- qwen
|
| 17 |
-
- computer-vision
|
| 18 |
-
- audio-analysis
|
| 19 |
-
short_description: AI-Powered Video Highlight Extractor
|
| 20 |
-
---
|
| 21 |
-
|
| 22 |
-
# ShortSmith v2
|
| 23 |
-
|
| 24 |
-
Extract the most engaging highlight clips from your videos automatically using AI.
|
| 25 |
-
|
| 26 |
-
## Features
|
| 27 |
-
- Multi-modal analysis (visual + audio + motion)
|
| 28 |
-
- Domain-optimized presets (Sports, Music, Vlogs, etc.)
|
| 29 |
-
- Person-specific filtering
|
| 30 |
-
- Scene-aware clip cutting
|
| 31 |
-
- Trained on Mr. HiSum "Most Replayed" data
|
| 32 |
-
|
| 33 |
-
## Usage
|
| 34 |
-
1. Upload a video (up to 500MB, max 1 hour)
|
| 35 |
-
2. Select content domain (Sports, Music, Vlogs, etc.)
|
| 36 |
-
3. Choose number of clips and duration
|
| 37 |
-
4. (Optional) Upload reference image for person filtering
|
| 38 |
-
5. Click "Extract Highlights"
|
| 39 |
-
6. Download your clips!
|
| 40 |
-
|
| 41 |
-
## Tech Stack
|
| 42 |
-
- **Visual**: Qwen2-VL-2B (INT4 quantized)
|
| 43 |
-
- **Audio**: Librosa + Wav2Vec 2.0
|
| 44 |
-
- **Face Recognition**: InsightFace (SCRFD + ArcFace)
|
| 45 |
-
- **Hype Scoring**: MLP trained on Mr. HiSum dataset
|
| 46 |
-
- **Scene Detection**: PySceneDetect
|
| 47 |
-
|
| 48 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: ShortSmith v2
|
| 3 |
+
emoji: 🎬
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "4.44.1"
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: mit
|
| 11 |
+
hardware: a10g-large
|
| 12 |
+
tags:
|
| 13 |
+
- video
|
| 14 |
+
- highlight-detection
|
| 15 |
+
- ai
|
| 16 |
+
- qwen
|
| 17 |
+
- computer-vision
|
| 18 |
+
- audio-analysis
|
| 19 |
+
short_description: AI-Powered Video Highlight Extractor
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
# ShortSmith v2
|
| 23 |
+
|
| 24 |
+
Extract the most engaging highlight clips from your videos automatically using AI.
|
| 25 |
+
|
| 26 |
+
## Features
|
| 27 |
+
- Multi-modal analysis (visual + audio + motion)
|
| 28 |
+
- Domain-optimized presets (Sports, Music, Vlogs, etc.)
|
| 29 |
+
- Person-specific filtering
|
| 30 |
+
- Scene-aware clip cutting
|
| 31 |
+
- Trained on Mr. HiSum "Most Replayed" data
|
| 32 |
+
|
| 33 |
+
## Usage
|
| 34 |
+
1. Upload a video (up to 500MB, max 1 hour)
|
| 35 |
+
2. Select content domain (Sports, Music, Vlogs, etc.)
|
| 36 |
+
3. Choose number of clips and duration
|
| 37 |
+
4. (Optional) Upload reference image for person filtering
|
| 38 |
+
5. Click "Extract Highlights"
|
| 39 |
+
6. Download your clips!
|
| 40 |
+
|
| 41 |
+
## Tech Stack
|
| 42 |
+
- **Visual**: Qwen2-VL-2B (INT4 quantized)
|
| 43 |
+
- **Audio**: Librosa + Wav2Vec 2.0
|
| 44 |
+
- **Face Recognition**: InsightFace (SCRFD + ArcFace)
|
| 45 |
+
- **Hype Scoring**: MLP trained on Mr. HiSum dataset
|
| 46 |
+
- **Scene Detection**: PySceneDetect
|
| 47 |
+
|
| 48 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
REQUIREMENTS_CHECKLIST.md
CHANGED
|
@@ -1,162 +1,162 @@
|
|
| 1 |
-
# ShortSmith v2 - Requirements Checklist
|
| 2 |
-
|
| 3 |
-
Comparing implementation against the original proposal document.
|
| 4 |
-
|
| 5 |
-
## ✅ Executive Summary Requirements
|
| 6 |
-
|
| 7 |
-
| Requirement | Status | Implementation |
|
| 8 |
-
|-------------|--------|----------------|
|
| 9 |
-
| Reduce costs vs Klap.app | ✅ | Uses open-weight models, no per-video API cost |
|
| 10 |
-
| Person-specific filtering | ✅ | `face_recognizer.py` + `body_recognizer.py` |
|
| 11 |
-
| Customizable "hype" definitions | ✅ | `domain_presets.py` with Sports, Vlogs, Music, etc. |
|
| 12 |
-
| Eliminate vendor dependency | ✅ | All processing is local |
|
| 13 |
-
|
| 14 |
-
## ✅ Technical Challenges Addressed
|
| 15 |
-
|
| 16 |
-
| Challenge | Status | Solution |
|
| 17 |
-
|-----------|--------|----------|
|
| 18 |
-
| Long video processing | ✅ | Hierarchical sampling in `frame_sampler.py` |
|
| 19 |
-
| Subjective "hype" | ✅ | Domain presets + trainable scorer |
|
| 20 |
-
| Person tracking | ✅ | Face + Body recognition + ByteTrack |
|
| 21 |
-
| Audio-visual correlation | ✅ | Multi-modal fusion in `hype_scorer.py` |
|
| 22 |
-
| Temporal precision | ✅ | Scene-aware cutting in `clip_extractor.py` |
|
| 23 |
-
|
| 24 |
-
## ✅ Technology Decisions (Section 5)
|
| 25 |
-
|
| 26 |
-
### 5.1 Visual Understanding Model
|
| 27 |
-
| Item | Proposal | Implementation | Status |
|
| 28 |
-
|------|----------|----------------|--------|
|
| 29 |
-
| Model | Qwen2-VL-2B | `visual_analyzer.py` | ✅ |
|
| 30 |
-
| Quantization | INT4 via AWQ/GPTQ | bitsandbytes INT4 | ✅ |
|
| 31 |
-
|
| 32 |
-
### 5.2 Audio Analysis
|
| 33 |
-
| Item | Proposal | Implementation | Status |
|
| 34 |
-
|------|----------|----------------|--------|
|
| 35 |
-
| Primary | Wav2Vec 2.0 + Librosa | `audio_analyzer.py` | ✅ |
|
| 36 |
-
| Features | RMS, spectral flux, centroid | Implemented | ✅ |
|
| 37 |
-
| MVP Strategy | Start with Librosa | Librosa default, Wav2Vec optional | ✅ |
|
| 38 |
-
|
| 39 |
-
### 5.3 Hype Scoring
|
| 40 |
-
| Item | Proposal | Implementation | Status |
|
| 41 |
-
|------|----------|----------------|--------|
|
| 42 |
-
| Dataset | Mr. HiSum | Training notebook created | ✅ |
|
| 43 |
-
| Method | Contrastive/pairwise ranking | `training/hype_scorer_training.ipynb` | ✅ |
|
| 44 |
-
| Model | 2-layer MLP | Implemented in training notebook | ✅ |
|
| 45 |
-
|
| 46 |
-
### 5.4 Face Recognition
|
| 47 |
-
| Item | Proposal | Implementation | Status |
|
| 48 |
-
|------|----------|----------------|--------|
|
| 49 |
-
| Detection | SCRFD | InsightFace in `face_recognizer.py` | ✅ |
|
| 50 |
-
| Embeddings | ArcFace (512-dim) | Implemented | ✅ |
|
| 51 |
-
| Threshold | >0.4 cosine similarity | Configurable in `config.py` | ✅ |
|
| 52 |
-
|
| 53 |
-
### 5.5 Body Recognition
|
| 54 |
-
| Item | Proposal | Implementation | Status |
|
| 55 |
-
|------|----------|----------------|--------|
|
| 56 |
-
| Model | OSNet | `body_recognizer.py` | ✅ |
|
| 57 |
-
| Purpose | Non-frontal views | Handles back views, profiles | ✅ |
|
| 58 |
-
|
| 59 |
-
### 5.6 Multi-Object Tracking
|
| 60 |
-
| Item | Proposal | Implementation | Status |
|
| 61 |
-
|------|----------|----------------|--------|
|
| 62 |
-
| Tracker | ByteTrack | `tracker.py` | ✅ |
|
| 63 |
-
| Features | Two-stage association | Implemented | ✅ |
|
| 64 |
-
|
| 65 |
-
### 5.7 Scene Boundary Detection
|
| 66 |
-
| Item | Proposal | Implementation | Status |
|
| 67 |
-
|------|----------|----------------|--------|
|
| 68 |
-
| Tool | PySceneDetect | `scene_detector.py` | ✅ |
|
| 69 |
-
| Modes | Content-aware, Adaptive | Both supported | ✅ |
|
| 70 |
-
|
| 71 |
-
### 5.8 Video Processing
|
| 72 |
-
| Item | Proposal | Implementation | Status |
|
| 73 |
-
|------|----------|----------------|--------|
|
| 74 |
-
| Tool | FFmpeg | `video_processor.py` | ✅ |
|
| 75 |
-
| Operations | Extract frames, audio, cut clips | All implemented | ✅ |
|
| 76 |
-
|
| 77 |
-
### 5.9 Motion Detection
|
| 78 |
-
| Item | Proposal | Implementation | Status |
|
| 79 |
-
|------|----------|----------------|--------|
|
| 80 |
-
| Model | RAFT Optical Flow | `motion_detector.py` | ✅ |
|
| 81 |
-
| Fallback | Farneback | Implemented | ✅ |
|
| 82 |
-
|
| 83 |
-
## ✅ Key Design Decisions (Section 7)
|
| 84 |
-
|
| 85 |
-
### 7.1 Hierarchical Sampling
|
| 86 |
-
| Feature | Status | Implementation |
|
| 87 |
-
|---------|--------|----------------|
|
| 88 |
-
| Coarse pass (1 frame/5-10s) | ✅ | `frame_sampler.py` |
|
| 89 |
-
| Dense pass on candidates | ✅ | `sample_dense()` method |
|
| 90 |
-
| Dynamic FPS | ✅ | Based on motion scores |
|
| 91 |
-
|
| 92 |
-
### 7.2 Contrastive Hype Scoring
|
| 93 |
-
| Feature | Status | Implementation |
|
| 94 |
-
|---------|--------|----------------|
|
| 95 |
-
| Pairwise ranking | ✅ | Training notebook |
|
| 96 |
-
| Relative scoring | ✅ | Normalized within video |
|
| 97 |
-
|
| 98 |
-
### 7.3 Multi-Modal Person Detection
|
| 99 |
-
| Feature | Status | Implementation |
|
| 100 |
-
|---------|--------|----------------|
|
| 101 |
-
| Face + Body | ✅ | Both recognizers |
|
| 102 |
-
| Confidence fusion | ✅ | `max(face_score, body_score)` |
|
| 103 |
-
| ByteTrack tracking | ✅ | `tracker.py` |
|
| 104 |
-
|
| 105 |
-
### 7.4 Domain-Aware Presets
|
| 106 |
-
| Domain | Visual | Audio | Status |
|
| 107 |
-
|--------|--------|-------|--------|
|
| 108 |
-
| Sports | 30% | 45% | ✅ |
|
| 109 |
-
| Vlogs | 55% | 20% | ✅ |
|
| 110 |
-
| Music | 35% | 45% | ✅ |
|
| 111 |
-
| Podcasts | 10% | 75% | ✅ |
|
| 112 |
-
| Gaming | 40% | 35% | ✅ |
|
| 113 |
-
| General | 40% | 35% | ✅ |
|
| 114 |
-
|
| 115 |
-
### 7.5 Diversity Enforcement
|
| 116 |
-
| Feature | Status | Implementation |
|
| 117 |
-
|---------|--------|----------------|
|
| 118 |
-
| Minimum 30s gap | ✅ | `clip_extractor.py` `select_clips()` |
|
| 119 |
-
|
| 120 |
-
### 7.6 Fallback Handling
|
| 121 |
-
| Feature | Status | Implementation |
|
| 122 |
-
|---------|--------|----------------|
|
| 123 |
-
| Uniform windowing for flat content | ✅ | `create_fallback_clips()` |
|
| 124 |
-
| Never zero clips | ✅ | Fallback always creates clips |
|
| 125 |
-
|
| 126 |
-
## ✅ Gradio UI Requirements
|
| 127 |
-
|
| 128 |
-
| Feature | Status | Implementation |
|
| 129 |
-
|---------|--------|----------------|
|
| 130 |
-
| Video upload | ✅ | `gr.Video` component |
|
| 131 |
-
| API key input | ✅ | `gr.Textbox(type="password")` |
|
| 132 |
-
| Domain selection | ✅ | `gr.Dropdown` |
|
| 133 |
-
| Clip duration slider | ✅ | `gr.Slider` |
|
| 134 |
-
| Num clips slider | ✅ | `gr.Slider` |
|
| 135 |
-
| Reference image | ✅ | `gr.Image` |
|
| 136 |
-
| Custom prompt | ✅ | `gr.Textbox` |
|
| 137 |
-
| Progress bar | ✅ | `gr.Progress` |
|
| 138 |
-
| Output gallery | ✅ | `gr.Gallery` |
|
| 139 |
-
| Download all | ⚠️ | Partial (individual clips downloadable) |
|
| 140 |
-
|
| 141 |
-
## ⚠️ Items for Future Enhancement
|
| 142 |
-
|
| 143 |
-
| Item | Status | Notes |
|
| 144 |
-
|------|--------|-------|
|
| 145 |
-
| Trained hype scorer weights | 🔄 | Notebook ready, needs training on real data |
|
| 146 |
-
| RAFT GPU acceleration | ⚠️ | Falls back to Farneback if unavailable |
|
| 147 |
-
| Download all as ZIP | ⚠️ | Could add `gr.DownloadButton` |
|
| 148 |
-
| Batch processing | ❌ | Single video only currently |
|
| 149 |
-
| API endpoint | ❌ | UI only, no REST API |
|
| 150 |
-
|
| 151 |
-
## Summary
|
| 152 |
-
|
| 153 |
-
**Completed**: 95% of proposal requirements
|
| 154 |
-
**Training Pipeline**: Separate Colab notebook for Mr. HiSum training
|
| 155 |
-
**Missing**: Only minor UI features (bulk download) and production training
|
| 156 |
-
|
| 157 |
-
The implementation fully covers:
|
| 158 |
-
- ✅ All 9 core components from the proposal
|
| 159 |
-
- ✅ All 6 key design decisions
|
| 160 |
-
- ✅ All domain presets
|
| 161 |
-
- ✅ Error handling and logging throughout
|
| 162 |
-
- ✅ Gradio UI with all inputs from proposal
|
|
|
|
| 1 |
+
# ShortSmith v2 - Requirements Checklist
|
| 2 |
+
|
| 3 |
+
Comparing implementation against the original proposal document.
|
| 4 |
+
|
| 5 |
+
## ✅ Executive Summary Requirements
|
| 6 |
+
|
| 7 |
+
| Requirement | Status | Implementation |
|
| 8 |
+
|-------------|--------|----------------|
|
| 9 |
+
| Reduce costs vs Klap.app | ✅ | Uses open-weight models, no per-video API cost |
|
| 10 |
+
| Person-specific filtering | ✅ | `face_recognizer.py` + `body_recognizer.py` |
|
| 11 |
+
| Customizable "hype" definitions | ✅ | `domain_presets.py` with Sports, Vlogs, Music, etc. |
|
| 12 |
+
| Eliminate vendor dependency | ✅ | All processing is local |
|
| 13 |
+
|
| 14 |
+
## ✅ Technical Challenges Addressed
|
| 15 |
+
|
| 16 |
+
| Challenge | Status | Solution |
|
| 17 |
+
|-----------|--------|----------|
|
| 18 |
+
| Long video processing | ✅ | Hierarchical sampling in `frame_sampler.py` |
|
| 19 |
+
| Subjective "hype" | ✅ | Domain presets + trainable scorer |
|
| 20 |
+
| Person tracking | ✅ | Face + Body recognition + ByteTrack |
|
| 21 |
+
| Audio-visual correlation | ✅ | Multi-modal fusion in `hype_scorer.py` |
|
| 22 |
+
| Temporal precision | ✅ | Scene-aware cutting in `clip_extractor.py` |
|
| 23 |
+
|
| 24 |
+
## ✅ Technology Decisions (Section 5)
|
| 25 |
+
|
| 26 |
+
### 5.1 Visual Understanding Model
|
| 27 |
+
| Item | Proposal | Implementation | Status |
|
| 28 |
+
|------|----------|----------------|--------|
|
| 29 |
+
| Model | Qwen2-VL-2B | `visual_analyzer.py` | ✅ |
|
| 30 |
+
| Quantization | INT4 via AWQ/GPTQ | bitsandbytes INT4 | ✅ |
|
| 31 |
+
|
| 32 |
+
### 5.2 Audio Analysis
|
| 33 |
+
| Item | Proposal | Implementation | Status |
|
| 34 |
+
|------|----------|----------------|--------|
|
| 35 |
+
| Primary | Wav2Vec 2.0 + Librosa | `audio_analyzer.py` | ✅ |
|
| 36 |
+
| Features | RMS, spectral flux, centroid | Implemented | ✅ |
|
| 37 |
+
| MVP Strategy | Start with Librosa | Librosa default, Wav2Vec optional | ✅ |
|
| 38 |
+
|
| 39 |
+
### 5.3 Hype Scoring
|
| 40 |
+
| Item | Proposal | Implementation | Status |
|
| 41 |
+
|------|----------|----------------|--------|
|
| 42 |
+
| Dataset | Mr. HiSum | Training notebook created | ✅ |
|
| 43 |
+
| Method | Contrastive/pairwise ranking | `training/hype_scorer_training.ipynb` | ✅ |
|
| 44 |
+
| Model | 2-layer MLP | Implemented in training notebook | ✅ |
|
| 45 |
+
|
| 46 |
+
### 5.4 Face Recognition
|
| 47 |
+
| Item | Proposal | Implementation | Status |
|
| 48 |
+
|------|----------|----------------|--------|
|
| 49 |
+
| Detection | SCRFD | InsightFace in `face_recognizer.py` | ✅ |
|
| 50 |
+
| Embeddings | ArcFace (512-dim) | Implemented | ✅ |
|
| 51 |
+
| Threshold | >0.4 cosine similarity | Configurable in `config.py` | ✅ |
|
| 52 |
+
|
| 53 |
+
### 5.5 Body Recognition
|
| 54 |
+
| Item | Proposal | Implementation | Status |
|
| 55 |
+
|------|----------|----------------|--------|
|
| 56 |
+
| Model | OSNet | `body_recognizer.py` | ✅ |
|
| 57 |
+
| Purpose | Non-frontal views | Handles back views, profiles | ✅ |
|
| 58 |
+
|
| 59 |
+
### 5.6 Multi-Object Tracking
|
| 60 |
+
| Item | Proposal | Implementation | Status |
|
| 61 |
+
|------|----------|----------------|--------|
|
| 62 |
+
| Tracker | ByteTrack | `tracker.py` | ✅ |
|
| 63 |
+
| Features | Two-stage association | Implemented | ✅ |
|
| 64 |
+
|
| 65 |
+
### 5.7 Scene Boundary Detection
|
| 66 |
+
| Item | Proposal | Implementation | Status |
|
| 67 |
+
|------|----------|----------------|--------|
|
| 68 |
+
| Tool | PySceneDetect | `scene_detector.py` | ✅ |
|
| 69 |
+
| Modes | Content-aware, Adaptive | Both supported | ✅ |
|
| 70 |
+
|
| 71 |
+
### 5.8 Video Processing
|
| 72 |
+
| Item | Proposal | Implementation | Status |
|
| 73 |
+
|------|----------|----------------|--------|
|
| 74 |
+
| Tool | FFmpeg | `video_processor.py` | ✅ |
|
| 75 |
+
| Operations | Extract frames, audio, cut clips | All implemented | ✅ |
|
| 76 |
+
|
| 77 |
+
### 5.9 Motion Detection
|
| 78 |
+
| Item | Proposal | Implementation | Status |
|
| 79 |
+
|------|----------|----------------|--------|
|
| 80 |
+
| Model | RAFT Optical Flow | `motion_detector.py` | ✅ |
|
| 81 |
+
| Fallback | Farneback | Implemented | ✅ |
|
| 82 |
+
|
| 83 |
+
## ✅ Key Design Decisions (Section 7)
|
| 84 |
+
|
| 85 |
+
### 7.1 Hierarchical Sampling
|
| 86 |
+
| Feature | Status | Implementation |
|
| 87 |
+
|---------|--------|----------------|
|
| 88 |
+
| Coarse pass (1 frame/5-10s) | ✅ | `frame_sampler.py` |
|
| 89 |
+
| Dense pass on candidates | ✅ | `sample_dense()` method |
|
| 90 |
+
| Dynamic FPS | ✅ | Based on motion scores |
|
| 91 |
+
|
| 92 |
+
### 7.2 Contrastive Hype Scoring
|
| 93 |
+
| Feature | Status | Implementation |
|
| 94 |
+
|---------|--------|----------------|
|
| 95 |
+
| Pairwise ranking | ✅ | Training notebook |
|
| 96 |
+
| Relative scoring | ✅ | Normalized within video |
|
| 97 |
+
|
| 98 |
+
### 7.3 Multi-Modal Person Detection
|
| 99 |
+
| Feature | Status | Implementation |
|
| 100 |
+
|---------|--------|----------------|
|
| 101 |
+
| Face + Body | ✅ | Both recognizers |
|
| 102 |
+
| Confidence fusion | ✅ | `max(face_score, body_score)` |
|
| 103 |
+
| ByteTrack tracking | ✅ | `tracker.py` |
|
| 104 |
+
|
| 105 |
+
### 7.4 Domain-Aware Presets
|
| 106 |
+
| Domain | Visual | Audio | Status |
|
| 107 |
+
|--------|--------|-------|--------|
|
| 108 |
+
| Sports | 30% | 45% | ✅ |
|
| 109 |
+
| Vlogs | 55% | 20% | ✅ |
|
| 110 |
+
| Music | 35% | 45% | ✅ |
|
| 111 |
+
| Podcasts | 10% | 75% | ✅ |
|
| 112 |
+
| Gaming | 40% | 35% | ✅ |
|
| 113 |
+
| General | 40% | 35% | ✅ |
|
| 114 |
+
|
| 115 |
+
### 7.5 Diversity Enforcement
|
| 116 |
+
| Feature | Status | Implementation |
|
| 117 |
+
|---------|--------|----------------|
|
| 118 |
+
| Minimum 30s gap | ✅ | `clip_extractor.py` `select_clips()` |
|
| 119 |
+
|
| 120 |
+
### 7.6 Fallback Handling
|
| 121 |
+
| Feature | Status | Implementation |
|
| 122 |
+
|---------|--------|----------------|
|
| 123 |
+
| Uniform windowing for flat content | ✅ | `create_fallback_clips()` |
|
| 124 |
+
| Never zero clips | ✅ | Fallback always creates clips |
|
| 125 |
+
|
| 126 |
+
## ✅ Gradio UI Requirements
|
| 127 |
+
|
| 128 |
+
| Feature | Status | Implementation |
|
| 129 |
+
|---------|--------|----------------|
|
| 130 |
+
| Video upload | ✅ | `gr.Video` component |
|
| 131 |
+
| API key input | ✅ | `gr.Textbox(type="password")` |
|
| 132 |
+
| Domain selection | ✅ | `gr.Dropdown` |
|
| 133 |
+
| Clip duration slider | ✅ | `gr.Slider` |
|
| 134 |
+
| Num clips slider | ✅ | `gr.Slider` |
|
| 135 |
+
| Reference image | ✅ | `gr.Image` |
|
| 136 |
+
| Custom prompt | ✅ | `gr.Textbox` |
|
| 137 |
+
| Progress bar | ✅ | `gr.Progress` |
|
| 138 |
+
| Output gallery | ✅ | `gr.Gallery` |
|
| 139 |
+
| Download all | ⚠️ | Partial (individual clips downloadable) |
|
| 140 |
+
|
| 141 |
+
## ⚠️ Items for Future Enhancement
|
| 142 |
+
|
| 143 |
+
| Item | Status | Notes |
|
| 144 |
+
|------|--------|-------|
|
| 145 |
+
| Trained hype scorer weights | 🔄 | Notebook ready, needs training on real data |
|
| 146 |
+
| RAFT GPU acceleration | ⚠️ | Falls back to Farneback if unavailable |
|
| 147 |
+
| Download all as ZIP | ⚠️ | Could add `gr.DownloadButton` |
|
| 148 |
+
| Batch processing | ❌ | Single video only currently |
|
| 149 |
+
| API endpoint | ❌ | UI only, no REST API |
|
| 150 |
+
|
| 151 |
+
## Summary
|
| 152 |
+
|
| 153 |
+
**Completed**: 95% of proposal requirements
|
| 154 |
+
**Training Pipeline**: Separate Colab notebook for Mr. HiSum training
|
| 155 |
+
**Missing**: Only minor UI features (bulk download) and production training
|
| 156 |
+
|
| 157 |
+
The implementation fully covers:
|
| 158 |
+
- ✅ All 9 core components from the proposal
|
| 159 |
+
- ✅ All 6 key design decisions
|
| 160 |
+
- ✅ All domain presets
|
| 161 |
+
- ✅ Error handling and logging throughout
|
| 162 |
+
- ✅ Gradio UI with all inputs from proposal
|
app.py
CHANGED
|
@@ -1,1014 +1,1096 @@
|
|
| 1 |
-
"""
|
| 2 |
-
ShortSmith
|
| 3 |
-
|
| 4 |
-
Hugging Face Space interface for video highlight extraction.
|
| 5 |
-
Features:
|
| 6 |
-
- Multi-modal analysis (visual + audio + motion)
|
| 7 |
-
- Domain-optimized presets
|
| 8 |
-
- Person-specific filtering (optional)
|
| 9 |
-
- Scene-aware clip cutting
|
| 10 |
-
- Batch testing with parameter variations
|
| 11 |
-
"""
|
| 12 |
-
|
| 13 |
-
import os
|
| 14 |
-
import sys
|
| 15 |
-
import tempfile
|
| 16 |
-
import shutil
|
| 17 |
-
import json
|
| 18 |
-
import zipfile
|
| 19 |
-
from pathlib import Path
|
| 20 |
-
import time
|
| 21 |
-
import traceback
|
| 22 |
-
from typing import List, Dict, Any, Optional
|
| 23 |
-
|
| 24 |
-
import gradio as gr
|
| 25 |
-
import pandas as pd
|
| 26 |
-
|
| 27 |
-
# Add project root to path
|
| 28 |
-
sys.path.insert(0, str(Path(__file__).parent))
|
| 29 |
-
|
| 30 |
-
# Initialize logging
|
| 31 |
-
try:
|
| 32 |
-
from utils.logger import setup_logging, get_logger
|
| 33 |
-
setup_logging(log_level="INFO", log_to_console=True)
|
| 34 |
-
logger = get_logger("app")
|
| 35 |
-
except Exception:
|
| 36 |
-
import logging
|
| 37 |
-
logging.basicConfig(level=logging.INFO)
|
| 38 |
-
logger = logging.getLogger("app")
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
# =============================================================================
|
| 42 |
-
# Shared Utilities
|
| 43 |
-
# =============================================================================
|
| 44 |
-
|
| 45 |
-
def build_metrics_output(result, domain: str, custom_prompt: Optional[str] = None) -> str:
|
| 46 |
-
"""
|
| 47 |
-
Build formatted metrics output for testing and evaluation.
|
| 48 |
-
|
| 49 |
-
Args:
|
| 50 |
-
result: PipelineResult object
|
| 51 |
-
domain: Content domain used for processing
|
| 52 |
-
custom_prompt: Custom prompt used (if any)
|
| 53 |
-
|
| 54 |
-
Returns:
|
| 55 |
-
Formatted string with all metrics
|
| 56 |
-
"""
|
| 57 |
-
lines = []
|
| 58 |
-
lines.append("=" * 50)
|
| 59 |
-
lines.append("AUTOMATED METRICS (System-Generated)")
|
| 60 |
-
lines.append("=" * 50)
|
| 61 |
-
lines.append("")
|
| 62 |
-
|
| 63 |
-
# Processing Metrics
|
| 64 |
-
lines.append("PROCESSING METRICS")
|
| 65 |
-
lines.append("-" * 30)
|
| 66 |
-
lines.append(f"processing_time_seconds: {result.processing_time:.2f}")
|
| 67 |
-
lines.append(f"frames_analyzed: {len(result.visual_features)}")
|
| 68 |
-
lines.append(f"scenes_detected: {len(result.scenes)}")
|
| 69 |
-
lines.append(f"audio_segments_analyzed: {len(result.audio_features)}")
|
| 70 |
-
lines.append(f"domain: {domain}")
|
| 71 |
-
lines.append(f"custom_prompt: {custom_prompt if custom_prompt else 'none'}")
|
| 72 |
-
|
| 73 |
-
# Count hooks from scores (estimate based on high-scoring segments)
|
| 74 |
-
hooks_detected = sum(1 for s in result.scores if s.combined_score > 0.7) if result.scores else 0
|
| 75 |
-
lines.append(f"hooks_detected: {hooks_detected}")
|
| 76 |
-
|
| 77 |
-
if result.metadata:
|
| 78 |
-
lines.append(f"video_duration_seconds: {result.metadata.duration:.2f}")
|
| 79 |
-
lines.append(f"video_resolution: {result.metadata.resolution}")
|
| 80 |
-
lines.append(f"video_fps: {result.metadata.fps:.2f}")
|
| 81 |
-
|
| 82 |
-
lines.append("")
|
| 83 |
-
|
| 84 |
-
# Per Clip Metrics
|
| 85 |
-
lines.append("PER CLIP METRICS")
|
| 86 |
-
lines.append("-" * 30)
|
| 87 |
-
|
| 88 |
-
for i, clip in enumerate(result.clips):
|
| 89 |
-
lines.append("")
|
| 90 |
-
lines.append(f"[Clip {i + 1}]")
|
| 91 |
-
lines.append(f" clip_id: {i + 1}")
|
| 92 |
-
lines.append(f" start_time: {clip.start_time:.2f}")
|
| 93 |
-
lines.append(f" end_time: {clip.end_time:.2f}")
|
| 94 |
-
lines.append(f" duration: {clip.duration:.2f}")
|
| 95 |
-
lines.append(f" hype_score: {clip.hype_score:.4f}")
|
| 96 |
-
lines.append(f" visual_score: {clip.visual_score:.4f}")
|
| 97 |
-
lines.append(f" audio_score: {clip.audio_score:.4f}")
|
| 98 |
-
lines.append(f" motion_score: {clip.motion_score:.4f}")
|
| 99 |
-
|
| 100 |
-
# Hook info - derive from segment scores if available
|
| 101 |
-
hook_type = "none"
|
| 102 |
-
hook_confidence = 0.0
|
| 103 |
-
|
| 104 |
-
# Find matching segment score for this clip
|
| 105 |
-
for score in result.scores:
|
| 106 |
-
if abs(score.start_time - clip.start_time) < 1.0:
|
| 107 |
-
if score.combined_score > 0.7:
|
| 108 |
-
hook_confidence = score.combined_score
|
| 109 |
-
# Infer hook type based on dominant score
|
| 110 |
-
if score.audio_score > score.visual_score and score.audio_score > score.motion_score:
|
| 111 |
-
hook_type = "audio_peak"
|
| 112 |
-
elif score.motion_score > score.visual_score:
|
| 113 |
-
hook_type = "motion_spike"
|
| 114 |
-
else:
|
| 115 |
-
hook_type = "visual_highlight"
|
| 116 |
-
break
|
| 117 |
-
|
| 118 |
-
lines.append(f" hook_type: {hook_type}")
|
| 119 |
-
lines.append(f" hook_confidence: {hook_confidence:.4f}")
|
| 120 |
-
|
| 121 |
-
if clip.person_detected:
|
| 122 |
-
lines.append(f" person_detected: True")
|
| 123 |
-
lines.append(f" person_screen_time: {clip.person_screen_time:.4f}")
|
| 124 |
-
|
| 125 |
-
lines.append("")
|
| 126 |
-
lines.append("=" * 50)
|
| 127 |
-
lines.append("END METRICS")
|
| 128 |
-
lines.append("=" * 50)
|
| 129 |
-
|
| 130 |
-
return "\n".join(lines)
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
# =============================================================================
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
progress(
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
|
| 286 |
-
|
| 287 |
-
#
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
|
| 338 |
-
|
| 339 |
-
|
| 340 |
-
|
| 341 |
-
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
|
| 348 |
-
|
| 349 |
-
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
|
| 353 |
-
|
| 354 |
-
|
| 355 |
-
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
|
| 360 |
-
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
|
| 364 |
-
|
| 365 |
-
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
|
| 374 |
-
|
| 375 |
-
"
|
| 376 |
-
"
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
"
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
|
| 388 |
-
|
| 389 |
-
|
| 390 |
-
|
| 391 |
-
|
| 392 |
-
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
|
| 396 |
-
|
| 397 |
-
|
| 398 |
-
|
| 399 |
-
|
| 400 |
-
|
| 401 |
-
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
|
| 413 |
-
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
|
| 421 |
-
|
| 422 |
-
|
| 423 |
-
|
| 424 |
-
|
| 425 |
-
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
|
| 429 |
-
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
-
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
|
| 454 |
-
|
| 455 |
-
|
| 456 |
-
|
| 457 |
-
|
| 458 |
-
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
|
| 473 |
-
|
| 474 |
-
|
| 475 |
-
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
|
| 479 |
-
|
| 480 |
-
|
| 481 |
-
|
| 482 |
-
|
| 483 |
-
|
| 484 |
-
|
| 485 |
-
|
| 486 |
-
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
|
| 490 |
-
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
|
| 494 |
-
|
| 495 |
-
|
| 496 |
-
|
| 497 |
-
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
|
| 509 |
-
|
| 510 |
-
|
| 511 |
-
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
|
| 517 |
-
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
|
| 521 |
-
|
| 522 |
-
|
| 523 |
-
|
| 524 |
-
|
| 525 |
-
|
| 526 |
-
|
| 527 |
-
|
| 528 |
-
|
| 529 |
-
|
| 530 |
-
|
| 531 |
-
|
| 532 |
-
|
| 533 |
-
|
| 534 |
-
|
| 535 |
-
|
| 536 |
-
|
| 537 |
-
|
| 538 |
-
|
| 539 |
-
|
| 540 |
-
|
| 541 |
-
|
| 542 |
-
|
| 543 |
-
|
| 544 |
-
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
|
| 548 |
-
|
| 549 |
-
|
| 550 |
-
|
| 551 |
-
|
| 552 |
-
|
| 553 |
-
|
| 554 |
-
|
| 555 |
-
|
| 556 |
-
|
| 557 |
-
|
| 558 |
-
|
| 559 |
-
|
| 560 |
-
|
| 561 |
-
}
|
| 562 |
-
|
| 563 |
-
|
| 564 |
-
|
| 565 |
-
|
| 566 |
-
|
| 567 |
-
|
| 568 |
-
|
| 569 |
-
|
| 570 |
-
|
| 571 |
-
|
| 572 |
-
|
| 573 |
-
|
| 574 |
-
|
| 575 |
-
)
|
| 576 |
-
|
| 577 |
-
|
| 578 |
-
|
| 579 |
-
|
| 580 |
-
|
| 581 |
-
|
| 582 |
-
|
| 583 |
-
|
| 584 |
-
|
| 585 |
-
|
| 586 |
-
|
| 587 |
-
|
| 588 |
-
|
| 589 |
-
|
| 590 |
-
|
| 591 |
-
|
| 592 |
-
|
| 593 |
-
|
| 594 |
-
|
| 595 |
-
|
| 596 |
-
|
| 597 |
-
|
| 598 |
-
|
| 599 |
-
|
| 600 |
-
|
| 601 |
-
|
| 602 |
-
|
| 603 |
-
|
| 604 |
-
|
| 605 |
-
|
| 606 |
-
|
| 607 |
-
|
| 608 |
-
|
| 609 |
-
|
| 610 |
-
|
| 611 |
-
|
| 612 |
-
|
| 613 |
-
|
| 614 |
-
|
| 615 |
-
|
| 616 |
-
|
| 617 |
-
|
| 618 |
-
|
| 619 |
-
|
| 620 |
-
|
| 621 |
-
|
| 622 |
-
|
| 623 |
-
|
| 624 |
-
|
| 625 |
-
|
| 626 |
-
|
| 627 |
-
|
| 628 |
-
|
| 629 |
-
|
| 630 |
-
|
| 631 |
-
|
| 632 |
-
|
| 633 |
-
|
| 634 |
-
|
| 635 |
-
|
| 636 |
-
|
| 637 |
-
|
| 638 |
-
|
| 639 |
-
|
| 640 |
-
|
| 641 |
-
|
| 642 |
-
|
| 643 |
-
|
| 644 |
-
|
| 645 |
-
|
| 646 |
-
|
| 647 |
-
|
| 648 |
-
|
| 649 |
-
|
| 650 |
-
|
| 651 |
-
|
| 652 |
-
|
| 653 |
-
|
| 654 |
-
|
| 655 |
-
|
| 656 |
-
|
| 657 |
-
|
| 658 |
-
#
|
| 659 |
-
|
| 660 |
-
|
| 661 |
-
|
| 662 |
-
|
| 663 |
-
|
| 664 |
-
|
| 665 |
-
|
| 666 |
-
|
| 667 |
-
|
| 668 |
-
|
| 669 |
-
|
| 670 |
-
|
| 671 |
-
|
| 672 |
-
|
| 673 |
-
|
| 674 |
-
|
| 675 |
-
|
| 676 |
-
|
| 677 |
-
|
| 678 |
-
|
| 679 |
-
|
| 680 |
-
|
| 681 |
-
|
| 682 |
-
|
| 683 |
-
|
| 684 |
-
|
| 685 |
-
|
| 686 |
-
|
| 687 |
-
|
| 688 |
-
|
| 689 |
-
|
| 690 |
-
|
| 691 |
-
|
| 692 |
-
|
| 693 |
-
|
| 694 |
-
|
| 695 |
-
|
| 696 |
-
|
| 697 |
-
|
| 698 |
-
|
| 699 |
-
|
| 700 |
-
|
| 701 |
-
|
| 702 |
-
|
| 703 |
-
|
| 704 |
-
|
| 705 |
-
#
|
| 706 |
-
|
| 707 |
-
|
| 708 |
-
|
| 709 |
-
|
| 710 |
-
|
| 711 |
-
|
| 712 |
-
|
| 713 |
-
|
| 714 |
-
)
|
| 715 |
-
|
| 716 |
-
|
| 717 |
-
|
| 718 |
-
|
| 719 |
-
|
| 720 |
-
|
| 721 |
-
|
| 722 |
-
|
| 723 |
-
|
| 724 |
-
|
| 725 |
-
|
| 726 |
-
|
| 727 |
-
|
| 728 |
-
|
| 729 |
-
|
| 730 |
-
|
| 731 |
-
|
| 732 |
-
|
| 733 |
-
|
| 734 |
-
|
| 735 |
-
|
| 736 |
-
|
| 737 |
-
|
| 738 |
-
|
| 739 |
-
|
| 740 |
-
|
| 741 |
-
|
| 742 |
-
|
| 743 |
-
|
| 744 |
-
|
| 745 |
-
|
| 746 |
-
|
| 747 |
-
|
| 748 |
-
|
| 749 |
-
|
| 750 |
-
|
| 751 |
-
|
| 752 |
-
|
| 753 |
-
|
| 754 |
-
|
| 755 |
-
|
| 756 |
-
|
| 757 |
-
|
| 758 |
-
|
| 759 |
-
|
| 760 |
-
|
| 761 |
-
|
| 762 |
-
|
| 763 |
-
|
| 764 |
-
|
| 765 |
-
|
| 766 |
-
|
| 767 |
-
|
| 768 |
-
|
| 769 |
-
|
| 770 |
-
|
| 771 |
-
|
| 772 |
-
|
| 773 |
-
|
| 774 |
-
|
| 775 |
-
|
| 776 |
-
|
| 777 |
-
|
| 778 |
-
|
| 779 |
-
|
| 780 |
-
|
| 781 |
-
|
| 782 |
-
|
| 783 |
-
|
| 784 |
-
|
| 785 |
-
|
| 786 |
-
|
| 787 |
-
|
| 788 |
-
|
| 789 |
-
|
| 790 |
-
|
| 791 |
-
|
| 792 |
-
|
| 793 |
-
|
| 794 |
-
|
| 795 |
-
|
| 796 |
-
|
| 797 |
-
|
| 798 |
-
|
| 799 |
-
|
| 800 |
-
|
| 801 |
-
|
| 802 |
-
|
| 803 |
-
|
| 804 |
-
|
| 805 |
-
|
| 806 |
-
|
| 807 |
-
|
| 808 |
-
|
| 809 |
-
|
| 810 |
-
|
| 811 |
-
|
| 812 |
-
|
| 813 |
-
|
| 814 |
-
|
| 815 |
-
|
| 816 |
-
|
| 817 |
-
|
| 818 |
-
|
| 819 |
-
|
| 820 |
-
|
| 821 |
-
|
| 822 |
-
|
| 823 |
-
|
| 824 |
-
|
| 825 |
-
|
| 826 |
-
|
| 827 |
-
|
| 828 |
-
|
| 829 |
-
|
| 830 |
-
|
| 831 |
-
|
| 832 |
-
|
| 833 |
-
|
| 834 |
-
|
| 835 |
-
|
| 836 |
-
|
| 837 |
-
|
| 838 |
-
|
| 839 |
-
|
| 840 |
-
|
| 841 |
-
|
| 842 |
-
|
| 843 |
-
|
| 844 |
-
|
| 845 |
-
|
| 846 |
-
|
| 847 |
-
|
| 848 |
-
|
| 849 |
-
|
| 850 |
-
|
| 851 |
-
|
| 852 |
-
|
| 853 |
-
|
| 854 |
-
|
| 855 |
-
|
| 856 |
-
|
| 857 |
-
)
|
| 858 |
-
|
| 859 |
-
gr.
|
| 860 |
-
|
| 861 |
-
|
| 862 |
-
|
| 863 |
-
|
| 864 |
-
|
| 865 |
-
|
| 866 |
-
|
| 867 |
-
|
| 868 |
-
|
| 869 |
-
|
| 870 |
-
|
| 871 |
-
|
| 872 |
-
|
| 873 |
-
|
| 874 |
-
|
| 875 |
-
|
| 876 |
-
|
| 877 |
-
|
| 878 |
-
|
| 879 |
-
|
| 880 |
-
|
| 881 |
-
|
| 882 |
-
|
| 883 |
-
|
| 884 |
-
|
| 885 |
-
|
| 886 |
-
|
| 887 |
-
|
| 888 |
-
|
| 889 |
-
|
| 890 |
-
|
| 891 |
-
|
| 892 |
-
|
| 893 |
-
|
| 894 |
-
|
| 895 |
-
|
| 896 |
-
|
| 897 |
-
|
| 898 |
-
|
| 899 |
-
|
| 900 |
-
|
| 901 |
-
|
| 902 |
-
|
| 903 |
-
|
| 904 |
-
|
| 905 |
-
|
| 906 |
-
|
| 907 |
-
|
| 908 |
-
|
| 909 |
-
|
| 910 |
-
|
| 911 |
-
|
| 912 |
-
|
| 913 |
-
|
| 914 |
-
|
| 915 |
-
|
| 916 |
-
|
| 917 |
-
|
| 918 |
-
|
| 919 |
-
|
| 920 |
-
|
| 921 |
-
|
| 922 |
-
|
| 923 |
-
|
| 924 |
-
|
| 925 |
-
|
| 926 |
-
)
|
| 927 |
-
|
| 928 |
-
|
| 929 |
-
label="
|
| 930 |
-
|
| 931 |
-
|
| 932 |
-
)
|
| 933 |
-
|
| 934 |
-
|
| 935 |
-
|
| 936 |
-
|
| 937 |
-
|
| 938 |
-
|
| 939 |
-
|
| 940 |
-
|
| 941 |
-
|
| 942 |
-
|
| 943 |
-
|
| 944 |
-
|
| 945 |
-
|
| 946 |
-
|
| 947 |
-
|
| 948 |
-
|
| 949 |
-
|
| 950 |
-
|
| 951 |
-
|
| 952 |
-
|
| 953 |
-
|
| 954 |
-
|
| 955 |
-
|
| 956 |
-
|
| 957 |
-
|
| 958 |
-
|
| 959 |
-
|
| 960 |
-
|
| 961 |
-
|
| 962 |
-
|
| 963 |
-
|
| 964 |
-
|
| 965 |
-
|
| 966 |
-
|
| 967 |
-
|
| 968 |
-
|
| 969 |
-
|
| 970 |
-
|
| 971 |
-
|
| 972 |
-
|
| 973 |
-
|
| 974 |
-
|
| 975 |
-
|
| 976 |
-
|
| 977 |
-
|
| 978 |
-
|
| 979 |
-
|
| 980 |
-
|
| 981 |
-
|
| 982 |
-
|
| 983 |
-
|
| 984 |
-
|
| 985 |
-
|
| 986 |
-
|
| 987 |
-
|
| 988 |
-
|
| 989 |
-
|
| 990 |
-
|
| 991 |
-
|
| 992 |
-
|
| 993 |
-
|
| 994 |
-
|
| 995 |
-
|
| 996 |
-
|
| 997 |
-
|
| 998 |
-
|
| 999 |
-
|
| 1000 |
-
|
| 1001 |
-
|
| 1002 |
-
|
| 1003 |
-
|
| 1004 |
-
|
| 1005 |
-
|
| 1006 |
-
|
| 1007 |
-
|
| 1008 |
-
|
| 1009 |
-
|
| 1010 |
-
|
| 1011 |
-
|
| 1012 |
-
|
| 1013 |
-
|
| 1014 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ShortSmith v3 - Gradio Application
|
| 3 |
+
|
| 4 |
+
Hugging Face Space interface for video highlight extraction.
|
| 5 |
+
Features:
|
| 6 |
+
- Multi-modal analysis (visual + audio + motion)
|
| 7 |
+
- Domain-optimized presets
|
| 8 |
+
- Person-specific filtering (optional)
|
| 9 |
+
- Scene-aware clip cutting
|
| 10 |
+
- Batch testing with parameter variations
|
| 11 |
+
"""
|
| 12 |
+
|
| 13 |
+
import os
|
| 14 |
+
import sys
|
| 15 |
+
import tempfile
|
| 16 |
+
import shutil
|
| 17 |
+
import json
|
| 18 |
+
import zipfile
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
import time
|
| 21 |
+
import traceback
|
| 22 |
+
from typing import List, Dict, Any, Optional
|
| 23 |
+
|
| 24 |
+
import gradio as gr
|
| 25 |
+
import pandas as pd
|
| 26 |
+
|
| 27 |
+
# Add project root to path
|
| 28 |
+
sys.path.insert(0, str(Path(__file__).parent))
|
| 29 |
+
|
| 30 |
+
# Initialize logging
|
| 31 |
+
try:
|
| 32 |
+
from utils.logger import setup_logging, get_logger
|
| 33 |
+
setup_logging(log_level="INFO", log_to_console=True)
|
| 34 |
+
logger = get_logger("app")
|
| 35 |
+
except Exception:
|
| 36 |
+
import logging
|
| 37 |
+
logging.basicConfig(level=logging.INFO)
|
| 38 |
+
logger = logging.getLogger("app")
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
# =============================================================================
|
| 42 |
+
# Shared Utilities
|
| 43 |
+
# =============================================================================
|
| 44 |
+
|
| 45 |
+
def build_metrics_output(result, domain: str, custom_prompt: Optional[str] = None) -> str:
|
| 46 |
+
"""
|
| 47 |
+
Build formatted metrics output for testing and evaluation.
|
| 48 |
+
|
| 49 |
+
Args:
|
| 50 |
+
result: PipelineResult object
|
| 51 |
+
domain: Content domain used for processing
|
| 52 |
+
custom_prompt: Custom prompt used (if any)
|
| 53 |
+
|
| 54 |
+
Returns:
|
| 55 |
+
Formatted string with all metrics
|
| 56 |
+
"""
|
| 57 |
+
lines = []
|
| 58 |
+
lines.append("=" * 50)
|
| 59 |
+
lines.append("AUTOMATED METRICS (System-Generated)")
|
| 60 |
+
lines.append("=" * 50)
|
| 61 |
+
lines.append("")
|
| 62 |
+
|
| 63 |
+
# Processing Metrics
|
| 64 |
+
lines.append("PROCESSING METRICS")
|
| 65 |
+
lines.append("-" * 30)
|
| 66 |
+
lines.append(f"processing_time_seconds: {result.processing_time:.2f}")
|
| 67 |
+
lines.append(f"frames_analyzed: {len(result.visual_features)}")
|
| 68 |
+
lines.append(f"scenes_detected: {len(result.scenes)}")
|
| 69 |
+
lines.append(f"audio_segments_analyzed: {len(result.audio_features)}")
|
| 70 |
+
lines.append(f"domain: {domain}")
|
| 71 |
+
lines.append(f"custom_prompt: {custom_prompt if custom_prompt else 'none'}")
|
| 72 |
+
|
| 73 |
+
# Count hooks from scores (estimate based on high-scoring segments)
|
| 74 |
+
hooks_detected = sum(1 for s in result.scores if s.combined_score > 0.7) if result.scores else 0
|
| 75 |
+
lines.append(f"hooks_detected: {hooks_detected}")
|
| 76 |
+
|
| 77 |
+
if result.metadata:
|
| 78 |
+
lines.append(f"video_duration_seconds: {result.metadata.duration:.2f}")
|
| 79 |
+
lines.append(f"video_resolution: {result.metadata.resolution}")
|
| 80 |
+
lines.append(f"video_fps: {result.metadata.fps:.2f}")
|
| 81 |
+
|
| 82 |
+
lines.append("")
|
| 83 |
+
|
| 84 |
+
# Per Clip Metrics
|
| 85 |
+
lines.append("PER CLIP METRICS")
|
| 86 |
+
lines.append("-" * 30)
|
| 87 |
+
|
| 88 |
+
for i, clip in enumerate(result.clips):
|
| 89 |
+
lines.append("")
|
| 90 |
+
lines.append(f"[Clip {i + 1}]")
|
| 91 |
+
lines.append(f" clip_id: {i + 1}")
|
| 92 |
+
lines.append(f" start_time: {clip.start_time:.2f}")
|
| 93 |
+
lines.append(f" end_time: {clip.end_time:.2f}")
|
| 94 |
+
lines.append(f" duration: {clip.duration:.2f}")
|
| 95 |
+
lines.append(f" hype_score: {clip.hype_score:.4f}")
|
| 96 |
+
lines.append(f" visual_score: {clip.visual_score:.4f}")
|
| 97 |
+
lines.append(f" audio_score: {clip.audio_score:.4f}")
|
| 98 |
+
lines.append(f" motion_score: {clip.motion_score:.4f}")
|
| 99 |
+
|
| 100 |
+
# Hook info - derive from segment scores if available
|
| 101 |
+
hook_type = "none"
|
| 102 |
+
hook_confidence = 0.0
|
| 103 |
+
|
| 104 |
+
# Find matching segment score for this clip
|
| 105 |
+
for score in result.scores:
|
| 106 |
+
if abs(score.start_time - clip.start_time) < 1.0:
|
| 107 |
+
if score.combined_score > 0.7:
|
| 108 |
+
hook_confidence = score.combined_score
|
| 109 |
+
# Infer hook type based on dominant score
|
| 110 |
+
if score.audio_score > score.visual_score and score.audio_score > score.motion_score:
|
| 111 |
+
hook_type = "audio_peak"
|
| 112 |
+
elif score.motion_score > score.visual_score:
|
| 113 |
+
hook_type = "motion_spike"
|
| 114 |
+
else:
|
| 115 |
+
hook_type = "visual_highlight"
|
| 116 |
+
break
|
| 117 |
+
|
| 118 |
+
lines.append(f" hook_type: {hook_type}")
|
| 119 |
+
lines.append(f" hook_confidence: {hook_confidence:.4f}")
|
| 120 |
+
|
| 121 |
+
if clip.person_detected:
|
| 122 |
+
lines.append(f" person_detected: True")
|
| 123 |
+
lines.append(f" person_screen_time: {clip.person_screen_time:.4f}")
|
| 124 |
+
|
| 125 |
+
lines.append("")
|
| 126 |
+
lines.append("=" * 50)
|
| 127 |
+
lines.append("END METRICS")
|
| 128 |
+
lines.append("=" * 50)
|
| 129 |
+
|
| 130 |
+
return "\n".join(lines)
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
# =============================================================================
|
| 136 |
+
# Single Video Processing
|
| 137 |
+
# =============================================================================
|
| 138 |
+
|
| 139 |
+
def process_video(
|
| 140 |
+
video_file,
|
| 141 |
+
domain,
|
| 142 |
+
num_clips,
|
| 143 |
+
clip_length,
|
| 144 |
+
reference_image,
|
| 145 |
+
custom_prompt,
|
| 146 |
+
progress=gr.Progress()
|
| 147 |
+
):
|
| 148 |
+
"""
|
| 149 |
+
Main video processing function for single video mode.
|
| 150 |
+
|
| 151 |
+
Args:
|
| 152 |
+
video_file: Uploaded video file path
|
| 153 |
+
domain: Content domain for scoring weights
|
| 154 |
+
num_clips: Number of clips to extract
|
| 155 |
+
clip_length: Clip length preset ("Short" or "Long")
|
| 156 |
+
reference_image: Optional reference image for person filtering
|
| 157 |
+
custom_prompt: Optional custom instructions
|
| 158 |
+
progress: Gradio progress tracker
|
| 159 |
+
|
| 160 |
+
Returns:
|
| 161 |
+
Tuple of (status_message, clip1, clip2, clip3, log_text, metrics_text)
|
| 162 |
+
"""
|
| 163 |
+
if video_file is None:
|
| 164 |
+
return "Please upload a video first.", None, None, None, "", ""
|
| 165 |
+
|
| 166 |
+
log_messages = []
|
| 167 |
+
|
| 168 |
+
def log(msg):
|
| 169 |
+
log_messages.append(f"[{time.strftime('%H:%M:%S')}] {msg}")
|
| 170 |
+
logger.info(msg)
|
| 171 |
+
|
| 172 |
+
try:
|
| 173 |
+
video_path = Path(video_file)
|
| 174 |
+
log(f"Processing video: {video_path.name}")
|
| 175 |
+
progress(0.05, desc="Validating video...")
|
| 176 |
+
|
| 177 |
+
# Import pipeline components
|
| 178 |
+
from utils.helpers import validate_video_file, validate_image_file, format_duration
|
| 179 |
+
from pipeline.orchestrator import PipelineOrchestrator
|
| 180 |
+
|
| 181 |
+
# Validate video
|
| 182 |
+
validation = validate_video_file(video_file)
|
| 183 |
+
if not validation.is_valid:
|
| 184 |
+
return f"Error: {validation.error_message}", None, None, None, "\n".join(log_messages), ""
|
| 185 |
+
|
| 186 |
+
log(f"Video size: {validation.file_size / (1024*1024):.1f} MB")
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
###-------------------------------TESTING MODE----------------------------------------------
|
| 190 |
+
# For testing: Skip processing and show mock highlights
|
| 191 |
+
if custom_prompt and custom_prompt.strip().lower() == "test@akatsuki":
|
| 192 |
+
log("TEST MODE: Skipping processing, showing mock highlights")
|
| 193 |
+
progress(0.5, desc="Generating test highlights...")
|
| 194 |
+
|
| 195 |
+
# Build dynamic HTML for mock highlights
|
| 196 |
+
clips_html = ""
|
| 197 |
+
for i in range(int(num_clips)):
|
| 198 |
+
clips_html += f'''
|
| 199 |
+
<div style="margin-bottom: 20px; border: 1px solid #ccc; border-radius: 8px; padding: 15px; background-color: #0b1619;">
|
| 200 |
+
<h4 style="margin-top: 0; color: #f2fafc;">Highlight Clip {i+1}</h4>
|
| 201 |
+
<div style="margin-bottom: 10px; font-size: 14px; color: #f2fafc;">
|
| 202 |
+
<strong>Time:</strong> {i*30}-{i*30+45}s |
|
| 203 |
+
<strong>Score:</strong> {95-i*5:.1f} |
|
| 204 |
+
<strong>Domain:</strong> {domain}
|
| 205 |
+
</div>
|
| 206 |
+
<video width="100%" controls style="border-radius: 4px;">
|
| 207 |
+
<source src="{video_file}" type="video/mp4">
|
| 208 |
+
</video>
|
| 209 |
+
</div>
|
| 210 |
+
'''
|
| 211 |
+
|
| 212 |
+
status = f"TEST MODE: Generated {int(num_clips)} mock highlight clips from {Path(video_file).name}"
|
| 213 |
+
metrics_output = f"""
|
| 214 |
+
Test Mode Metrics:
|
| 215 |
+
- Domain: {domain}
|
| 216 |
+
- Number of Clips: {int(num_clips)}
|
| 217 |
+
- Clip Length: {clip_length}
|
| 218 |
+
- Processing Time: 2.3s (simulated)
|
| 219 |
+
- Total Video Duration: 5:42
|
| 220 |
+
- Highlights Extracted: {int(num_clips)}
|
| 221 |
+
""".strip()
|
| 222 |
+
|
| 223 |
+
progress(1.0, desc="Test complete!")
|
| 224 |
+
return status, clips_html, "\n".join(log_messages), metrics_output, ""
|
| 225 |
+
###----------------------------------------------------------------------------------------------------------------------
|
| 226 |
+
# Validate reference image if provided
|
| 227 |
+
ref_path = None
|
| 228 |
+
if reference_image is not None:
|
| 229 |
+
ref_validation = validate_image_file(reference_image)
|
| 230 |
+
if ref_validation.is_valid:
|
| 231 |
+
ref_path = reference_image
|
| 232 |
+
log(f"Reference image: {Path(reference_image).name}")
|
| 233 |
+
else:
|
| 234 |
+
log(f"Warning: Invalid reference image - {ref_validation.error_message}")
|
| 235 |
+
|
| 236 |
+
# Map domain string to internal value
|
| 237 |
+
domain_map = {
|
| 238 |
+
"Sports": "sports",
|
| 239 |
+
"Vlogs": "vlogs",
|
| 240 |
+
"Music Videos": "music",
|
| 241 |
+
"Podcasts": "podcasts",
|
| 242 |
+
"Gaming": "gaming",
|
| 243 |
+
"Comedy": "comedy",
|
| 244 |
+
"General": "general",
|
| 245 |
+
}
|
| 246 |
+
domain_value = domain_map.get(domain, "general")
|
| 247 |
+
log(f"Domain: {domain_value}")
|
| 248 |
+
|
| 249 |
+
# Map clip length to internal value
|
| 250 |
+
clip_length_value = "short" if clip_length == "Short (30-60s)" else "long"
|
| 251 |
+
log(f"Clip length: {clip_length_value}")
|
| 252 |
+
|
| 253 |
+
# Create output directory
|
| 254 |
+
output_dir = Path(tempfile.mkdtemp(prefix="shortsmith_output_"))
|
| 255 |
+
log(f"Output directory: {output_dir}")
|
| 256 |
+
|
| 257 |
+
# Progress callback to update UI during processing
|
| 258 |
+
def on_progress(pipeline_progress):
|
| 259 |
+
stage = pipeline_progress.stage.value
|
| 260 |
+
pct = pipeline_progress.progress
|
| 261 |
+
msg = pipeline_progress.message
|
| 262 |
+
log(f"[{stage}] {msg}")
|
| 263 |
+
# Map pipeline progress (0-1) to our range (0.1-0.9)
|
| 264 |
+
mapped_progress = 0.1 + (pct * 0.8)
|
| 265 |
+
progress(mapped_progress, desc=f"{stage}: {msg}")
|
| 266 |
+
|
| 267 |
+
# Initialize pipeline
|
| 268 |
+
progress(0.1, desc="Initializing AI models...")
|
| 269 |
+
log("Initializing pipeline...")
|
| 270 |
+
pipeline = PipelineOrchestrator(progress_callback=on_progress)
|
| 271 |
+
|
| 272 |
+
# Process video
|
| 273 |
+
progress(0.15, desc="Starting analysis...")
|
| 274 |
+
log(f"Processing: {int(num_clips)} clips, length={clip_length_value}")
|
| 275 |
+
|
| 276 |
+
result = pipeline.process(
|
| 277 |
+
video_path=video_path,
|
| 278 |
+
num_clips=int(num_clips),
|
| 279 |
+
clip_length=clip_length_value,
|
| 280 |
+
domain=domain_value,
|
| 281 |
+
reference_image=ref_path,
|
| 282 |
+
custom_prompt=custom_prompt.strip() if custom_prompt else None,
|
| 283 |
+
)
|
| 284 |
+
|
| 285 |
+
progress(0.9, desc="Extracting clips...")
|
| 286 |
+
|
| 287 |
+
# Handle result
|
| 288 |
+
if result.success:
|
| 289 |
+
log(f"Processing complete in {result.processing_time:.1f}s")
|
| 290 |
+
|
| 291 |
+
clip_paths = []
|
| 292 |
+
for i, clip in enumerate(result.clips):
|
| 293 |
+
if clip.clip_path.exists():
|
| 294 |
+
output_path = output_dir / f"highlight_{i+1}.mp4"
|
| 295 |
+
shutil.copy2(clip.clip_path, output_path)
|
| 296 |
+
clip_paths.append(str(output_path))
|
| 297 |
+
log(f"Clip {i+1}: {format_duration(clip.start_time)} - {format_duration(clip.end_time)} (score: {clip.hype_score:.2f})")
|
| 298 |
+
|
| 299 |
+
status = f"Successfully extracted {len(clip_paths)} highlight clips!\nProcessing time: {result.processing_time:.1f}s"
|
| 300 |
+
|
| 301 |
+
# Build metrics output
|
| 302 |
+
metrics_output = build_metrics_output(result, domain_value, custom_prompt.strip() if custom_prompt else None)
|
| 303 |
+
|
| 304 |
+
pipeline.cleanup()
|
| 305 |
+
progress(1.0, desc="Done!")
|
| 306 |
+
|
| 307 |
+
# Build dynamic HTML for clips with extracted highlights
|
| 308 |
+
clips_html = ""
|
| 309 |
+
for i, clip_path in enumerate(clip_paths):
|
| 310 |
+
clips_html += f'''
|
| 311 |
+
<div style="margin-bottom: 20px; border: 1px solid #ddd; border-radius: 8px; padding: 15px; background-color: #f9f9f9;">
|
| 312 |
+
<h4 style="margin-top: 0; color: #2e7d32;">Highlight Clip {i+1}</h4>
|
| 313 |
+
<div style="margin-bottom: 10px; font-size: 14px; color: #666;">
|
| 314 |
+
<strong>Time:</strong> {format_duration(result.clips[i].start_time)} - {format_duration(result.clips[i].end_time)} |
|
| 315 |
+
<strong>Score:</strong> {result.clips[i].hype_score:.2f} |
|
| 316 |
+
<strong>Domain:</strong> {domain_value}
|
| 317 |
+
</div>
|
| 318 |
+
<video width="100%" controls style="border-radius: 4px;">
|
| 319 |
+
<source src="{clip_path}" type="video/mp4">
|
| 320 |
+
</video>
|
| 321 |
+
</div>
|
| 322 |
+
'''
|
| 323 |
+
|
| 324 |
+
return status, clips_html, "\n".join(log_messages), metrics_output, ""
|
| 325 |
+
else:
|
| 326 |
+
log(f"Processing failed: {result.error_message}")
|
| 327 |
+
pipeline.cleanup()
|
| 328 |
+
return f"Error: {result.error_message}", "", "\n".join(log_messages), "", ""
|
| 329 |
+
|
| 330 |
+
except Exception as e:
|
| 331 |
+
error_msg = f"Unexpected error: {str(e)}"
|
| 332 |
+
log(error_msg)
|
| 333 |
+
log(traceback.format_exc())
|
| 334 |
+
logger.exception("Pipeline error")
|
| 335 |
+
return error_msg, "", "\n".join(log_messages), "", ""
|
| 336 |
+
|
| 337 |
+
|
| 338 |
+
# =============================================================================
|
| 339 |
+
# Batch Testing Functions
|
| 340 |
+
# =============================================================================
|
| 341 |
+
|
| 342 |
+
def generate_test_queue(
|
| 343 |
+
videos: List[str],
|
| 344 |
+
domains: List[str],
|
| 345 |
+
clip_lengths: List[str],
|
| 346 |
+
num_clips: int,
|
| 347 |
+
ref_image: Optional[str],
|
| 348 |
+
prompts: List[str],
|
| 349 |
+
include_no_prompt: bool
|
| 350 |
+
) -> List[Dict[str, Any]]:
|
| 351 |
+
"""Generate all parameter combinations to test (cartesian product)."""
|
| 352 |
+
# Build prompt list
|
| 353 |
+
prompt_list = []
|
| 354 |
+
if include_no_prompt:
|
| 355 |
+
prompt_list.append(None) # No prompt baseline
|
| 356 |
+
prompt_list.extend([p.strip() for p in prompts if p and p.strip()])
|
| 357 |
+
|
| 358 |
+
# If no prompts at all, use just None
|
| 359 |
+
if not prompt_list:
|
| 360 |
+
prompt_list = [None]
|
| 361 |
+
|
| 362 |
+
# Map domain display names to internal values
|
| 363 |
+
domain_map = {
|
| 364 |
+
"Sports": "sports",
|
| 365 |
+
"Vlogs": "vlogs",
|
| 366 |
+
"Music Videos": "music",
|
| 367 |
+
"Podcasts": "podcasts",
|
| 368 |
+
"Gaming": "gaming",
|
| 369 |
+
"Comedy": "comedy",
|
| 370 |
+
"General": "general",
|
| 371 |
+
}
|
| 372 |
+
|
| 373 |
+
# Map clip length display names to internal values
|
| 374 |
+
clip_length_map = {
|
| 375 |
+
"Short (30-60s)": "short",
|
| 376 |
+
"Long (1-3 min)": "long",
|
| 377 |
+
}
|
| 378 |
+
|
| 379 |
+
queue = []
|
| 380 |
+
test_id = 1
|
| 381 |
+
for video in videos:
|
| 382 |
+
video_name = Path(video).name if video else "unknown"
|
| 383 |
+
for domain in domains:
|
| 384 |
+
domain_value = domain_map.get(domain, "general")
|
| 385 |
+
for clip_length in clip_lengths:
|
| 386 |
+
clip_length_value = clip_length_map.get(clip_length, "short")
|
| 387 |
+
for prompt in prompt_list:
|
| 388 |
+
queue.append({
|
| 389 |
+
"test_id": test_id,
|
| 390 |
+
"video_path": video,
|
| 391 |
+
"video_name": video_name,
|
| 392 |
+
"domain": domain,
|
| 393 |
+
"domain_value": domain_value,
|
| 394 |
+
"clip_length": clip_length,
|
| 395 |
+
"clip_length_value": clip_length_value,
|
| 396 |
+
"num_clips": num_clips,
|
| 397 |
+
"reference_image": ref_image,
|
| 398 |
+
"custom_prompt": prompt,
|
| 399 |
+
})
|
| 400 |
+
test_id += 1
|
| 401 |
+
return queue
|
| 402 |
+
|
| 403 |
+
|
| 404 |
+
def run_single_batch_test(config: Dict[str, Any], output_base_dir: Path) -> Dict[str, Any]:
|
| 405 |
+
"""Run a single test from the batch queue."""
|
| 406 |
+
from utils.helpers import validate_video_file
|
| 407 |
+
from pipeline.orchestrator import PipelineOrchestrator
|
| 408 |
+
|
| 409 |
+
test_id = config["test_id"]
|
| 410 |
+
video_path = config["video_path"]
|
| 411 |
+
video_name = config["video_name"]
|
| 412 |
+
domain_value = config["domain_value"]
|
| 413 |
+
clip_length = config["clip_length"]
|
| 414 |
+
clip_length_value = config["clip_length_value"]
|
| 415 |
+
num_clips = config["num_clips"]
|
| 416 |
+
ref_image = config["reference_image"]
|
| 417 |
+
custom_prompt = config["custom_prompt"]
|
| 418 |
+
|
| 419 |
+
# Create unique output folder for this test
|
| 420 |
+
prompt_suffix = "no_prompt" if not custom_prompt else f"prompt_{hash(custom_prompt) % 1000}"
|
| 421 |
+
test_folder = f"{Path(video_name).stem}_{domain_value}_{clip_length_value}_{prompt_suffix}"
|
| 422 |
+
output_dir = output_base_dir / test_folder
|
| 423 |
+
output_dir.mkdir(parents=True, exist_ok=True)
|
| 424 |
+
|
| 425 |
+
result_data = {
|
| 426 |
+
"test_id": test_id,
|
| 427 |
+
"video_name": video_name,
|
| 428 |
+
"domain": domain_value,
|
| 429 |
+
"clip_length": clip_length,
|
| 430 |
+
"custom_prompt": custom_prompt if custom_prompt else "none",
|
| 431 |
+
"num_clips": num_clips,
|
| 432 |
+
"status": "failed",
|
| 433 |
+
"error": None,
|
| 434 |
+
"processing_time": 0,
|
| 435 |
+
"frames_analyzed": 0,
|
| 436 |
+
"scenes_detected": 0,
|
| 437 |
+
"hooks_detected": 0,
|
| 438 |
+
"clips": [],
|
| 439 |
+
"clip_paths": [],
|
| 440 |
+
}
|
| 441 |
+
|
| 442 |
+
try:
|
| 443 |
+
# Validate video
|
| 444 |
+
validation = validate_video_file(video_path)
|
| 445 |
+
if not validation.is_valid:
|
| 446 |
+
result_data["error"] = validation.error_message
|
| 447 |
+
return result_data
|
| 448 |
+
|
| 449 |
+
# Initialize and run pipeline
|
| 450 |
+
pipeline = PipelineOrchestrator()
|
| 451 |
+
result = pipeline.process(
|
| 452 |
+
video_path=video_path,
|
| 453 |
+
num_clips=num_clips,
|
| 454 |
+
clip_length=clip_length_value,
|
| 455 |
+
domain=domain_value,
|
| 456 |
+
reference_image=ref_image,
|
| 457 |
+
custom_prompt=custom_prompt,
|
| 458 |
+
)
|
| 459 |
+
|
| 460 |
+
if result.success:
|
| 461 |
+
result_data["status"] = "success"
|
| 462 |
+
result_data["processing_time"] = round(result.processing_time, 2)
|
| 463 |
+
result_data["frames_analyzed"] = len(result.visual_features)
|
| 464 |
+
result_data["scenes_detected"] = len(result.scenes)
|
| 465 |
+
result_data["hooks_detected"] = sum(1 for s in result.scores if s.combined_score > 0.7) if result.scores else 0
|
| 466 |
+
|
| 467 |
+
# Copy clips and collect data
|
| 468 |
+
for i, clip in enumerate(result.clips):
|
| 469 |
+
if clip.clip_path.exists():
|
| 470 |
+
clip_output = output_dir / f"clip_{i+1}.mp4"
|
| 471 |
+
shutil.copy2(clip.clip_path, clip_output)
|
| 472 |
+
result_data["clip_paths"].append(str(clip_output))
|
| 473 |
+
|
| 474 |
+
# Find hook type for this clip
|
| 475 |
+
hook_type = "none"
|
| 476 |
+
hook_confidence = 0.0
|
| 477 |
+
for score in result.scores:
|
| 478 |
+
if abs(score.start_time - clip.start_time) < 1.0:
|
| 479 |
+
if score.combined_score > 0.7:
|
| 480 |
+
hook_confidence = score.combined_score
|
| 481 |
+
if score.audio_score > score.visual_score and score.audio_score > score.motion_score:
|
| 482 |
+
hook_type = "audio_peak"
|
| 483 |
+
elif score.motion_score > score.visual_score:
|
| 484 |
+
hook_type = "motion_spike"
|
| 485 |
+
else:
|
| 486 |
+
hook_type = "visual_highlight"
|
| 487 |
+
break
|
| 488 |
+
|
| 489 |
+
result_data["clips"].append({
|
| 490 |
+
"clip_id": i + 1,
|
| 491 |
+
"start_time": round(clip.start_time, 2),
|
| 492 |
+
"end_time": round(clip.end_time, 2),
|
| 493 |
+
"duration": round(clip.duration, 2),
|
| 494 |
+
"hype_score": round(clip.hype_score, 4),
|
| 495 |
+
"visual_score": round(clip.visual_score, 4),
|
| 496 |
+
"audio_score": round(clip.audio_score, 4),
|
| 497 |
+
"motion_score": round(clip.motion_score, 4),
|
| 498 |
+
"hook_type": hook_type,
|
| 499 |
+
"hook_confidence": round(hook_confidence, 4),
|
| 500 |
+
})
|
| 501 |
+
else:
|
| 502 |
+
result_data["error"] = result.error_message
|
| 503 |
+
|
| 504 |
+
pipeline.cleanup()
|
| 505 |
+
|
| 506 |
+
except Exception as e:
|
| 507 |
+
result_data["error"] = str(e)
|
| 508 |
+
logger.exception(f"Batch test {test_id} failed")
|
| 509 |
+
|
| 510 |
+
return result_data
|
| 511 |
+
|
| 512 |
+
|
| 513 |
+
def results_to_dataframe(results: List[Dict[str, Any]]) -> pd.DataFrame:
|
| 514 |
+
"""Convert batch results to a pandas DataFrame for display."""
|
| 515 |
+
rows = []
|
| 516 |
+
for r in results:
|
| 517 |
+
row = {
|
| 518 |
+
"Test ID": r["test_id"],
|
| 519 |
+
"Video": r["video_name"],
|
| 520 |
+
"Domain": r["domain"],
|
| 521 |
+
"Length": r["clip_length"],
|
| 522 |
+
"Prompt": r["custom_prompt"][:20] + "..." if len(r["custom_prompt"]) > 20 else r["custom_prompt"],
|
| 523 |
+
"Status": r["status"],
|
| 524 |
+
"Time (s)": r["processing_time"],
|
| 525 |
+
"Frames": r["frames_analyzed"],
|
| 526 |
+
"Hooks": r["hooks_detected"],
|
| 527 |
+
}
|
| 528 |
+
# Add clip scores
|
| 529 |
+
for i, clip in enumerate(r.get("clips", [])[:3]):
|
| 530 |
+
row[f"Clip {i+1} Hype"] = clip.get("hype_score", 0)
|
| 531 |
+
rows.append(row)
|
| 532 |
+
return pd.DataFrame(rows)
|
| 533 |
+
|
| 534 |
+
|
| 535 |
+
def results_to_csv(results: List[Dict[str, Any]]) -> str:
|
| 536 |
+
"""Convert results to CSV format."""
|
| 537 |
+
rows = []
|
| 538 |
+
for r in results:
|
| 539 |
+
row = {
|
| 540 |
+
"test_id": r["test_id"],
|
| 541 |
+
"video_name": r["video_name"],
|
| 542 |
+
"domain": r["domain"],
|
| 543 |
+
"clip_length": r["clip_length"],
|
| 544 |
+
"custom_prompt": r["custom_prompt"],
|
| 545 |
+
"num_clips": r["num_clips"],
|
| 546 |
+
"status": r["status"],
|
| 547 |
+
"error": r.get("error", ""),
|
| 548 |
+
"processing_time": r["processing_time"],
|
| 549 |
+
"frames_analyzed": r["frames_analyzed"],
|
| 550 |
+
"scenes_detected": r["scenes_detected"],
|
| 551 |
+
"hooks_detected": r["hooks_detected"],
|
| 552 |
+
}
|
| 553 |
+
# Add per-clip data
|
| 554 |
+
for i in range(3):
|
| 555 |
+
if i < len(r.get("clips", [])):
|
| 556 |
+
clip = r["clips"][i]
|
| 557 |
+
row[f"clip_{i+1}_start"] = clip["start_time"]
|
| 558 |
+
row[f"clip_{i+1}_end"] = clip["end_time"]
|
| 559 |
+
row[f"clip_{i+1}_hype"] = clip["hype_score"]
|
| 560 |
+
row[f"clip_{i+1}_visual"] = clip["visual_score"]
|
| 561 |
+
row[f"clip_{i+1}_audio"] = clip["audio_score"]
|
| 562 |
+
row[f"clip_{i+1}_motion"] = clip["motion_score"]
|
| 563 |
+
row[f"clip_{i+1}_hook_type"] = clip["hook_type"]
|
| 564 |
+
else:
|
| 565 |
+
row[f"clip_{i+1}_start"] = ""
|
| 566 |
+
row[f"clip_{i+1}_end"] = ""
|
| 567 |
+
row[f"clip_{i+1}_hype"] = ""
|
| 568 |
+
row[f"clip_{i+1}_visual"] = ""
|
| 569 |
+
row[f"clip_{i+1}_audio"] = ""
|
| 570 |
+
row[f"clip_{i+1}_motion"] = ""
|
| 571 |
+
row[f"clip_{i+1}_hook_type"] = ""
|
| 572 |
+
rows.append(row)
|
| 573 |
+
|
| 574 |
+
df = pd.DataFrame(rows)
|
| 575 |
+
return df.to_csv(index=False)
|
| 576 |
+
|
| 577 |
+
|
| 578 |
+
def results_to_json(results: List[Dict[str, Any]]) -> str:
|
| 579 |
+
"""Convert results to JSON format."""
|
| 580 |
+
# Remove clip_paths from export (they're temp files)
|
| 581 |
+
export_results = []
|
| 582 |
+
for r in results:
|
| 583 |
+
r_copy = r.copy()
|
| 584 |
+
r_copy.pop("clip_paths", None)
|
| 585 |
+
export_results.append(r_copy)
|
| 586 |
+
return json.dumps(export_results, indent=2)
|
| 587 |
+
|
| 588 |
+
|
| 589 |
+
def create_clips_zip(results: List[Dict[str, Any]]) -> Optional[str]:
|
| 590 |
+
"""Create a ZIP file of all extracted clips."""
|
| 591 |
+
zip_path = Path(tempfile.mkdtemp()) / "batch_clips.zip"
|
| 592 |
+
|
| 593 |
+
with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
|
| 594 |
+
for r in results:
|
| 595 |
+
if r["status"] == "success":
|
| 596 |
+
folder_name = f"{Path(r['video_name']).stem}_{r['domain']}_{r['clip_length']}"
|
| 597 |
+
if r["custom_prompt"] != "none":
|
| 598 |
+
folder_name += f"_prompt"
|
| 599 |
+
for clip_path in r.get("clip_paths", []):
|
| 600 |
+
if Path(clip_path).exists():
|
| 601 |
+
arcname = f"{folder_name}/{Path(clip_path).name}"
|
| 602 |
+
zf.write(clip_path, arcname)
|
| 603 |
+
|
| 604 |
+
return str(zip_path) if zip_path.exists() else None
|
| 605 |
+
|
| 606 |
+
|
| 607 |
+
# Batch state (module level for simplicity)
|
| 608 |
+
batch_state = {
|
| 609 |
+
"is_running": False,
|
| 610 |
+
"should_cancel": False,
|
| 611 |
+
"results": [],
|
| 612 |
+
"output_dir": None,
|
| 613 |
+
}
|
| 614 |
+
|
| 615 |
+
|
| 616 |
+
def run_batch_tests(
|
| 617 |
+
videos,
|
| 618 |
+
domains,
|
| 619 |
+
clip_lengths,
|
| 620 |
+
num_clips,
|
| 621 |
+
reference_image,
|
| 622 |
+
include_no_prompt,
|
| 623 |
+
prompt1,
|
| 624 |
+
prompt2,
|
| 625 |
+
prompt3,
|
| 626 |
+
progress=gr.Progress()
|
| 627 |
+
):
|
| 628 |
+
"""Main batch testing function."""
|
| 629 |
+
global batch_state
|
| 630 |
+
|
| 631 |
+
# Validate inputs
|
| 632 |
+
if not videos:
|
| 633 |
+
return "Please upload at least one video.", None, "", "", None, None, None
|
| 634 |
+
|
| 635 |
+
if not domains:
|
| 636 |
+
return "Please select at least one domain.", None, "", "", None, None, None
|
| 637 |
+
|
| 638 |
+
if not clip_lengths:
|
| 639 |
+
return "Please select at least one clip length.", None, "", "", None, None, None
|
| 640 |
+
|
| 641 |
+
# Collect prompts
|
| 642 |
+
prompts = [p for p in [prompt1, prompt2, prompt3] if p and p.strip()]
|
| 643 |
+
|
| 644 |
+
# Generate test queue
|
| 645 |
+
queue = generate_test_queue(
|
| 646 |
+
videos=videos,
|
| 647 |
+
domains=domains,
|
| 648 |
+
clip_lengths=clip_lengths,
|
| 649 |
+
num_clips=int(num_clips),
|
| 650 |
+
ref_image=reference_image,
|
| 651 |
+
prompts=prompts,
|
| 652 |
+
include_no_prompt=include_no_prompt,
|
| 653 |
+
)
|
| 654 |
+
|
| 655 |
+
if not queue:
|
| 656 |
+
return "No tests to run. Please check your configuration.", None, "", "", None, None, None
|
| 657 |
+
|
| 658 |
+
# Initialize batch state
|
| 659 |
+
batch_state["is_running"] = True
|
| 660 |
+
batch_state["should_cancel"] = False
|
| 661 |
+
batch_state["results"] = []
|
| 662 |
+
batch_state["output_dir"] = Path(tempfile.mkdtemp(prefix="shortsmith_batch_"))
|
| 663 |
+
|
| 664 |
+
total_tests = len(queue)
|
| 665 |
+
log_messages = []
|
| 666 |
+
|
| 667 |
+
def log(msg):
|
| 668 |
+
log_messages.append(f"[{time.strftime('%H:%M:%S')}] {msg}")
|
| 669 |
+
logger.info(msg)
|
| 670 |
+
|
| 671 |
+
log(f"Starting batch testing: {total_tests} tests")
|
| 672 |
+
log(f"Videos: {len(videos)}, Domains: {len(domains)}, Lengths: {len(clip_lengths)}, Prompts: {len(prompts) + (1 if include_no_prompt else 0)}")
|
| 673 |
+
|
| 674 |
+
# Run tests sequentially
|
| 675 |
+
for i, test_config in enumerate(queue):
|
| 676 |
+
if batch_state["should_cancel"]:
|
| 677 |
+
log("Batch cancelled by user")
|
| 678 |
+
break
|
| 679 |
+
|
| 680 |
+
test_id = test_config["test_id"]
|
| 681 |
+
video_name = test_config["video_name"]
|
| 682 |
+
domain = test_config["domain_value"]
|
| 683 |
+
clip_length = test_config["clip_length"]
|
| 684 |
+
prompt = test_config["custom_prompt"] or "no-prompt"
|
| 685 |
+
|
| 686 |
+
log(f"[{i+1}/{total_tests}] Testing: {video_name} | {domain} | {clip_length} | {prompt[:30]}...")
|
| 687 |
+
progress((i + 1) / total_tests, desc=f"Test {i+1}/{total_tests}: {video_name}")
|
| 688 |
+
|
| 689 |
+
# Run the test
|
| 690 |
+
result = run_single_batch_test(test_config, batch_state["output_dir"])
|
| 691 |
+
batch_state["results"].append(result)
|
| 692 |
+
|
| 693 |
+
if result["status"] == "success":
|
| 694 |
+
log(f" ✓ Completed in {result['processing_time']}s")
|
| 695 |
+
else:
|
| 696 |
+
log(f" ✗ Failed: {result.get('error', 'Unknown error')}")
|
| 697 |
+
|
| 698 |
+
# Finalize
|
| 699 |
+
batch_state["is_running"] = False
|
| 700 |
+
completed = len([r for r in batch_state["results"] if r["status"] == "success"])
|
| 701 |
+
failed = len([r for r in batch_state["results"] if r["status"] == "failed"])
|
| 702 |
+
|
| 703 |
+
log(f"Batch complete: {completed} succeeded, {failed} failed")
|
| 704 |
+
|
| 705 |
+
# Generate outputs
|
| 706 |
+
results_df = results_to_dataframe(batch_state["results"])
|
| 707 |
+
csv_content = results_to_csv(batch_state["results"])
|
| 708 |
+
json_content = results_to_json(batch_state["results"])
|
| 709 |
+
|
| 710 |
+
# Save CSV and JSON to files for download
|
| 711 |
+
csv_path = batch_state["output_dir"] / "results.csv"
|
| 712 |
+
json_path = batch_state["output_dir"] / "results.json"
|
| 713 |
+
csv_path.write_text(csv_content)
|
| 714 |
+
json_path.write_text(json_content)
|
| 715 |
+
|
| 716 |
+
# Create ZIP of clips
|
| 717 |
+
zip_path = create_clips_zip(batch_state["results"])
|
| 718 |
+
|
| 719 |
+
status = f"Batch complete: {completed}/{total_tests} tests succeeded"
|
| 720 |
+
|
| 721 |
+
return (
|
| 722 |
+
status,
|
| 723 |
+
results_df,
|
| 724 |
+
"\n".join(log_messages),
|
| 725 |
+
json_content,
|
| 726 |
+
str(csv_path),
|
| 727 |
+
str(json_path),
|
| 728 |
+
zip_path,
|
| 729 |
+
)
|
| 730 |
+
|
| 731 |
+
|
| 732 |
+
def cancel_batch():
|
| 733 |
+
"""Cancel the running batch."""
|
| 734 |
+
global batch_state
|
| 735 |
+
batch_state["should_cancel"] = True
|
| 736 |
+
return "Cancelling batch... (will stop after current test completes)"
|
| 737 |
+
|
| 738 |
+
|
| 739 |
+
def calculate_queue_size(videos, domains, clip_lengths, include_no_prompt, prompt1, prompt2, prompt3):
|
| 740 |
+
"""Calculate and display the queue size."""
|
| 741 |
+
num_videos = len(videos) if videos else 0
|
| 742 |
+
num_domains = len(domains) if domains else 0
|
| 743 |
+
num_lengths = len(clip_lengths) if clip_lengths else 0
|
| 744 |
+
|
| 745 |
+
prompts = [p for p in [prompt1, prompt2, prompt3] if p and p.strip()]
|
| 746 |
+
num_prompts = len(prompts) + (1 if include_no_prompt else 0)
|
| 747 |
+
if num_prompts == 0:
|
| 748 |
+
num_prompts = 1 # Default to no-prompt if nothing selected
|
| 749 |
+
|
| 750 |
+
total = num_videos * num_domains * num_lengths * num_prompts
|
| 751 |
+
|
| 752 |
+
return f"Queue: {num_videos} video(s) × {num_domains} domain(s) × {num_lengths} length(s) × {num_prompts} prompt(s) = **{total} tests**"
|
| 753 |
+
|
| 754 |
+
|
| 755 |
+
def generate_clip_preview(num_clips):
|
| 756 |
+
"""Generate HTML preview of expected clips."""
|
| 757 |
+
if num_clips <= 0:
|
| 758 |
+
return "<p>No clips to preview</p>"
|
| 759 |
+
|
| 760 |
+
preview_html = '<div style="display: flex; flex-wrap: wrap; gap: 10px; margin: 10px 0;">'
|
| 761 |
+
|
| 762 |
+
for i in range(num_clips):
|
| 763 |
+
preview_html += f'''
|
| 764 |
+
<div style="border: 2px dashed #ccc; border-radius: 8px; padding: 20px; text-align: center; width: 150px; background: #0b1619;">
|
| 765 |
+
<div style="font-size: 48px; color: ##f2fafc; margin-bottom: 10px;">🎥</div>
|
| 766 |
+
<div style="font-weight: bold; color: ##f2fafc;">Clip {i+1}</div>
|
| 767 |
+
<div style="font-size: 12px; color: ##a1a5a6;">Processing...</div>
|
| 768 |
+
</div>
|
| 769 |
+
'''
|
| 770 |
+
|
| 771 |
+
preview_html += '</div>'
|
| 772 |
+
return preview_html
|
| 773 |
+
|
| 774 |
+
|
| 775 |
+
# =============================================================================
|
| 776 |
+
# Build Gradio Interface
|
| 777 |
+
# =============================================================================
|
| 778 |
+
|
| 779 |
+
with gr.Blocks(
|
| 780 |
+
title="ShortSmith v3",
|
| 781 |
+
theme=gr.themes.Soft(),
|
| 782 |
+
css="""
|
| 783 |
+
.container { max-width: 1200px; margin: auto; }
|
| 784 |
+
.output-video { min-height: 200px; }
|
| 785 |
+
"""
|
| 786 |
+
) as demo:
|
| 787 |
+
|
| 788 |
+
gr.Markdown("""
|
| 789 |
+
# ShortSmith v3
|
| 790 |
+
### AI-Powered Video Highlight Extractor
|
| 791 |
+
|
| 792 |
+
Upload a video and automatically extract the most engaging highlight clips using AI analysis.
|
| 793 |
+
""")
|
| 794 |
+
|
| 795 |
+
with gr.Tabs():
|
| 796 |
+
# =================================================================
|
| 797 |
+
# Tab 1: Single Video
|
| 798 |
+
# =================================================================
|
| 799 |
+
with gr.TabItem("Single Video"):
|
| 800 |
+
with gr.Row():
|
| 801 |
+
# Left column - Inputs
|
| 802 |
+
with gr.Column(scale=1):
|
| 803 |
+
gr.Markdown("### Input")
|
| 804 |
+
|
| 805 |
+
video_input = gr.Video(
|
| 806 |
+
label="Upload Video",
|
| 807 |
+
sources=["upload"],
|
| 808 |
+
)
|
| 809 |
+
|
| 810 |
+
with gr.Accordion("Settings", open=True):
|
| 811 |
+
domain_dropdown = gr.Dropdown(
|
| 812 |
+
choices=["Sports", "Vlogs", "Music Videos", "Podcasts", "Gaming", "Comedy", "General"],
|
| 813 |
+
value="General",
|
| 814 |
+
label="Content Domain",
|
| 815 |
+
info="Select the type of content for optimized scoring"
|
| 816 |
+
)
|
| 817 |
+
|
| 818 |
+
with gr.Row():
|
| 819 |
+
num_clips_slider = gr.Slider(
|
| 820 |
+
minimum=1,
|
| 821 |
+
maximum=5,
|
| 822 |
+
value=3,
|
| 823 |
+
step=1,
|
| 824 |
+
label="Number of Clips",
|
| 825 |
+
info="How many highlight clips to extract"
|
| 826 |
+
)
|
| 827 |
+
clip_length_radio = gr.Radio(
|
| 828 |
+
choices=["Short (30-60s)", "Long (1-3 min)"],
|
| 829 |
+
value="Short (30-60s)",
|
| 830 |
+
label="Clip Length",
|
| 831 |
+
info="Short clips for social media, long clips for YouTube"
|
| 832 |
+
)
|
| 833 |
+
|
| 834 |
+
with gr.Accordion("Person Filtering (Optional)", open=False):
|
| 835 |
+
reference_image = gr.Image(
|
| 836 |
+
label="Reference Image",
|
| 837 |
+
type="filepath",
|
| 838 |
+
sources=["upload"],
|
| 839 |
+
)
|
| 840 |
+
gr.Markdown("*Upload a photo of a person to prioritize clips featuring them.*")
|
| 841 |
+
|
| 842 |
+
with gr.Accordion("Custom Instructions (Optional)", open=False):
|
| 843 |
+
custom_prompt = gr.Textbox(
|
| 844 |
+
label="Additional Instructions",
|
| 845 |
+
placeholder="E.g., 'Focus on crowd reactions' or 'Prioritize action scenes'",
|
| 846 |
+
lines=2,
|
| 847 |
+
)
|
| 848 |
+
|
| 849 |
+
process_btn = gr.Button(
|
| 850 |
+
"Extract Highlights",
|
| 851 |
+
variant="primary",
|
| 852 |
+
size="lg"
|
| 853 |
+
)
|
| 854 |
+
|
| 855 |
+
# Right column - Outputs
|
| 856 |
+
with gr.Column(scale=1):
|
| 857 |
+
gr.Markdown("### Output")
|
| 858 |
+
|
| 859 |
+
status_output = gr.Textbox(
|
| 860 |
+
label="Status",
|
| 861 |
+
lines=2,
|
| 862 |
+
interactive=False
|
| 863 |
+
)
|
| 864 |
+
|
| 865 |
+
gr.Markdown("#### Extracted Clips")
|
| 866 |
+
clips_output = gr.HTML(label="Extracted Clips")
|
| 867 |
+
|
| 868 |
+
with gr.Accordion("Processing Log", open=True):
|
| 869 |
+
log_output = gr.Textbox(
|
| 870 |
+
label="Log",
|
| 871 |
+
lines=10,
|
| 872 |
+
interactive=False,
|
| 873 |
+
show_copy_button=True
|
| 874 |
+
)
|
| 875 |
+
|
| 876 |
+
with gr.Accordion("Automated Metrics (System-Generated)", open=True):
|
| 877 |
+
metrics_output = gr.Textbox(
|
| 878 |
+
label="Metrics for Testing",
|
| 879 |
+
lines=20,
|
| 880 |
+
interactive=False,
|
| 881 |
+
show_copy_button=True,
|
| 882 |
+
info="Copy these metrics for evaluation spreadsheets"
|
| 883 |
+
)
|
| 884 |
+
|
| 885 |
+
# Connect single video processing
|
| 886 |
+
process_btn.click(
|
| 887 |
+
fn=process_video,
|
| 888 |
+
inputs=[
|
| 889 |
+
video_input,
|
| 890 |
+
domain_dropdown,
|
| 891 |
+
num_clips_slider,
|
| 892 |
+
clip_length_radio,
|
| 893 |
+
reference_image,
|
| 894 |
+
custom_prompt
|
| 895 |
+
],
|
| 896 |
+
outputs=[
|
| 897 |
+
status_output,
|
| 898 |
+
clips_output,
|
| 899 |
+
log_output,
|
| 900 |
+
metrics_output
|
| 901 |
+
],
|
| 902 |
+
show_progress="full"
|
| 903 |
+
)
|
| 904 |
+
|
| 905 |
+
# Update preview when num_clips changes
|
| 906 |
+
num_clips_slider.change(
|
| 907 |
+
fn=generate_clip_preview,
|
| 908 |
+
inputs=[num_clips_slider],
|
| 909 |
+
outputs=[clips_output]
|
| 910 |
+
)
|
| 911 |
+
|
| 912 |
+
# Initialize preview with default value
|
| 913 |
+
demo.load(
|
| 914 |
+
fn=lambda: generate_clip_preview(3),
|
| 915 |
+
inputs=[],
|
| 916 |
+
outputs=[clips_output]
|
| 917 |
+
)
|
| 918 |
+
|
| 919 |
+
# =================================================================
|
| 920 |
+
# Tab 2: Batch Testing
|
| 921 |
+
# =================================================================
|
| 922 |
+
with gr.TabItem("Batch Testing"):
|
| 923 |
+
with gr.Row():
|
| 924 |
+
# Left column - Configuration
|
| 925 |
+
with gr.Column(scale=1):
|
| 926 |
+
gr.Markdown("### Batch Configuration")
|
| 927 |
+
|
| 928 |
+
batch_videos = gr.File(
|
| 929 |
+
label="Upload Video(s)",
|
| 930 |
+
file_count="multiple",
|
| 931 |
+
file_types=["video"],
|
| 932 |
+
)
|
| 933 |
+
|
| 934 |
+
gr.Markdown("#### Domains to Test")
|
| 935 |
+
batch_domains = gr.CheckboxGroup(
|
| 936 |
+
choices=["Sports", "Vlogs", "Music Videos", "Podcasts", "Gaming", "Comedy", "General"],
|
| 937 |
+
value=["General"],
|
| 938 |
+
label="Select domains",
|
| 939 |
+
)
|
| 940 |
+
|
| 941 |
+
gr.Markdown("#### Clip Lengths to Test")
|
| 942 |
+
batch_clip_lengths = gr.CheckboxGroup(
|
| 943 |
+
choices=["Short (30-60s)", "Long (1-3 min)"],
|
| 944 |
+
value=["Short (30-60s)"],
|
| 945 |
+
label="Select clip lengths",
|
| 946 |
+
)
|
| 947 |
+
|
| 948 |
+
batch_num_clips = gr.Slider(
|
| 949 |
+
minimum=1,
|
| 950 |
+
maximum=5,
|
| 951 |
+
value=3,
|
| 952 |
+
step=1,
|
| 953 |
+
label="Number of Clips per Test",
|
| 954 |
+
)
|
| 955 |
+
|
| 956 |
+
with gr.Accordion("Custom Prompts", open=True):
|
| 957 |
+
batch_no_prompt = gr.Checkbox(
|
| 958 |
+
label="Include no-prompt baseline",
|
| 959 |
+
value=True,
|
| 960 |
+
info="Test without any custom prompt for comparison"
|
| 961 |
+
)
|
| 962 |
+
batch_prompt1 = gr.Textbox(
|
| 963 |
+
label="Prompt 1",
|
| 964 |
+
placeholder="E.g., 'Focus on action moments'",
|
| 965 |
+
lines=1,
|
| 966 |
+
)
|
| 967 |
+
batch_prompt2 = gr.Textbox(
|
| 968 |
+
label="Prompt 2",
|
| 969 |
+
placeholder="E.g., 'Find crowd reactions'",
|
| 970 |
+
lines=1,
|
| 971 |
+
)
|
| 972 |
+
batch_prompt3 = gr.Textbox(
|
| 973 |
+
label="Prompt 3",
|
| 974 |
+
placeholder="E.g., 'Prioritize emotional moments'",
|
| 975 |
+
lines=1,
|
| 976 |
+
)
|
| 977 |
+
|
| 978 |
+
with gr.Accordion("Reference Image (Optional)", open=False):
|
| 979 |
+
batch_ref_image = gr.Image(
|
| 980 |
+
label="Reference Image (applies to all tests)",
|
| 981 |
+
type="filepath",
|
| 982 |
+
sources=["upload"],
|
| 983 |
+
)
|
| 984 |
+
|
| 985 |
+
# Queue size indicator
|
| 986 |
+
queue_info = gr.Markdown("Queue: 0 tests")
|
| 987 |
+
|
| 988 |
+
with gr.Row():
|
| 989 |
+
batch_start_btn = gr.Button(
|
| 990 |
+
"Start Batch",
|
| 991 |
+
variant="primary",
|
| 992 |
+
size="lg"
|
| 993 |
+
)
|
| 994 |
+
batch_cancel_btn = gr.Button(
|
| 995 |
+
"Cancel",
|
| 996 |
+
variant="secondary",
|
| 997 |
+
size="lg"
|
| 998 |
+
)
|
| 999 |
+
|
| 1000 |
+
# Right column - Results
|
| 1001 |
+
with gr.Column(scale=1):
|
| 1002 |
+
gr.Markdown("### Results")
|
| 1003 |
+
|
| 1004 |
+
batch_status = gr.Textbox(
|
| 1005 |
+
label="Status",
|
| 1006 |
+
lines=2,
|
| 1007 |
+
interactive=False
|
| 1008 |
+
)
|
| 1009 |
+
|
| 1010 |
+
batch_results_table = gr.Dataframe(
|
| 1011 |
+
label="Test Results",
|
| 1012 |
+
headers=["Test ID", "Video", "Domain", "Length", "Prompt", "Status", "Time (s)", "Frames", "Hooks"],
|
| 1013 |
+
interactive=False,
|
| 1014 |
+
)
|
| 1015 |
+
|
| 1016 |
+
with gr.Accordion("Processing Log", open=True):
|
| 1017 |
+
batch_log = gr.Textbox(
|
| 1018 |
+
label="Log",
|
| 1019 |
+
lines=15,
|
| 1020 |
+
interactive=False,
|
| 1021 |
+
show_copy_button=True
|
| 1022 |
+
)
|
| 1023 |
+
|
| 1024 |
+
with gr.Accordion("Full Results (JSON)", open=False):
|
| 1025 |
+
batch_json = gr.Textbox(
|
| 1026 |
+
label="JSON Output",
|
| 1027 |
+
lines=10,
|
| 1028 |
+
interactive=False,
|
| 1029 |
+
show_copy_button=True
|
| 1030 |
+
)
|
| 1031 |
+
|
| 1032 |
+
gr.Markdown("#### Download Results")
|
| 1033 |
+
with gr.Row():
|
| 1034 |
+
csv_download = gr.File(label="CSV Results")
|
| 1035 |
+
json_download = gr.File(label="JSON Results")
|
| 1036 |
+
zip_download = gr.File(label="All Clips (ZIP)")
|
| 1037 |
+
|
| 1038 |
+
# Update queue size when inputs change
|
| 1039 |
+
queue_inputs = [batch_videos, batch_domains, batch_clip_lengths, batch_no_prompt, batch_prompt1, batch_prompt2, batch_prompt3]
|
| 1040 |
+
for inp in queue_inputs:
|
| 1041 |
+
inp.change(
|
| 1042 |
+
fn=calculate_queue_size,
|
| 1043 |
+
inputs=queue_inputs,
|
| 1044 |
+
outputs=queue_info
|
| 1045 |
+
)
|
| 1046 |
+
|
| 1047 |
+
# Connect batch processing
|
| 1048 |
+
batch_start_btn.click(
|
| 1049 |
+
fn=run_batch_tests,
|
| 1050 |
+
inputs=[
|
| 1051 |
+
batch_videos,
|
| 1052 |
+
batch_domains,
|
| 1053 |
+
batch_clip_lengths,
|
| 1054 |
+
batch_num_clips,
|
| 1055 |
+
batch_ref_image,
|
| 1056 |
+
batch_no_prompt,
|
| 1057 |
+
batch_prompt1,
|
| 1058 |
+
batch_prompt2,
|
| 1059 |
+
batch_prompt3,
|
| 1060 |
+
],
|
| 1061 |
+
outputs=[
|
| 1062 |
+
batch_status,
|
| 1063 |
+
batch_results_table,
|
| 1064 |
+
batch_log,
|
| 1065 |
+
batch_json,
|
| 1066 |
+
csv_download,
|
| 1067 |
+
json_download,
|
| 1068 |
+
zip_download,
|
| 1069 |
+
],
|
| 1070 |
+
show_progress="full"
|
| 1071 |
+
)
|
| 1072 |
+
|
| 1073 |
+
batch_cancel_btn.click(
|
| 1074 |
+
fn=cancel_batch,
|
| 1075 |
+
inputs=[],
|
| 1076 |
+
outputs=[batch_status]
|
| 1077 |
+
)
|
| 1078 |
+
|
| 1079 |
+
gr.Markdown("""
|
| 1080 |
+
---
|
| 1081 |
+
**ShortSmith v3** | Powered by Qwen2-VL, InsightFace, and Librosa |
|
| 1082 |
+
[GitHub](https://github.com) | Built with Gradio
|
| 1083 |
+
""")
|
| 1084 |
+
|
| 1085 |
+
# Launch the app
|
| 1086 |
+
if __name__ == "__main__":
|
| 1087 |
+
demo.queue()
|
| 1088 |
+
demo.launch(
|
| 1089 |
+
server_name="0.0.0.0",
|
| 1090 |
+
server_port=7860,
|
| 1091 |
+
show_error=True
|
| 1092 |
+
)
|
| 1093 |
+
else:
|
| 1094 |
+
# For HuggingFace Spaces
|
| 1095 |
+
demo.queue()
|
| 1096 |
+
demo.launch()
|
config.py
CHANGED
|
@@ -1,201 +1,201 @@
|
|
| 1 |
-
"""
|
| 2 |
-
ShortSmith v2 - Configuration Module
|
| 3 |
-
|
| 4 |
-
Centralized configuration for all components including model paths,
|
| 5 |
-
thresholds, domain presets, and runtime settings.
|
| 6 |
-
"""
|
| 7 |
-
|
| 8 |
-
import os
|
| 9 |
-
from dataclasses import dataclass, field
|
| 10 |
-
from typing import Dict, Optional
|
| 11 |
-
from enum import Enum
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
class ContentDomain(Enum):
|
| 15 |
-
"""Supported content domains with different hype characteristics."""
|
| 16 |
-
SPORTS = "sports"
|
| 17 |
-
VLOGS = "vlogs"
|
| 18 |
-
MUSIC = "music"
|
| 19 |
-
PODCASTS = "podcasts"
|
| 20 |
-
GAMING = "gaming"
|
| 21 |
-
COMEDY = "comedy"
|
| 22 |
-
GENERAL = "general"
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
class ClipLength(Enum):
|
| 26 |
-
"""Clip length presets - flexible ranges instead of fixed durations."""
|
| 27 |
-
SHORT = "short" # 30-60 seconds (soft bounds, can vary by ~5-10s)
|
| 28 |
-
LONG = "long" # Up to 3 minutes
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
# Clip length configuration - just min/max constraints, algorithm finds natural boundaries
|
| 32 |
-
CLIP_LENGTH_CONFIG = {
|
| 33 |
-
ClipLength.SHORT: {
|
| 34 |
-
"min": 30.0, # Minimum duration in seconds
|
| 35 |
-
"max": 60.0, # Maximum duration in seconds
|
| 36 |
-
"description": "30-60 seconds",
|
| 37 |
-
},
|
| 38 |
-
ClipLength.LONG: {
|
| 39 |
-
"min": 60.0, # Minimum duration in seconds
|
| 40 |
-
"max": 180.0, # Maximum duration in seconds (3 min)
|
| 41 |
-
"description": "1-3 minutes",
|
| 42 |
-
},
|
| 43 |
-
}
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
@dataclass
|
| 47 |
-
class DomainWeights:
|
| 48 |
-
"""Weight configuration for visual vs audio scoring per domain."""
|
| 49 |
-
visual_weight: float
|
| 50 |
-
audio_weight: float
|
| 51 |
-
motion_weight: float = 0.0
|
| 52 |
-
|
| 53 |
-
def __post_init__(self):
|
| 54 |
-
"""Normalize weights to sum to 1.0."""
|
| 55 |
-
total = self.visual_weight + self.audio_weight + self.motion_weight
|
| 56 |
-
if total > 0:
|
| 57 |
-
self.visual_weight /= total
|
| 58 |
-
self.audio_weight /= total
|
| 59 |
-
self.motion_weight /= total
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
@dataclass
|
| 64 |
-
class ModelConfig:
|
| 65 |
-
"""Configuration for AI models."""
|
| 66 |
-
# Visual model (Qwen2-VL)
|
| 67 |
-
visual_model_id: str = "Qwen/Qwen2-VL-2B-Instruct"
|
| 68 |
-
visual_model_quantization: str = "int4" # Options: "int4", "int8", "none"
|
| 69 |
-
visual_max_frames: int = 32
|
| 70 |
-
|
| 71 |
-
# Audio model
|
| 72 |
-
audio_model_id: str = "facebook/wav2vec2-base-960h"
|
| 73 |
-
use_advanced_audio: bool = False # Use Wav2Vec2 instead of just Librosa
|
| 74 |
-
|
| 75 |
-
# Face recognition (InsightFace)
|
| 76 |
-
face_detection_model: str = "buffalo_l" # SCRFD model
|
| 77 |
-
face_similarity_threshold: float = 0.4
|
| 78 |
-
|
| 79 |
-
# Body recognition (OSNet)
|
| 80 |
-
body_model_name: str = "osnet_x1_0"
|
| 81 |
-
body_similarity_threshold: float = 0.5
|
| 82 |
-
|
| 83 |
-
# Motion detection (RAFT)
|
| 84 |
-
motion_model: str = "raft-things"
|
| 85 |
-
motion_threshold: float = 5.0
|
| 86 |
-
|
| 87 |
-
# Device settings
|
| 88 |
-
device: str = "cuda" # Options: "cuda", "cpu", "mps"
|
| 89 |
-
|
| 90 |
-
def __post_init__(self):
|
| 91 |
-
"""Validate and adjust device based on availability."""
|
| 92 |
-
import torch
|
| 93 |
-
if self.device == "cuda" and not torch.cuda.is_available():
|
| 94 |
-
self.device = "cpu"
|
| 95 |
-
elif self.device == "mps" and not (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()):
|
| 96 |
-
self.device = "cpu"
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
@dataclass
|
| 100 |
-
class ProcessingConfig:
|
| 101 |
-
"""Configuration for video processing pipeline."""
|
| 102 |
-
# Sampling settings
|
| 103 |
-
coarse_sample_interval: float = 5.0 # Seconds between frames in first pass
|
| 104 |
-
dense_sample_fps: float = 3.0 # FPS for dense sampling on candidates
|
| 105 |
-
min_motion_for_dense: float = 2.0 # Threshold to trigger dense sampling
|
| 106 |
-
|
| 107 |
-
# Clip settings - these are safety bounds, actual duration comes from CLIP_LENGTH_CONFIG
|
| 108 |
-
min_clip_duration: float = 15.0 # Absolute minimum clip length (safety floor)
|
| 109 |
-
max_clip_duration: float = 180.0 # Absolute maximum clip length (safety ceiling for LONG preset)
|
| 110 |
-
default_clip_duration: float = 45.0 # Default clip length (midpoint of SHORT range)
|
| 111 |
-
min_gap_between_clips: float = 30.0 # Minimum gap between clip starts
|
| 112 |
-
|
| 113 |
-
# Output settings
|
| 114 |
-
default_num_clips: int = 3
|
| 115 |
-
max_num_clips: int = 10
|
| 116 |
-
output_format: str = "mp4"
|
| 117 |
-
output_codec: str = "libx264"
|
| 118 |
-
output_audio_codec: str = "aac"
|
| 119 |
-
|
| 120 |
-
# Scene detection
|
| 121 |
-
scene_threshold: float = 27.0 # PySceneDetect threshold
|
| 122 |
-
|
| 123 |
-
# Hype scoring
|
| 124 |
-
hype_threshold: float = 0.3 # Minimum normalized score to consider
|
| 125 |
-
diversity_weight: float = 0.2 # Weight for temporal diversity in ranking
|
| 126 |
-
|
| 127 |
-
# Performance
|
| 128 |
-
batch_size: int = 8 # Frames per batch for model inference
|
| 129 |
-
max_video_duration: float = 7200.0 # Maximum video length (2 hours)
|
| 130 |
-
|
| 131 |
-
# Temporary files
|
| 132 |
-
temp_dir: Optional[str] = None
|
| 133 |
-
cleanup_temp: bool = True
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
@dataclass
|
| 137 |
-
class AppConfig:
|
| 138 |
-
"""Main application configuration."""
|
| 139 |
-
model: ModelConfig = field(default_factory=ModelConfig)
|
| 140 |
-
processing: ProcessingConfig = field(default_factory=ProcessingConfig)
|
| 141 |
-
|
| 142 |
-
# Logging
|
| 143 |
-
log_level: str = "INFO"
|
| 144 |
-
log_file: Optional[str] = "shortsmith.log"
|
| 145 |
-
log_to_console: bool = True
|
| 146 |
-
|
| 147 |
-
# API settings (for future extensibility)
|
| 148 |
-
api_key: Optional[str] = None
|
| 149 |
-
|
| 150 |
-
# UI settings
|
| 151 |
-
share_gradio: bool = False
|
| 152 |
-
server_port: int = 7860
|
| 153 |
-
|
| 154 |
-
@classmethod
|
| 155 |
-
def from_env(cls) -> "AppConfig":
|
| 156 |
-
"""Create configuration from environment variables."""
|
| 157 |
-
config = cls()
|
| 158 |
-
|
| 159 |
-
# Override from environment
|
| 160 |
-
if os.environ.get("SHORTSMITH_LOG_LEVEL"):
|
| 161 |
-
config.log_level = os.environ["SHORTSMITH_LOG_LEVEL"]
|
| 162 |
-
|
| 163 |
-
if os.environ.get("SHORTSMITH_DEVICE"):
|
| 164 |
-
config.model.device = os.environ["SHORTSMITH_DEVICE"]
|
| 165 |
-
|
| 166 |
-
if os.environ.get("SHORTSMITH_API_KEY"):
|
| 167 |
-
config.api_key = os.environ["SHORTSMITH_API_KEY"]
|
| 168 |
-
|
| 169 |
-
return config
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
# Global configuration instance
|
| 173 |
-
_config: Optional[AppConfig] = None
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
def get_config() -> AppConfig:
|
| 177 |
-
"""Get the global configuration instance."""
|
| 178 |
-
global _config
|
| 179 |
-
if _config is None:
|
| 180 |
-
_config = AppConfig.from_env()
|
| 181 |
-
return _config
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
def set_config(config: AppConfig) -> None:
|
| 185 |
-
"""Set the global configuration instance."""
|
| 186 |
-
global _config
|
| 187 |
-
_config = config
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
# Export commonly used items
|
| 191 |
-
__all__ = [
|
| 192 |
-
"ContentDomain",
|
| 193 |
-
"ClipLength",
|
| 194 |
-
"CLIP_LENGTH_CONFIG",
|
| 195 |
-
"DomainWeights",
|
| 196 |
-
"ModelConfig",
|
| 197 |
-
"ProcessingConfig",
|
| 198 |
-
"AppConfig",
|
| 199 |
-
"get_config",
|
| 200 |
-
"set_config",
|
| 201 |
-
]
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ShortSmith v2 - Configuration Module
|
| 3 |
+
|
| 4 |
+
Centralized configuration for all components including model paths,
|
| 5 |
+
thresholds, domain presets, and runtime settings.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import os
|
| 9 |
+
from dataclasses import dataclass, field
|
| 10 |
+
from typing import Dict, Optional
|
| 11 |
+
from enum import Enum
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
class ContentDomain(Enum):
|
| 15 |
+
"""Supported content domains with different hype characteristics."""
|
| 16 |
+
SPORTS = "sports"
|
| 17 |
+
VLOGS = "vlogs"
|
| 18 |
+
MUSIC = "music"
|
| 19 |
+
PODCASTS = "podcasts"
|
| 20 |
+
GAMING = "gaming"
|
| 21 |
+
COMEDY = "comedy"
|
| 22 |
+
GENERAL = "general"
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
class ClipLength(Enum):
|
| 26 |
+
"""Clip length presets - flexible ranges instead of fixed durations."""
|
| 27 |
+
SHORT = "short" # 30-60 seconds (soft bounds, can vary by ~5-10s)
|
| 28 |
+
LONG = "long" # Up to 3 minutes
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
# Clip length configuration - just min/max constraints, algorithm finds natural boundaries
|
| 32 |
+
CLIP_LENGTH_CONFIG = {
|
| 33 |
+
ClipLength.SHORT: {
|
| 34 |
+
"min": 30.0, # Minimum duration in seconds
|
| 35 |
+
"max": 60.0, # Maximum duration in seconds
|
| 36 |
+
"description": "30-60 seconds",
|
| 37 |
+
},
|
| 38 |
+
ClipLength.LONG: {
|
| 39 |
+
"min": 60.0, # Minimum duration in seconds
|
| 40 |
+
"max": 180.0, # Maximum duration in seconds (3 min)
|
| 41 |
+
"description": "1-3 minutes",
|
| 42 |
+
},
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
@dataclass
|
| 47 |
+
class DomainWeights:
|
| 48 |
+
"""Weight configuration for visual vs audio scoring per domain."""
|
| 49 |
+
visual_weight: float
|
| 50 |
+
audio_weight: float
|
| 51 |
+
motion_weight: float = 0.0
|
| 52 |
+
|
| 53 |
+
def __post_init__(self):
|
| 54 |
+
"""Normalize weights to sum to 1.0."""
|
| 55 |
+
total = self.visual_weight + self.audio_weight + self.motion_weight
|
| 56 |
+
if total > 0:
|
| 57 |
+
self.visual_weight /= total
|
| 58 |
+
self.audio_weight /= total
|
| 59 |
+
self.motion_weight /= total
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
@dataclass
|
| 64 |
+
class ModelConfig:
|
| 65 |
+
"""Configuration for AI models."""
|
| 66 |
+
# Visual model (Qwen2-VL)
|
| 67 |
+
visual_model_id: str = "Qwen/Qwen2-VL-2B-Instruct"
|
| 68 |
+
visual_model_quantization: str = "int4" # Options: "int4", "int8", "none"
|
| 69 |
+
visual_max_frames: int = 32
|
| 70 |
+
|
| 71 |
+
# Audio model
|
| 72 |
+
audio_model_id: str = "facebook/wav2vec2-base-960h"
|
| 73 |
+
use_advanced_audio: bool = False # Use Wav2Vec2 instead of just Librosa
|
| 74 |
+
|
| 75 |
+
# Face recognition (InsightFace)
|
| 76 |
+
face_detection_model: str = "buffalo_l" # SCRFD model
|
| 77 |
+
face_similarity_threshold: float = 0.4
|
| 78 |
+
|
| 79 |
+
# Body recognition (OSNet)
|
| 80 |
+
body_model_name: str = "osnet_x1_0"
|
| 81 |
+
body_similarity_threshold: float = 0.5
|
| 82 |
+
|
| 83 |
+
# Motion detection (RAFT)
|
| 84 |
+
motion_model: str = "raft-things"
|
| 85 |
+
motion_threshold: float = 5.0
|
| 86 |
+
|
| 87 |
+
# Device settings
|
| 88 |
+
device: str = "cuda" # Options: "cuda", "cpu", "mps"
|
| 89 |
+
|
| 90 |
+
def __post_init__(self):
|
| 91 |
+
"""Validate and adjust device based on availability."""
|
| 92 |
+
import torch
|
| 93 |
+
if self.device == "cuda" and not torch.cuda.is_available():
|
| 94 |
+
self.device = "cpu"
|
| 95 |
+
elif self.device == "mps" and not (hasattr(torch.backends, "mps") and torch.backends.mps.is_available()):
|
| 96 |
+
self.device = "cpu"
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
@dataclass
|
| 100 |
+
class ProcessingConfig:
|
| 101 |
+
"""Configuration for video processing pipeline."""
|
| 102 |
+
# Sampling settings
|
| 103 |
+
coarse_sample_interval: float = 5.0 # Seconds between frames in first pass
|
| 104 |
+
dense_sample_fps: float = 3.0 # FPS for dense sampling on candidates
|
| 105 |
+
min_motion_for_dense: float = 2.0 # Threshold to trigger dense sampling
|
| 106 |
+
|
| 107 |
+
# Clip settings - these are safety bounds, actual duration comes from CLIP_LENGTH_CONFIG
|
| 108 |
+
min_clip_duration: float = 15.0 # Absolute minimum clip length (safety floor)
|
| 109 |
+
max_clip_duration: float = 180.0 # Absolute maximum clip length (safety ceiling for LONG preset)
|
| 110 |
+
default_clip_duration: float = 45.0 # Default clip length (midpoint of SHORT range)
|
| 111 |
+
min_gap_between_clips: float = 30.0 # Minimum gap between clip starts
|
| 112 |
+
|
| 113 |
+
# Output settings
|
| 114 |
+
default_num_clips: int = 3
|
| 115 |
+
max_num_clips: int = 10
|
| 116 |
+
output_format: str = "mp4"
|
| 117 |
+
output_codec: str = "libx264"
|
| 118 |
+
output_audio_codec: str = "aac"
|
| 119 |
+
|
| 120 |
+
# Scene detection
|
| 121 |
+
scene_threshold: float = 27.0 # PySceneDetect threshold
|
| 122 |
+
|
| 123 |
+
# Hype scoring
|
| 124 |
+
hype_threshold: float = 0.3 # Minimum normalized score to consider
|
| 125 |
+
diversity_weight: float = 0.2 # Weight for temporal diversity in ranking
|
| 126 |
+
|
| 127 |
+
# Performance
|
| 128 |
+
batch_size: int = 8 # Frames per batch for model inference
|
| 129 |
+
max_video_duration: float = 7200.0 # Maximum video length (2 hours)
|
| 130 |
+
|
| 131 |
+
# Temporary files
|
| 132 |
+
temp_dir: Optional[str] = None
|
| 133 |
+
cleanup_temp: bool = True
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
@dataclass
|
| 137 |
+
class AppConfig:
|
| 138 |
+
"""Main application configuration."""
|
| 139 |
+
model: ModelConfig = field(default_factory=ModelConfig)
|
| 140 |
+
processing: ProcessingConfig = field(default_factory=ProcessingConfig)
|
| 141 |
+
|
| 142 |
+
# Logging
|
| 143 |
+
log_level: str = "INFO"
|
| 144 |
+
log_file: Optional[str] = "shortsmith.log"
|
| 145 |
+
log_to_console: bool = True
|
| 146 |
+
|
| 147 |
+
# API settings (for future extensibility)
|
| 148 |
+
api_key: Optional[str] = None
|
| 149 |
+
|
| 150 |
+
# UI settings
|
| 151 |
+
share_gradio: bool = False
|
| 152 |
+
server_port: int = 7860
|
| 153 |
+
|
| 154 |
+
@classmethod
|
| 155 |
+
def from_env(cls) -> "AppConfig":
|
| 156 |
+
"""Create configuration from environment variables."""
|
| 157 |
+
config = cls()
|
| 158 |
+
|
| 159 |
+
# Override from environment
|
| 160 |
+
if os.environ.get("SHORTSMITH_LOG_LEVEL"):
|
| 161 |
+
config.log_level = os.environ["SHORTSMITH_LOG_LEVEL"]
|
| 162 |
+
|
| 163 |
+
if os.environ.get("SHORTSMITH_DEVICE"):
|
| 164 |
+
config.model.device = os.environ["SHORTSMITH_DEVICE"]
|
| 165 |
+
|
| 166 |
+
if os.environ.get("SHORTSMITH_API_KEY"):
|
| 167 |
+
config.api_key = os.environ["SHORTSMITH_API_KEY"]
|
| 168 |
+
|
| 169 |
+
return config
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
# Global configuration instance
|
| 173 |
+
_config: Optional[AppConfig] = None
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
def get_config() -> AppConfig:
|
| 177 |
+
"""Get the global configuration instance."""
|
| 178 |
+
global _config
|
| 179 |
+
if _config is None:
|
| 180 |
+
_config = AppConfig.from_env()
|
| 181 |
+
return _config
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
def set_config(config: AppConfig) -> None:
|
| 185 |
+
"""Set the global configuration instance."""
|
| 186 |
+
global _config
|
| 187 |
+
_config = config
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
# Export commonly used items
|
| 191 |
+
__all__ = [
|
| 192 |
+
"ContentDomain",
|
| 193 |
+
"ClipLength",
|
| 194 |
+
"CLIP_LENGTH_CONFIG",
|
| 195 |
+
"DomainWeights",
|
| 196 |
+
"ModelConfig",
|
| 197 |
+
"ProcessingConfig",
|
| 198 |
+
"AppConfig",
|
| 199 |
+
"get_config",
|
| 200 |
+
"set_config",
|
| 201 |
+
]
|
requirements.txt
CHANGED
|
@@ -1,103 +1,103 @@
|
|
| 1 |
-
# ShortSmith v2 - Requirements
|
| 2 |
-
# For Hugging Face Spaces deployment
|
| 3 |
-
|
| 4 |
-
# ============================================
|
| 5 |
-
# Core Dependencies
|
| 6 |
-
# ============================================
|
| 7 |
-
|
| 8 |
-
# Gradio UI framework
|
| 9 |
-
gradio==4.44.1
|
| 10 |
-
|
| 11 |
-
# Pin pydantic to fix "argument of type 'bool' is not iterable" error
|
| 12 |
-
pydantic==2.10.6
|
| 13 |
-
|
| 14 |
-
# Deep learning frameworks
|
| 15 |
-
torch>=2.0.0
|
| 16 |
-
torchvision>=0.15.0
|
| 17 |
-
torchaudio>=2.0.0
|
| 18 |
-
|
| 19 |
-
# Transformers and model loading
|
| 20 |
-
transformers>=4.35.0
|
| 21 |
-
accelerate>=0.24.0
|
| 22 |
-
bitsandbytes>=0.41.0 # For INT4/INT8 quantization
|
| 23 |
-
|
| 24 |
-
# ============================================
|
| 25 |
-
# Video Processing
|
| 26 |
-
# ============================================
|
| 27 |
-
|
| 28 |
-
# Video I/O
|
| 29 |
-
ffmpeg-python>=0.2.0
|
| 30 |
-
opencv-python-headless>=4.8.0
|
| 31 |
-
|
| 32 |
-
# Scene detection
|
| 33 |
-
scenedetect[opencv]>=0.6.0
|
| 34 |
-
|
| 35 |
-
# ============================================
|
| 36 |
-
# Audio Processing
|
| 37 |
-
# ============================================
|
| 38 |
-
|
| 39 |
-
# Audio analysis
|
| 40 |
-
librosa>=0.10.0
|
| 41 |
-
soundfile>=0.12.0
|
| 42 |
-
|
| 43 |
-
# Optional: Advanced audio understanding
|
| 44 |
-
# wav2vec2 is loaded via transformers
|
| 45 |
-
|
| 46 |
-
# ============================================
|
| 47 |
-
# Computer Vision Models
|
| 48 |
-
# ============================================
|
| 49 |
-
|
| 50 |
-
# Face recognition
|
| 51 |
-
insightface>=0.7.0
|
| 52 |
-
onnxruntime-gpu>=1.16.0 # Use onnxruntime for CPU-only
|
| 53 |
-
|
| 54 |
-
# Person detection (YOLO)
|
| 55 |
-
ultralytics>=8.0.0
|
| 56 |
-
|
| 57 |
-
# Image processing
|
| 58 |
-
Pillow>=10.0.0
|
| 59 |
-
|
| 60 |
-
# ============================================
|
| 61 |
-
# Utilities
|
| 62 |
-
# ============================================
|
| 63 |
-
|
| 64 |
-
# Numerical computing
|
| 65 |
-
numpy>=1.24.0
|
| 66 |
-
|
| 67 |
-
# Progress bars
|
| 68 |
-
tqdm>=4.65.0
|
| 69 |
-
|
| 70 |
-
# ============================================
|
| 71 |
-
# Hugging Face Specific
|
| 72 |
-
# ============================================
|
| 73 |
-
|
| 74 |
-
# For model downloading
|
| 75 |
-
huggingface_hub>=0.17.0
|
| 76 |
-
|
| 77 |
-
# Qwen2-VL specific utilities
|
| 78 |
-
qwen-vl-utils>=0.0.2
|
| 79 |
-
|
| 80 |
-
# ============================================
|
| 81 |
-
# Optional: GPU Acceleration
|
| 82 |
-
# ============================================
|
| 83 |
-
|
| 84 |
-
# Uncomment for specific CUDA versions if needed
|
| 85 |
-
# --extra-index-url https://download.pytorch.org/whl/cu118
|
| 86 |
-
# torch==2.1.0+cu118
|
| 87 |
-
# torchvision==0.16.0+cu118
|
| 88 |
-
|
| 89 |
-
# ============================================
|
| 90 |
-
# Training Dependencies (optional)
|
| 91 |
-
# ============================================
|
| 92 |
-
|
| 93 |
-
# For loading Mr. HiSum dataset
|
| 94 |
-
h5py>=3.9.0
|
| 95 |
-
|
| 96 |
-
# ============================================
|
| 97 |
-
# Development Dependencies (optional)
|
| 98 |
-
# ============================================
|
| 99 |
-
|
| 100 |
-
# pytest>=7.0.0
|
| 101 |
-
# black>=23.0.0
|
| 102 |
-
# isort>=5.0.0
|
| 103 |
-
# mypy>=1.0.0
|
|
|
|
| 1 |
+
# ShortSmith v2 - Requirements
|
| 2 |
+
# For Hugging Face Spaces deployment
|
| 3 |
+
|
| 4 |
+
# ============================================
|
| 5 |
+
# Core Dependencies
|
| 6 |
+
# ============================================
|
| 7 |
+
|
| 8 |
+
# Gradio UI framework
|
| 9 |
+
gradio==4.44.1
|
| 10 |
+
|
| 11 |
+
# Pin pydantic to fix "argument of type 'bool' is not iterable" error
|
| 12 |
+
pydantic==2.10.6
|
| 13 |
+
|
| 14 |
+
# Deep learning frameworks
|
| 15 |
+
torch>=2.0.0
|
| 16 |
+
torchvision>=0.15.0
|
| 17 |
+
torchaudio>=2.0.0
|
| 18 |
+
|
| 19 |
+
# Transformers and model loading
|
| 20 |
+
transformers>=4.35.0
|
| 21 |
+
accelerate>=0.24.0
|
| 22 |
+
bitsandbytes>=0.41.0 # For INT4/INT8 quantization
|
| 23 |
+
|
| 24 |
+
# ============================================
|
| 25 |
+
# Video Processing
|
| 26 |
+
# ============================================
|
| 27 |
+
|
| 28 |
+
# Video I/O
|
| 29 |
+
ffmpeg-python>=0.2.0
|
| 30 |
+
opencv-python-headless>=4.8.0
|
| 31 |
+
|
| 32 |
+
# Scene detection
|
| 33 |
+
scenedetect[opencv]>=0.6.0
|
| 34 |
+
|
| 35 |
+
# ============================================
|
| 36 |
+
# Audio Processing
|
| 37 |
+
# ============================================
|
| 38 |
+
|
| 39 |
+
# Audio analysis
|
| 40 |
+
librosa>=0.10.0
|
| 41 |
+
soundfile>=0.12.0
|
| 42 |
+
|
| 43 |
+
# Optional: Advanced audio understanding
|
| 44 |
+
# wav2vec2 is loaded via transformers
|
| 45 |
+
|
| 46 |
+
# ============================================
|
| 47 |
+
# Computer Vision Models
|
| 48 |
+
# ============================================
|
| 49 |
+
|
| 50 |
+
# Face recognition
|
| 51 |
+
insightface>=0.7.0
|
| 52 |
+
onnxruntime-gpu>=1.16.0 # Use onnxruntime for CPU-only
|
| 53 |
+
|
| 54 |
+
# Person detection (YOLO)
|
| 55 |
+
ultralytics>=8.0.0
|
| 56 |
+
|
| 57 |
+
# Image processing
|
| 58 |
+
Pillow>=10.0.0
|
| 59 |
+
|
| 60 |
+
# ============================================
|
| 61 |
+
# Utilities
|
| 62 |
+
# ============================================
|
| 63 |
+
|
| 64 |
+
# Numerical computing
|
| 65 |
+
numpy>=1.24.0
|
| 66 |
+
|
| 67 |
+
# Progress bars
|
| 68 |
+
tqdm>=4.65.0
|
| 69 |
+
|
| 70 |
+
# ============================================
|
| 71 |
+
# Hugging Face Specific
|
| 72 |
+
# ============================================
|
| 73 |
+
|
| 74 |
+
# For model downloading
|
| 75 |
+
huggingface_hub>=0.17.0
|
| 76 |
+
|
| 77 |
+
# Qwen2-VL specific utilities
|
| 78 |
+
qwen-vl-utils>=0.0.2
|
| 79 |
+
|
| 80 |
+
# ============================================
|
| 81 |
+
# Optional: GPU Acceleration
|
| 82 |
+
# ============================================
|
| 83 |
+
|
| 84 |
+
# Uncomment for specific CUDA versions if needed
|
| 85 |
+
# --extra-index-url https://download.pytorch.org/whl/cu118
|
| 86 |
+
# torch==2.1.0+cu118
|
| 87 |
+
# torchvision==0.16.0+cu118
|
| 88 |
+
|
| 89 |
+
# ============================================
|
| 90 |
+
# Training Dependencies (optional)
|
| 91 |
+
# ============================================
|
| 92 |
+
|
| 93 |
+
# For loading Mr. HiSum dataset
|
| 94 |
+
h5py>=3.9.0
|
| 95 |
+
|
| 96 |
+
# ============================================
|
| 97 |
+
# Development Dependencies (optional)
|
| 98 |
+
# ============================================
|
| 99 |
+
|
| 100 |
+
# pytest>=7.0.0
|
| 101 |
+
# black>=23.0.0
|
| 102 |
+
# isort>=5.0.0
|
| 103 |
+
# mypy>=1.0.0
|
space.yaml
CHANGED
|
@@ -1,31 +1,31 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: ShortSmith v2
|
| 3 |
-
emoji: 🎬
|
| 4 |
-
colorFrom: purple
|
| 5 |
-
colorTo: blue
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version: "4.44.1"
|
| 8 |
-
app_file: app.py
|
| 9 |
-
pinned: false
|
| 10 |
-
license: mit
|
| 11 |
-
tags:
|
| 12 |
-
- video
|
| 13 |
-
- highlight-detection
|
| 14 |
-
- ai
|
| 15 |
-
- qwen
|
| 16 |
-
- computer-vision
|
| 17 |
-
- audio-analysis
|
| 18 |
-
short_description: AI-Powered Video Highlight Extractor
|
| 19 |
-
---
|
| 20 |
-
|
| 21 |
-
# ShortSmith v2
|
| 22 |
-
|
| 23 |
-
Extract the most engaging highlight clips from your videos automatically using AI.
|
| 24 |
-
|
| 25 |
-
## Features
|
| 26 |
-
- Multi-modal analysis (visual + audio + motion)
|
| 27 |
-
- Domain-optimized presets (Sports, Music, Vlogs, etc.)
|
| 28 |
-
- Person-specific filtering
|
| 29 |
-
- Scene-aware clip cutting
|
| 30 |
-
|
| 31 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: ShortSmith v2
|
| 3 |
+
emoji: 🎬
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "4.44.1"
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: mit
|
| 11 |
+
tags:
|
| 12 |
+
- video
|
| 13 |
+
- highlight-detection
|
| 14 |
+
- ai
|
| 15 |
+
- qwen
|
| 16 |
+
- computer-vision
|
| 17 |
+
- audio-analysis
|
| 18 |
+
short_description: AI-Powered Video Highlight Extractor
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# ShortSmith v2
|
| 22 |
+
|
| 23 |
+
Extract the most engaging highlight clips from your videos automatically using AI.
|
| 24 |
+
|
| 25 |
+
## Features
|
| 26 |
+
- Multi-modal analysis (visual + audio + motion)
|
| 27 |
+
- Domain-optimized presets (Sports, Music, Vlogs, etc.)
|
| 28 |
+
- Person-specific filtering
|
| 29 |
+
- Scene-aware clip cutting
|
| 30 |
+
|
| 31 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|