Spaces:

AI-Talent-Force
/

dev_caio

Paused

File size: 9,427 Bytes

c4ee290

# ShortSmith v2 - Implementation Plan

## Overview
Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.

## Project Structure
```
shortsmith-v2/
├── app.py                    # Gradio UI (Hugging Face interface)
├── requirements.txt          # Dependencies
├── config.py                 # Configuration and constants
├── utils/
│   ├── __init__.py
│   ├── logger.py             # Centralized logging
│   └── helpers.py            # Utility functions
├── core/
│   ├── __init__.py
│   ├── video_processor.py    # FFmpeg video/audio extraction
│   ├── scene_detector.py     # PySceneDetect integration
│   ├── frame_sampler.py      # Hierarchical sampling logic
│   └── clip_extractor.py     # Final clip cutting
├── models/
│   ├── __init__.py
│   ├── visual_analyzer.py    # Qwen2-VL integration
│   ├── audio_analyzer.py     # Wav2Vec 2.0 + Librosa
│   ├── face_recognizer.py    # InsightFace (SCRFD + ArcFace)
│   ├── body_recognizer.py    # OSNet for body recognition
│   ├── motion_detector.py    # RAFT optical flow
│   └── tracker.py            # ByteTrack integration
├── scoring/
│   ├── __init__.py
│   ├── hype_scorer.py        # Hype scoring logic
│   └── domain_presets.py     # Domain-specific weights
└── pipeline/
    ├── __init__.py
    └── orchestrator.py       # Main pipeline coordinator
```

## Implementation Phases

### Phase 1: Core Infrastructure
1. **config.py** - Configuration management
   - Model paths, thresholds, domain presets
   - HuggingFace API key handling

2. **utils/logger.py** - Centralized logging
   - File and console handlers
   - Different log levels per module
   - Timing decorators for performance tracking

3. **utils/helpers.py** - Common utilities
   - File validation
   - Temporary file management
   - Error formatting

### Phase 2: Video Processing Layer
4. **core/video_processor.py** - FFmpeg operations
   - Extract frames at specified FPS
   - Extract audio track
   - Get video metadata (duration, resolution, fps)
   - Cut clips at timestamps

5. **core/scene_detector.py** - Scene boundary detection
   - PySceneDetect integration
   - Content-aware detection
   - Return scene timestamps

6. **core/frame_sampler.py** - Hierarchical sampling
   - First pass: 1 frame per 5-10 seconds
   - Second pass: Dense sampling on candidates
   - Dynamic FPS based on motion

### Phase 3: AI Models
7. **models/visual_analyzer.py** - Qwen2-VL-2B
   - Load quantized model
   - Process frame batches
   - Extract visual embeddings/scores

8. **models/audio_analyzer.py** - Audio analysis
   - Librosa for basic features (RMS, spectral flux, centroid)
   - Optional Wav2Vec 2.0 for advanced understanding
   - Return audio hype signals per segment

9. **models/face_recognizer.py** - Face detection/recognition
   - InsightFace SCRFD for detection
   - ArcFace for embeddings
   - Reference image matching

10. **models/body_recognizer.py** - Body recognition
    - OSNet for full-body embeddings
    - Handle non-frontal views

11. **models/motion_detector.py** - Motion analysis
    - RAFT optical flow
    - Motion magnitude scoring

12. **models/tracker.py** - Multi-object tracking
    - ByteTrack integration
    - Maintain identity across frames

### Phase 4: Scoring & Selection
13. **scoring/domain_presets.py** - Domain configurations
    - Sports, Vlogs, Music, Podcasts presets
    - Custom weight definitions

14. **scoring/hype_scorer.py** - Hype calculation
    - Combine visual + audio scores
    - Apply domain weights
    - Normalize and rank segments

### Phase 5: Pipeline & UI
15. **pipeline/orchestrator.py** - Main coordinator
    - Coordinate all components
    - Handle errors gracefully
    - Progress reporting

16. **app.py** - Gradio interface
    - Video upload
    - API key input (secure)
    - Prompt/instructions input
    - Domain selection
    - Reference image upload (for person filtering)
    - Progress bar
    - Output video gallery

## Key Design Decisions

### Error Handling Strategy
- Each module has try/except with specific exception types
- Errors bubble up with context
- Pipeline continues with degraded functionality when possible
- User-friendly error messages in UI

### Logging Strategy
- DEBUG: Model loading, frame processing details
- INFO: Pipeline stages, timing, results
- WARNING: Fallback triggers, degraded mode
- ERROR: Failures with stack traces

### Memory Management
- Process frames in batches
- Clear GPU memory between stages
- Use generators where possible
- Temporary file cleanup

### HuggingFace Space Considerations
- Use `gr.State` for session data
- Respect ZeroGPU limits (if using)
- Cache models in `/tmp` or HF cache
- Handle timeouts gracefully

## API Key Usage
The API key input is for future extensibility (e.g., external services).
For MVP, all processing is local using open-weight models.

## Gradio UI Layout
```
┌─────────────────────────────────────────────────────────────┐
│  ShortSmith v2 - AI Video Highlight Extractor               │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐  ┌─────────────────────────────┐   │
│  │ Upload Video        │  │ Settings                    │   │
│  │ [Drop zone]         │  │ Domain: [Dropdown]          │   │
│  │                     │  │ Clip Duration: [Slider]     │   │
│  └─────────────────────┘  │ Num Clips: [Slider]         │   │
│                           │ API Key: [Password field]   │   │
│  ┌─────────────────────┐  └─────────────────────────────┘   │
│  │ Reference Image     │                                    │
│  │ (Optional)          │  ┌─────────────────────────────┐   │
│  │ [Drop zone]         │  │ Additional Instructions     │   │
│  └─────────────────────┘  │ [Textbox]                   │   │
│                           └─────────────────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│  [🚀 Extract Highlights]                                    │
├─────────────────────────────────────────────────────────────┤
│  Progress: [████████████░░░░░░░░] 60%                       │
│  Status: Analyzing audio...                                 │
├─────────────────────────────────────────────────────────────┤
│  Results                                                    │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐                    │
│  │ Clip 1   │ │ Clip 2   │ │ Clip 3   │                    │
│  │ [Video]  │ │ [Video]  │ │ [Video]  │                    │
│  │ Score:85 │ │ Score:78 │ │ Score:72 │                    │
│  └──────────┘ └──────────┘ └──────────┘                    │
│  [Download All]                                             │
└─────────────────────────────────────────────────────────────┘
```

## Dependencies (requirements.txt)
```
gradio>=4.0.0
torch>=2.0.0
transformers>=4.35.0
accelerate
bitsandbytes
qwen-vl-utils
librosa>=0.10.0
soundfile
insightface
onnxruntime-gpu
opencv-python-headless
scenedetect[opencv]
numpy
pillow
tqdm
ffmpeg-python
```

## Implementation Order
1. config.py, utils/ (foundation)
2. core/video_processor.py (essential)
3. models/audio_analyzer.py (simpler, Librosa first)
4. core/scene_detector.py
5. core/frame_sampler.py
6. scoring/ modules
7. models/visual_analyzer.py (Qwen2-VL)
8. models/face_recognizer.py, body_recognizer.py
9. models/tracker.py, motion_detector.py
10. pipeline/orchestrator.py
11. app.py (Gradio UI)

## Notes
- Start with Librosa-only audio (MVP), add Wav2Vec later
- Face/body recognition is optional (triggered by reference image)
- Motion detection can be skipped in MVP for speed
- ByteTrack only needed when person filtering is enabled