dev_caio

Paused

App Files Files Community

dev_caio / PLAN.md

Chaitanya-aitf

Initializing project from local

ad4e58a verified 2 months ago

preview code

raw

history blame contribute delete

9.43 kB

A newer version of the Gradio SDK is available: 6.6.0

Upgrade

ShortSmith v2 - Implementation Plan

Overview

Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.

Project Structure

shortsmith-v2/
├── app.py                    # Gradio UI (Hugging Face interface)
├── requirements.txt          # Dependencies
├── config.py                 # Configuration and constants
├── utils/
│   ├── __init__.py
│   ├── logger.py             # Centralized logging
│   └── helpers.py            # Utility functions
├── core/
│   ├── __init__.py
│   ├── video_processor.py    # FFmpeg video/audio extraction
│   ├── scene_detector.py     # PySceneDetect integration
│   ├── frame_sampler.py      # Hierarchical sampling logic
│   └── clip_extractor.py     # Final clip cutting
├── models/
│   ├── __init__.py
│   ├── visual_analyzer.py    # Qwen2-VL integration
│   ├── audio_analyzer.py     # Wav2Vec 2.0 + Librosa
│   ├── face_recognizer.py    # InsightFace (SCRFD + ArcFace)
│   ├── body_recognizer.py    # OSNet for body recognition
│   ├── motion_detector.py    # RAFT optical flow
│   └── tracker.py            # ByteTrack integration
├── scoring/
│   ├── __init__.py
│   ├── hype_scorer.py        # Hype scoring logic
│   └── domain_presets.py     # Domain-specific weights
└── pipeline/
    ├── __init__.py
    └── orchestrator.py       # Main pipeline coordinator

Implementation Phases

Phase 1: Core Infrastructure

config.py - Configuration management
- Model paths, thresholds, domain presets
- HuggingFace API key handling
utils/logger.py - Centralized logging
- File and console handlers
- Different log levels per module
- Timing decorators for performance tracking
utils/helpers.py - Common utilities
- File validation
- Temporary file management
- Error formatting

Phase 2: Video Processing Layer

core/video_processor.py - FFmpeg operations
- Extract frames at specified FPS
- Extract audio track
- Get video metadata (duration, resolution, fps)
- Cut clips at timestamps
core/scene_detector.py - Scene boundary detection
- PySceneDetect integration
- Content-aware detection
- Return scene timestamps
core/frame_sampler.py - Hierarchical sampling
- First pass: 1 frame per 5-10 seconds
- Second pass: Dense sampling on candidates
- Dynamic FPS based on motion

Phase 3: AI Models

models/visual_analyzer.py - Qwen2-VL-2B
- Load quantized model
- Process frame batches
- Extract visual embeddings/scores
models/audio_analyzer.py - Audio analysis
- Librosa for basic features (RMS, spectral flux, centroid)
- Optional Wav2Vec 2.0 for advanced understanding
- Return audio hype signals per segment
models/face_recognizer.py - Face detection/recognition
- InsightFace SCRFD for detection
- ArcFace for embeddings
- Reference image matching
models/body_recognizer.py - Body recognition
- OSNet for full-body embeddings
- Handle non-frontal views
models/motion_detector.py - Motion analysis
- RAFT optical flow
- Motion magnitude scoring
models/tracker.py - Multi-object tracking
- ByteTrack integration
- Maintain identity across frames

Phase 4: Scoring & Selection

scoring/domain_presets.py - Domain configurations
- Sports, Vlogs, Music, Podcasts presets
- Custom weight definitions
scoring/hype_scorer.py - Hype calculation
- Combine visual + audio scores
- Apply domain weights
- Normalize and rank segments

Phase 5: Pipeline & UI

pipeline/orchestrator.py - Main coordinator
- Coordinate all components
- Handle errors gracefully
- Progress reporting
app.py - Gradio interface
- Video upload
- API key input (secure)
- Prompt/instructions input
- Domain selection
- Reference image upload (for person filtering)
- Progress bar
- Output video gallery

Key Design Decisions

Error Handling Strategy

Each module has try/except with specific exception types
Errors bubble up with context
Pipeline continues with degraded functionality when possible
User-friendly error messages in UI

Logging Strategy

DEBUG: Model loading, frame processing details
INFO: Pipeline stages, timing, results
WARNING: Fallback triggers, degraded mode
ERROR: Failures with stack traces

Memory Management

Process frames in batches
Clear GPU memory between stages
Use generators where possible
Temporary file cleanup

HuggingFace Space Considerations

Use gr.State for session data
Respect ZeroGPU limits (if using)
Cache models in /tmp or HF cache
Handle timeouts gracefully

API Key Usage

The API key input is for future extensibility (e.g., external services). For MVP, all processing is local using open-weight models.

Gradio UI Layout

┌─────────────────────────────────────────────────────────────┐
│  ShortSmith v2 - AI Video Highlight Extractor               │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐  ┌─────────────────────────────┐   │
│  │ Upload Video        │  │ Settings                    │   │
│  │ [Drop zone]         │  │ Domain: [Dropdown]          │   │
│  │                     │  │ Clip Duration: [Slider]     │   │
│  └─────────────────────┘  │ Num Clips: [Slider]         │   │
│                           │ API Key: [Password field]   │   │
│  ┌─────────────────────┐  └─────────────────────────────┘   │
│  │ Reference Image     │                                    │
│  │ (Optional)          │  ┌─────────────────────────────┐   │
│  │ [Drop zone]         │  │ Additional Instructions     │   │
│  └─────────────────────┘  │ [Textbox]                   │   │
│                           └─────────────────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│  [🚀 Extract Highlights]                                    │
├─────────────────────────────────────────────────────────────┤
│  Progress: [████████████░░░░░░░░] 60%                       │
│  Status: Analyzing audio...                                 │
├─────────────────────────────────────────────────────────────┤
│  Results                                                    │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐                    │
│  │ Clip 1   │ │ Clip 2   │ │ Clip 3   │                    │
│  │ [Video]  │ │ [Video]  │ │ [Video]  │                    │
│  │ Score:85 │ │ Score:78 │ │ Score:72 │                    │
│  └──────────┘ └──────────┘ └──────────┘                    │
│  [Download All]                                             │
└─────────────────────────────────────────────────────────────┘

Dependencies (requirements.txt)

gradio>=4.0.0
torch>=2.0.0
transformers>=4.35.0
accelerate
bitsandbytes
qwen-vl-utils
librosa>=0.10.0
soundfile
insightface
onnxruntime-gpu
opencv-python-headless
scenedetect[opencv]
numpy
pillow
tqdm
ffmpeg-python

Implementation Order

config.py, utils/ (foundation)
core/video_processor.py (essential)
models/audio_analyzer.py (simpler, Librosa first)
core/scene_detector.py
core/frame_sampler.py
scoring/ modules
models/visual_analyzer.py (Qwen2-VL)
models/face_recognizer.py, body_recognizer.py
models/tracker.py, motion_detector.py
pipeline/orchestrator.py
app.py (Gradio UI)

Notes

Start with Librosa-only audio (MVP), add Wav2Vec later
Face/body recognition is optional (triggered by reference image)
Motion detection can be skipped in MVP for speed
ByteTrack only needed when person filtering is enabled