dev_caio / PLAN.md
Chaitanya-aitf's picture
Upload 30 files
c4ee290 verified

A newer version of the Gradio SDK is available: 6.7.0

Upgrade

ShortSmith v2 - Implementation Plan

Overview

Build a Hugging Face Space that extracts "hype" moments from videos with optional person-specific filtering.

Project Structure

shortsmith-v2/
β”œβ”€β”€ app.py                    # Gradio UI (Hugging Face interface)
β”œβ”€β”€ requirements.txt          # Dependencies
β”œβ”€β”€ config.py                 # Configuration and constants
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ logger.py             # Centralized logging
β”‚   └── helpers.py            # Utility functions
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ video_processor.py    # FFmpeg video/audio extraction
β”‚   β”œβ”€β”€ scene_detector.py     # PySceneDetect integration
β”‚   β”œβ”€β”€ frame_sampler.py      # Hierarchical sampling logic
β”‚   └── clip_extractor.py     # Final clip cutting
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ visual_analyzer.py    # Qwen2-VL integration
β”‚   β”œβ”€β”€ audio_analyzer.py     # Wav2Vec 2.0 + Librosa
β”‚   β”œβ”€β”€ face_recognizer.py    # InsightFace (SCRFD + ArcFace)
β”‚   β”œβ”€β”€ body_recognizer.py    # OSNet for body recognition
β”‚   β”œβ”€β”€ motion_detector.py    # RAFT optical flow
β”‚   └── tracker.py            # ByteTrack integration
β”œβ”€β”€ scoring/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ hype_scorer.py        # Hype scoring logic
β”‚   └── domain_presets.py     # Domain-specific weights
└── pipeline/
    β”œβ”€β”€ __init__.py
    └── orchestrator.py       # Main pipeline coordinator

Implementation Phases

Phase 1: Core Infrastructure

  1. config.py - Configuration management

    • Model paths, thresholds, domain presets
    • HuggingFace API key handling
  2. utils/logger.py - Centralized logging

    • File and console handlers
    • Different log levels per module
    • Timing decorators for performance tracking
  3. utils/helpers.py - Common utilities

    • File validation
    • Temporary file management
    • Error formatting

Phase 2: Video Processing Layer

  1. core/video_processor.py - FFmpeg operations

    • Extract frames at specified FPS
    • Extract audio track
    • Get video metadata (duration, resolution, fps)
    • Cut clips at timestamps
  2. core/scene_detector.py - Scene boundary detection

    • PySceneDetect integration
    • Content-aware detection
    • Return scene timestamps
  3. core/frame_sampler.py - Hierarchical sampling

    • First pass: 1 frame per 5-10 seconds
    • Second pass: Dense sampling on candidates
    • Dynamic FPS based on motion

Phase 3: AI Models

  1. models/visual_analyzer.py - Qwen2-VL-2B

    • Load quantized model
    • Process frame batches
    • Extract visual embeddings/scores
  2. models/audio_analyzer.py - Audio analysis

    • Librosa for basic features (RMS, spectral flux, centroid)
    • Optional Wav2Vec 2.0 for advanced understanding
    • Return audio hype signals per segment
  3. models/face_recognizer.py - Face detection/recognition

    • InsightFace SCRFD for detection
    • ArcFace for embeddings
    • Reference image matching
  4. models/body_recognizer.py - Body recognition

    • OSNet for full-body embeddings
    • Handle non-frontal views
  5. models/motion_detector.py - Motion analysis

    • RAFT optical flow
    • Motion magnitude scoring
  6. models/tracker.py - Multi-object tracking

    • ByteTrack integration
    • Maintain identity across frames

Phase 4: Scoring & Selection

  1. scoring/domain_presets.py - Domain configurations

    • Sports, Vlogs, Music, Podcasts presets
    • Custom weight definitions
  2. scoring/hype_scorer.py - Hype calculation

    • Combine visual + audio scores
    • Apply domain weights
    • Normalize and rank segments

Phase 5: Pipeline & UI

  1. pipeline/orchestrator.py - Main coordinator

    • Coordinate all components
    • Handle errors gracefully
    • Progress reporting
  2. app.py - Gradio interface

    • Video upload
    • API key input (secure)
    • Prompt/instructions input
    • Domain selection
    • Reference image upload (for person filtering)
    • Progress bar
    • Output video gallery

Key Design Decisions

Error Handling Strategy

  • Each module has try/except with specific exception types
  • Errors bubble up with context
  • Pipeline continues with degraded functionality when possible
  • User-friendly error messages in UI

Logging Strategy

  • DEBUG: Model loading, frame processing details
  • INFO: Pipeline stages, timing, results
  • WARNING: Fallback triggers, degraded mode
  • ERROR: Failures with stack traces

Memory Management

  • Process frames in batches
  • Clear GPU memory between stages
  • Use generators where possible
  • Temporary file cleanup

HuggingFace Space Considerations

  • Use gr.State for session data
  • Respect ZeroGPU limits (if using)
  • Cache models in /tmp or HF cache
  • Handle timeouts gracefully

API Key Usage

The API key input is for future extensibility (e.g., external services). For MVP, all processing is local using open-weight models.

Gradio UI Layout

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ShortSmith v2 - AI Video Highlight Extractor               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Upload Video        β”‚  β”‚ Settings                    β”‚   β”‚
β”‚  β”‚ [Drop zone]         β”‚  β”‚ Domain: [Dropdown]          β”‚   β”‚
β”‚  β”‚                     β”‚  β”‚ Clip Duration: [Slider]     β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ Num Clips: [Slider]         β”‚   β”‚
β”‚                           β”‚ API Key: [Password field]   β”‚   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚  β”‚ Reference Image     β”‚                                    β”‚
β”‚  β”‚ (Optional)          β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ [Drop zone]         β”‚  β”‚ Additional Instructions     β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚ [Textbox]                   β”‚   β”‚
β”‚                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  [πŸš€ Extract Highlights]                                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Progress: [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 60%                       β”‚
β”‚  Status: Analyzing audio...                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Results                                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚ Clip 1   β”‚ β”‚ Clip 2   β”‚ β”‚ Clip 3   β”‚                    β”‚
β”‚  β”‚ [Video]  β”‚ β”‚ [Video]  β”‚ β”‚ [Video]  β”‚                    β”‚
β”‚  β”‚ Score:85 β”‚ β”‚ Score:78 β”‚ β”‚ Score:72 β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚  [Download All]                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Dependencies (requirements.txt)

gradio>=4.0.0
torch>=2.0.0
transformers>=4.35.0
accelerate
bitsandbytes
qwen-vl-utils
librosa>=0.10.0
soundfile
insightface
onnxruntime-gpu
opencv-python-headless
scenedetect[opencv]
numpy
pillow
tqdm
ffmpeg-python

Implementation Order

  1. config.py, utils/ (foundation)
  2. core/video_processor.py (essential)
  3. models/audio_analyzer.py (simpler, Librosa first)
  4. core/scene_detector.py
  5. core/frame_sampler.py
  6. scoring/ modules
  7. models/visual_analyzer.py (Qwen2-VL)
  8. models/face_recognizer.py, body_recognizer.py
  9. models/tracker.py, motion_detector.py
  10. pipeline/orchestrator.py
  11. app.py (Gradio UI)

Notes

  • Start with Librosa-only audio (MVP), add Wav2Vec later
  • Face/body recognition is optional (triggered by reference image)
  • Motion detection can be skipped in MVP for speed
  • ByteTrack only needed when person filtering is enabled