Spaces:

areebsa
/

imageaudiosync

Sleeping

App Files Files Community

imageaudiosync / plan.md

Nanny7

Initial commit: Audio Video Generator v1.0.0

929f41f 2 months ago

preview code

Raw

History Blame Contribute Delete

6.71 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Audio-to-Image Video Generator - Conversion Plan

Overview

Convert the Jupyter notebook into a proper Python CLI tool with optional Gradio UI.

Architecture

audio_video_generator/
├── pyproject.toml
├── requirements.txt
├── README.md
├── src/
│   └── audio_video_generator/
│       ├── __init__.py
│       ├── __main__.py
│       ├── cli.py
│       ├── config.py
│       ├── core/
│       │   ├── __init__.py
│       │   ├── audio.py          # Audio loading, Whisper transcription
│       │   ├── alignment.py      # CSV-to-audio alignment logic
│       │   ├── images.py         # Image processing, ZIP handling
│       │   ├── video.py          # Video composition, animations
│       │   └── text_overlay.py   # Text overlay rendering
│       ├── utils/
│       │   ├── __init__.py
│       │   ├── files.py          # File utilities, checkpoints
│       │   └── text.py           # Text normalization, tokenization
│       └── web/
│           ├── __init__.py
│           └── gradio_ui.py      # Gradio interface
└── tests/
    └── ...

Tasks

Task 1: Project Structure and Packaging

Create the package structure with pyproject.toml, requirements.txt, and basic module setup.

Files to create:

pyproject.toml - Package metadata, dependencies, entry points
requirements.txt - Runtime dependencies
src/audio_video_generator/__init__.py - Version info
src/audio_video_generator/config.py - Configuration constants (RESOLUTION_MAP, ANIMATION_OPTIONS, etc.)

Key specs:

Package name: audio-video-generator
CLI entry point: avg command
Version: 1.0.0
Include all dependencies: whisper, moviepy, gradio, torch, pillow, pandas, numpy

Task 2: Utility Modules

Extract utility functions from the notebook into reusable modules.

Files to create:

src/audio_video_generator/utils/text.py - normalize_text(), tokenize_text(), extract_number(), get_fuzzy_threshold(), clamp01(), safe_int(), apply_case_style()
src/audio_video_generator/utils/files.py - ensure_dir(), make_run_dir(), safe_output_name(), write_json_file(), write_text_file(), extract_zip(), collect_images_recursive(), sort_images()

Key specs:

All functions must be pure (no global state)
Add type hints
Add docstrings

Task 3: Audio and Transcription Module

Extract Whisper-related functionality.

Files to create:

src/audio_video_generator/core/audio.py - transcribe_with_words(), extract_word_timeline(), get_whisper_model(), get_device()

Key specs:

Use singleton pattern for Whisper model (lazy loading)
Support CPU and CUDA
Handle model caching properly

Task 4: Image Processing Module

Extract image handling functionality.

Files to create:

src/audio_video_generator/core/images.py - prepare_image_inputs(), verify_and_filter_images(), build_image_indexes(), image_preflight_report(), resolve_image_reference(), resize_with_padding()

Key specs:

Support both ZIP and manual image inputs
Image caching for performance
Proper error handling for corrupt images

Task 5: CSV Alignment Module

Extract CSV loading and alignment logic.

Files to create:

src/audio_video_generator/core/alignment.py - load_csv(), preprocess_csv(), mapping_preflight_report(), find_sentence_match(), build_timeline()

Key specs:

CSV must have exactly 2 columns: text, image
Fuzzy matching with configurable thresholds
Duplicate row handling

Task 6: Video and Animation Module

Extract video composition and animation logic.

Files to create:

src/audio_video_generator/core/video.py - resolve_effect_sequence(), get_transition_duration(), apply_animation_to_clip(), apply_slide_position(), build_transition_overlay(), build_render_clips()

Key specs:

Support all animation types: none, zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
Support all transitions: none, fade, crossfade, slide_left, slide_right, dip_to_black, flash
Use moviepy for video processing

Task 7: Text Overlay Module

Extract text overlay functionality.

Files to create:

src/audio_video_generator/core/text_overlay.py - load_overlay_txt(), find_all_phrase_matches(), build_text_overlay_events(), render_text_rgba(), resolve_overlay_position(), build_text_overlay_clips(), get_available_font_map(), get_pil_font(), make_text_style_config()

Key specs:

Support font selection, colors, patterns (solid, boxed, highlighted)
Support text animations: fade_in, pop_in, pulse, slide_up, glow_pop, typewriter, zoom_in
Typewriter effect builds character-by-character clips

Task 8: CLI Interface

Create command-line interface using Click.

Files to create:

src/audio_video_generator/cli.py - Main CLI with commands and options
src/audio_video_generator/__main__.py - Entry point for python -m

Key specs:

Command: avg generate or just avg
Options for all major settings: audio, csv, images, resolution, output, animations, transitions
Progress reporting
Checkpoint saving
Proper error handling with exit codes

Task 9: Gradio Web UI

Extract and clean up the Gradio interface.

Files to create:

src/audio_video_generator/web/gradio_ui.py - Full Gradio UI implementation
src/audio_video_generator/web/__init__.py

Key specs:

Command: avg web to launch UI
Include all features from notebook: file uploads, path inputs, live preview, text overlay editor
Drive integration optional (Colab-specific code made conditional)

Task 10: Main Pipeline Integration

Create the main orchestration pipeline.

Files to create:

src/audio_video_generator/core/pipeline.py - create_video_pipeline() function that orchestrates all components

Key specs:

Error handling with cleanup
Progress callbacks
Memory management (gc.collect, CUDA cache clear)
Report generation
Checkpoint saving

Task 11: Documentation

Create README and usage documentation.

Files to create:

README.md - Installation, usage examples, CSV format, CLI reference

Key specs:

Installation instructions
CSV format specification
CLI examples
Web UI usage

Execution Notes

All code must work outside Colab (no hardcoded /content paths)
Drive integration should be optional/conditional
Keep checkpoint functionality for debugging
Preserve all animation and transition options
Ensure proper resource cleanup (moviepy clips, torch CUDA)