imageaudiosync / plan.md
Nanny7's picture
Initial commit: Audio Video Generator v1.0.0
929f41f
|
Raw
History Blame Contribute Delete
6.71 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

Audio-to-Image Video Generator - Conversion Plan

Overview

Convert the Jupyter notebook into a proper Python CLI tool with optional Gradio UI.

Architecture

audio_video_generator/
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
β”œβ”€β”€ src/
β”‚   └── audio_video_generator/
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ __main__.py
β”‚       β”œβ”€β”€ cli.py
β”‚       β”œβ”€β”€ config.py
β”‚       β”œβ”€β”€ core/
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ audio.py          # Audio loading, Whisper transcription
β”‚       β”‚   β”œβ”€β”€ alignment.py      # CSV-to-audio alignment logic
β”‚       β”‚   β”œβ”€β”€ images.py         # Image processing, ZIP handling
β”‚       β”‚   β”œβ”€β”€ video.py          # Video composition, animations
β”‚       β”‚   └── text_overlay.py   # Text overlay rendering
β”‚       β”œβ”€β”€ utils/
β”‚       β”‚   β”œβ”€β”€ __init__.py
β”‚       β”‚   β”œβ”€β”€ files.py          # File utilities, checkpoints
β”‚       β”‚   └── text.py           # Text normalization, tokenization
β”‚       └── web/
β”‚           β”œβ”€β”€ __init__.py
β”‚           └── gradio_ui.py      # Gradio interface
└── tests/
    └── ...

Tasks

Task 1: Project Structure and Packaging

Create the package structure with pyproject.toml, requirements.txt, and basic module setup.

Files to create:

  • pyproject.toml - Package metadata, dependencies, entry points
  • requirements.txt - Runtime dependencies
  • src/audio_video_generator/__init__.py - Version info
  • src/audio_video_generator/config.py - Configuration constants (RESOLUTION_MAP, ANIMATION_OPTIONS, etc.)

Key specs:

  • Package name: audio-video-generator
  • CLI entry point: avg command
  • Version: 1.0.0
  • Include all dependencies: whisper, moviepy, gradio, torch, pillow, pandas, numpy

Task 2: Utility Modules

Extract utility functions from the notebook into reusable modules.

Files to create:

  • src/audio_video_generator/utils/text.py - normalize_text(), tokenize_text(), extract_number(), get_fuzzy_threshold(), clamp01(), safe_int(), apply_case_style()
  • src/audio_video_generator/utils/files.py - ensure_dir(), make_run_dir(), safe_output_name(), write_json_file(), write_text_file(), extract_zip(), collect_images_recursive(), sort_images()

Key specs:

  • All functions must be pure (no global state)
  • Add type hints
  • Add docstrings

Task 3: Audio and Transcription Module

Extract Whisper-related functionality.

Files to create:

  • src/audio_video_generator/core/audio.py - transcribe_with_words(), extract_word_timeline(), get_whisper_model(), get_device()

Key specs:

  • Use singleton pattern for Whisper model (lazy loading)
  • Support CPU and CUDA
  • Handle model caching properly

Task 4: Image Processing Module

Extract image handling functionality.

Files to create:

  • src/audio_video_generator/core/images.py - prepare_image_inputs(), verify_and_filter_images(), build_image_indexes(), image_preflight_report(), resolve_image_reference(), resize_with_padding()

Key specs:

  • Support both ZIP and manual image inputs
  • Image caching for performance
  • Proper error handling for corrupt images

Task 5: CSV Alignment Module

Extract CSV loading and alignment logic.

Files to create:

  • src/audio_video_generator/core/alignment.py - load_csv(), preprocess_csv(), mapping_preflight_report(), find_sentence_match(), build_timeline()

Key specs:

  • CSV must have exactly 2 columns: text, image
  • Fuzzy matching with configurable thresholds
  • Duplicate row handling

Task 6: Video and Animation Module

Extract video composition and animation logic.

Files to create:

  • src/audio_video_generator/core/video.py - resolve_effect_sequence(), get_transition_duration(), apply_animation_to_clip(), apply_slide_position(), build_transition_overlay(), build_render_clips()

Key specs:

  • Support all animation types: none, zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
  • Support all transitions: none, fade, crossfade, slide_left, slide_right, dip_to_black, flash
  • Use moviepy for video processing

Task 7: Text Overlay Module

Extract text overlay functionality.

Files to create:

  • src/audio_video_generator/core/text_overlay.py - load_overlay_txt(), find_all_phrase_matches(), build_text_overlay_events(), render_text_rgba(), resolve_overlay_position(), build_text_overlay_clips(), get_available_font_map(), get_pil_font(), make_text_style_config()

Key specs:

  • Support font selection, colors, patterns (solid, boxed, highlighted)
  • Support text animations: fade_in, pop_in, pulse, slide_up, glow_pop, typewriter, zoom_in
  • Typewriter effect builds character-by-character clips

Task 8: CLI Interface

Create command-line interface using Click.

Files to create:

  • src/audio_video_generator/cli.py - Main CLI with commands and options
  • src/audio_video_generator/__main__.py - Entry point for python -m

Key specs:

  • Command: avg generate or just avg
  • Options for all major settings: audio, csv, images, resolution, output, animations, transitions
  • Progress reporting
  • Checkpoint saving
  • Proper error handling with exit codes

Task 9: Gradio Web UI

Extract and clean up the Gradio interface.

Files to create:

  • src/audio_video_generator/web/gradio_ui.py - Full Gradio UI implementation
  • src/audio_video_generator/web/__init__.py

Key specs:

  • Command: avg web to launch UI
  • Include all features from notebook: file uploads, path inputs, live preview, text overlay editor
  • Drive integration optional (Colab-specific code made conditional)

Task 10: Main Pipeline Integration

Create the main orchestration pipeline.

Files to create:

  • src/audio_video_generator/core/pipeline.py - create_video_pipeline() function that orchestrates all components

Key specs:

  • Error handling with cleanup
  • Progress callbacks
  • Memory management (gc.collect, CUDA cache clear)
  • Report generation
  • Checkpoint saving

Task 11: Documentation

Create README and usage documentation.

Files to create:

  • README.md - Installation, usage examples, CSV format, CLI reference

Key specs:

  • Installation instructions
  • CSV format specification
  • CLI examples
  • Web UI usage

Execution Notes

  • All code must work outside Colab (no hardcoded /content paths)
  • Drive integration should be optional/conditional
  • Keep checkpoint functionality for debugging
  • Preserve all animation and transition options
  • Ensure proper resource cleanup (moviepy clips, torch CUDA)