Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
Audio-to-Image Video Generator - Conversion Plan
Overview
Convert the Jupyter notebook into a proper Python CLI tool with optional Gradio UI.
Architecture
audio_video_generator/
βββ pyproject.toml
βββ requirements.txt
βββ README.md
βββ src/
β βββ audio_video_generator/
β βββ __init__.py
β βββ __main__.py
β βββ cli.py
β βββ config.py
β βββ core/
β β βββ __init__.py
β β βββ audio.py # Audio loading, Whisper transcription
β β βββ alignment.py # CSV-to-audio alignment logic
β β βββ images.py # Image processing, ZIP handling
β β βββ video.py # Video composition, animations
β β βββ text_overlay.py # Text overlay rendering
β βββ utils/
β β βββ __init__.py
β β βββ files.py # File utilities, checkpoints
β β βββ text.py # Text normalization, tokenization
β βββ web/
β βββ __init__.py
β βββ gradio_ui.py # Gradio interface
βββ tests/
βββ ...
Tasks
Task 1: Project Structure and Packaging
Create the package structure with pyproject.toml, requirements.txt, and basic module setup.
Files to create:
pyproject.toml- Package metadata, dependencies, entry pointsrequirements.txt- Runtime dependenciessrc/audio_video_generator/__init__.py- Version infosrc/audio_video_generator/config.py- Configuration constants (RESOLUTION_MAP, ANIMATION_OPTIONS, etc.)
Key specs:
- Package name:
audio-video-generator - CLI entry point:
avgcommand - Version: 1.0.0
- Include all dependencies: whisper, moviepy, gradio, torch, pillow, pandas, numpy
Task 2: Utility Modules
Extract utility functions from the notebook into reusable modules.
Files to create:
src/audio_video_generator/utils/text.py-normalize_text(),tokenize_text(),extract_number(),get_fuzzy_threshold(),clamp01(),safe_int(),apply_case_style()src/audio_video_generator/utils/files.py-ensure_dir(),make_run_dir(),safe_output_name(),write_json_file(),write_text_file(),extract_zip(),collect_images_recursive(),sort_images()
Key specs:
- All functions must be pure (no global state)
- Add type hints
- Add docstrings
Task 3: Audio and Transcription Module
Extract Whisper-related functionality.
Files to create:
src/audio_video_generator/core/audio.py-transcribe_with_words(),extract_word_timeline(),get_whisper_model(),get_device()
Key specs:
- Use singleton pattern for Whisper model (lazy loading)
- Support CPU and CUDA
- Handle model caching properly
Task 4: Image Processing Module
Extract image handling functionality.
Files to create:
src/audio_video_generator/core/images.py-prepare_image_inputs(),verify_and_filter_images(),build_image_indexes(),image_preflight_report(),resolve_image_reference(),resize_with_padding()
Key specs:
- Support both ZIP and manual image inputs
- Image caching for performance
- Proper error handling for corrupt images
Task 5: CSV Alignment Module
Extract CSV loading and alignment logic.
Files to create:
src/audio_video_generator/core/alignment.py-load_csv(),preprocess_csv(),mapping_preflight_report(),find_sentence_match(),build_timeline()
Key specs:
- CSV must have exactly 2 columns: text, image
- Fuzzy matching with configurable thresholds
- Duplicate row handling
Task 6: Video and Animation Module
Extract video composition and animation logic.
Files to create:
src/audio_video_generator/core/video.py-resolve_effect_sequence(),get_transition_duration(),apply_animation_to_clip(),apply_slide_position(),build_transition_overlay(),build_render_clips()
Key specs:
- Support all animation types: none, zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
- Support all transitions: none, fade, crossfade, slide_left, slide_right, dip_to_black, flash
- Use moviepy for video processing
Task 7: Text Overlay Module
Extract text overlay functionality.
Files to create:
src/audio_video_generator/core/text_overlay.py-load_overlay_txt(),find_all_phrase_matches(),build_text_overlay_events(),render_text_rgba(),resolve_overlay_position(),build_text_overlay_clips(),get_available_font_map(),get_pil_font(),make_text_style_config()
Key specs:
- Support font selection, colors, patterns (solid, boxed, highlighted)
- Support text animations: fade_in, pop_in, pulse, slide_up, glow_pop, typewriter, zoom_in
- Typewriter effect builds character-by-character clips
Task 8: CLI Interface
Create command-line interface using Click.
Files to create:
src/audio_video_generator/cli.py- Main CLI with commands and optionssrc/audio_video_generator/__main__.py- Entry point forpython -m
Key specs:
- Command:
avg generateor justavg - Options for all major settings: audio, csv, images, resolution, output, animations, transitions
- Progress reporting
- Checkpoint saving
- Proper error handling with exit codes
Task 9: Gradio Web UI
Extract and clean up the Gradio interface.
Files to create:
src/audio_video_generator/web/gradio_ui.py- Full Gradio UI implementationsrc/audio_video_generator/web/__init__.py
Key specs:
- Command:
avg webto launch UI - Include all features from notebook: file uploads, path inputs, live preview, text overlay editor
- Drive integration optional (Colab-specific code made conditional)
Task 10: Main Pipeline Integration
Create the main orchestration pipeline.
Files to create:
src/audio_video_generator/core/pipeline.py-create_video_pipeline()function that orchestrates all components
Key specs:
- Error handling with cleanup
- Progress callbacks
- Memory management (gc.collect, CUDA cache clear)
- Report generation
- Checkpoint saving
Task 11: Documentation
Create README and usage documentation.
Files to create:
README.md- Installation, usage examples, CSV format, CLI reference
Key specs:
- Installation instructions
- CSV format specification
- CLI examples
- Web UI usage
Execution Notes
- All code must work outside Colab (no hardcoded
/contentpaths) - Drive integration should be optional/conditional
- Keep checkpoint functionality for debugging
- Preserve all animation and transition options
- Ensure proper resource cleanup (moviepy clips, torch CUDA)