# Audio-to-Image Video Generator - Conversion Plan ## Overview Convert the Jupyter notebook into a proper Python CLI tool with optional Gradio UI. ## Architecture ``` audio_video_generator/ ├── pyproject.toml ├── requirements.txt ├── README.md ├── src/ │ └── audio_video_generator/ │ ├── __init__.py │ ├── __main__.py │ ├── cli.py │ ├── config.py │ ├── core/ │ │ ├── __init__.py │ │ ├── audio.py # Audio loading, Whisper transcription │ │ ├── alignment.py # CSV-to-audio alignment logic │ │ ├── images.py # Image processing, ZIP handling │ │ ├── video.py # Video composition, animations │ │ └── text_overlay.py # Text overlay rendering │ ├── utils/ │ │ ├── __init__.py │ │ ├── files.py # File utilities, checkpoints │ │ └── text.py # Text normalization, tokenization │ └── web/ │ ├── __init__.py │ └── gradio_ui.py # Gradio interface └── tests/ └── ... ``` ## Tasks ### Task 1: Project Structure and Packaging Create the package structure with pyproject.toml, requirements.txt, and basic module setup. **Files to create:** - `pyproject.toml` - Package metadata, dependencies, entry points - `requirements.txt` - Runtime dependencies - `src/audio_video_generator/__init__.py` - Version info - `src/audio_video_generator/config.py` - Configuration constants (RESOLUTION_MAP, ANIMATION_OPTIONS, etc.) **Key specs:** - Package name: `audio-video-generator` - CLI entry point: `avg` command - Version: 1.0.0 - Include all dependencies: whisper, moviepy, gradio, torch, pillow, pandas, numpy ### Task 2: Utility Modules Extract utility functions from the notebook into reusable modules. **Files to create:** - `src/audio_video_generator/utils/text.py` - `normalize_text()`, `tokenize_text()`, `extract_number()`, `get_fuzzy_threshold()`, `clamp01()`, `safe_int()`, `apply_case_style()` - `src/audio_video_generator/utils/files.py` - `ensure_dir()`, `make_run_dir()`, `safe_output_name()`, `write_json_file()`, `write_text_file()`, `extract_zip()`, `collect_images_recursive()`, `sort_images()` **Key specs:** - All functions must be pure (no global state) - Add type hints - Add docstrings ### Task 3: Audio and Transcription Module Extract Whisper-related functionality. **Files to create:** - `src/audio_video_generator/core/audio.py` - `transcribe_with_words()`, `extract_word_timeline()`, `get_whisper_model()`, `get_device()` **Key specs:** - Use singleton pattern for Whisper model (lazy loading) - Support CPU and CUDA - Handle model caching properly ### Task 4: Image Processing Module Extract image handling functionality. **Files to create:** - `src/audio_video_generator/core/images.py` - `prepare_image_inputs()`, `verify_and_filter_images()`, `build_image_indexes()`, `image_preflight_report()`, `resolve_image_reference()`, `resize_with_padding()` **Key specs:** - Support both ZIP and manual image inputs - Image caching for performance - Proper error handling for corrupt images ### Task 5: CSV Alignment Module Extract CSV loading and alignment logic. **Files to create:** - `src/audio_video_generator/core/alignment.py` - `load_csv()`, `preprocess_csv()`, `mapping_preflight_report()`, `find_sentence_match()`, `build_timeline()` **Key specs:** - CSV must have exactly 2 columns: text, image - Fuzzy matching with configurable thresholds - Duplicate row handling ### Task 6: Video and Animation Module Extract video composition and animation logic. **Files to create:** - `src/audio_video_generator/core/video.py` - `resolve_effect_sequence()`, `get_transition_duration()`, `apply_animation_to_clip()`, `apply_slide_position()`, `build_transition_overlay()`, `build_render_clips()` **Key specs:** - Support all animation types: none, zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in - Support all transitions: none, fade, crossfade, slide_left, slide_right, dip_to_black, flash - Use moviepy for video processing ### Task 7: Text Overlay Module Extract text overlay functionality. **Files to create:** - `src/audio_video_generator/core/text_overlay.py` - `load_overlay_txt()`, `find_all_phrase_matches()`, `build_text_overlay_events()`, `render_text_rgba()`, `resolve_overlay_position()`, `build_text_overlay_clips()`, `get_available_font_map()`, `get_pil_font()`, `make_text_style_config()` **Key specs:** - Support font selection, colors, patterns (solid, boxed, highlighted) - Support text animations: fade_in, pop_in, pulse, slide_up, glow_pop, typewriter, zoom_in - Typewriter effect builds character-by-character clips ### Task 8: CLI Interface Create command-line interface using Click. **Files to create:** - `src/audio_video_generator/cli.py` - Main CLI with commands and options - `src/audio_video_generator/__main__.py` - Entry point for `python -m` **Key specs:** - Command: `avg generate` or just `avg` - Options for all major settings: audio, csv, images, resolution, output, animations, transitions - Progress reporting - Checkpoint saving - Proper error handling with exit codes ### Task 9: Gradio Web UI Extract and clean up the Gradio interface. **Files to create:** - `src/audio_video_generator/web/gradio_ui.py` - Full Gradio UI implementation - `src/audio_video_generator/web/__init__.py` **Key specs:** - Command: `avg web` to launch UI - Include all features from notebook: file uploads, path inputs, live preview, text overlay editor - Drive integration optional (Colab-specific code made conditional) ### Task 10: Main Pipeline Integration Create the main orchestration pipeline. **Files to create:** - `src/audio_video_generator/core/pipeline.py` - `create_video_pipeline()` function that orchestrates all components **Key specs:** - Error handling with cleanup - Progress callbacks - Memory management (gc.collect, CUDA cache clear) - Report generation - Checkpoint saving ### Task 11: Documentation Create README and usage documentation. **Files to create:** - `README.md` - Installation, usage examples, CSV format, CLI reference **Key specs:** - Installation instructions - CSV format specification - CLI examples - Web UI usage ## Execution Notes - All code must work outside Colab (no hardcoded `/content` paths) - Drive integration should be optional/conditional - Keep checkpoint functionality for debugging - Preserve all animation and transition options - Ensure proper resource cleanup (moviepy clips, torch CUDA)