Spaces:
Sleeping
Sleeping
| # Audio-to-Image Video Generator - Conversion Plan | |
| ## Overview | |
| Convert the Jupyter notebook into a proper Python CLI tool with optional Gradio UI. | |
| ## Architecture | |
| ``` | |
| audio_video_generator/ | |
| ├── pyproject.toml | |
| ├── requirements.txt | |
| ├── README.md | |
| ├── src/ | |
| │ └── audio_video_generator/ | |
| │ ├── __init__.py | |
| │ ├── __main__.py | |
| │ ├── cli.py | |
| │ ├── config.py | |
| │ ├── core/ | |
| │ │ ├── __init__.py | |
| │ │ ├── audio.py # Audio loading, Whisper transcription | |
| │ │ ├── alignment.py # CSV-to-audio alignment logic | |
| │ │ ├── images.py # Image processing, ZIP handling | |
| │ │ ├── video.py # Video composition, animations | |
| │ │ └── text_overlay.py # Text overlay rendering | |
| │ ├── utils/ | |
| │ │ ├── __init__.py | |
| │ │ ├── files.py # File utilities, checkpoints | |
| │ │ └── text.py # Text normalization, tokenization | |
| │ └── web/ | |
| │ ├── __init__.py | |
| │ └── gradio_ui.py # Gradio interface | |
| └── tests/ | |
| └── ... | |
| ``` | |
| ## Tasks | |
| ### Task 1: Project Structure and Packaging | |
| Create the package structure with pyproject.toml, requirements.txt, and basic module setup. | |
| **Files to create:** | |
| - `pyproject.toml` - Package metadata, dependencies, entry points | |
| - `requirements.txt` - Runtime dependencies | |
| - `src/audio_video_generator/__init__.py` - Version info | |
| - `src/audio_video_generator/config.py` - Configuration constants (RESOLUTION_MAP, ANIMATION_OPTIONS, etc.) | |
| **Key specs:** | |
| - Package name: `audio-video-generator` | |
| - CLI entry point: `avg` command | |
| - Version: 1.0.0 | |
| - Include all dependencies: whisper, moviepy, gradio, torch, pillow, pandas, numpy | |
| ### Task 2: Utility Modules | |
| Extract utility functions from the notebook into reusable modules. | |
| **Files to create:** | |
| - `src/audio_video_generator/utils/text.py` - `normalize_text()`, `tokenize_text()`, `extract_number()`, `get_fuzzy_threshold()`, `clamp01()`, `safe_int()`, `apply_case_style()` | |
| - `src/audio_video_generator/utils/files.py` - `ensure_dir()`, `make_run_dir()`, `safe_output_name()`, `write_json_file()`, `write_text_file()`, `extract_zip()`, `collect_images_recursive()`, `sort_images()` | |
| **Key specs:** | |
| - All functions must be pure (no global state) | |
| - Add type hints | |
| - Add docstrings | |
| ### Task 3: Audio and Transcription Module | |
| Extract Whisper-related functionality. | |
| **Files to create:** | |
| - `src/audio_video_generator/core/audio.py` - `transcribe_with_words()`, `extract_word_timeline()`, `get_whisper_model()`, `get_device()` | |
| **Key specs:** | |
| - Use singleton pattern for Whisper model (lazy loading) | |
| - Support CPU and CUDA | |
| - Handle model caching properly | |
| ### Task 4: Image Processing Module | |
| Extract image handling functionality. | |
| **Files to create:** | |
| - `src/audio_video_generator/core/images.py` - `prepare_image_inputs()`, `verify_and_filter_images()`, `build_image_indexes()`, `image_preflight_report()`, `resolve_image_reference()`, `resize_with_padding()` | |
| **Key specs:** | |
| - Support both ZIP and manual image inputs | |
| - Image caching for performance | |
| - Proper error handling for corrupt images | |
| ### Task 5: CSV Alignment Module | |
| Extract CSV loading and alignment logic. | |
| **Files to create:** | |
| - `src/audio_video_generator/core/alignment.py` - `load_csv()`, `preprocess_csv()`, `mapping_preflight_report()`, `find_sentence_match()`, `build_timeline()` | |
| **Key specs:** | |
| - CSV must have exactly 2 columns: text, image | |
| - Fuzzy matching with configurable thresholds | |
| - Duplicate row handling | |
| ### Task 6: Video and Animation Module | |
| Extract video composition and animation logic. | |
| **Files to create:** | |
| - `src/audio_video_generator/core/video.py` - `resolve_effect_sequence()`, `get_transition_duration()`, `apply_animation_to_clip()`, `apply_slide_position()`, `build_transition_overlay()`, `build_render_clips()` | |
| **Key specs:** | |
| - Support all animation types: none, zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in | |
| - Support all transitions: none, fade, crossfade, slide_left, slide_right, dip_to_black, flash | |
| - Use moviepy for video processing | |
| ### Task 7: Text Overlay Module | |
| Extract text overlay functionality. | |
| **Files to create:** | |
| - `src/audio_video_generator/core/text_overlay.py` - `load_overlay_txt()`, `find_all_phrase_matches()`, `build_text_overlay_events()`, `render_text_rgba()`, `resolve_overlay_position()`, `build_text_overlay_clips()`, `get_available_font_map()`, `get_pil_font()`, `make_text_style_config()` | |
| **Key specs:** | |
| - Support font selection, colors, patterns (solid, boxed, highlighted) | |
| - Support text animations: fade_in, pop_in, pulse, slide_up, glow_pop, typewriter, zoom_in | |
| - Typewriter effect builds character-by-character clips | |
| ### Task 8: CLI Interface | |
| Create command-line interface using Click. | |
| **Files to create:** | |
| - `src/audio_video_generator/cli.py` - Main CLI with commands and options | |
| - `src/audio_video_generator/__main__.py` - Entry point for `python -m` | |
| **Key specs:** | |
| - Command: `avg generate` or just `avg` | |
| - Options for all major settings: audio, csv, images, resolution, output, animations, transitions | |
| - Progress reporting | |
| - Checkpoint saving | |
| - Proper error handling with exit codes | |
| ### Task 9: Gradio Web UI | |
| Extract and clean up the Gradio interface. | |
| **Files to create:** | |
| - `src/audio_video_generator/web/gradio_ui.py` - Full Gradio UI implementation | |
| - `src/audio_video_generator/web/__init__.py` | |
| **Key specs:** | |
| - Command: `avg web` to launch UI | |
| - Include all features from notebook: file uploads, path inputs, live preview, text overlay editor | |
| - Drive integration optional (Colab-specific code made conditional) | |
| ### Task 10: Main Pipeline Integration | |
| Create the main orchestration pipeline. | |
| **Files to create:** | |
| - `src/audio_video_generator/core/pipeline.py` - `create_video_pipeline()` function that orchestrates all components | |
| **Key specs:** | |
| - Error handling with cleanup | |
| - Progress callbacks | |
| - Memory management (gc.collect, CUDA cache clear) | |
| - Report generation | |
| - Checkpoint saving | |
| ### Task 11: Documentation | |
| Create README and usage documentation. | |
| **Files to create:** | |
| - `README.md` - Installation, usage examples, CSV format, CLI reference | |
| **Key specs:** | |
| - Installation instructions | |
| - CSV format specification | |
| - CLI examples | |
| - Web UI usage | |
| ## Execution Notes | |
| - All code must work outside Colab (no hardcoded `/content` paths) | |
| - Drive integration should be optional/conditional | |
| - Keep checkpoint functionality for debugging | |
| - Preserve all animation and transition options | |
| - Ensure proper resource cleanup (moviepy clips, torch CUDA) | |