imageaudiosync / plan.md
Nanny7's picture
Initial commit: Audio Video Generator v1.0.0
929f41f
|
Raw
History Blame Contribute Delete
6.71 kB
# Audio-to-Image Video Generator - Conversion Plan
## Overview
Convert the Jupyter notebook into a proper Python CLI tool with optional Gradio UI.
## Architecture
```
audio_video_generator/
├── pyproject.toml
├── requirements.txt
├── README.md
├── src/
│ └── audio_video_generator/
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py
│ ├── config.py
│ ├── core/
│ │ ├── __init__.py
│ │ ├── audio.py # Audio loading, Whisper transcription
│ │ ├── alignment.py # CSV-to-audio alignment logic
│ │ ├── images.py # Image processing, ZIP handling
│ │ ├── video.py # Video composition, animations
│ │ └── text_overlay.py # Text overlay rendering
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── files.py # File utilities, checkpoints
│ │ └── text.py # Text normalization, tokenization
│ └── web/
│ ├── __init__.py
│ └── gradio_ui.py # Gradio interface
└── tests/
└── ...
```
## Tasks
### Task 1: Project Structure and Packaging
Create the package structure with pyproject.toml, requirements.txt, and basic module setup.
**Files to create:**
- `pyproject.toml` - Package metadata, dependencies, entry points
- `requirements.txt` - Runtime dependencies
- `src/audio_video_generator/__init__.py` - Version info
- `src/audio_video_generator/config.py` - Configuration constants (RESOLUTION_MAP, ANIMATION_OPTIONS, etc.)
**Key specs:**
- Package name: `audio-video-generator`
- CLI entry point: `avg` command
- Version: 1.0.0
- Include all dependencies: whisper, moviepy, gradio, torch, pillow, pandas, numpy
### Task 2: Utility Modules
Extract utility functions from the notebook into reusable modules.
**Files to create:**
- `src/audio_video_generator/utils/text.py` - `normalize_text()`, `tokenize_text()`, `extract_number()`, `get_fuzzy_threshold()`, `clamp01()`, `safe_int()`, `apply_case_style()`
- `src/audio_video_generator/utils/files.py` - `ensure_dir()`, `make_run_dir()`, `safe_output_name()`, `write_json_file()`, `write_text_file()`, `extract_zip()`, `collect_images_recursive()`, `sort_images()`
**Key specs:**
- All functions must be pure (no global state)
- Add type hints
- Add docstrings
### Task 3: Audio and Transcription Module
Extract Whisper-related functionality.
**Files to create:**
- `src/audio_video_generator/core/audio.py` - `transcribe_with_words()`, `extract_word_timeline()`, `get_whisper_model()`, `get_device()`
**Key specs:**
- Use singleton pattern for Whisper model (lazy loading)
- Support CPU and CUDA
- Handle model caching properly
### Task 4: Image Processing Module
Extract image handling functionality.
**Files to create:**
- `src/audio_video_generator/core/images.py` - `prepare_image_inputs()`, `verify_and_filter_images()`, `build_image_indexes()`, `image_preflight_report()`, `resolve_image_reference()`, `resize_with_padding()`
**Key specs:**
- Support both ZIP and manual image inputs
- Image caching for performance
- Proper error handling for corrupt images
### Task 5: CSV Alignment Module
Extract CSV loading and alignment logic.
**Files to create:**
- `src/audio_video_generator/core/alignment.py` - `load_csv()`, `preprocess_csv()`, `mapping_preflight_report()`, `find_sentence_match()`, `build_timeline()`
**Key specs:**
- CSV must have exactly 2 columns: text, image
- Fuzzy matching with configurable thresholds
- Duplicate row handling
### Task 6: Video and Animation Module
Extract video composition and animation logic.
**Files to create:**
- `src/audio_video_generator/core/video.py` - `resolve_effect_sequence()`, `get_transition_duration()`, `apply_animation_to_clip()`, `apply_slide_position()`, `build_transition_overlay()`, `build_render_clips()`
**Key specs:**
- Support all animation types: none, zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
- Support all transitions: none, fade, crossfade, slide_left, slide_right, dip_to_black, flash
- Use moviepy for video processing
### Task 7: Text Overlay Module
Extract text overlay functionality.
**Files to create:**
- `src/audio_video_generator/core/text_overlay.py` - `load_overlay_txt()`, `find_all_phrase_matches()`, `build_text_overlay_events()`, `render_text_rgba()`, `resolve_overlay_position()`, `build_text_overlay_clips()`, `get_available_font_map()`, `get_pil_font()`, `make_text_style_config()`
**Key specs:**
- Support font selection, colors, patterns (solid, boxed, highlighted)
- Support text animations: fade_in, pop_in, pulse, slide_up, glow_pop, typewriter, zoom_in
- Typewriter effect builds character-by-character clips
### Task 8: CLI Interface
Create command-line interface using Click.
**Files to create:**
- `src/audio_video_generator/cli.py` - Main CLI with commands and options
- `src/audio_video_generator/__main__.py` - Entry point for `python -m`
**Key specs:**
- Command: `avg generate` or just `avg`
- Options for all major settings: audio, csv, images, resolution, output, animations, transitions
- Progress reporting
- Checkpoint saving
- Proper error handling with exit codes
### Task 9: Gradio Web UI
Extract and clean up the Gradio interface.
**Files to create:**
- `src/audio_video_generator/web/gradio_ui.py` - Full Gradio UI implementation
- `src/audio_video_generator/web/__init__.py`
**Key specs:**
- Command: `avg web` to launch UI
- Include all features from notebook: file uploads, path inputs, live preview, text overlay editor
- Drive integration optional (Colab-specific code made conditional)
### Task 10: Main Pipeline Integration
Create the main orchestration pipeline.
**Files to create:**
- `src/audio_video_generator/core/pipeline.py` - `create_video_pipeline()` function that orchestrates all components
**Key specs:**
- Error handling with cleanup
- Progress callbacks
- Memory management (gc.collect, CUDA cache clear)
- Report generation
- Checkpoint saving
### Task 11: Documentation
Create README and usage documentation.
**Files to create:**
- `README.md` - Installation, usage examples, CSV format, CLI reference
**Key specs:**
- Installation instructions
- CSV format specification
- CLI examples
- Web UI usage
## Execution Notes
- All code must work outside Colab (no hardcoded `/content` paths)
- Drive integration should be optional/conditional
- Keep checkpoint functionality for debugging
- Preserve all animation and transition options
- Ensure proper resource cleanup (moviepy clips, torch CUDA)