--- title: Audio Video Generator emoji: 🎬 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 6.13.0 app_file: app.py pinned: false license: mit --- # Audio Video Generator Synchronize images to audio using OpenAI Whisper transcription and CSV mapping. Create engaging videos with automated image transitions, animations, and text overlays. ## Features - **Audio Transcription**: Uses OpenAI Whisper for accurate word-level timestamps - **CSV Alignment**: Maps text phrases to images with fuzzy matching - **Image Animations**: zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in - **Transitions**: fade, crossfade, slide_left, slide_right, dip_to_black, flash - **Text Overlays**: Synchronized text with multiple animations and styles - **Flexible Input**: Accept ZIP files or directories of images - **Multiple Resolutions**: Landscape (1280x720), Portrait (720x1280), Square (1080x1080) - **CLI & Web UI**: Use command-line or Gradio web interface ## Installation ```bash # Clone or download the repository cd audio-video-generator # Install dependencies pip install -e . # Or install from requirements.txt pip install -r requirements.txt ``` ### System Requirements - Python 3.9+ - FFmpeg (required by moviepy) - CUDA (optional, for GPU acceleration) #### Installing FFmpeg **macOS:** ```bash brew install ffmpeg ``` **Ubuntu/Debian:** ```bash sudo apt-get update sudo apt-get install ffmpeg ``` **Windows:** Download from https://ffmpeg.org/download.html and add to PATH ## Quick Start ### 1. Prepare Your Files **Audio file**: Any common format (MP3, WAV, M4A, etc.) **CSV mapping file** (`storyboard.csv`): ```csv text,image "Welcome to our presentation",1_intro.png "First we discuss the basics",2_basics.jpg "Then we explore advanced topics",3_advanced.png "Thank you for watching",4_thanks.png ``` **Images**: Name files with numbers for automatic ordering (e.g., `1_intro.png`, `2_basics.jpg`) ### 2. Generate Video (CLI) ```bash # Basic usage with ZIP file avg generate -a audio.mp3 -c storyboard.csv -i images.zip -o output.mp4 # With manual image directory avg generate -a audio.mp3 -c storyboard.csv -i ./images/ --input-mode MANUAL # With custom animations avg generate -a audio.mp3 -c storyboard.csv -i images.zip \ --animation-mode single --animation zoom_in \ --transition-mode single --transition fade # With text overlay avg generate -a audio.mp3 -c storyboard.csv -i images.zip \ --txt-overlay phrases.txt --text-color "#FFFFFF" ``` ### 3. Launch Web UI ```bash # Start local web interface avg web # With public link (for sharing) avg web --share # Custom port avg web --port 8080 ``` ## CLI Reference ### `avg generate` Generate a synchronized video. **Required Arguments:** - `-a, --audio`: Path to audio file - `-c, --csv`: Path to CSV mapping file - `-i, --images`: Path to images (ZIP or directory) **Optional Arguments:** | Option | Description | Default | |--------|-------------|---------| | `-o, --output` | Output filename | `output.mp4` | | `-r, --resolution` | Resolution preset (`landscape`, `portrait`, `square`) | `landscape` | | `--animation-mode` | Animation selection (`random`, `single`, `custom`) | `random` | | `--animation` | Single animation to use | - | | `--animations` | Custom animation list (can specify multiple) | - | | `--transition-mode` | Transition selection (`random`, `single`, `custom`) | `random` | | `--transition` | Single transition to use | - | | `--transitions` | Custom transition list (can specify multiple) | - | | `--txt-overlay` | Path to text overlay file | - | | `--font-size` | Text overlay font size | `56` | | `--text-color` | Text color (hex) | `#FFFFFF` | | `--text-pos-x` | Horizontal position (0.0-1.0) | `0.5` | | `--text-pos-y` | Vertical position (0.0-1.0) | `0.5` | | `--whisper-model` | Whisper model (`tiny`, `base`, `small`, `medium`, `large`) | `base` | | `--fps` | Output framerate | `24` | ### `avg web` Launch Gradio web interface. | Option | Description | Default | |--------|-------------|---------| | `--host` | Host to bind to | `127.0.0.1` | | `--port` | Port to listen on | `7860` | | `--share` | Create public shareable link | `False` | ### `avg models` List available Whisper models with size and speed info. ### `avg animations` List available animations and transitions. ## CSV File Format The CSV file must have exactly **2 columns**: 1. `text`: The text content that appears in the audio 2. `image`: Reference to the image file ### Image Reference Resolution Image references are resolved in this order: 1. **Exact filename match**: `image.png` matches `image.png` 2. **Stem match**: `image` matches `image.png` 3. **Number match**: `1` matches `1_xxx.png`, `2` matches `2_yyy.jpg` 4. **Fallback**: Last numbered image in the collection ### Example CSV ```csv text,image "Introduction",1_intro.jpg "Main topic one",2_topic1.png "Main topic two",3_topic2.jpg "Conclusion",4_outro.jpg ``` ## Text Overlay File Format Create a text file with one phrase per line: ``` Welcome First Point Second Point Key Takeaway Thank You ``` Each phrase will be matched to the audio and displayed with the selected animation. ## Animations ### Image Animations | Animation | Description | |-----------|-------------| | `none` | Static image | | `zoom_in` | Slow zoom in (1.00x → 1.06x) | | `zoom_out` | Slow zoom out (1.06x → 1.00x) | | `fade_in` | Fade in from black | | `blink` | Subtle brightness pulsing | | `pulse` | Scale pulsing with sine wave | | `fade_zoom_in` | Fade in + slow zoom | ### Transitions | Transition | Description | |------------|-------------| | `none` | Cut (no transition) | | `fade` | Crossfade between images | | `crossfade` | Longer crossfade | | `slide_left` | Slide in from right | | `slide_right` | Slide in from left | | `dip_to_black` | Brief black screen between | | `flash` | Brief white flash between | ### Text Animations | Animation | Description | |-----------|-------------| | `zoom_in` | Scale up from 0.72x | | `fade_in` | Fade in | | `pop_in` | Scale + fade pop effect | | `pulse` | Continuous pulse | | `slide_up` | Slide up + fade | | `glow_pop` | Pop with glow effect | | `typewriter` | Character-by-character reveal | ## Python API ```python from audio_video_generator.core.pipeline import VideoPipeline, VideoPipelineConfig # Configure pipeline config = VideoPipelineConfig( audio_path="audio.mp3", csv_path="mapping.csv", input_mode="ZIP", zip_path="images.zip", output_filename="output.mp4", resolution="landscape", animation_mode="random", transition_mode="fade", whisper_model="base" ) # Run pipeline pipeline = VideoPipeline(config) result = pipeline.run() print(f"Video saved to: {result['output_path']}") ``` ## Project Structure ``` audio-video-generator/ ├── pyproject.toml # Package configuration ├── requirements.txt # Dependencies ├── README.md # This file └── src/ └── audio_video_generator/ ├── __init__.py ├── __main__.py # Module entry point ├── cli.py # CLI implementation ├── config.py # Constants and defaults ├── core/ │ ├── audio.py # Whisper transcription │ ├── alignment.py # CSV-audio alignment │ ├── images.py # Image processing │ ├── pipeline.py # Main orchestration │ ├── text_overlay.py # Text rendering │ └── video.py # Animation/transitions ├── utils/ │ ├── files.py # File utilities │ └── text.py # Text processing └── web/ └── gradio_ui.py # Web interface ``` ## Troubleshooting ### CUDA Out of Memory Use a smaller Whisper model: ```bash avg generate --whisper-model tiny ... ``` Or force CPU: ```bash CUDA_VISIBLE_DEVICES="" avg generate ... ``` ### Images Not Found Ensure image filenames start with numbers (e.g., `1_image.png`). The tool uses numbers for fallback resolution. ### CSV Alignment Failing Check that: 1. CSV has exactly 2 columns: `text` and `image` 2. Text content matches what's spoken in the audio 3. File encoding is UTF-8 ### FFmpeg Errors Ensure FFmpeg is installed and available in PATH: ```bash ffmpeg -version ``` ## License MIT License ## Acknowledgments - [OpenAI Whisper](https://github.com/openai/whisper) for transcription - [MoviePy](https://zulko.github.io/moviepy/) for video processing - [Gradio](https://gradio.app/) for web interface