Spaces:
Sleeping
Sleeping
| title: Audio Video Generator | |
| emoji: π¬ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 6.13.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # Audio Video Generator | |
| Synchronize images to audio using OpenAI Whisper transcription and CSV mapping. Create engaging videos with automated image transitions, animations, and text overlays. | |
| ## Features | |
| - **Audio Transcription**: Uses OpenAI Whisper for accurate word-level timestamps | |
| - **CSV Alignment**: Maps text phrases to images with fuzzy matching | |
| - **Image Animations**: zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in | |
| - **Transitions**: fade, crossfade, slide_left, slide_right, dip_to_black, flash | |
| - **Text Overlays**: Synchronized text with multiple animations and styles | |
| - **Flexible Input**: Accept ZIP files or directories of images | |
| - **Multiple Resolutions**: Landscape (1280x720), Portrait (720x1280), Square (1080x1080) | |
| - **CLI & Web UI**: Use command-line or Gradio web interface | |
| ## Installation | |
| ```bash | |
| # Clone or download the repository | |
| cd audio-video-generator | |
| # Install dependencies | |
| pip install -e . | |
| # Or install from requirements.txt | |
| pip install -r requirements.txt | |
| ``` | |
| ### System Requirements | |
| - Python 3.9+ | |
| - FFmpeg (required by moviepy) | |
| - CUDA (optional, for GPU acceleration) | |
| #### Installing FFmpeg | |
| **macOS:** | |
| ```bash | |
| brew install ffmpeg | |
| ``` | |
| **Ubuntu/Debian:** | |
| ```bash | |
| sudo apt-get update | |
| sudo apt-get install ffmpeg | |
| ``` | |
| **Windows:** | |
| Download from https://ffmpeg.org/download.html and add to PATH | |
| ## Quick Start | |
| ### 1. Prepare Your Files | |
| **Audio file**: Any common format (MP3, WAV, M4A, etc.) | |
| **CSV mapping file** (`storyboard.csv`): | |
| ```csv | |
| text,image | |
| "Welcome to our presentation",1_intro.png | |
| "First we discuss the basics",2_basics.jpg | |
| "Then we explore advanced topics",3_advanced.png | |
| "Thank you for watching",4_thanks.png | |
| ``` | |
| **Images**: Name files with numbers for automatic ordering (e.g., `1_intro.png`, `2_basics.jpg`) | |
| ### 2. Generate Video (CLI) | |
| ```bash | |
| # Basic usage with ZIP file | |
| avg generate -a audio.mp3 -c storyboard.csv -i images.zip -o output.mp4 | |
| # With manual image directory | |
| avg generate -a audio.mp3 -c storyboard.csv -i ./images/ --input-mode MANUAL | |
| # With custom animations | |
| avg generate -a audio.mp3 -c storyboard.csv -i images.zip \ | |
| --animation-mode single --animation zoom_in \ | |
| --transition-mode single --transition fade | |
| # With text overlay | |
| avg generate -a audio.mp3 -c storyboard.csv -i images.zip \ | |
| --txt-overlay phrases.txt --text-color "#FFFFFF" | |
| ``` | |
| ### 3. Launch Web UI | |
| ```bash | |
| # Start local web interface | |
| avg web | |
| # With public link (for sharing) | |
| avg web --share | |
| # Custom port | |
| avg web --port 8080 | |
| ``` | |
| ## CLI Reference | |
| ### `avg generate` | |
| Generate a synchronized video. | |
| **Required Arguments:** | |
| - `-a, --audio`: Path to audio file | |
| - `-c, --csv`: Path to CSV mapping file | |
| - `-i, --images`: Path to images (ZIP or directory) | |
| **Optional Arguments:** | |
| | Option | Description | Default | | |
| |--------|-------------|---------| | |
| | `-o, --output` | Output filename | `output.mp4` | | |
| | `-r, --resolution` | Resolution preset (`landscape`, `portrait`, `square`) | `landscape` | | |
| | `--animation-mode` | Animation selection (`random`, `single`, `custom`) | `random` | | |
| | `--animation` | Single animation to use | - | | |
| | `--animations` | Custom animation list (can specify multiple) | - | | |
| | `--transition-mode` | Transition selection (`random`, `single`, `custom`) | `random` | | |
| | `--transition` | Single transition to use | - | | |
| | `--transitions` | Custom transition list (can specify multiple) | - | | |
| | `--txt-overlay` | Path to text overlay file | - | | |
| | `--font-size` | Text overlay font size | `56` | | |
| | `--text-color` | Text color (hex) | `#FFFFFF` | | |
| | `--text-pos-x` | Horizontal position (0.0-1.0) | `0.5` | | |
| | `--text-pos-y` | Vertical position (0.0-1.0) | `0.5` | | |
| | `--whisper-model` | Whisper model (`tiny`, `base`, `small`, `medium`, `large`) | `base` | | |
| | `--fps` | Output framerate | `24` | | |
| ### `avg web` | |
| Launch Gradio web interface. | |
| | Option | Description | Default | | |
| |--------|-------------|---------| | |
| | `--host` | Host to bind to | `127.0.0.1` | | |
| | `--port` | Port to listen on | `7860` | | |
| | `--share` | Create public shareable link | `False` | | |
| ### `avg models` | |
| List available Whisper models with size and speed info. | |
| ### `avg animations` | |
| List available animations and transitions. | |
| ## CSV File Format | |
| The CSV file must have exactly **2 columns**: | |
| 1. `text`: The text content that appears in the audio | |
| 2. `image`: Reference to the image file | |
| ### Image Reference Resolution | |
| Image references are resolved in this order: | |
| 1. **Exact filename match**: `image.png` matches `image.png` | |
| 2. **Stem match**: `image` matches `image.png` | |
| 3. **Number match**: `1` matches `1_xxx.png`, `2` matches `2_yyy.jpg` | |
| 4. **Fallback**: Last numbered image in the collection | |
| ### Example CSV | |
| ```csv | |
| text,image | |
| "Introduction",1_intro.jpg | |
| "Main topic one",2_topic1.png | |
| "Main topic two",3_topic2.jpg | |
| "Conclusion",4_outro.jpg | |
| ``` | |
| ## Text Overlay File Format | |
| Create a text file with one phrase per line: | |
| ``` | |
| Welcome | |
| First Point | |
| Second Point | |
| Key Takeaway | |
| Thank You | |
| ``` | |
| Each phrase will be matched to the audio and displayed with the selected animation. | |
| ## Animations | |
| ### Image Animations | |
| | Animation | Description | | |
| |-----------|-------------| | |
| | `none` | Static image | | |
| | `zoom_in` | Slow zoom in (1.00x β 1.06x) | | |
| | `zoom_out` | Slow zoom out (1.06x β 1.00x) | | |
| | `fade_in` | Fade in from black | | |
| | `blink` | Subtle brightness pulsing | | |
| | `pulse` | Scale pulsing with sine wave | | |
| | `fade_zoom_in` | Fade in + slow zoom | | |
| ### Transitions | |
| | Transition | Description | | |
| |------------|-------------| | |
| | `none` | Cut (no transition) | | |
| | `fade` | Crossfade between images | | |
| | `crossfade` | Longer crossfade | | |
| | `slide_left` | Slide in from right | | |
| | `slide_right` | Slide in from left | | |
| | `dip_to_black` | Brief black screen between | | |
| | `flash` | Brief white flash between | | |
| ### Text Animations | |
| | Animation | Description | | |
| |-----------|-------------| | |
| | `zoom_in` | Scale up from 0.72x | | |
| | `fade_in` | Fade in | | |
| | `pop_in` | Scale + fade pop effect | | |
| | `pulse` | Continuous pulse | | |
| | `slide_up` | Slide up + fade | | |
| | `glow_pop` | Pop with glow effect | | |
| | `typewriter` | Character-by-character reveal | | |
| ## Python API | |
| ```python | |
| from audio_video_generator.core.pipeline import VideoPipeline, VideoPipelineConfig | |
| # Configure pipeline | |
| config = VideoPipelineConfig( | |
| audio_path="audio.mp3", | |
| csv_path="mapping.csv", | |
| input_mode="ZIP", | |
| zip_path="images.zip", | |
| output_filename="output.mp4", | |
| resolution="landscape", | |
| animation_mode="random", | |
| transition_mode="fade", | |
| whisper_model="base" | |
| ) | |
| # Run pipeline | |
| pipeline = VideoPipeline(config) | |
| result = pipeline.run() | |
| print(f"Video saved to: {result['output_path']}") | |
| ``` | |
| ## Project Structure | |
| ``` | |
| audio-video-generator/ | |
| βββ pyproject.toml # Package configuration | |
| βββ requirements.txt # Dependencies | |
| βββ README.md # This file | |
| βββ src/ | |
| βββ audio_video_generator/ | |
| βββ __init__.py | |
| βββ __main__.py # Module entry point | |
| βββ cli.py # CLI implementation | |
| βββ config.py # Constants and defaults | |
| βββ core/ | |
| β βββ audio.py # Whisper transcription | |
| β βββ alignment.py # CSV-audio alignment | |
| β βββ images.py # Image processing | |
| β βββ pipeline.py # Main orchestration | |
| β βββ text_overlay.py # Text rendering | |
| β βββ video.py # Animation/transitions | |
| βββ utils/ | |
| β βββ files.py # File utilities | |
| β βββ text.py # Text processing | |
| βββ web/ | |
| βββ gradio_ui.py # Web interface | |
| ``` | |
| ## Troubleshooting | |
| ### CUDA Out of Memory | |
| Use a smaller Whisper model: | |
| ```bash | |
| avg generate --whisper-model tiny ... | |
| ``` | |
| Or force CPU: | |
| ```bash | |
| CUDA_VISIBLE_DEVICES="" avg generate ... | |
| ``` | |
| ### Images Not Found | |
| Ensure image filenames start with numbers (e.g., `1_image.png`). The tool uses numbers for fallback resolution. | |
| ### CSV Alignment Failing | |
| Check that: | |
| 1. CSV has exactly 2 columns: `text` and `image` | |
| 2. Text content matches what's spoken in the audio | |
| 3. File encoding is UTF-8 | |
| ### FFmpeg Errors | |
| Ensure FFmpeg is installed and available in PATH: | |
| ```bash | |
| ffmpeg -version | |
| ``` | |
| ## License | |
| MIT License | |
| ## Acknowledgments | |
| - [OpenAI Whisper](https://github.com/openai/whisper) for transcription | |
| - [MoviePy](https://zulko.github.io/moviepy/) for video processing | |
| - [Gradio](https://gradio.app/) for web interface |