Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
title: Audio Video Generator
emoji: π¬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.13.0
app_file: app.py
pinned: false
license: mit
Audio Video Generator
Synchronize images to audio using OpenAI Whisper transcription and CSV mapping. Create engaging videos with automated image transitions, animations, and text overlays.
Features
- Audio Transcription: Uses OpenAI Whisper for accurate word-level timestamps
- CSV Alignment: Maps text phrases to images with fuzzy matching
- Image Animations: zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
- Transitions: fade, crossfade, slide_left, slide_right, dip_to_black, flash
- Text Overlays: Synchronized text with multiple animations and styles
- Flexible Input: Accept ZIP files or directories of images
- Multiple Resolutions: Landscape (1280x720), Portrait (720x1280), Square (1080x1080)
- CLI & Web UI: Use command-line or Gradio web interface
Installation
# Clone or download the repository
cd audio-video-generator
# Install dependencies
pip install -e .
# Or install from requirements.txt
pip install -r requirements.txt
System Requirements
- Python 3.9+
- FFmpeg (required by moviepy)
- CUDA (optional, for GPU acceleration)
Installing FFmpeg
macOS:
brew install ffmpeg
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install ffmpeg
Windows: Download from https://ffmpeg.org/download.html and add to PATH
Quick Start
1. Prepare Your Files
Audio file: Any common format (MP3, WAV, M4A, etc.)
CSV mapping file (storyboard.csv):
text,image
"Welcome to our presentation",1_intro.png
"First we discuss the basics",2_basics.jpg
"Then we explore advanced topics",3_advanced.png
"Thank you for watching",4_thanks.png
Images: Name files with numbers for automatic ordering (e.g., 1_intro.png, 2_basics.jpg)
2. Generate Video (CLI)
# Basic usage with ZIP file
avg generate -a audio.mp3 -c storyboard.csv -i images.zip -o output.mp4
# With manual image directory
avg generate -a audio.mp3 -c storyboard.csv -i ./images/ --input-mode MANUAL
# With custom animations
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
--animation-mode single --animation zoom_in \
--transition-mode single --transition fade
# With text overlay
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
--txt-overlay phrases.txt --text-color "#FFFFFF"
3. Launch Web UI
# Start local web interface
avg web
# With public link (for sharing)
avg web --share
# Custom port
avg web --port 8080
CLI Reference
avg generate
Generate a synchronized video.
Required Arguments:
-a, --audio: Path to audio file-c, --csv: Path to CSV mapping file-i, --images: Path to images (ZIP or directory)
Optional Arguments:
| Option | Description | Default |
|---|---|---|
-o, --output |
Output filename | output.mp4 |
-r, --resolution |
Resolution preset (landscape, portrait, square) |
landscape |
--animation-mode |
Animation selection (random, single, custom) |
random |
--animation |
Single animation to use | - |
--animations |
Custom animation list (can specify multiple) | - |
--transition-mode |
Transition selection (random, single, custom) |
random |
--transition |
Single transition to use | - |
--transitions |
Custom transition list (can specify multiple) | - |
--txt-overlay |
Path to text overlay file | - |
--font-size |
Text overlay font size | 56 |
--text-color |
Text color (hex) | #FFFFFF |
--text-pos-x |
Horizontal position (0.0-1.0) | 0.5 |
--text-pos-y |
Vertical position (0.0-1.0) | 0.5 |
--whisper-model |
Whisper model (tiny, base, small, medium, large) |
base |
--fps |
Output framerate | 24 |
avg web
Launch Gradio web interface.
| Option | Description | Default |
|---|---|---|
--host |
Host to bind to | 127.0.0.1 |
--port |
Port to listen on | 7860 |
--share |
Create public shareable link | False |
avg models
List available Whisper models with size and speed info.
avg animations
List available animations and transitions.
CSV File Format
The CSV file must have exactly 2 columns:
text: The text content that appears in the audioimage: Reference to the image file
Image Reference Resolution
Image references are resolved in this order:
- Exact filename match:
image.pngmatchesimage.png - Stem match:
imagematchesimage.png - Number match:
1matches1_xxx.png,2matches2_yyy.jpg - Fallback: Last numbered image in the collection
Example CSV
text,image
"Introduction",1_intro.jpg
"Main topic one",2_topic1.png
"Main topic two",3_topic2.jpg
"Conclusion",4_outro.jpg
Text Overlay File Format
Create a text file with one phrase per line:
Welcome
First Point
Second Point
Key Takeaway
Thank You
Each phrase will be matched to the audio and displayed with the selected animation.
Animations
Image Animations
| Animation | Description |
|---|---|
none |
Static image |
zoom_in |
Slow zoom in (1.00x β 1.06x) |
zoom_out |
Slow zoom out (1.06x β 1.00x) |
fade_in |
Fade in from black |
blink |
Subtle brightness pulsing |
pulse |
Scale pulsing with sine wave |
fade_zoom_in |
Fade in + slow zoom |
Transitions
| Transition | Description |
|---|---|
none |
Cut (no transition) |
fade |
Crossfade between images |
crossfade |
Longer crossfade |
slide_left |
Slide in from right |
slide_right |
Slide in from left |
dip_to_black |
Brief black screen between |
flash |
Brief white flash between |
Text Animations
| Animation | Description |
|---|---|
zoom_in |
Scale up from 0.72x |
fade_in |
Fade in |
pop_in |
Scale + fade pop effect |
pulse |
Continuous pulse |
slide_up |
Slide up + fade |
glow_pop |
Pop with glow effect |
typewriter |
Character-by-character reveal |
Python API
from audio_video_generator.core.pipeline import VideoPipeline, VideoPipelineConfig
# Configure pipeline
config = VideoPipelineConfig(
audio_path="audio.mp3",
csv_path="mapping.csv",
input_mode="ZIP",
zip_path="images.zip",
output_filename="output.mp4",
resolution="landscape",
animation_mode="random",
transition_mode="fade",
whisper_model="base"
)
# Run pipeline
pipeline = VideoPipeline(config)
result = pipeline.run()
print(f"Video saved to: {result['output_path']}")
Project Structure
audio-video-generator/
βββ pyproject.toml # Package configuration
βββ requirements.txt # Dependencies
βββ README.md # This file
βββ src/
βββ audio_video_generator/
βββ __init__.py
βββ __main__.py # Module entry point
βββ cli.py # CLI implementation
βββ config.py # Constants and defaults
βββ core/
β βββ audio.py # Whisper transcription
β βββ alignment.py # CSV-audio alignment
β βββ images.py # Image processing
β βββ pipeline.py # Main orchestration
β βββ text_overlay.py # Text rendering
β βββ video.py # Animation/transitions
βββ utils/
β βββ files.py # File utilities
β βββ text.py # Text processing
βββ web/
βββ gradio_ui.py # Web interface
Troubleshooting
CUDA Out of Memory
Use a smaller Whisper model:
avg generate --whisper-model tiny ...
Or force CPU:
CUDA_VISIBLE_DEVICES="" avg generate ...
Images Not Found
Ensure image filenames start with numbers (e.g., 1_image.png). The tool uses numbers for fallback resolution.
CSV Alignment Failing
Check that:
- CSV has exactly 2 columns:
textandimage - Text content matches what's spoken in the audio
- File encoding is UTF-8
FFmpeg Errors
Ensure FFmpeg is installed and available in PATH:
ffmpeg -version
License
MIT License
Acknowledgments
- OpenAI Whisper for transcription
- MoviePy for video processing
- Gradio for web interface