imageaudiosync / README.md
areebsa's picture
Update README.md
201c94a verified
|
Raw
History Blame Contribute Delete
8.64 kB
---
title: Audio Video Generator
emoji: 🎬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.13.0
app_file: app.py
pinned: false
license: mit
---
# Audio Video Generator
Synchronize images to audio using OpenAI Whisper transcription and CSV mapping. Create engaging videos with automated image transitions, animations, and text overlays.
## Features
- **Audio Transcription**: Uses OpenAI Whisper for accurate word-level timestamps
- **CSV Alignment**: Maps text phrases to images with fuzzy matching
- **Image Animations**: zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
- **Transitions**: fade, crossfade, slide_left, slide_right, dip_to_black, flash
- **Text Overlays**: Synchronized text with multiple animations and styles
- **Flexible Input**: Accept ZIP files or directories of images
- **Multiple Resolutions**: Landscape (1280x720), Portrait (720x1280), Square (1080x1080)
- **CLI & Web UI**: Use command-line or Gradio web interface
## Installation
```bash
# Clone or download the repository
cd audio-video-generator
# Install dependencies
pip install -e .
# Or install from requirements.txt
pip install -r requirements.txt
```
### System Requirements
- Python 3.9+
- FFmpeg (required by moviepy)
- CUDA (optional, for GPU acceleration)
#### Installing FFmpeg
**macOS:**
```bash
brew install ffmpeg
```
**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install ffmpeg
```
**Windows:**
Download from https://ffmpeg.org/download.html and add to PATH
## Quick Start
### 1. Prepare Your Files
**Audio file**: Any common format (MP3, WAV, M4A, etc.)
**CSV mapping file** (`storyboard.csv`):
```csv
text,image
"Welcome to our presentation",1_intro.png
"First we discuss the basics",2_basics.jpg
"Then we explore advanced topics",3_advanced.png
"Thank you for watching",4_thanks.png
```
**Images**: Name files with numbers for automatic ordering (e.g., `1_intro.png`, `2_basics.jpg`)
### 2. Generate Video (CLI)
```bash
# Basic usage with ZIP file
avg generate -a audio.mp3 -c storyboard.csv -i images.zip -o output.mp4
# With manual image directory
avg generate -a audio.mp3 -c storyboard.csv -i ./images/ --input-mode MANUAL
# With custom animations
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
--animation-mode single --animation zoom_in \
--transition-mode single --transition fade
# With text overlay
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
--txt-overlay phrases.txt --text-color "#FFFFFF"
```
### 3. Launch Web UI
```bash
# Start local web interface
avg web
# With public link (for sharing)
avg web --share
# Custom port
avg web --port 8080
```
## CLI Reference
### `avg generate`
Generate a synchronized video.
**Required Arguments:**
- `-a, --audio`: Path to audio file
- `-c, --csv`: Path to CSV mapping file
- `-i, --images`: Path to images (ZIP or directory)
**Optional Arguments:**
| Option | Description | Default |
|--------|-------------|---------|
| `-o, --output` | Output filename | `output.mp4` |
| `-r, --resolution` | Resolution preset (`landscape`, `portrait`, `square`) | `landscape` |
| `--animation-mode` | Animation selection (`random`, `single`, `custom`) | `random` |
| `--animation` | Single animation to use | - |
| `--animations` | Custom animation list (can specify multiple) | - |
| `--transition-mode` | Transition selection (`random`, `single`, `custom`) | `random` |
| `--transition` | Single transition to use | - |
| `--transitions` | Custom transition list (can specify multiple) | - |
| `--txt-overlay` | Path to text overlay file | - |
| `--font-size` | Text overlay font size | `56` |
| `--text-color` | Text color (hex) | `#FFFFFF` |
| `--text-pos-x` | Horizontal position (0.0-1.0) | `0.5` |
| `--text-pos-y` | Vertical position (0.0-1.0) | `0.5` |
| `--whisper-model` | Whisper model (`tiny`, `base`, `small`, `medium`, `large`) | `base` |
| `--fps` | Output framerate | `24` |
### `avg web`
Launch Gradio web interface.
| Option | Description | Default |
|--------|-------------|---------|
| `--host` | Host to bind to | `127.0.0.1` |
| `--port` | Port to listen on | `7860` |
| `--share` | Create public shareable link | `False` |
### `avg models`
List available Whisper models with size and speed info.
### `avg animations`
List available animations and transitions.
## CSV File Format
The CSV file must have exactly **2 columns**:
1. `text`: The text content that appears in the audio
2. `image`: Reference to the image file
### Image Reference Resolution
Image references are resolved in this order:
1. **Exact filename match**: `image.png` matches `image.png`
2. **Stem match**: `image` matches `image.png`
3. **Number match**: `1` matches `1_xxx.png`, `2` matches `2_yyy.jpg`
4. **Fallback**: Last numbered image in the collection
### Example CSV
```csv
text,image
"Introduction",1_intro.jpg
"Main topic one",2_topic1.png
"Main topic two",3_topic2.jpg
"Conclusion",4_outro.jpg
```
## Text Overlay File Format
Create a text file with one phrase per line:
```
Welcome
First Point
Second Point
Key Takeaway
Thank You
```
Each phrase will be matched to the audio and displayed with the selected animation.
## Animations
### Image Animations
| Animation | Description |
|-----------|-------------|
| `none` | Static image |
| `zoom_in` | Slow zoom in (1.00x β†’ 1.06x) |
| `zoom_out` | Slow zoom out (1.06x β†’ 1.00x) |
| `fade_in` | Fade in from black |
| `blink` | Subtle brightness pulsing |
| `pulse` | Scale pulsing with sine wave |
| `fade_zoom_in` | Fade in + slow zoom |
### Transitions
| Transition | Description |
|------------|-------------|
| `none` | Cut (no transition) |
| `fade` | Crossfade between images |
| `crossfade` | Longer crossfade |
| `slide_left` | Slide in from right |
| `slide_right` | Slide in from left |
| `dip_to_black` | Brief black screen between |
| `flash` | Brief white flash between |
### Text Animations
| Animation | Description |
|-----------|-------------|
| `zoom_in` | Scale up from 0.72x |
| `fade_in` | Fade in |
| `pop_in` | Scale + fade pop effect |
| `pulse` | Continuous pulse |
| `slide_up` | Slide up + fade |
| `glow_pop` | Pop with glow effect |
| `typewriter` | Character-by-character reveal |
## Python API
```python
from audio_video_generator.core.pipeline import VideoPipeline, VideoPipelineConfig
# Configure pipeline
config = VideoPipelineConfig(
audio_path="audio.mp3",
csv_path="mapping.csv",
input_mode="ZIP",
zip_path="images.zip",
output_filename="output.mp4",
resolution="landscape",
animation_mode="random",
transition_mode="fade",
whisper_model="base"
)
# Run pipeline
pipeline = VideoPipeline(config)
result = pipeline.run()
print(f"Video saved to: {result['output_path']}")
```
## Project Structure
```
audio-video-generator/
β”œβ”€β”€ pyproject.toml # Package configuration
β”œβ”€β”€ requirements.txt # Dependencies
β”œβ”€β”€ README.md # This file
└── src/
└── audio_video_generator/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ __main__.py # Module entry point
β”œβ”€β”€ cli.py # CLI implementation
β”œβ”€β”€ config.py # Constants and defaults
β”œβ”€β”€ core/
β”‚ β”œβ”€β”€ audio.py # Whisper transcription
β”‚ β”œβ”€β”€ alignment.py # CSV-audio alignment
β”‚ β”œβ”€β”€ images.py # Image processing
β”‚ β”œβ”€β”€ pipeline.py # Main orchestration
β”‚ β”œβ”€β”€ text_overlay.py # Text rendering
β”‚ └── video.py # Animation/transitions
β”œβ”€β”€ utils/
β”‚ β”œβ”€β”€ files.py # File utilities
β”‚ └── text.py # Text processing
└── web/
└── gradio_ui.py # Web interface
```
## Troubleshooting
### CUDA Out of Memory
Use a smaller Whisper model:
```bash
avg generate --whisper-model tiny ...
```
Or force CPU:
```bash
CUDA_VISIBLE_DEVICES="" avg generate ...
```
### Images Not Found
Ensure image filenames start with numbers (e.g., `1_image.png`). The tool uses numbers for fallback resolution.
### CSV Alignment Failing
Check that:
1. CSV has exactly 2 columns: `text` and `image`
2. Text content matches what's spoken in the audio
3. File encoding is UTF-8
### FFmpeg Errors
Ensure FFmpeg is installed and available in PATH:
```bash
ffmpeg -version
```
## License
MIT License
## Acknowledgments
- [OpenAI Whisper](https://github.com/openai/whisper) for transcription
- [MoviePy](https://zulko.github.io/moviepy/) for video processing
- [Gradio](https://gradio.app/) for web interface