---
title: Audio Video Generator
emoji: 🎬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.13.0
app_file: app.py
pinned: false
license: mit
---

# Audio Video Generator

Synchronize images to audio using OpenAI Whisper transcription and CSV mapping. Create engaging videos with automated image transitions, animations, and text overlays.

## Features

- **Audio Transcription**: Uses OpenAI Whisper for accurate word-level timestamps
- **CSV Alignment**: Maps text phrases to images with fuzzy matching
- **Image Animations**: zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
- **Transitions**: fade, crossfade, slide_left, slide_right, dip_to_black, flash
- **Text Overlays**: Synchronized text with multiple animations and styles
- **Flexible Input**: Accept ZIP files or directories of images
- **Multiple Resolutions**: Landscape (1280x720), Portrait (720x1280), Square (1080x1080)
- **CLI & Web UI**: Use command-line or Gradio web interface

## Installation

```bash
# Clone or download the repository
cd audio-video-generator

# Install dependencies
pip install -e .

# Or install from requirements.txt
pip install -r requirements.txt
```

### System Requirements

- Python 3.9+
- FFmpeg (required by moviepy)
- CUDA (optional, for GPU acceleration)

#### Installing FFmpeg

**macOS:**
```bash
brew install ffmpeg
```

**Ubuntu/Debian:**
```bash
sudo apt-get update
sudo apt-get install ffmpeg
```

**Windows:**
Download from https://ffmpeg.org/download.html and add to PATH

## Quick Start

### 1. Prepare Your Files

**Audio file**: Any common format (MP3, WAV, M4A, etc.)

**CSV mapping file** (`storyboard.csv`):
```csv
text,image
"Welcome to our presentation",1_intro.png
"First we discuss the basics",2_basics.jpg
"Then we explore advanced topics",3_advanced.png
"Thank you for watching",4_thanks.png
```

**Images**: Name files with numbers for automatic ordering (e.g., `1_intro.png`, `2_basics.jpg`)

### 2. Generate Video (CLI)

```bash
# Basic usage with ZIP file
avg generate -a audio.mp3 -c storyboard.csv -i images.zip -o output.mp4

# With manual image directory
avg generate -a audio.mp3 -c storyboard.csv -i ./images/ --input-mode MANUAL

# With custom animations
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
  --animation-mode single --animation zoom_in \
  --transition-mode single --transition fade

# With text overlay
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
  --txt-overlay phrases.txt --text-color "#FFFFFF"
```

### 3. Launch Web UI

```bash
# Start local web interface
avg web

# With public link (for sharing)
avg web --share

# Custom port
avg web --port 8080
```

## CLI Reference

### `avg generate`

Generate a synchronized video.

**Required Arguments:**
- `-a, --audio`: Path to audio file
- `-c, --csv`: Path to CSV mapping file
- `-i, --images`: Path to images (ZIP or directory)

**Optional Arguments:**

| Option | Description | Default |
|--------|-------------|---------|
| `-o, --output` | Output filename | `output.mp4` |
| `-r, --resolution` | Resolution preset (`landscape`, `portrait`, `square`) | `landscape` |
| `--animation-mode` | Animation selection (`random`, `single`, `custom`) | `random` |
| `--animation` | Single animation to use | - |
| `--animations` | Custom animation list (can specify multiple) | - |
| `--transition-mode` | Transition selection (`random`, `single`, `custom`) | `random` |
| `--transition` | Single transition to use | - |
| `--transitions` | Custom transition list (can specify multiple) | - |
| `--txt-overlay` | Path to text overlay file | - |
| `--font-size` | Text overlay font size | `56` |
| `--text-color` | Text color (hex) | `#FFFFFF` |
| `--text-pos-x` | Horizontal position (0.0-1.0) | `0.5` |
| `--text-pos-y` | Vertical position (0.0-1.0) | `0.5` |
| `--whisper-model` | Whisper model (`tiny`, `base`, `small`, `medium`, `large`) | `base` |
| `--fps` | Output framerate | `24` |

### `avg web`

Launch Gradio web interface.

| Option | Description | Default |
|--------|-------------|---------|
| `--host` | Host to bind to | `127.0.0.1` |
| `--port` | Port to listen on | `7860` |
| `--share` | Create public shareable link | `False` |

### `avg models`

List available Whisper models with size and speed info.

### `avg animations`

List available animations and transitions.

## CSV File Format

The CSV file must have exactly **2 columns**:
1. `text`: The text content that appears in the audio
2. `image`: Reference to the image file

### Image Reference Resolution

Image references are resolved in this order:
1. **Exact filename match**: `image.png` matches `image.png`
2. **Stem match**: `image` matches `image.png`
3. **Number match**: `1` matches `1_xxx.png`, `2` matches `2_yyy.jpg`
4. **Fallback**: Last numbered image in the collection

### Example CSV

```csv
text,image
"Introduction",1_intro.jpg
"Main topic one",2_topic1.png
"Main topic two",3_topic2.jpg
"Conclusion",4_outro.jpg
```

## Text Overlay File Format

Create a text file with one phrase per line:

```
Welcome
First Point
Second Point
Key Takeaway
Thank You
```

Each phrase will be matched to the audio and displayed with the selected animation.

## Animations

### Image Animations

| Animation | Description |
|-----------|-------------|
| `none` | Static image |
| `zoom_in` | Slow zoom in (1.00x → 1.06x) |
| `zoom_out` | Slow zoom out (1.06x → 1.00x) |
| `fade_in` | Fade in from black |
| `blink` | Subtle brightness pulsing |
| `pulse` | Scale pulsing with sine wave |
| `fade_zoom_in` | Fade in + slow zoom |

### Transitions

| Transition | Description |
|------------|-------------|
| `none` | Cut (no transition) |
| `fade` | Crossfade between images |
| `crossfade` | Longer crossfade |
| `slide_left` | Slide in from right |
| `slide_right` | Slide in from left |
| `dip_to_black` | Brief black screen between |
| `flash` | Brief white flash between |

### Text Animations

| Animation | Description |
|-----------|-------------|
| `zoom_in` | Scale up from 0.72x |
| `fade_in` | Fade in |
| `pop_in` | Scale + fade pop effect |
| `pulse` | Continuous pulse |
| `slide_up` | Slide up + fade |
| `glow_pop` | Pop with glow effect |
| `typewriter` | Character-by-character reveal |

## Python API

```python
from audio_video_generator.core.pipeline import VideoPipeline, VideoPipelineConfig

# Configure pipeline
config = VideoPipelineConfig(
    audio_path="audio.mp3",
    csv_path="mapping.csv",
    input_mode="ZIP",
    zip_path="images.zip",
    output_filename="output.mp4",
    resolution="landscape",
    animation_mode="random",
    transition_mode="fade",
    whisper_model="base"
)

# Run pipeline
pipeline = VideoPipeline(config)
result = pipeline.run()

print(f"Video saved to: {result['output_path']}")
```

## Project Structure

```
audio-video-generator/
├── pyproject.toml          # Package configuration
├── requirements.txt        # Dependencies
├── README.md              # This file
└── src/
    └── audio_video_generator/
        ├── __init__.py
        ├── __main__.py      # Module entry point
        ├── cli.py           # CLI implementation
        ├── config.py        # Constants and defaults
        ├── core/
        │   ├── audio.py     # Whisper transcription
        │   ├── alignment.py # CSV-audio alignment
        │   ├── images.py    # Image processing
        │   ├── pipeline.py  # Main orchestration
        │   ├── text_overlay.py  # Text rendering
        │   └── video.py     # Animation/transitions
        ├── utils/
        │   ├── files.py     # File utilities
        │   └── text.py      # Text processing
        └── web/
            └── gradio_ui.py # Web interface
```

## Troubleshooting

### CUDA Out of Memory

Use a smaller Whisper model:
```bash
avg generate --whisper-model tiny ...
```

Or force CPU:
```bash
CUDA_VISIBLE_DEVICES="" avg generate ...
```

### Images Not Found

Ensure image filenames start with numbers (e.g., `1_image.png`). The tool uses numbers for fallback resolution.

### CSV Alignment Failing

Check that:
1. CSV has exactly 2 columns: `text` and `image`
2. Text content matches what's spoken in the audio
3. File encoding is UTF-8

### FFmpeg Errors

Ensure FFmpeg is installed and available in PATH:
```bash
ffmpeg -version
```

## License

MIT License

## Acknowledgments

- [OpenAI Whisper](https://github.com/openai/whisper) for transcription
- [MoviePy](https://zulko.github.io/moviepy/) for video processing
- [Gradio](https://gradio.app/) for web interface