imageaudiosync / README.md
areebsa's picture
Update README.md
201c94a verified
|
Raw
History Blame Contribute Delete
8.64 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: Audio Video Generator
emoji: 🎬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.13.0
app_file: app.py
pinned: false
license: mit

Audio Video Generator

Synchronize images to audio using OpenAI Whisper transcription and CSV mapping. Create engaging videos with automated image transitions, animations, and text overlays.

Features

  • Audio Transcription: Uses OpenAI Whisper for accurate word-level timestamps
  • CSV Alignment: Maps text phrases to images with fuzzy matching
  • Image Animations: zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
  • Transitions: fade, crossfade, slide_left, slide_right, dip_to_black, flash
  • Text Overlays: Synchronized text with multiple animations and styles
  • Flexible Input: Accept ZIP files or directories of images
  • Multiple Resolutions: Landscape (1280x720), Portrait (720x1280), Square (1080x1080)
  • CLI & Web UI: Use command-line or Gradio web interface

Installation

# Clone or download the repository
cd audio-video-generator

# Install dependencies
pip install -e .

# Or install from requirements.txt
pip install -r requirements.txt

System Requirements

  • Python 3.9+
  • FFmpeg (required by moviepy)
  • CUDA (optional, for GPU acceleration)

Installing FFmpeg

macOS:

brew install ffmpeg

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install ffmpeg

Windows: Download from https://ffmpeg.org/download.html and add to PATH

Quick Start

1. Prepare Your Files

Audio file: Any common format (MP3, WAV, M4A, etc.)

CSV mapping file (storyboard.csv):

text,image
"Welcome to our presentation",1_intro.png
"First we discuss the basics",2_basics.jpg
"Then we explore advanced topics",3_advanced.png
"Thank you for watching",4_thanks.png

Images: Name files with numbers for automatic ordering (e.g., 1_intro.png, 2_basics.jpg)

2. Generate Video (CLI)

# Basic usage with ZIP file
avg generate -a audio.mp3 -c storyboard.csv -i images.zip -o output.mp4

# With manual image directory
avg generate -a audio.mp3 -c storyboard.csv -i ./images/ --input-mode MANUAL

# With custom animations
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
  --animation-mode single --animation zoom_in \
  --transition-mode single --transition fade

# With text overlay
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
  --txt-overlay phrases.txt --text-color "#FFFFFF"

3. Launch Web UI

# Start local web interface
avg web

# With public link (for sharing)
avg web --share

# Custom port
avg web --port 8080

CLI Reference

avg generate

Generate a synchronized video.

Required Arguments:

  • -a, --audio: Path to audio file
  • -c, --csv: Path to CSV mapping file
  • -i, --images: Path to images (ZIP or directory)

Optional Arguments:

Option Description Default
-o, --output Output filename output.mp4
-r, --resolution Resolution preset (landscape, portrait, square) landscape
--animation-mode Animation selection (random, single, custom) random
--animation Single animation to use -
--animations Custom animation list (can specify multiple) -
--transition-mode Transition selection (random, single, custom) random
--transition Single transition to use -
--transitions Custom transition list (can specify multiple) -
--txt-overlay Path to text overlay file -
--font-size Text overlay font size 56
--text-color Text color (hex) #FFFFFF
--text-pos-x Horizontal position (0.0-1.0) 0.5
--text-pos-y Vertical position (0.0-1.0) 0.5
--whisper-model Whisper model (tiny, base, small, medium, large) base
--fps Output framerate 24

avg web

Launch Gradio web interface.

Option Description Default
--host Host to bind to 127.0.0.1
--port Port to listen on 7860
--share Create public shareable link False

avg models

List available Whisper models with size and speed info.

avg animations

List available animations and transitions.

CSV File Format

The CSV file must have exactly 2 columns:

  1. text: The text content that appears in the audio
  2. image: Reference to the image file

Image Reference Resolution

Image references are resolved in this order:

  1. Exact filename match: image.png matches image.png
  2. Stem match: image matches image.png
  3. Number match: 1 matches 1_xxx.png, 2 matches 2_yyy.jpg
  4. Fallback: Last numbered image in the collection

Example CSV

text,image
"Introduction",1_intro.jpg
"Main topic one",2_topic1.png
"Main topic two",3_topic2.jpg
"Conclusion",4_outro.jpg

Text Overlay File Format

Create a text file with one phrase per line:

Welcome
First Point
Second Point
Key Takeaway
Thank You

Each phrase will be matched to the audio and displayed with the selected animation.

Animations

Image Animations

Animation Description
none Static image
zoom_in Slow zoom in (1.00x β†’ 1.06x)
zoom_out Slow zoom out (1.06x β†’ 1.00x)
fade_in Fade in from black
blink Subtle brightness pulsing
pulse Scale pulsing with sine wave
fade_zoom_in Fade in + slow zoom

Transitions

Transition Description
none Cut (no transition)
fade Crossfade between images
crossfade Longer crossfade
slide_left Slide in from right
slide_right Slide in from left
dip_to_black Brief black screen between
flash Brief white flash between

Text Animations

Animation Description
zoom_in Scale up from 0.72x
fade_in Fade in
pop_in Scale + fade pop effect
pulse Continuous pulse
slide_up Slide up + fade
glow_pop Pop with glow effect
typewriter Character-by-character reveal

Python API

from audio_video_generator.core.pipeline import VideoPipeline, VideoPipelineConfig

# Configure pipeline
config = VideoPipelineConfig(
    audio_path="audio.mp3",
    csv_path="mapping.csv",
    input_mode="ZIP",
    zip_path="images.zip",
    output_filename="output.mp4",
    resolution="landscape",
    animation_mode="random",
    transition_mode="fade",
    whisper_model="base"
)

# Run pipeline
pipeline = VideoPipeline(config)
result = pipeline.run()

print(f"Video saved to: {result['output_path']}")

Project Structure

audio-video-generator/
β”œβ”€β”€ pyproject.toml          # Package configuration
β”œβ”€β”€ requirements.txt        # Dependencies
β”œβ”€β”€ README.md              # This file
└── src/
    └── audio_video_generator/
        β”œβ”€β”€ __init__.py
        β”œβ”€β”€ __main__.py      # Module entry point
        β”œβ”€β”€ cli.py           # CLI implementation
        β”œβ”€β”€ config.py        # Constants and defaults
        β”œβ”€β”€ core/
        β”‚   β”œβ”€β”€ audio.py     # Whisper transcription
        β”‚   β”œβ”€β”€ alignment.py # CSV-audio alignment
        β”‚   β”œβ”€β”€ images.py    # Image processing
        β”‚   β”œβ”€β”€ pipeline.py  # Main orchestration
        β”‚   β”œβ”€β”€ text_overlay.py  # Text rendering
        β”‚   └── video.py     # Animation/transitions
        β”œβ”€β”€ utils/
        β”‚   β”œβ”€β”€ files.py     # File utilities
        β”‚   └── text.py      # Text processing
        └── web/
            └── gradio_ui.py # Web interface

Troubleshooting

CUDA Out of Memory

Use a smaller Whisper model:

avg generate --whisper-model tiny ...

Or force CPU:

CUDA_VISIBLE_DEVICES="" avg generate ...

Images Not Found

Ensure image filenames start with numbers (e.g., 1_image.png). The tool uses numbers for fallback resolution.

CSV Alignment Failing

Check that:

  1. CSV has exactly 2 columns: text and image
  2. Text content matches what's spoken in the audio
  3. File encoding is UTF-8

FFmpeg Errors

Ensure FFmpeg is installed and available in PATH:

ffmpeg -version

License

MIT License

Acknowledgments