Spaces:

areebsa
/

imageaudiosync

Sleeping

App Files Files Community

imageaudiosync / README.md

areebsa

Update README.md

201c94a verified 2 months ago

preview code

Raw

History Blame Contribute Delete

8.64 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: Audio Video Generator
emoji: 🎬
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 6.13.0
app_file: app.py
pinned: false
license: mit

Audio Video Generator

Synchronize images to audio using OpenAI Whisper transcription and CSV mapping. Create engaging videos with automated image transitions, animations, and text overlays.

Features

Audio Transcription: Uses OpenAI Whisper for accurate word-level timestamps
CSV Alignment: Maps text phrases to images with fuzzy matching
Image Animations: zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
Transitions: fade, crossfade, slide_left, slide_right, dip_to_black, flash
Text Overlays: Synchronized text with multiple animations and styles
Flexible Input: Accept ZIP files or directories of images
Multiple Resolutions: Landscape (1280x720), Portrait (720x1280), Square (1080x1080)
CLI & Web UI: Use command-line or Gradio web interface

Installation

# Clone or download the repository
cd audio-video-generator

# Install dependencies
pip install -e .

# Or install from requirements.txt
pip install -r requirements.txt

System Requirements

Python 3.9+
FFmpeg (required by moviepy)
CUDA (optional, for GPU acceleration)

Installing FFmpeg

macOS:

brew install ffmpeg

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install ffmpeg

Windows: Download from https://ffmpeg.org/download.html and add to PATH

Quick Start

1. Prepare Your Files

Audio file: Any common format (MP3, WAV, M4A, etc.)

CSV mapping file (storyboard.csv):

text,image
"Welcome to our presentation",1_intro.png
"First we discuss the basics",2_basics.jpg
"Then we explore advanced topics",3_advanced.png
"Thank you for watching",4_thanks.png

Images: Name files with numbers for automatic ordering (e.g., 1_intro.png, 2_basics.jpg)

2. Generate Video (CLI)

# Basic usage with ZIP file
avg generate -a audio.mp3 -c storyboard.csv -i images.zip -o output.mp4

# With manual image directory
avg generate -a audio.mp3 -c storyboard.csv -i ./images/ --input-mode MANUAL

# With custom animations
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
  --animation-mode single --animation zoom_in \
  --transition-mode single --transition fade

# With text overlay
avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
  --txt-overlay phrases.txt --text-color "#FFFFFF"

3. Launch Web UI

# Start local web interface
avg web

# With public link (for sharing)
avg web --share

# Custom port
avg web --port 8080

CLI Reference

`avg generate`

Generate a synchronized video.

Required Arguments:

-a, --audio: Path to audio file
-c, --csv: Path to CSV mapping file
-i, --images: Path to images (ZIP or directory)

Optional Arguments:

Option	Description	Default
`-o, --output`	Output filename	`output.mp4`
`-r, --resolution`	Resolution preset (`landscape`, `portrait`, `square`)	`landscape`
`--animation-mode`	Animation selection (`random`, `single`, `custom`)	`random`
`--animation`	Single animation to use	-
`--animations`	Custom animation list (can specify multiple)	-
`--transition-mode`	Transition selection (`random`, `single`, `custom`)	`random`
`--transition`	Single transition to use	-
`--transitions`	Custom transition list (can specify multiple)	-
`--txt-overlay`	Path to text overlay file	-
`--font-size`	Text overlay font size	`56`
`--text-color`	Text color (hex)	`#FFFFFF`
`--text-pos-x`	Horizontal position (0.0-1.0)	`0.5`
`--text-pos-y`	Vertical position (0.0-1.0)	`0.5`
`--whisper-model`	Whisper model (`tiny`, `base`, `small`, `medium`, `large`)	`base`
`--fps`	Output framerate	`24`

`avg web`

Launch Gradio web interface.

Option	Description	Default
`--host`	Host to bind to	`127.0.0.1`
`--port`	Port to listen on	`7860`
`--share`	Create public shareable link	`False`

`avg models`

List available Whisper models with size and speed info.

`avg animations`

List available animations and transitions.

CSV File Format

The CSV file must have exactly 2 columns:

text: The text content that appears in the audio
image: Reference to the image file

Image Reference Resolution

Image references are resolved in this order:

Exact filename match: image.png matches image.png
Stem match: image matches image.png
Number match: 1 matches 1_xxx.png, 2 matches 2_yyy.jpg
Fallback: Last numbered image in the collection

Example CSV

text,image
"Introduction",1_intro.jpg
"Main topic one",2_topic1.png
"Main topic two",3_topic2.jpg
"Conclusion",4_outro.jpg

Text Overlay File Format

Create a text file with one phrase per line:

Welcome
First Point
Second Point
Key Takeaway
Thank You

Each phrase will be matched to the audio and displayed with the selected animation.

Animations

Image Animations

Animation	Description
`none`	Static image
`zoom_in`	Slow zoom in (1.00x → 1.06x)
`zoom_out`	Slow zoom out (1.06x → 1.00x)
`fade_in`	Fade in from black
`blink`	Subtle brightness pulsing
`pulse`	Scale pulsing with sine wave
`fade_zoom_in`	Fade in + slow zoom

Transitions

Transition	Description
`none`	Cut (no transition)
`fade`	Crossfade between images
`crossfade`	Longer crossfade
`slide_left`	Slide in from right
`slide_right`	Slide in from left
`dip_to_black`	Brief black screen between
`flash`	Brief white flash between

Text Animations

Animation	Description
`zoom_in`	Scale up from 0.72x
`fade_in`	Fade in
`pop_in`	Scale + fade pop effect
`pulse`	Continuous pulse
`slide_up`	Slide up + fade
`glow_pop`	Pop with glow effect
`typewriter`	Character-by-character reveal

Python API

from audio_video_generator.core.pipeline import VideoPipeline, VideoPipelineConfig

# Configure pipeline
config = VideoPipelineConfig(
    audio_path="audio.mp3",
    csv_path="mapping.csv",
    input_mode="ZIP",
    zip_path="images.zip",
    output_filename="output.mp4",
    resolution="landscape",
    animation_mode="random",
    transition_mode="fade",
    whisper_model="base"
)

# Run pipeline
pipeline = VideoPipeline(config)
result = pipeline.run()

print(f"Video saved to: {result['output_path']}")

Project Structure

audio-video-generator/
├── pyproject.toml          # Package configuration
├── requirements.txt        # Dependencies
├── README.md              # This file
└── src/
    └── audio_video_generator/
        ├── __init__.py
        ├── __main__.py      # Module entry point
        ├── cli.py           # CLI implementation
        ├── config.py        # Constants and defaults
        ├── core/
        │   ├── audio.py     # Whisper transcription
        │   ├── alignment.py # CSV-audio alignment
        │   ├── images.py    # Image processing
        │   ├── pipeline.py  # Main orchestration
        │   ├── text_overlay.py  # Text rendering
        │   └── video.py     # Animation/transitions
        ├── utils/
        │   ├── files.py     # File utilities
        │   └── text.py      # Text processing
        └── web/
            └── gradio_ui.py # Web interface

Troubleshooting

CUDA Out of Memory

Use a smaller Whisper model:

avg generate --whisper-model tiny ...

Or force CPU:

CUDA_VISIBLE_DEVICES="" avg generate ...

Images Not Found

Ensure image filenames start with numbers (e.g., 1_image.png). The tool uses numbers for fallback resolution.

CSV Alignment Failing

Check that:

CSV has exactly 2 columns: text and image
Text content matches what's spoken in the audio
File encoding is UTF-8

FFmpeg Errors

Ensure FFmpeg is installed and available in PATH:

ffmpeg -version

License

MIT License

Acknowledgments

OpenAI Whisper for transcription
MoviePy for video processing
Gradio for web interface