Spaces:

areebsa
/

imageaudiosync

Sleeping

App Files Files Community

imageaudiosync / README.md

areebsa

Update README.md

201c94a verified 2 months ago

preview code

Raw

History Blame Contribute Delete

8.64 kB

	---
	title: Audio Video Generator
	emoji: 🎬
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 6.13.0
	app_file: app.py
	pinned: false
	license: mit
	---

	# Audio Video Generator

	Synchronize images to audio using OpenAI Whisper transcription and CSV mapping. Create engaging videos with automated image transitions, animations, and text overlays.

	## Features

	- Audio Transcription: Uses OpenAI Whisper for accurate word-level timestamps
	- CSV Alignment: Maps text phrases to images with fuzzy matching
	- Image Animations: zoom_in, zoom_out, fade_in, blink, pulse, fade_zoom_in
	- Transitions: fade, crossfade, slide_left, slide_right, dip_to_black, flash
	- Text Overlays: Synchronized text with multiple animations and styles
	- Flexible Input: Accept ZIP files or directories of images
	- Multiple Resolutions: Landscape (1280x720), Portrait (720x1280), Square (1080x1080)
	- CLI & Web UI: Use command-line or Gradio web interface

	## Installation

	```bash
	# Clone or download the repository
	cd audio-video-generator

	# Install dependencies
	pip install -e .

	# Or install from requirements.txt
	pip install -r requirements.txt
	```

	### System Requirements

	- Python 3.9+
	- FFmpeg (required by moviepy)
	- CUDA (optional, for GPU acceleration)

	#### Installing FFmpeg

	macOS:
	```bash
	brew install ffmpeg
	```

	Ubuntu/Debian:
	```bash
	sudo apt-get update
	sudo apt-get install ffmpeg
	```

	Windows:
	Download from https://ffmpeg.org/download.html and add to PATH

	## Quick Start

	### 1. Prepare Your Files

	Audio file: Any common format (MP3, WAV, M4A, etc.)

	CSV mapping file (`storyboard.csv`):
	```csv
	text,image
	"Welcome to our presentation",1_intro.png
	"First we discuss the basics",2_basics.jpg
	"Then we explore advanced topics",3_advanced.png
	"Thank you for watching",4_thanks.png
	```

	Images: Name files with numbers for automatic ordering (e.g., `1_intro.png`, `2_basics.jpg`)

	### 2. Generate Video (CLI)

	```bash
	# Basic usage with ZIP file
	avg generate -a audio.mp3 -c storyboard.csv -i images.zip -o output.mp4

	# With manual image directory
	avg generate -a audio.mp3 -c storyboard.csv -i ./images/ --input-mode MANUAL

	# With custom animations
	avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
	--animation-mode single --animation zoom_in \
	--transition-mode single --transition fade

	# With text overlay
	avg generate -a audio.mp3 -c storyboard.csv -i images.zip \
	--txt-overlay phrases.txt --text-color "#FFFFFF"
	```

	### 3. Launch Web UI

	```bash
	# Start local web interface
	avg web

	# With public link (for sharing)
	avg web --share

	# Custom port
	avg web --port 8080
	```

	## CLI Reference

	### `avg generate`

	Generate a synchronized video.

	Required Arguments:
	- `-a, --audio`: Path to audio file
	- `-c, --csv`: Path to CSV mapping file
	- `-i, --images`: Path to images (ZIP or directory)

	Optional Arguments:

	\| Option \| Description \| Default \|
	\|--------\|-------------\|---------\|
	\| `-o, --output` \| Output filename \| `output.mp4` \|
	\| `-r, --resolution` \| Resolution preset (`landscape`, `portrait`, `square`) \| `landscape` \|
	\| `--animation-mode` \| Animation selection (`random`, `single`, `custom`) \| `random` \|
	\| `--animation` \| Single animation to use \| - \|
	\| `--animations` \| Custom animation list (can specify multiple) \| - \|
	\| `--transition-mode` \| Transition selection (`random`, `single`, `custom`) \| `random` \|
	\| `--transition` \| Single transition to use \| - \|
	\| `--transitions` \| Custom transition list (can specify multiple) \| - \|
	\| `--txt-overlay` \| Path to text overlay file \| - \|
	\| `--font-size` \| Text overlay font size \| `56` \|
	\| `--text-color` \| Text color (hex) \| `#FFFFFF` \|
	\| `--text-pos-x` \| Horizontal position (0.0-1.0) \| `0.5` \|
	\| `--text-pos-y` \| Vertical position (0.0-1.0) \| `0.5` \|
	\| `--whisper-model` \| Whisper model (`tiny`, `base`, `small`, `medium`, `large`) \| `base` \|
	\| `--fps` \| Output framerate \| `24` \|

	### `avg web`

	Launch Gradio web interface.

	\| Option \| Description \| Default \|
	\|--------\|-------------\|---------\|
	\| `--host` \| Host to bind to \| `127.0.0.1` \|
	\| `--port` \| Port to listen on \| `7860` \|
	\| `--share` \| Create public shareable link \| `False` \|

	### `avg models`

	List available Whisper models with size and speed info.

	### `avg animations`

	List available animations and transitions.

	## CSV File Format

	The CSV file must have exactly 2 columns:
	1. `text`: The text content that appears in the audio
	2. `image`: Reference to the image file

	### Image Reference Resolution

	Image references are resolved in this order:
	1. Exact filename match: `image.png` matches `image.png`
	2. Stem match: `image` matches `image.png`
	3. Number match: `1` matches `1_xxx.png`, `2` matches `2_yyy.jpg`
	4. Fallback: Last numbered image in the collection

	### Example CSV

	```csv
	text,image
	"Introduction",1_intro.jpg
	"Main topic one",2_topic1.png
	"Main topic two",3_topic2.jpg
	"Conclusion",4_outro.jpg
	```

	## Text Overlay File Format

	Create a text file with one phrase per line:

	```
	Welcome
	First Point
	Second Point
	Key Takeaway
	Thank You
	```

	Each phrase will be matched to the audio and displayed with the selected animation.

	## Animations

	### Image Animations

	\| Animation \| Description \|
	\|-----------\|-------------\|
	\| `none` \| Static image \|
	\| `zoom_in` \| Slow zoom in (1.00x → 1.06x) \|
	\| `zoom_out` \| Slow zoom out (1.06x → 1.00x) \|
	\| `fade_in` \| Fade in from black \|
	\| `blink` \| Subtle brightness pulsing \|
	\| `pulse` \| Scale pulsing with sine wave \|
	\| `fade_zoom_in` \| Fade in + slow zoom \|

	### Transitions

	\| Transition \| Description \|
	\|------------\|-------------\|
	\| `none` \| Cut (no transition) \|
	\| `fade` \| Crossfade between images \|
	\| `crossfade` \| Longer crossfade \|
	\| `slide_left` \| Slide in from right \|
	\| `slide_right` \| Slide in from left \|
	\| `dip_to_black` \| Brief black screen between \|
	\| `flash` \| Brief white flash between \|

	### Text Animations

	\| Animation \| Description \|
	\|-----------\|-------------\|
	\| `zoom_in` \| Scale up from 0.72x \|
	\| `fade_in` \| Fade in \|
	\| `pop_in` \| Scale + fade pop effect \|
	\| `pulse` \| Continuous pulse \|
	\| `slide_up` \| Slide up + fade \|
	\| `glow_pop` \| Pop with glow effect \|
	\| `typewriter` \| Character-by-character reveal \|

	## Python API

	```python
	from audio_video_generator.core.pipeline import VideoPipeline, VideoPipelineConfig

	# Configure pipeline
	config = VideoPipelineConfig(
	audio_path="audio.mp3",
	csv_path="mapping.csv",
	input_mode="ZIP",
	zip_path="images.zip",
	output_filename="output.mp4",
	resolution="landscape",
	animation_mode="random",
	transition_mode="fade",
	whisper_model="base"
	)

	# Run pipeline
	pipeline = VideoPipeline(config)
	result = pipeline.run()

	print(f"Video saved to: {result['output_path']}")
	```

	## Project Structure

	```
	audio-video-generator/
	├── pyproject.toml # Package configuration
	├── requirements.txt # Dependencies
	├── README.md # This file
	└── src/
	└── audio_video_generator/
	├── __init__.py
	├── __main__.py # Module entry point
	├── cli.py # CLI implementation
	├── config.py # Constants and defaults
	├── core/
	│ ├── audio.py # Whisper transcription
	│ ├── alignment.py # CSV-audio alignment
	│ ├── images.py # Image processing
	│ ├── pipeline.py # Main orchestration
	│ ├── text_overlay.py # Text rendering
	│ └── video.py # Animation/transitions
	├── utils/
	│ ├── files.py # File utilities
	│ └── text.py # Text processing
	└── web/
	└── gradio_ui.py # Web interface
	```

	## Troubleshooting

	### CUDA Out of Memory

	Use a smaller Whisper model:
	```bash
	avg generate --whisper-model tiny ...
	```

	Or force CPU:
	```bash
	CUDA_VISIBLE_DEVICES="" avg generate ...
	```

	### Images Not Found

	Ensure image filenames start with numbers (e.g., `1_image.png`). The tool uses numbers for fallback resolution.

	### CSV Alignment Failing

	Check that:
	1. CSV has exactly 2 columns: `text` and `image`
	2. Text content matches what's spoken in the audio
	3. File encoding is UTF-8

	### FFmpeg Errors

	Ensure FFmpeg is installed and available in PATH:
	```bash
	ffmpeg -version
	```

	## License

	MIT License

	## Acknowledgments

	- [OpenAI Whisper](https://github.com/openai/whisper) for transcription
	- [MoviePy](https://zulko.github.io/moviepy/) for video processing
	- [Gradio](https://gradio.app/) for web interface