Spaces:
Running
title: SRT Caption Generator
emoji: π¬
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
SRT Caption Generator for CapCut
Tunisian Arabic Forced Alignment Tool
A local CLI tool for generating accurate SRT caption files from audio and script text, specifically designed for CapCut import compatibility and Tunisian Arabic dialect (Arabic + French code-switching).
Features
β
Forced Alignment: Precisely align existing scripts with audio (no transcription)
β
CapCut Compatible: UTF-8, CRLF line endings, perfect import every time
β
Tunisian Arabic: Handles Arabic + French code-switching seamlessly
β
CPU Only: Runs on MacBook without GPU requirements
β
Batch Processing: Process 20+ videos with one command
β
Quality Features: Word-level alignment, auto-split long captions, smart gap correction
Quick Start
1. Install Dependencies
# Install FFmpeg (required for audio processing)
brew install ffmpeg
# Install Python packages
pip install ctc-forced-aligner torch torchaudio
2. Basic Usage
# Single file processing (uses word-level by default for optimal results)
python3 align.py --audio input/video.mp3 --script input/script.txt
# Batch processing (auto-matches filenames)
python3 align.py --batch --input-dir input/ --output-dir output/
# Force sentence-level alignment (if needed)
python3 align.py --audio input/video.wav --script input/script.txt --sentence-level
# Quality analysis of results
python3 quality_analyzer.py output/video.srt
3. Test Installation
python3 test_basic.py
Project Structure
caption-tool/
βββ align.py β Main CLI entrypoint
βββ aligner.py β Forced alignment core (sentence + word-level)
βββ srt_writer.py β SRT formatting, group_words(), timing logic
βββ normalize.py β Audio β 16kHz mono WAV via ffmpeg
βββ validator.py β Input validation (file existence, WPM check)
βββ batch.py β Batch processing (sentence-level)
βββ config.py β All constants (ARABIC_PARTICLES, timings, etc.)
βββ diff_check.py β Quality checker: compare output vs reference SRT
βββ test_word_level.py β Quick alignment test on first N sentences
βββ download_model.py β Resume-capable downloader for ONNX model
βββ demo_align.py β Demo mode with synthetic data
βββ test_basic.py β Basic module functionality tests
βββ input/ β Drop audio + txt files here
βββ output/ β SRT files generated here
βββ docs/ β Detailed documentation
βββ CLI.md
βββ ALIGNER.md
βββ SRT_WRITER.md
βββ CONFIG.md
βββ BATCH.md
βββ NORMALIZE.md
βββ VALIDATOR.md
βββ TROUBLESHOOTING.md
Usage Examples
Single File Processing
# Basic alignment
python3 align.py --audio input/video_01.mp3 --script input/video_01.txt
# Custom output path
python3 align.py --audio input/video.wav --script input/script.txt --output custom.srt
# Adjust timing (shift captions earlier)
python3 align.py --audio input/video.m4a --script input/script.txt --offset -200
Quality Options
# Default word-level alignment (optimal for Tunisian Arabic)
python3 align.py --audio input/video.wav --script input/script.txt
# Force sentence-level alignment (for very long captions)
python3 align.py --audio input/video.wav --script input/script.txt --sentence-level
# Custom caption length limit
python3 align.py --audio input/video.mp3 --script input/script.txt --max-chars 30
# Quality analysis with improvement suggestions
python3 quality_analyzer.py output/video.srt
# Verbose output for debugging
python3 align.py --audio input/video.wav --script input/script.txt --verbose
Batch Processing
# Process all matched pairs in directory
python3 align.py --batch --input-dir input/ --output-dir output/
# Input structure:
# input/
# βββ video_01.mp3 β video_01.txt
# βββ video_02.wav β video_02.txt
# βββ video_03.m4a β video_03.txt
Expected Workflow
- Record voiceover from your written script
- Export audio as MP3/WAV from your video editor
- Save script as UTF-8 text file with same filename
- Run alignment:
python3 align.py --audio video.mp3 --script video.txt - Import SRT directly into CapCut - captions appear accurately timed
No manual timestamping. No CapCut caption editing. Just perfect alignment.
Supported Formats
- Audio: MP3, WAV, M4A, AAC
- Text: UTF-8 encoded plain text
- Output: CapCut-compatible SRT files
Quality Features
- Gap Elimination: Eliminates gaps between captions for seamless playback
- Character Limits: Auto-split long captions at word boundaries
- Word-Level Mode: More precise timing for fast-speaking segments
- Confidence Warnings: Alerts for low-quality alignments
- Validation Checks: Prevents common audio/script mismatches
Documentation
- CLI.md: Complete command-line reference
- ALIGNER.md: Alignment engine details
- TROUBLESHOOTING.md: Common issues and fixes
- SRT_WRITER.md: SRT formatting specifics
- BATCH.md: Batch processing guide
Troubleshooting
First Run
- Model download (~1GB) happens automatically on first use
- Requires stable internet connection and ~2GB free disk space
Common Issues
- Arabic text garbled: Ensure script file is UTF-8 encoded
- Captions too early/late: Use
--offsetflag to adjust timing - Low alignment confidence: Try
--word-levelmode for better precision - Import fails in CapCut: Generated SRT uses correct formatting automatically
See TROUBLESHOOTING.md for detailed solutions.
Technical Details
- Alignment Model: facebook/mms-300m (HuggingFace)
- Alignment Library: ctc-forced-aligner
- Audio Processing: FFmpeg normalization to 16kHz mono WAV
- Language Support: Arabic (ara) + multilingual text
- Platform: macOS (CPU-only, no CUDA required)
- Line Endings: CRLF for CapCut compatibility
License
This tool is designed specifically for content creators producing Tunisian Arabic videos. Use responsibly and ensure you have rights to the audio content you're processing.
Contributing
This is a specialized tool built to exact specifications. For issues or feature requests related to CapCut compatibility or Tunisian Arabic handling, please document them with specific test cases.