Your Name
web app with docker --reflected==true
946fedd
metadata
title: SRT Caption Generator
emoji: 🎬
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

SRT Caption Generator for CapCut

Tunisian Arabic Forced Alignment Tool

A local CLI tool for generating accurate SRT caption files from audio and script text, specifically designed for CapCut import compatibility and Tunisian Arabic dialect (Arabic + French code-switching).


Features

βœ… Forced Alignment: Precisely align existing scripts with audio (no transcription)
βœ… CapCut Compatible: UTF-8, CRLF line endings, perfect import every time
βœ… Tunisian Arabic: Handles Arabic + French code-switching seamlessly
βœ… CPU Only: Runs on MacBook without GPU requirements
βœ… Batch Processing: Process 20+ videos with one command
βœ… Quality Features: Word-level alignment, auto-split long captions, smart gap correction


Quick Start

1. Install Dependencies

# Install FFmpeg (required for audio processing)
brew install ffmpeg

# Install Python packages
pip install ctc-forced-aligner torch torchaudio

2. Basic Usage

# Single file processing (uses word-level by default for optimal results)
python3 align.py --audio input/video.mp3 --script input/script.txt

# Batch processing (auto-matches filenames)
python3 align.py --batch --input-dir input/ --output-dir output/

# Force sentence-level alignment (if needed)
python3 align.py --audio input/video.wav --script input/script.txt --sentence-level

# Quality analysis of results
python3 quality_analyzer.py output/video.srt

3. Test Installation

python3 test_basic.py

Project Structure

caption-tool/
β”œβ”€β”€ align.py              ← Main CLI entrypoint
β”œβ”€β”€ aligner.py            ← Forced alignment core (sentence + word-level)
β”œβ”€β”€ srt_writer.py         ← SRT formatting, group_words(), timing logic
β”œβ”€β”€ normalize.py          ← Audio β†’ 16kHz mono WAV via ffmpeg
β”œβ”€β”€ validator.py          ← Input validation (file existence, WPM check)
β”œβ”€β”€ batch.py              ← Batch processing (sentence-level)
β”œβ”€β”€ config.py             ← All constants (ARABIC_PARTICLES, timings, etc.)
β”œβ”€β”€ diff_check.py         ← Quality checker: compare output vs reference SRT
β”œβ”€β”€ test_word_level.py    ← Quick alignment test on first N sentences
β”œβ”€β”€ download_model.py     ← Resume-capable downloader for ONNX model
β”œβ”€β”€ demo_align.py         ← Demo mode with synthetic data
β”œβ”€β”€ test_basic.py         ← Basic module functionality tests
β”œβ”€β”€ input/                ← Drop audio + txt files here
β”œβ”€β”€ output/               ← SRT files generated here
└── docs/                 ← Detailed documentation
    β”œβ”€β”€ CLI.md
    β”œβ”€β”€ ALIGNER.md
    β”œβ”€β”€ SRT_WRITER.md
    β”œβ”€β”€ CONFIG.md
    β”œβ”€β”€ BATCH.md
    β”œβ”€β”€ NORMALIZE.md
    β”œβ”€β”€ VALIDATOR.md
    └── TROUBLESHOOTING.md

Usage Examples

Single File Processing

# Basic alignment
python3 align.py --audio input/video_01.mp3 --script input/video_01.txt

# Custom output path
python3 align.py --audio input/video.wav --script input/script.txt --output custom.srt

# Adjust timing (shift captions earlier)
python3 align.py --audio input/video.m4a --script input/script.txt --offset -200

Quality Options

# Default word-level alignment (optimal for Tunisian Arabic)
python3 align.py --audio input/video.wav --script input/script.txt

# Force sentence-level alignment (for very long captions)
python3 align.py --audio input/video.wav --script input/script.txt --sentence-level

# Custom caption length limit
python3 align.py --audio input/video.mp3 --script input/script.txt --max-chars 30

# Quality analysis with improvement suggestions
python3 quality_analyzer.py output/video.srt

# Verbose output for debugging
python3 align.py --audio input/video.wav --script input/script.txt --verbose

Batch Processing

# Process all matched pairs in directory
python3 align.py --batch --input-dir input/ --output-dir output/

# Input structure:
# input/
# β”œβ”€β”€ video_01.mp3 ↔ video_01.txt
# β”œβ”€β”€ video_02.wav ↔ video_02.txt
# └── video_03.m4a ↔ video_03.txt

Expected Workflow

  1. Record voiceover from your written script
  2. Export audio as MP3/WAV from your video editor
  3. Save script as UTF-8 text file with same filename
  4. Run alignment: python3 align.py --audio video.mp3 --script video.txt
  5. Import SRT directly into CapCut - captions appear accurately timed

No manual timestamping. No CapCut caption editing. Just perfect alignment.


Supported Formats

  • Audio: MP3, WAV, M4A, AAC
  • Text: UTF-8 encoded plain text
  • Output: CapCut-compatible SRT files

Quality Features

  • Gap Elimination: Eliminates gaps between captions for seamless playback
  • Character Limits: Auto-split long captions at word boundaries
  • Word-Level Mode: More precise timing for fast-speaking segments
  • Confidence Warnings: Alerts for low-quality alignments
  • Validation Checks: Prevents common audio/script mismatches

Documentation


Troubleshooting

First Run

  • Model download (~1GB) happens automatically on first use
  • Requires stable internet connection and ~2GB free disk space

Common Issues

  • Arabic text garbled: Ensure script file is UTF-8 encoded
  • Captions too early/late: Use --offset flag to adjust timing
  • Low alignment confidence: Try --word-level mode for better precision
  • Import fails in CapCut: Generated SRT uses correct formatting automatically

See TROUBLESHOOTING.md for detailed solutions.


Technical Details

  • Alignment Model: facebook/mms-300m (HuggingFace)
  • Alignment Library: ctc-forced-aligner
  • Audio Processing: FFmpeg normalization to 16kHz mono WAV
  • Language Support: Arabic (ara) + multilingual text
  • Platform: macOS (CPU-only, no CUDA required)
  • Line Endings: CRLF for CapCut compatibility

License

This tool is designed specifically for content creators producing Tunisian Arabic videos. Use responsibly and ensure you have rights to the audio content you're processing.


Contributing

This is a specialized tool built to exact specifications. For issues or feature requests related to CapCut compatibility or Tunisian Arabic handling, please document them with specific test cases.