Spaces:
Running
Running
VALIDATOR
Last updated: 2026-03-09
Purpose
Performs comprehensive pre-flight validation of audio and script files before forced alignment processing. Ensures files exist, are properly formatted, and have realistic word count to duration ratios for Tunisian Arabic content.
Function Signature
def validate_inputs(audio_path: Union[str, Path], script_path: Union[str, Path]) -> Dict:
Parameters
| Param | Type | Required | Default | Description |
|---|---|---|---|---|
| audio_path | Union[str, Path] | Yes | - | Path to audio file for validation |
| script_path | Union[str, Path] | Yes | - | Path to script text file for validation |
Returns
Dictionary with validation results and warnings:
{
"audio_duration_sec": 23.5,
"sentence_count": 4,
"word_count": 58,
"warnings": ["Script may be too short for audio duration..."]
}
Error Handling
| Exception | Condition |
|---|---|
| FileNotFoundError | Audio or script file doesn't exist |
| ValueError | File is empty, script not UTF-8, or no valid content |
| RuntimeError | ffprobe fails or can't analyze audio duration |
Usage Example
from validator import validate_inputs
result = validate_inputs("input/video.mp3", "input/video.txt")
print(f"Duration: {result['audio_duration_sec']}s")
print(f"Sentences: {result['sentence_count']}")
for warning in result['warnings']:
print(f"⚠️ {warning}")
Known Edge Cases
- Mixed Arabic/French script: Word counting handles code-switching by splitting on whitespace
- Empty lines in script: Automatically filtered out, only non-empty lines count as sentences
- Special characters: Preserved as-is, no normalization or filtering applied
- Very short audio: Duration validation may trigger false positives for audio < 5 seconds
- Corrupted audio: ffprobe will fail with descriptive error message
- Non-UTF8 script: Explicit check prevents garbled Arabic text processing
Dependencies
- ffprobe (part of ffmpeg): System requirement for audio duration analysis
- pathlib: Built-in Python module
- subprocess: Built-in Python module
- re: Built-in Python module for text processing