Your Name
fine v.1.0
f5bce42
# VALIDATOR
> Last updated: 2026-03-09
## Purpose
Performs comprehensive pre-flight validation of audio and script files before forced alignment processing. Ensures files exist, are properly formatted, and have realistic word count to duration ratios for Tunisian Arabic content.
## Function Signature
```python
def validate_inputs(audio_path: Union[str, Path], script_path: Union[str, Path]) -> Dict:
```
## Parameters
| Param | Type | Required | Default | Description |
|---|---|---|---|---|
| audio_path | Union[str, Path] | Yes | - | Path to audio file for validation |
| script_path | Union[str, Path] | Yes | - | Path to script text file for validation |
## Returns
Dictionary with validation results and warnings:
```python
{
"audio_duration_sec": 23.5,
"sentence_count": 4,
"word_count": 58,
"warnings": ["Script may be too short for audio duration..."]
}
```
## Error Handling
| Exception | Condition |
|---|---|
| FileNotFoundError | Audio or script file doesn't exist |
| ValueError | File is empty, script not UTF-8, or no valid content |
| RuntimeError | ffprobe fails or can't analyze audio duration |
## Usage Example
```python
from validator import validate_inputs
result = validate_inputs("input/video.mp3", "input/video.txt")
print(f"Duration: {result['audio_duration_sec']}s")
print(f"Sentences: {result['sentence_count']}")
for warning in result['warnings']:
print(f"⚠️ {warning}")
```
## Known Edge Cases
- **Mixed Arabic/French script**: Word counting handles code-switching by splitting on whitespace
- **Empty lines in script**: Automatically filtered out, only non-empty lines count as sentences
- **Special characters**: Preserved as-is, no normalization or filtering applied
- **Very short audio**: Duration validation may trigger false positives for audio < 5 seconds
- **Corrupted audio**: ffprobe will fail with descriptive error message
- **Non-UTF8 script**: Explicit check prevents garbled Arabic text processing
## Dependencies
- **ffprobe** (part of ffmpeg): System requirement for audio duration analysis
- **pathlib**: Built-in Python module
- **subprocess**: Built-in Python module
- **re**: Built-in Python module for text processing