Spaces:
Sleeping
Sleeping
File size: 2,207 Bytes
f5bce42 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | # VALIDATOR
> Last updated: 2026-03-09
## Purpose
Performs comprehensive pre-flight validation of audio and script files before forced alignment processing. Ensures files exist, are properly formatted, and have realistic word count to duration ratios for Tunisian Arabic content.
## Function Signature
```python
def validate_inputs(audio_path: Union[str, Path], script_path: Union[str, Path]) -> Dict:
```
## Parameters
| Param | Type | Required | Default | Description |
|---|---|---|---|---|
| audio_path | Union[str, Path] | Yes | - | Path to audio file for validation |
| script_path | Union[str, Path] | Yes | - | Path to script text file for validation |
## Returns
Dictionary with validation results and warnings:
```python
{
"audio_duration_sec": 23.5,
"sentence_count": 4,
"word_count": 58,
"warnings": ["Script may be too short for audio duration..."]
}
```
## Error Handling
| Exception | Condition |
|---|---|
| FileNotFoundError | Audio or script file doesn't exist |
| ValueError | File is empty, script not UTF-8, or no valid content |
| RuntimeError | ffprobe fails or can't analyze audio duration |
## Usage Example
```python
from validator import validate_inputs
result = validate_inputs("input/video.mp3", "input/video.txt")
print(f"Duration: {result['audio_duration_sec']}s")
print(f"Sentences: {result['sentence_count']}")
for warning in result['warnings']:
print(f"⚠️ {warning}")
```
## Known Edge Cases
- **Mixed Arabic/French script**: Word counting handles code-switching by splitting on whitespace
- **Empty lines in script**: Automatically filtered out, only non-empty lines count as sentences
- **Special characters**: Preserved as-is, no normalization or filtering applied
- **Very short audio**: Duration validation may trigger false positives for audio < 5 seconds
- **Corrupted audio**: ffprobe will fail with descriptive error message
- **Non-UTF8 script**: Explicit check prevents garbled Arabic text processing
## Dependencies
- **ffprobe** (part of ffmpeg): System requirement for audio duration analysis
- **pathlib**: Built-in Python module
- **subprocess**: Built-in Python module
- **re**: Built-in Python module for text processing |