Visual Narrator CLI
visual-narrator installs the vn command for generating audio description text from local video files, image frames, or YouTube URLs using the live Visual Narrator Frame Description API.
Install
pip install visual-narrator
For local development from this repository:
cd cli
pip install -e .
The CLI uses ffmpeg for video frame extraction. Install it separately if it is not already available:
brew install ffmpeg
Authentication
Pass an API key with --api-key or set VN_API_KEY:
export VN_API_KEY=vn_live_your_key
Create a free-tier key:
vn keys create dev@example.com
For education-focused gap narration, vn edu calls GPT-4o directly and does not use VN_API_KEY. Set:
export DEEPGRAM_API_KEY=dg_your_key
export OPENAI_API_KEY=sk-your-key
Describe A Video
Local video:
vn describe ./demo.mp4 --api-key "$VN_API_KEY" --format json
YouTube URL:
vn describe "https://youtube.com/watch?v=VIDEO_ID" --api-key "$VN_API_KEY" --format srt
Single image frame:
vn describe /tmp/vn-test.jpg --api-key "$VN_API_KEY" --format text
Sampling defaults to one frame every three seconds. Configure it with --fps:
vn describe ./demo.mp4 --fps 1 --format json
Override the API base URL:
vn describe ./demo.mp4 --api-url http://localhost:3000 --api-key "$VN_API_KEY"
Detect Narration Gaps
Find silence windows and dialogue breaks where narration can fit:
vn gaps ./demo.mp4 --format json
YouTube URLs use the same download path as vn describe:
vn gaps "https://youtube.com/watch?v=VIDEO_ID" --format text --min-gap 2.5
Choose the Whisper model with --whisper-model when you want a faster or more accurate pass:
vn gaps ./demo.mp4 --whisper-model base --format srt
Output formats:
vn gaps ./demo.mp4 --format json
vn gaps ./demo.mp4 --format text
vn gaps ./demo.mp4 --format srt
JSON output returns objects with start_sec, end_sec, duration_sec, and gap_type.
Compliance Reports
Score a video against WCAG/CVAA audio-description checks using the same gap detector:
vn compliance ./demo.mp4 --format json
Text output is available for quick terminal review:
vn compliance ./demo.mp4 --format text --min-gap 3.0
Choose the Whisper model the same way as vn gaps:
vn compliance ./demo.mp4 --whisper-model base --format json
JSON output returns:
{
"score": 67,
"wcag_level": "A",
"criteria": {
"wcag_1_2_3": {"passed": true},
"wcag_1_2_5": {"passed": false},
"cvaa_audio_description": {"passed": true}
},
"gaps": [],
"recommendations": []
}
Coverage is calculated from silence and music-only gap duration divided by total video duration. Recommendations are capped at 10 narration opportunities.
Festival Film Accessibility Kit
Generate a complete gap-targeted narration package in one command:
vn kit ./demo.mp4 --api-key "$VN_API_KEY" --format json
The kit command:
- detects narration gaps with Deepgram + ffmpeg silence detection
- extracts one frame at each gap midpoint
- sends those frames to the live Visual Narrator API
- returns narration text, SRT timing, compliance scoring, and cost totals
Output formats:
vn kit ./demo.mp4 --format json
vn kit ./demo.mp4 --format srt
vn kit ./demo.mp4 --format text
Tune gap sensitivity with --min-gap:
vn kit ./demo.mp4 --min-gap 3.0 --format text
YouTube URLs use the same download path as vn describe:
vn kit "https://youtube.com/watch?v=VIDEO_ID" --format srt --api-key "$VN_API_KEY"
JSON output includes:
{
"source": "./demo.mp4",
"duration_seconds": 5421.4,
"gaps_found": 6,
"narrations": [
{
"start_sec": 12.4,
"end_sec": 16.1,
"gap_duration_sec": 3.7,
"gap_type": "silence",
"frame_timestamp_sec": 14.25,
"description": "A wide shot shows the ship drifting past Saturn.",
"cost_estimate": 0.0012,
"srt_index": 1
}
],
"compliance": {
"score": 67,
"wcag_level": "A"
},
"model_version": "visual-narrator-gpt4o-v1",
"cost_estimate": 0.0072
}
Educational Video Describer
Generate accessibility narration only for frames that add educational value such as slides, diagrams, equations, code, and charts:
vn edu ./lecture.mp4 --format json
The edu command:
- detects narration gaps with Deepgram Nova-3 plus ffmpeg silence detection
- extracts one frame at each gap midpoint
- sends each frame directly to GPT-4o with an education-specific prompt
- filters out talking-head frames when the model returns
NO_VISUAL_AID - keeps WCAG/CVAA compliance scoring based on the full detected gap list
Output formats:
vn edu ./lecture.mp4 --format json
vn edu ./lecture.mp4 --format srt
vn edu ./lecture.mp4 --format text
Tune gap sensitivity with --min-gap:
vn edu ./lecture.mp4 --min-gap 3.0 --format text
YouTube URLs are supported:
vn edu "https://youtube.com/watch?v=VIDEO_ID" --format srt
JSON output includes:
{
"source": "./lecture.mp4",
"duration_seconds": 1800.0,
"gaps_analyzed": 12,
"visual_moments": 7,
"skipped_talking_head": 5,
"narrations": [
{
"start_sec": 42.0,
"end_sec": 46.5,
"gap_duration_sec": 4.5,
"gap_type": "music_only",
"frame_timestamp_sec": 44.25,
"description": "A slide compares precision and recall with a 2x2 confusion matrix and labels true positives, false positives, false negatives, and true negatives.",
"cost_estimate": 0.0013,
"srt_index": 1,
"visual_moment": true
}
],
"compliance": {
"score": 100,
"wcag_level": "AA"
},
"model_version": "gpt-4o",
"cost_estimate": 0.0091
}
Output Formats
JSON:
vn describe ./demo.mp4 --format json
Returns:
[
{
"timecode": "00:00:00.000",
"description": "A person walks through a city street.",
"objects_detected": [{"label": "Person", "confidence": 98.7}],
"latency_ms": 241
}
]
SRT:
vn describe ./demo.mp4 --format srt
Text:
vn describe ./demo.mp4 --format text
Benchmark
Run a single-frame benchmark comparing Visual Narrator, GPT-4o, and Gemini:
vn benchmark /tmp/vn-test.jpg
Use a custom API deployment:
vn benchmark /tmp/vn-test.jpg --api-url http://localhost:3000
Commands
vn --help
vn describe --help
vn kit --help
vn edu --help
vn gaps --help
vn compliance --help
vn benchmark --help
vn keys create --help