| # Visual Narrator CLI |
|
|
| `visual-narrator` installs the `vn` command for generating audio description text from local video files, image frames, or YouTube URLs using the live Visual Narrator Frame Description API. |
|
|
| ## Install |
|
|
| ```bash |
| pip install visual-narrator |
| ``` |
|
|
| For local development from this repository: |
|
|
| ```bash |
| cd cli |
| pip install -e . |
| ``` |
|
|
| The CLI uses `ffmpeg` for video frame extraction. Install it separately if it is not already available: |
|
|
| ```bash |
| brew install ffmpeg |
| ``` |
|
|
| ## Authentication |
|
|
| Pass an API key with `--api-key` or set `VN_API_KEY`: |
|
|
| ```bash |
| export VN_API_KEY=vn_live_your_key |
| ``` |
|
|
| Create a free-tier key: |
|
|
| ```bash |
| vn keys create dev@example.com |
| ``` |
|
|
| For education-focused gap narration, `vn edu` calls GPT-4o directly and does not use `VN_API_KEY`. Set: |
|
|
| ```bash |
| export DEEPGRAM_API_KEY=dg_your_key |
| export OPENAI_API_KEY=sk-your-key |
| ``` |
|
|
| ## Describe A Video |
|
|
| Local video: |
|
|
| ```bash |
| vn describe ./demo.mp4 --api-key "$VN_API_KEY" --format json |
| ``` |
|
|
| YouTube URL: |
|
|
| ```bash |
| vn describe "https://youtube.com/watch?v=VIDEO_ID" --api-key "$VN_API_KEY" --format srt |
| ``` |
|
|
| Single image frame: |
|
|
| ```bash |
| vn describe /tmp/vn-test.jpg --api-key "$VN_API_KEY" --format text |
| ``` |
|
|
| Sampling defaults to one frame every three seconds. Configure it with `--fps`: |
|
|
| ```bash |
| vn describe ./demo.mp4 --fps 1 --format json |
| ``` |
|
|
| Override the API base URL: |
|
|
| ```bash |
| vn describe ./demo.mp4 --api-url http://localhost:3000 --api-key "$VN_API_KEY" |
| ``` |
|
|
| ## Detect Narration Gaps |
|
|
| Find silence windows and dialogue breaks where narration can fit: |
|
|
| ```bash |
| vn gaps ./demo.mp4 --format json |
| ``` |
|
|
| YouTube URLs use the same download path as `vn describe`: |
|
|
| ```bash |
| vn gaps "https://youtube.com/watch?v=VIDEO_ID" --format text --min-gap 2.5 |
| ``` |
|
|
| Choose the Whisper model with `--whisper-model` when you want a faster or more accurate pass: |
|
|
| ```bash |
| vn gaps ./demo.mp4 --whisper-model base --format srt |
| ``` |
|
|
| Output formats: |
|
|
| ```bash |
| vn gaps ./demo.mp4 --format json |
| vn gaps ./demo.mp4 --format text |
| vn gaps ./demo.mp4 --format srt |
| ``` |
|
|
| JSON output returns objects with `start_sec`, `end_sec`, `duration_sec`, and `gap_type`. |
|
|
| ## Compliance Reports |
|
|
| Score a video against WCAG/CVAA audio-description checks using the same gap detector: |
|
|
| ```bash |
| vn compliance ./demo.mp4 --format json |
| ``` |
|
|
| Text output is available for quick terminal review: |
|
|
| ```bash |
| vn compliance ./demo.mp4 --format text --min-gap 3.0 |
| ``` |
|
|
| Choose the Whisper model the same way as `vn gaps`: |
|
|
| ```bash |
| vn compliance ./demo.mp4 --whisper-model base --format json |
| ``` |
|
|
| JSON output returns: |
|
|
| ```json |
| { |
| "score": 67, |
| "wcag_level": "A", |
| "criteria": { |
| "wcag_1_2_3": {"passed": true}, |
| "wcag_1_2_5": {"passed": false}, |
| "cvaa_audio_description": {"passed": true} |
| }, |
| "gaps": [], |
| "recommendations": [] |
| } |
| ``` |
|
|
| Coverage is calculated from silence and music-only gap duration divided by total video duration. Recommendations are capped at 10 narration opportunities. |
|
|
| ## Festival Film Accessibility Kit |
|
|
| Generate a complete gap-targeted narration package in one command: |
|
|
| ```bash |
| vn kit ./demo.mp4 --api-key "$VN_API_KEY" --format json |
| ``` |
|
|
| The `kit` command: |
|
|
| - detects narration gaps with Deepgram + ffmpeg silence detection |
| - extracts one frame at each gap midpoint |
| - sends those frames to the live Visual Narrator API |
| - returns narration text, SRT timing, compliance scoring, and cost totals |
|
|
| Output formats: |
|
|
| ```bash |
| vn kit ./demo.mp4 --format json |
| vn kit ./demo.mp4 --format srt |
| vn kit ./demo.mp4 --format text |
| ``` |
|
|
| Tune gap sensitivity with `--min-gap`: |
|
|
| ```bash |
| vn kit ./demo.mp4 --min-gap 3.0 --format text |
| ``` |
|
|
| YouTube URLs use the same download path as `vn describe`: |
|
|
| ```bash |
| vn kit "https://youtube.com/watch?v=VIDEO_ID" --format srt --api-key "$VN_API_KEY" |
| ``` |
|
|
| JSON output includes: |
|
|
| ```json |
| { |
| "source": "./demo.mp4", |
| "duration_seconds": 5421.4, |
| "gaps_found": 6, |
| "narrations": [ |
| { |
| "start_sec": 12.4, |
| "end_sec": 16.1, |
| "gap_duration_sec": 3.7, |
| "gap_type": "silence", |
| "frame_timestamp_sec": 14.25, |
| "description": "A wide shot shows the ship drifting past Saturn.", |
| "cost_estimate": 0.0012, |
| "srt_index": 1 |
| } |
| ], |
| "compliance": { |
| "score": 67, |
| "wcag_level": "A" |
| }, |
| "model_version": "visual-narrator-gpt4o-v1", |
| "cost_estimate": 0.0072 |
| } |
| ``` |
|
|
| ## Educational Video Describer |
|
|
| Generate accessibility narration only for frames that add educational value such as slides, diagrams, equations, code, and charts: |
|
|
| ```bash |
| vn edu ./lecture.mp4 --format json |
| ``` |
|
|
| The `edu` command: |
|
|
| - detects narration gaps with Deepgram Nova-3 plus ffmpeg silence detection |
| - extracts one frame at each gap midpoint |
| - sends each frame directly to GPT-4o with an education-specific prompt |
| - filters out talking-head frames when the model returns `NO_VISUAL_AID` |
| - keeps WCAG/CVAA compliance scoring based on the full detected gap list |
|
|
| Output formats: |
|
|
| ```bash |
| vn edu ./lecture.mp4 --format json |
| vn edu ./lecture.mp4 --format srt |
| vn edu ./lecture.mp4 --format text |
| ``` |
|
|
| Tune gap sensitivity with `--min-gap`: |
|
|
| ```bash |
| vn edu ./lecture.mp4 --min-gap 3.0 --format text |
| ``` |
|
|
| YouTube URLs are supported: |
|
|
| ```bash |
| vn edu "https://youtube.com/watch?v=VIDEO_ID" --format srt |
| ``` |
|
|
| JSON output includes: |
|
|
| ```json |
| { |
| "source": "./lecture.mp4", |
| "duration_seconds": 1800.0, |
| "gaps_analyzed": 12, |
| "visual_moments": 7, |
| "skipped_talking_head": 5, |
| "narrations": [ |
| { |
| "start_sec": 42.0, |
| "end_sec": 46.5, |
| "gap_duration_sec": 4.5, |
| "gap_type": "music_only", |
| "frame_timestamp_sec": 44.25, |
| "description": "A slide compares precision and recall with a 2x2 confusion matrix and labels true positives, false positives, false negatives, and true negatives.", |
| "cost_estimate": 0.0013, |
| "srt_index": 1, |
| "visual_moment": true |
| } |
| ], |
| "compliance": { |
| "score": 100, |
| "wcag_level": "AA" |
| }, |
| "model_version": "gpt-4o", |
| "cost_estimate": 0.0091 |
| } |
| ``` |
|
|
| ## Output Formats |
|
|
| JSON: |
|
|
| ```bash |
| vn describe ./demo.mp4 --format json |
| ``` |
|
|
| Returns: |
|
|
| ```json |
| [ |
| { |
| "timecode": "00:00:00.000", |
| "description": "A person walks through a city street.", |
| "objects_detected": [{"label": "Person", "confidence": 98.7}], |
| "latency_ms": 241 |
| } |
| ] |
| ``` |
|
|
| SRT: |
|
|
| ```bash |
| vn describe ./demo.mp4 --format srt |
| ``` |
|
|
| Text: |
|
|
| ```bash |
| vn describe ./demo.mp4 --format text |
| ``` |
|
|
| ## Benchmark |
|
|
| Run a single-frame benchmark comparing Visual Narrator, GPT-4o, and Gemini: |
|
|
| ```bash |
| vn benchmark /tmp/vn-test.jpg |
| ``` |
|
|
| Use a custom API deployment: |
|
|
| ```bash |
| vn benchmark /tmp/vn-test.jpg --api-url http://localhost:3000 |
| ``` |
|
|
| ## Commands |
|
|
| ```bash |
| vn --help |
| vn describe --help |
| vn kit --help |
| vn edu --help |
| vn gaps --help |
| vn compliance --help |
| vn benchmark --help |
| vn keys create --help |
| ``` |
|
|