feat(VN-008): Educational Video Describer — education-specific GPT-4o prompt with NO_VISUAL_AID filtering

5179c8a 25 days ago

6.73 kB

	# Visual Narrator CLI

	`visual-narrator` installs the `vn` command for generating audio description text from local video files, image frames, or YouTube URLs using the live Visual Narrator Frame Description API.

	## Install

	```bash
	pip install visual-narrator
	```

	For local development from this repository:

	```bash
	cd cli
	pip install -e .
	```

	The CLI uses `ffmpeg` for video frame extraction. Install it separately if it is not already available:

	```bash
	brew install ffmpeg
	```

	## Authentication

	Pass an API key with `--api-key` or set `VN_API_KEY`:

	```bash
	export VN_API_KEY=vn_live_your_key
	```

	Create a free-tier key:

	```bash
	vn keys create dev@example.com
	```

	For education-focused gap narration, `vn edu` calls GPT-4o directly and does not use `VN_API_KEY`. Set:

	```bash
	export DEEPGRAM_API_KEY=dg_your_key
	export OPENAI_API_KEY=sk-your-key
	```

	## Describe A Video

	Local video:

	```bash
	vn describe ./demo.mp4 --api-key "$VN_API_KEY" --format json
	```

	YouTube URL:

	```bash
	vn describe "https://youtube.com/watch?v=VIDEO_ID" --api-key "$VN_API_KEY" --format srt
	```

	Single image frame:

	```bash
	vn describe /tmp/vn-test.jpg --api-key "$VN_API_KEY" --format text
	```

	Sampling defaults to one frame every three seconds. Configure it with `--fps`:

	```bash
	vn describe ./demo.mp4 --fps 1 --format json
	```

	Override the API base URL:

	```bash
	vn describe ./demo.mp4 --api-url http://localhost:3000 --api-key "$VN_API_KEY"
	```

	## Detect Narration Gaps

	Find silence windows and dialogue breaks where narration can fit:

	```bash
	vn gaps ./demo.mp4 --format json
	```

	YouTube URLs use the same download path as `vn describe`:

	```bash
	vn gaps "https://youtube.com/watch?v=VIDEO_ID" --format text --min-gap 2.5
	```

	Choose the Whisper model with `--whisper-model` when you want a faster or more accurate pass:

	```bash
	vn gaps ./demo.mp4 --whisper-model base --format srt
	```

	Output formats:

	```bash
	vn gaps ./demo.mp4 --format json
	vn gaps ./demo.mp4 --format text
	vn gaps ./demo.mp4 --format srt
	```

	JSON output returns objects with `start_sec`, `end_sec`, `duration_sec`, and `gap_type`.

	## Compliance Reports

	Score a video against WCAG/CVAA audio-description checks using the same gap detector:

	```bash
	vn compliance ./demo.mp4 --format json
	```

	Text output is available for quick terminal review:

	```bash
	vn compliance ./demo.mp4 --format text --min-gap 3.0
	```

	Choose the Whisper model the same way as `vn gaps`:

	```bash
	vn compliance ./demo.mp4 --whisper-model base --format json
	```

	JSON output returns:

	```json
	{
	"score": 67,
	"wcag_level": "A",
	"criteria": {
	"wcag_1_2_3": {"passed": true},
	"wcag_1_2_5": {"passed": false},
	"cvaa_audio_description": {"passed": true}
	},
	"gaps": [],
	"recommendations": []
	}
	```

	Coverage is calculated from silence and music-only gap duration divided by total video duration. Recommendations are capped at 10 narration opportunities.

	## Festival Film Accessibility Kit

	Generate a complete gap-targeted narration package in one command:

	```bash
	vn kit ./demo.mp4 --api-key "$VN_API_KEY" --format json
	```

	The `kit` command:

	- detects narration gaps with Deepgram + ffmpeg silence detection
	- extracts one frame at each gap midpoint
	- sends those frames to the live Visual Narrator API
	- returns narration text, SRT timing, compliance scoring, and cost totals

	Output formats:

	```bash
	vn kit ./demo.mp4 --format json
	vn kit ./demo.mp4 --format srt
	vn kit ./demo.mp4 --format text
	```

	Tune gap sensitivity with `--min-gap`:

	```bash
	vn kit ./demo.mp4 --min-gap 3.0 --format text
	```

	YouTube URLs use the same download path as `vn describe`:

	```bash
	vn kit "https://youtube.com/watch?v=VIDEO_ID" --format srt --api-key "$VN_API_KEY"
	```

	JSON output includes:

	```json
	{
	"source": "./demo.mp4",
	"duration_seconds": 5421.4,
	"gaps_found": 6,
	"narrations": [
	{
	"start_sec": 12.4,
	"end_sec": 16.1,
	"gap_duration_sec": 3.7,
	"gap_type": "silence",
	"frame_timestamp_sec": 14.25,
	"description": "A wide shot shows the ship drifting past Saturn.",
	"cost_estimate": 0.0012,
	"srt_index": 1
	}
	],
	"compliance": {
	"score": 67,
	"wcag_level": "A"
	},
	"model_version": "visual-narrator-gpt4o-v1",
	"cost_estimate": 0.0072
	}
	```

	## Educational Video Describer

	Generate accessibility narration only for frames that add educational value such as slides, diagrams, equations, code, and charts:

	```bash
	vn edu ./lecture.mp4 --format json
	```

	The `edu` command:

	- detects narration gaps with Deepgram Nova-3 plus ffmpeg silence detection
	- extracts one frame at each gap midpoint
	- sends each frame directly to GPT-4o with an education-specific prompt
	- filters out talking-head frames when the model returns `NO_VISUAL_AID`
	- keeps WCAG/CVAA compliance scoring based on the full detected gap list

	Output formats:

	```bash
	vn edu ./lecture.mp4 --format json
	vn edu ./lecture.mp4 --format srt
	vn edu ./lecture.mp4 --format text
	```

	Tune gap sensitivity with `--min-gap`:

	```bash
	vn edu ./lecture.mp4 --min-gap 3.0 --format text
	```

	YouTube URLs are supported:

	```bash
	vn edu "https://youtube.com/watch?v=VIDEO_ID" --format srt
	```

	JSON output includes:

	```json
	{
	"source": "./lecture.mp4",
	"duration_seconds": 1800.0,
	"gaps_analyzed": 12,
	"visual_moments": 7,
	"skipped_talking_head": 5,
	"narrations": [
	{
	"start_sec": 42.0,
	"end_sec": 46.5,
	"gap_duration_sec": 4.5,
	"gap_type": "music_only",
	"frame_timestamp_sec": 44.25,
	"description": "A slide compares precision and recall with a 2x2 confusion matrix and labels true positives, false positives, false negatives, and true negatives.",
	"cost_estimate": 0.0013,
	"srt_index": 1,
	"visual_moment": true
	}
	],
	"compliance": {
	"score": 100,
	"wcag_level": "AA"
	},
	"model_version": "gpt-4o",
	"cost_estimate": 0.0091
	}
	```

	## Output Formats

	JSON:

	```bash
	vn describe ./demo.mp4 --format json
	```

	Returns:

	```json
	[
	{
	"timecode": "00:00:00.000",
	"description": "A person walks through a city street.",
	"objects_detected": [{"label": "Person", "confidence": 98.7}],
	"latency_ms": 241
	}
	]
	```

	SRT:

	```bash
	vn describe ./demo.mp4 --format srt
	```

	Text:

	```bash
	vn describe ./demo.mp4 --format text
	```

	## Benchmark

	Run a single-frame benchmark comparing Visual Narrator, GPT-4o, and Gemini:

	```bash
	vn benchmark /tmp/vn-test.jpg
	```

	Use a custom API deployment:

	```bash
	vn benchmark /tmp/vn-test.jpg --api-url http://localhost:3000
	```

	## Commands

	```bash
	vn --help
	vn describe --help
	vn kit --help
	vn edu --help
	vn gaps --help
	vn compliance --help
	vn benchmark --help
	vn keys create --help
	```