Visual Narrator 3B - Real-Time Video Narration
Matching Premium Quality at Real-Time Speed
A specialized 3B parameter model that matches Claude-quality descriptions while enabling real-time video narration that API-based models cannot achieve.
Performance Summary
| Capability | Visual Narrator | Competitors |
|---|---|---|
| Frame Processing | 2.4ms | 2,300-3,500ms |
| Speed Advantage | โ | 976-1,449x slower |
| Descriptive Quality | 2.0 adj/desc | 2.0 adj/desc (parity) |
| Model Size | 3B parameters | 70-200B+ |
| Real-Time Capable | Yes | No |
Two Benchmark Types (Important Distinction)
Video-to-Text: Speed Benchmark
Measures how fast we process video frames into narration.
| Model | Latency | Real-Time? |
|---|---|---|
| Visual Narrator 3B | 2.4ms | Yes (400+ FPS) |
| GPT-4 Turbo | 2,344ms | No |
| Claude Opus | 3,536ms | No |
What this proves: We can narrate live video. Competitors cannot.
Text-to-Text: Quality Benchmark
Measures descriptive language richness.
| Model | Adjectives/Description |
|---|---|
| Visual Narrator 3B | 2.0 |
| Claude Sonnet 4.5 | 2.0 |
What this proves: Our language quality matches premium APIs.
Live API Demo Results (January 2026)
We built a live demo that races Visual Narrator against frontier models using real API callsโno simulation, no cherry-picking.
| Model | Live Latency | vs Visual Narrator |
|---|---|---|
| Visual Narrator | 429ms | โ |
| Claude Sonnet 4 | 4,559ms | 10.6x slower |
| Gemini 2.0 Flash | 8,048ms | 18.8x slower |
| GPT-4o | 11,873ms | 27.7x slower |
Try it yourself: Live Comparison Demo
Results from parallel API calls at the same millisecond. WebSocket endpoint available for verification.
The Unlock
We're not claiming to beat Claude on language quality. We're claiming to match their quality while running 10x+ faster in real-world API conditions.
That enables:
- Live broadcasting with real-time audio description
- Streaming accessibility at scale
- Real-time content creation
- Markets that API latency makes impossible
Sample Output
Input: Video frame of urban night scene
Visual Narrator Output:
"A sleek automobile navigates the urban landscape at night, neon lights reflecting off wet pavement as pedestrians move through crosswalks beneath glowing storefronts."
Technical Details
Model: Visual Narrator 3B - Phase 10
Parameters: 3 billion
Architecture: Vision-Language Model (VLM)
Specialization: Real-time cinematic scene description
Inference: 2.4ms on standard GPU hardware
Deployment: Local / Edge / Serverless
Verified Metrics
| Metric | Value | Source |
|---|---|---|
| Processing Speed | 2.4ms/frame | Benchmark suite |
| Semantic Accuracy | 71.6% | Evaluation protocol |
| Descriptive Quality | 2.0 adj/desc | Text-to-text benchmark |
| Real-time Capability | 400+ FPS | Calculated |
Cost Comparison (At Scale)
| Provider | Cost for 1M Videos/Month |
|---|---|
| Visual Narrator | $900 (fixed infrastructure) |
| GPT-4 Vision | ~$83,000 |
| Claude Vision | ~$252,000 |
Result: 90-280x cost advantage at scale.
Quick Start
# Clone repository
git clone https://huggingface.co/Ytgetahun/visual-narrator-llm
# Run inference
python visual_narrator_api.py --input video.mp4
Links
Methodology Notes
Speed Benchmark:
- Visual Narrator: Local GPU inference (2.4ms)
- Competitors: Cloud API round-trip (includes network latency)
- This reflects real-world deployment conditions
Quality Benchmark:
- Both models given identical text prompts
- Measured adjective density per description
- Visual Narrator tuned to match Claude's 2.0 adj/desc (optimal quality level)
Historical Context
Early benchmarks showed our model could achieve 3.62 adj/desc (+81% vs Claude's 2.0). We intentionally reduced to 2.0 after determining higher density produced "fluff" rather than quality. Claude's output level was the correct target, not something to exceed.
License
Apache 2.0 - See LICENSE file for details.
Last updated: January 2026 Replaces previous model card with verified, accurate claims