Visual Narrator 3B - Real-Time Video Narration

Matching Premium Quality at Real-Time Speed

A specialized 3B parameter model that matches Claude-quality descriptions while enabling real-time video narration that API-based models cannot achieve.


Performance Summary

Capability Visual Narrator Competitors
Frame Processing 2.4ms 2,300-3,500ms
Speed Advantage โ€” 976-1,449x slower
Descriptive Quality 2.0 adj/desc 2.0 adj/desc (parity)
Model Size 3B parameters 70-200B+
Real-Time Capable Yes No

Two Benchmark Types (Important Distinction)

Video-to-Text: Speed Benchmark

Measures how fast we process video frames into narration.

Model Latency Real-Time?
Visual Narrator 3B 2.4ms Yes (400+ FPS)
GPT-4 Turbo 2,344ms No
Claude Opus 3,536ms No

What this proves: We can narrate live video. Competitors cannot.

Text-to-Text: Quality Benchmark

Measures descriptive language richness.

Model Adjectives/Description
Visual Narrator 3B 2.0
Claude Sonnet 4.5 2.0

What this proves: Our language quality matches premium APIs.


Live API Demo Results (January 2026)

We built a live demo that races Visual Narrator against frontier models using real API callsโ€”no simulation, no cherry-picking.

Model Live Latency vs Visual Narrator
Visual Narrator 429ms โ€”
Claude Sonnet 4 4,559ms 10.6x slower
Gemini 2.0 Flash 8,048ms 18.8x slower
GPT-4o 11,873ms 27.7x slower

Try it yourself: Live Comparison Demo

Results from parallel API calls at the same millisecond. WebSocket endpoint available for verification.


The Unlock

We're not claiming to beat Claude on language quality. We're claiming to match their quality while running 10x+ faster in real-world API conditions.

That enables:

  • Live broadcasting with real-time audio description
  • Streaming accessibility at scale
  • Real-time content creation
  • Markets that API latency makes impossible

Sample Output

Input: Video frame of urban night scene

Visual Narrator Output:

"A sleek automobile navigates the urban landscape at night, neon lights reflecting off wet pavement as pedestrians move through crosswalks beneath glowing storefronts."


Technical Details

Model: Visual Narrator 3B - Phase 10
Parameters: 3 billion
Architecture: Vision-Language Model (VLM)
Specialization: Real-time cinematic scene description
Inference: 2.4ms on standard GPU hardware
Deployment: Local / Edge / Serverless

Verified Metrics

Metric Value Source
Processing Speed 2.4ms/frame Benchmark suite
Semantic Accuracy 71.6% Evaluation protocol
Descriptive Quality 2.0 adj/desc Text-to-text benchmark
Real-time Capability 400+ FPS Calculated

Cost Comparison (At Scale)

Provider Cost for 1M Videos/Month
Visual Narrator $900 (fixed infrastructure)
GPT-4 Vision ~$83,000
Claude Vision ~$252,000

Result: 90-280x cost advantage at scale.


Quick Start

# Clone repository
git clone https://huggingface.co/Ytgetahun/visual-narrator-llm

# Run inference
python visual_narrator_api.py --input video.mp4

Links


Methodology Notes

Speed Benchmark:

  • Visual Narrator: Local GPU inference (2.4ms)
  • Competitors: Cloud API round-trip (includes network latency)
  • This reflects real-world deployment conditions

Quality Benchmark:

  • Both models given identical text prompts
  • Measured adjective density per description
  • Visual Narrator tuned to match Claude's 2.0 adj/desc (optimal quality level)

Historical Context

Early benchmarks showed our model could achieve 3.62 adj/desc (+81% vs Claude's 2.0). We intentionally reduced to 2.0 after determining higher density produced "fluff" rather than quality. Claude's output level was the correct target, not something to exceed.


License

Apache 2.0 - See LICENSE file for details.


Last updated: January 2026 Replaces previous model card with verified, accurate claims

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Ytgetahun/visual-narrator-llm 1