Leum-VL-8B-preview0320
Timeline-Grounded Structural Understanding for Internet-Native Short Video
Leum-VL-8B is an 8B open-weight video-language model for timeline-grounded structural understanding of internet-native short video. It parses short videos into timeline-aligned, evidence-linked, machine-usable structure across six dimensions: subject, aesthetics, camera language, editing, narrative, and observable dissemination strategy.
In this model card, observable dissemination strategy refers to retention-oriented design, platform-native packaging, and engagement context around the video when such context is provided, including title, hashtags, cover, and comments.
Updates
[2026-03-24]Local preview version refreshed for the public Hugging Face repo, aligned with the current technical report table and assets.[TBD]Inference API examples and deployment commands will be added after engineering validation.[TBD]The formal YAML / JSON schema and task-specific prompt templates will be added in a later update.
Highlights
- Understands edits, not just scene content.
- Built for internet-native short-video structure.
- Produces timeline-grounded, machine-usable outputs.
- Trained with full-parameter updates across the vision encoder, projector, and language model.
Benchmark Snapshot
The table below mirrors the current benchmark table in the technical report.
| Category | Benchmark | Leum-VL-8B | Qwen3-VL-8B1 | Keye-VL-8B Thinking2 | GLM-4.1V-9B Thinking3 | MiniCPM-V-4.5-8B4 |
|---|---|---|---|---|---|---|
| General VQA | MMBench-EN test | 84.8 | 84.5 | 92.0 | 85.8 | 84.2 |
| MMBench-CN test | 83.9 | 84.7 | - | 84.7 | - | |
| HallusionBench | 56.5 | 61.1 | 62.7 | 63.2 | 61.2 | |
| RealWorldQA | 73.2 | 71.5 | 73.5 | - | 72.1 | |
| MMStar | 67.5 | 70.9 | 80.5 | 72.9 | 72.1 | |
| BLINK | 65.2 | 69.1 | 54.9‡ | 65.1 | 42.0‡ | |
| Document & OCR | OCRBench | 85.4 | 89.6 | 86.6 | 84.2 | 89.0 |
| DocVQA test | 95.7 | 96.1 | 93.4‡ | 93.3‡ | 94.7 | |
| TextVQA val | 85.0 | 82.8‡ | 81.5‡ | 79.6‡ | 82.2 | |
| ChartQA test | 85.3 | 89.6 | 94.1‡ | 70.0‡ | 87.4 | |
| Video Understanding | Video-MME w/o sub. | 70.8 | 71.4 | 73.0 | 68.2 | 67.9 |
| MVBench | 70.0 | 68.7 | 56.9‡ | 68.4 | 60.5‡ | |
| TempCompass | 74.3 | 74.3‡ | 75.5 | 72.3‡ | 72.7‡ | |
| MotionBench | 61.6 | 56.9‡ | 55.1‡ | 59.0 | 59.7 | |
| FAVOR-Bench | 58.9 | 54.1 | - | - | 56.0 | |
| LongVideoBench | 64.6 | 62.4‡ | 66.0 | 65.7‡ | 63.9 | |
| Tomato | 36.7 | 35.7‡ | 33.0‡ | 30.0‡ | 29.8‡ | |
| Charades-STA mIoU | 59.4 | 56.0 | - | - | - |
1 Qwen3-VL report.
2 Keye-VL report.
3 GLM-4.1V report.
4 MiniCPM-V report.‡ Reported values reproduced from referenced public reports as cited in the technical report.
Eval setting: vLLM v0.17.1-cu130, FPS=4, max 768 frames, max 50K tokens/video.
We also construct FeedBench, a benchmark for structure-sensitive short-video understanding. [TBD] The public benchmark page and dataset link will be added after release.
What Makes It Different
Conventional video-language models can often describe scenes, answer event-centric questions, or read on-screen text. They are typically less reliable at explaining why a cut happens, what narrative role a segment serves, what retention tactics a short video uses, or how video content aligns with platform-native packaging and audience-response context. Leum-VL-8B addresses this gap by treating video understanding as timeline-grounded structural parsing that can incorporate linked platform context.
| Paradigm | Can capture | Often misses |
|---|---|---|
| Dense captioning | Scene and event semantics | Cut rationale, hook design, narrative role |
| VQA | Event-centric reasoning | Timeline-level structural grammar |
| OCR / subtitle parsing | On-screen text | How text interacts with pacing and packaging |
| Shot boundary detection | Physical cuts | Why the cut matters |
| Leum-VL-8B (SV6D) | Multi-layer timeline-grounded structure | - |
SV6D Representation
SV6D decomposes video into six complementary dimensions, each grounded on the timeline and tied to observable evidence.
| Dimension | What it captures | Observable evidence or linked context | Downstream value |
|---|---|---|---|
| Subject | Who or what is present and how it changes | People, objects, actions, scene changes | Retrieval, indexing |
| Aesthetics | Visual style and perceptual tone | Color, lighting, composition, overlay style | Style analysis, generation control |
| Camera language | Shot size, angle, and movement | Close-up, pan, zoom, handheld, tracking | Cinematography parsing |
| Editing | Transitions and pacing logic | Hard cuts, jump cuts, rhythm changes, tension release | Edit analysis, editing assistance |
| Narrative | Timeline-grounded function of segments | Hook, setup, progression, reveal, payoff | Story analysis, clip structuring |
| Observable dissemination strategy | Retention-oriented design, packaging, and engagement context around the video | Topic framing, first-3-second hook design, subtitle emphasis, cover-title-hashtag consistency, video-comment alignment | Retention analysis, packaging analysis, comment analysis, creator tooling |
Leum-VL-8B realizes the SV6D objective through full-parameter training spanning continual pretraining, supervised fine-tuning, and RLHF alignment. Initialized from a multimodal instruct checkpoint, it is trained to produce temporally aligned, evidence-linked, machine-usable outputs for short-video structural parsing.
Model Snapshot
| Item | Value |
|---|---|
| Model name | Leum-VL-8B |
| Model type | Video-language model |
| Parameters | 8B |
| Base model | Qwen3-VL-8B-Instruct |
| Training scope | Full-parameter training over vision encoder, projector, and language model |
| Training stages | Continual pretraining -> SFT -> RLHF alignment |
| Primary task | Timeline-grounded structural parsing for short video |
| Output | Structured YAML or text reports aligned to the timeline |
| Release type | Open weights |
| License | MIT |
Output Format
Leum-VL-8B is designed to produce outputs that are directly usable by downstream systems. The following is an illustrative example. [TBD] Field names and schema are not final until the official release schema is published.
timeline:
- span: "<00:00.0-00:07.0>"
content_structure: "Opening Hook"
structure_description: |
At the center of the frame stands a multi-tiered pagoda-style structure against the night sky. The eaves on each level are lined with alternating cyan and golden light strips. The tower continuously emits golden linear fireworks outward, forming parabolic trajectories that spread to both sides. Gradually, red light beams begin to dominate, and the atmosphere of the fireworks shifts from tranquil to intense, completing the emotional setup of the opening.
shots:
- span: "<00:00.0-00:07.0>"
subject_analysis: |
A multi-tiered pagoda-style building stands at the center of the frame, tapering upward level by level. Each tier is decorated with surrounding light strips that alternate between cyan and golden illumination, with a pointed ornament at the top. From each level of the structure, golden linear fireworks continuously shoot outward, forming parabolic trajectories that spread to the sides and upward. As the scene progresses, denser red light beams shoot from the tower, and the lighting color of the structure also changes, with red gradually becoming dominant. The background is a black night sky. At the bottom of the frame, there is a horizontal railing, beyond which a large crowd appears as black silhouettes. Among them are dense rectangular lights from handheld electronic device screens.
shot_breakdown:
shot_size: "Wide shot"
camera_position: "Low angle position"
camera_angle: "Upward angle"
focal_length: "Wide-angle"
camera_movement: "Handheld shake"
depth_of_field: "Deep depth of field"
aesthetic_analysis:
light_source_type: "Artificial light"
lighting_direction: "Backlight / rim light"
light_hardness: "Hard light"
contrast: "High contrast"
saturation: "High"
color_temperature_tone: "Mixed warm and cool tones"
base_tone: "Low-key"
composition: "Symmetrical composition"
emotional_curve:
- time: "<00:00.0>"
emotion_level: 0 (Calm)
description: |
The video begins with the Wunüzhou pagoda standing under the night sky, presenting a gorgeous and tranquil scene as a buildup.
- time: "<00:02.1>"
emotion_level: 1 (Engaged)
description: |
The first wave of fireworks bursts on both sides of the tower, breaking the stillness with visual motion and raising emotional engagement.
- span: "<00:07.0-00:14.9>"
content_structure: "Core Content"
structure_description: |
The fireworks display enters its visual climax phase. Slender white firework streaks and dense colorful bursts appear alternately. Thick smoke spreads, gradually obscuring the tower’s outline. The brightness and colors shift dramatically from white to emerald green to pink. Strong light penetrates the smoke, forming a hazy glow that delivers continuous visual impact and immersion, sustaining the emotional peak through the ending.
shots:
- span: "<00:07.0-00:09.5>"
subject_analysis: |
The multi-tiered pagoda stands at the center of the frame. The lights on each level form red circular rings, with blue light visible in the gaps. The building’s outline is faintly visible through the smoke. Slender bright white firework streaks continuously shoot diagonally upward from the tower, leaving straight trails in the air. Dense gray-white smoke forms around the structure and rises upward, partially obscuring architectural details. At the bottom of the frame, a bridge railing emits a cyan-blue glow, with silhouettes of spectators holding recording devices in front of it. The surrounding background is a dark night sky, showing slight gray variations under the illumination of fireworks.
shot_breakdown:
shot_size: "Wide shot"
camera_position: "Low angle position"
camera_angle: "Upward angle"
focal_length: "Wide-angle"
camera_movement: "Handheld shake"
depth_of_field: "Deep depth of field"
aesthetic_analysis:
light_source_type: "Artificial light"
lighting_direction: "Backlight / rim light"
light_hardness: "Hard light"
contrast: "High contrast"
saturation: "High"
color_temperature_tone: "Mixed warm and cool tones"
base_tone: "Low-key"
composition: "Symmetrical composition"
- span: "<00:09.5-00:14.9>"
subject_analysis: |
The lower part of the frame shows the top structure of a building emitting bright golden light, with a square, block-like appearance and a smooth surface. Dense firework streaks spray upward from the top in a विशाल fan shape. The fireworks initially appear bright white, then shift to emerald green, and finally turn pink. As the fireworks erupt, large amounts of smoke spread outward, and light passing through the smoke creates a soft halo effect. The black night sky is illuminated by the fireworks. Along the bottom edge of the frame, silhouettes of raised rectangular objects occasionally appear in extremely low light conditions.
shot_breakdown:
shot_size: "Wide shot"
camera_position: "Low angle position"
camera_angle: "Upward angle"
focal_length: "Telephoto"
camera_movement: "Handheld shake"
depth_of_field: "Medium depth of field"
aesthetic_analysis:
light_source_type: "Artificial light"
lighting_direction: "Backlight / rim light"
light_hardness: "Hard light"
contrast: "High contrast"
saturation: "Medium"
color_temperature_tone: "Warm"
base_tone: "Mid-tone"
composition: "No obvious compositional intent"
emotional_curve:
- time: "<00:08.1>"
emotion_level: 2 (Climax)
description: |
Fireworks of varying forms and colors alternate continuously, maintaining a high density of visual output and immersing the audience in a sense of exhilaration.
Depending on the task mode, observable dissemination strategy may be expressed either as segment-level retention cues or as clip-level packaging or comment alignment outputs.
Get Started
- Primary input: MP4.
- Optional associated context for supported task modes: title, hashtags, cover images, and comments.
- Recommended deployment framework: vLLM.
- Planned task modes:
sv6d_parse,summary,edit_suggestions,retention_analysis,comment_analysis,packaging_alignment.
[TBD] Official inference examples and validated deployment recipes will be added to this repo after release.
Use Cases
- Short-video structural parsing: identify hooks, setup, progression, reveal, and payoff.
- Edit analysis: reason about cuts, pacing changes, and shot-level transitions.
- Retention analysis: identify topic promise, first-3-second hook design, curiosity gaps, juxtaposition cues, multi-hook chaining, and payoff timing.
- Subtitle-heavy internet video understanding: analyze overlays, stickers, and UI-like layouts as structural signals.
- Packaging alignment: assess whether cover image, title, and hashtags match the video's actual content and structure.
- Comment analysis: align videos with associated comments and support comment-aware summarization.
- Retrieval and indexing: search videos by structural patterns rather than only objects or events.
- Creator tooling: support edit review, packaging review, and minimal revision suggestions.
Limitations
- Observable dissemination strategy refers to observable retention, packaging, and engagement-related context signals around the video, not causal prediction of virality, reach, CTR, or platform distribution outcomes.
- Comment-related outputs depend on the availability, freshness, and platform specificity of associated comments.
- Cover-title-hashtag alignment reflects semantic and strategic consistency, not guaranteed performance.
- Performance may degrade on corrupted videos, visually cluttered inputs, poor OCR readability, or editing conventions far outside the training distribution.
- Timestamp boundaries and segment labels are approximate rather than frame-perfect; downstream systems should tolerate small temporal drift.
- Narrative and dissemination labels are interpretive and may vary across cultures, languages, platforms, or annotators.
- Structural outputs should be used as assistive analysis, not as the sole decision-maker in high-stakes settings.
Out-of-Scope Use
- Causal claims about content performance without external validation.
- Fully automated moderation or enforcement decisions.
- High-stakes judgment without human review.
Citation
@article{LeumVL,
title={Leum-VL Technical Report},
author={Yuxuan He and Chaiming Huang and Yifan Wu and Hongjun Wang and Chenkui Shen and Jifan Zhang and Long Li},
journal={arXiv preprint arXiv:2603.20354},
year={2026}
}
License
This model is released under the MIT License.
- Downloads last month
- 162
Model tree for leum-team/Leum-VL-8B-preview0320
Base model
Qwen/Qwen3-VL-8B-Instruct