Leum-VL logo

Leum-VL-8B-preview0320

Timeline-Grounded Structural Understanding for Internet-Native Short Video

Leum-VL-8B is an 8B open-weight video-language model for timeline-grounded structural understanding of internet-native short video. It parses short videos into timeline-aligned, evidence-linked, machine-usable structure across six dimensions: subject, aesthetics, camera language, editing, narrative, and observable dissemination strategy.

In this model card, observable dissemination strategy refers to retention-oriented design, platform-native packaging, and engagement context around the video when such context is provided, including title, hashtags, cover, and comments.

Benchmark comparison

Updates

[2026-03-24] Local preview version refreshed for the public Hugging Face repo, aligned with the current technical report table and assets.
[TBD] Inference API examples and deployment commands will be added after engineering validation.
[TBD] The formal YAML / JSON schema and task-specific prompt templates will be added in a later update.

Highlights

Understands edits, not just scene content.
Built for internet-native short-video structure.
Produces timeline-grounded, machine-usable outputs.
Trained with full-parameter updates across the vision encoder, projector, and language model.

Benchmark Snapshot

The table below mirrors the current benchmark table in the technical report.

Category	Benchmark	Leum-VL-8B	Qwen3-VL-8B¹	Keye-VL-8B Thinking²	GLM-4.1V-9B Thinking³	MiniCPM-V-4.5-8B⁴
General VQA	MMBench-EN test	84.8	84.5	92.0	85.8	84.2
	MMBench-CN test	83.9	84.7	-	84.7	-
	HallusionBench	56.5	61.1	62.7	63.2	61.2
	RealWorldQA	73.2	71.5	73.5	-	72.1
	MMStar	67.5	70.9	80.5	72.9	72.1
	BLINK	65.2	69.1	54.9‡	65.1	42.0‡
Document & OCR	OCRBench	85.4	89.6	86.6	84.2	89.0
	DocVQA test	95.7	96.1	93.4‡	93.3‡	94.7
	TextVQA val	85.0	82.8‡	81.5‡	79.6‡	82.2
	ChartQA test	85.3	89.6	94.1‡	70.0‡	87.4
Video Understanding	Video-MME w/o sub.	70.8	71.4	73.0	68.2	67.9
	MVBench	70.0	68.7	56.9‡	68.4	60.5‡
	TempCompass	74.3	74.3‡	75.5	72.3‡	72.7‡
	MotionBench	61.6	56.9‡	55.1‡	59.0	59.7
	FAVOR-Bench	58.9	54.1	-	-	56.0
	LongVideoBench	64.6	62.4‡	66.0	65.7‡	63.9
	Tomato	36.7	35.7‡	33.0‡	30.0‡	29.8‡
	Charades-STA mIoU	59.4	56.0	-	-	-

¹ Qwen3-VL report.
² Keye-VL report.
³ GLM-4.1V report.
⁴ MiniCPM-V report.
‡ Reported values reproduced from referenced public reports as cited in the technical report.

Eval setting: vLLM v0.17.1-cu130, FPS=4, max 768 frames, max 50K tokens/video.

We also construct FeedBench, a benchmark for structure-sensitive short-video understanding. [TBD] The public benchmark page and dataset link will be added after release.

What Makes It Different

Conventional video-language models can often describe scenes, answer event-centric questions, or read on-screen text. They are typically less reliable at explaining why a cut happens, what narrative role a segment serves, what retention tactics a short video uses, or how video content aligns with platform-native packaging and audience-response context. Leum-VL-8B addresses this gap by treating video understanding as timeline-grounded structural parsing that can incorporate linked platform context.

Paradigm	Can capture	Often misses
Dense captioning	Scene and event semantics	Cut rationale, hook design, narrative role
VQA	Event-centric reasoning	Timeline-level structural grammar
OCR / subtitle parsing	On-screen text	How text interacts with pacing and packaging
Shot boundary detection	Physical cuts	Why the cut matters
Leum-VL-8B (SV6D)	Multi-layer timeline-grounded structure	-

SV6D Representation

SV6D decomposes video into six complementary dimensions, each grounded on the timeline and tied to observable evidence.

Dimension	What it captures	Observable evidence or linked context	Downstream value
Subject	Who or what is present and how it changes	People, objects, actions, scene changes	Retrieval, indexing
Aesthetics	Visual style and perceptual tone	Color, lighting, composition, overlay style	Style analysis, generation control
Camera language	Shot size, angle, and movement	Close-up, pan, zoom, handheld, tracking	Cinematography parsing
Editing	Transitions and pacing logic	Hard cuts, jump cuts, rhythm changes, tension release	Edit analysis, editing assistance
Narrative	Timeline-grounded function of segments	Hook, setup, progression, reveal, payoff	Story analysis, clip structuring
Observable dissemination strategy	Retention-oriented design, packaging, and engagement context around the video	Topic framing, first-3-second hook design, subtitle emphasis, cover-title-hashtag consistency, video-comment alignment	Retention analysis, packaging analysis, comment analysis, creator tooling

Leum-VL-8B realizes the SV6D objective through full-parameter training spanning continual pretraining, supervised fine-tuning, and RLHF alignment. Initialized from a multimodal instruct checkpoint, it is trained to produce temporally aligned, evidence-linked, machine-usable outputs for short-video structural parsing.

Model Snapshot

Item	Value
Model name	Leum-VL-8B
Model type	Video-language model
Parameters	8B
Base model	Qwen3-VL-8B-Instruct
Training scope	Full-parameter training over vision encoder, projector, and language model
Training stages	Continual pretraining -> SFT -> RLHF alignment
Primary task	Timeline-grounded structural parsing for short video
Output	Structured YAML or text reports aligned to the timeline
Release type	Open weights
License	MIT

Output Format

Leum-VL-8B is designed to produce outputs that are directly usable by downstream systems. The following is an illustrative example. [TBD] Field names and schema are not final until the official release schema is published.

timeline:
  - span: "<00:00.0-00:07.0>"
    content_structure: "Opening Hook"
    structure_description: |
      At the center of the frame stands a multi-tiered pagoda-style structure against the night sky. The eaves on each level are lined with alternating cyan and golden light strips. The tower continuously emits golden linear fireworks outward, forming parabolic trajectories that spread to both sides. Gradually, red light beams begin to dominate, and the atmosphere of the fireworks shifts from tranquil to intense, completing the emotional setup of the opening.
    shots:
      - span: "<00:00.0-00:07.0>"
        subject_analysis: |
          A multi-tiered pagoda-style building stands at the center of the frame, tapering upward level by level. Each tier is decorated with surrounding light strips that alternate between cyan and golden illumination, with a pointed ornament at the top. From each level of the structure, golden linear fireworks continuously shoot outward, forming parabolic trajectories that spread to the sides and upward. As the scene progresses, denser red light beams shoot from the tower, and the lighting color of the structure also changes, with red gradually becoming dominant. The background is a black night sky. At the bottom of the frame, there is a horizontal railing, beyond which a large crowd appears as black silhouettes. Among them are dense rectangular lights from handheld electronic device screens.
        shot_breakdown:
          shot_size: "Wide shot"
          camera_position: "Low angle position"
          camera_angle: "Upward angle"
          focal_length: "Wide-angle"
          camera_movement: "Handheld shake"
          depth_of_field: "Deep depth of field"
        aesthetic_analysis:
          light_source_type: "Artificial light"
          lighting_direction: "Backlight / rim light"
          light_hardness: "Hard light"
          contrast: "High contrast"
          saturation: "High"
          color_temperature_tone: "Mixed warm and cool tones"
          base_tone: "Low-key"
          composition: "Symmetrical composition"
    emotional_curve:
      - time: "<00:00.0>"
        emotion_level: 0 (Calm)
        description: |
          The video begins with the Wunüzhou pagoda standing under the night sky, presenting a gorgeous and tranquil scene as a buildup.
      - time: "<00:02.1>"
        emotion_level: 1 (Engaged)
        description: |
          The first wave of fireworks bursts on both sides of the tower, breaking the stillness with visual motion and raising emotional engagement.

  - span: "<00:07.0-00:14.9>"
    content_structure: "Core Content"
    structure_description: |
      The fireworks display enters its visual climax phase. Slender white firework streaks and dense colorful bursts appear alternately. Thick smoke spreads, gradually obscuring the tower’s outline. The brightness and colors shift dramatically from white to emerald green to pink. Strong light penetrates the smoke, forming a hazy glow that delivers continuous visual impact and immersion, sustaining the emotional peak through the ending.
    shots:
      - span: "<00:07.0-00:09.5>"
        subject_analysis: |
          The multi-tiered pagoda stands at the center of the frame. The lights on each level form red circular rings, with blue light visible in the gaps. The building’s outline is faintly visible through the smoke. Slender bright white firework streaks continuously shoot diagonally upward from the tower, leaving straight trails in the air. Dense gray-white smoke forms around the structure and rises upward, partially obscuring architectural details. At the bottom of the frame, a bridge railing emits a cyan-blue glow, with silhouettes of spectators holding recording devices in front of it. The surrounding background is a dark night sky, showing slight gray variations under the illumination of fireworks.
        shot_breakdown:
          shot_size: "Wide shot"
          camera_position: "Low angle position"
          camera_angle: "Upward angle"
          focal_length: "Wide-angle"
          camera_movement: "Handheld shake"
          depth_of_field: "Deep depth of field"
        aesthetic_analysis:
          light_source_type: "Artificial light"
          lighting_direction: "Backlight / rim light"
          light_hardness: "Hard light"
          contrast: "High contrast"
          saturation: "High"
          color_temperature_tone: "Mixed warm and cool tones"
          base_tone: "Low-key"
          composition: "Symmetrical composition"
      - span: "<00:09.5-00:14.9>"
        subject_analysis: |
          The lower part of the frame shows the top structure of a building emitting bright golden light, with a square, block-like appearance and a smooth surface. Dense firework streaks spray upward from the top in a विशाल fan shape. The fireworks initially appear bright white, then shift to emerald green, and finally turn pink. As the fireworks erupt, large amounts of smoke spread outward, and light passing through the smoke creates a soft halo effect. The black night sky is illuminated by the fireworks. Along the bottom edge of the frame, silhouettes of raised rectangular objects occasionally appear in extremely low light conditions.
        shot_breakdown:
          shot_size: "Wide shot"
          camera_position: "Low angle position"
          camera_angle: "Upward angle"
          focal_length: "Telephoto"
          camera_movement: "Handheld shake"
          depth_of_field: "Medium depth of field"
        aesthetic_analysis:
          light_source_type: "Artificial light"
          lighting_direction: "Backlight / rim light"
          light_hardness: "Hard light"
          contrast: "High contrast"
          saturation: "Medium"
          color_temperature_tone: "Warm"
          base_tone: "Mid-tone"
          composition: "No obvious compositional intent"
    emotional_curve:
      - time: "<00:08.1>"
        emotion_level: 2 (Climax)
        description: |
          Fireworks of varying forms and colors alternate continuously, maintaining a high density of visual output and immersing the audience in a sense of exhilaration.

Depending on the task mode, observable dissemination strategy may be expressed either as segment-level retention cues or as clip-level packaging or comment alignment outputs.

Get Started

Primary input: MP4.
Optional associated context for supported task modes: title, hashtags, cover images, and comments.
Recommended deployment framework: vLLM.
Planned task modes: sv6d_parse, summary, edit_suggestions, retention_analysis, comment_analysis, packaging_alignment.

[TBD] Official inference examples and validated deployment recipes will be added to this repo after release.

Use Cases

Short-video structural parsing: identify hooks, setup, progression, reveal, and payoff.
Edit analysis: reason about cuts, pacing changes, and shot-level transitions.
Retention analysis: identify topic promise, first-3-second hook design, curiosity gaps, juxtaposition cues, multi-hook chaining, and payoff timing.
Subtitle-heavy internet video understanding: analyze overlays, stickers, and UI-like layouts as structural signals.
Packaging alignment: assess whether cover image, title, and hashtags match the video's actual content and structure.
Comment analysis: align videos with associated comments and support comment-aware summarization.
Retrieval and indexing: search videos by structural patterns rather than only objects or events.
Creator tooling: support edit review, packaging review, and minimal revision suggestions.

Limitations

Observable dissemination strategy refers to observable retention, packaging, and engagement-related context signals around the video, not causal prediction of virality, reach, CTR, or platform distribution outcomes.
Comment-related outputs depend on the availability, freshness, and platform specificity of associated comments.
Cover-title-hashtag alignment reflects semantic and strategic consistency, not guaranteed performance.
Performance may degrade on corrupted videos, visually cluttered inputs, poor OCR readability, or editing conventions far outside the training distribution.
Timestamp boundaries and segment labels are approximate rather than frame-perfect; downstream systems should tolerate small temporal drift.
Narrative and dissemination labels are interpretive and may vary across cultures, languages, platforms, or annotators.
Structural outputs should be used as assistive analysis, not as the sole decision-maker in high-stakes settings.

Out-of-Scope Use

Causal claims about content performance without external validation.
Fully automated moderation or enforcement decisions.
High-stakes judgment without human review.

Citation

@article{LeumVL,
  title={Leum-VL Technical Report},
  author={Yuxuan He and Chaiming Huang and Yifan Wu and Hongjun Wang and Chenkui Shen and Jifan Zhang and Long Li},
  journal={arXiv preprint arXiv:2603.20354},
  year={2026}
}

License

This model is released under the MIT License.

Downloads last month: 162

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leum-team/Leum-VL-8B-preview0320

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

(236)

this model

Paper for leum-team/Leum-VL-8B-preview0320

Leum-VL Technical Report

Paper • 2603.20354 • Published 29 days ago