Video-Text-to-Text
Transformers
Safetensors
English
qwen3_5
text-generation
video
multimodal
video-captioning
temporal-grounding
qwen
VLM
custom_code
Instructions to use cudabenchmarktest/video-scan with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cudabenchmarktest/video-scan with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("cudabenchmarktest/video-scan", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("cudabenchmarktest/video-scan", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
| license: other | |
| license_name: bsl-1.1 | |
| license_link: LICENSE | |
| language: | |
| - en | |
| base_model: Qwen/Qwen3.5-2B | |
| pipeline_tag: video-text-to-text | |
| library_name: transformers | |
| tags: | |
| - video | |
| - multimodal | |
| - video-captioning | |
| - temporal-grounding | |
| - qwen | |
| - text-generation | |
| - VLM | |
| # video-scan | |
| A 2B-parameter video VLM for dense captioning and natural-language temporal grounding. Given a video, it produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to `(start, end)` time spans in the video. | |
| This repository is a redistribution of [`NemoStation/Marlin-2B`](https://huggingface.co/NemoStation/Marlin-2B) packaged for internal use. Weights are unmodified. Licensed under the Business Source License 1.1 — see [`LICENSE`](LICENSE) and [`NOTICE`](NOTICE) for terms and attribution. The internal Python module name (`modeling_marlin.py`) and class name (`MarlinForConditionalGeneration`) are preserved verbatim so that `trust_remote_code=True` loading via `auto_map` continues to work without modification. | |
| ## Capabilities | |
| - **Caption mode**: returns `Scene: <paragraph>` followed by `Events: <X.X - Y.Y> <description>` lines. | |
| - **Find mode**: given a natural-language event description, returns the matching time span as `From X.X to Y.Y.`. | |
| - **Multichunk reasoning** (limited): `<think>`-style chunked-video reasoning with chunk-time to source-time arithmetic. Not exposed through the `.caption()` / `.find()` helpers — use a raw prompt to access it. | |
| ## Architecture | |
| Fine-tune of Qwen3.5-2B with the video-capable visual tower kept intact. Custom modeling code in `modeling_marlin.py` exposes two convenience methods (`.caption()` and `.find()`) that wrap a single canonical training prompt per mode and parse the structured output into typed Python dicts. Raw `.generate()` is also available for custom prompts. | |
| | Component | Value | | |
| |---|---| | |
| | Base model | Qwen/Qwen3.5-2B | | |
| | Parameters | 2.21B (text + vision combined) | | |
| | Precision | bfloat16 | | |
| | Storage on disk | ~5.5 GB | | |
| | Architecture string | `MarlinForConditionalGeneration` | | |
| | `model_type` | `qwen3_5` | | |
| | Context length | 262144 tokens | | |
| ## Training (upstream) | |
| The following describes the upstream training pipeline as documented by NemoStation. We have not retrained or modified the weights. | |
| - **Data**: ~400K clip-level annotations assembled from ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, and TimeLens, with dense re-annotations distilled from Gemini-3-Flash and targeted human review on the highest-impact splits. | |
| - **Stage 1**: supervised fine-tuning on the curated corpus with a fixed canonical prompt per mode and Tarsier-schema output formatting. | |
| - **Stage 2**: SimPO (Simple Preference Optimization) on a teacher-distilled preference set, scored against Gemini-3-Flash on factual accuracy, completeness, and temporal alignment. | |
| - **Hardware**: single H100. | |
| ## Evaluation (upstream-reported) | |
| Upstream benchmarks Marlin-2B on three suites: | |
| - **CaReBench** — [arXiv:2501.00513](https://arxiv.org/abs/2501.00513) | |
| - **DREAM-1K** — [arXiv:2407.00634](https://arxiv.org/abs/2407.00634) | |
| - **TimeLens-Bench** — [arXiv:2512.14698](https://arxiv.org/abs/2512.14698) | |
| Headline numbers reported by NemoStation: tops the CaReBench leaderboard at the 2B scale, +6.4 mIoU over Qwen2.5-VL-7B on TimeLens-Bench (Charades / ActivityNet / QVHighlights). These numbers have not been independently re-verified in this repository. | |
| ## Quickstart | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "cudabenchmarktest/video-scan", | |
| trust_remote_code=True, | |
| dtype=torch.bfloat16, | |
| device_map={"": "cuda"}, | |
| ) | |
| model.compile() # optional — wraps torch.compile, faster after first call | |
| ``` | |
| ### Caption mode | |
| ```python | |
| result = model.caption("video.mp4") | |
| print(result["caption"]) # full raw caption text (Scene: ... Events: ...) | |
| print(result["scene"]) # parsed Scene paragraph | |
| for ev in result["events"]: | |
| print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}") | |
| ``` | |
| Optional kwargs: | |
| - `max_new_tokens=2048` — generation token cap (default). | |
| - `prompt=None` — override the canonical training prompt. Almost always leave as `None`. | |
| - `do_sample=False`, `temperature=1.0`, `top_p=1.0` — sampling controls. | |
| ### Find mode | |
| ```python | |
| result = model.find("video.mp4", event="a person enters the room") | |
| print(result["raw"]) # "From 14.3 to 18.2." raw model output | |
| print(result["span"]) # (14.3, 18.2) tuple in seconds, or None on parse failure | |
| print(result["format_ok"]) # True if output matched the trained format | |
| ``` | |
| ## Raw inference | |
| To bypass the helper methods and call `generate()` directly: | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "cudabenchmarktest/video-scan", | |
| trust_remote_code=True, | |
| dtype=torch.bfloat16, | |
| device_map={"": "cuda"}, | |
| ) | |
| processor = AutoProcessor.from_pretrained( | |
| "cudabenchmarktest/video-scan", trust_remote_code=True | |
| ) | |
| messages = [{"role": "user", "content": [ | |
| {"type": "video", "video": "video.mp4"}, | |
| {"type": "text", "text": "Your custom prompt here"}, | |
| ]}] | |
| inputs = processor.apply_chat_template( | |
| messages, tokenize=True, add_generation_prompt=True, | |
| return_tensors="pt", return_dict=True, | |
| ).to(model.device) | |
| with torch.inference_mode(): | |
| out = model.generate(**inputs, max_new_tokens=512, do_sample=False) | |
| out = out[:, inputs["input_ids"].shape[1]:] | |
| text = processor.batch_decode(out, skip_special_tokens=True)[0] | |
| print(text) | |
| ``` | |
| ## Output format notes | |
| The model emits a `<think>` token at the start of every response (an artifact of training with `add_non_thinking_prefix=True`). The `.caption()` and `.find()` helpers strip this automatically. When calling `generate()` directly, strip any leading `<think>...</think>` block (with or without closing tag) from the output before parsing. | |
| ## Requirements | |
| - `transformers >= 5.7.0` (for native `qwen3_5` architecture) | |
| - `torch >= 2.11.0` | |
| - `torchcodec` (video decoding) | |
| - `qwen-vl-utils >= 0.0.14` | |
| - `av` (torchcodec system dependency) | |
| - `pillow` | |
| ```bash | |
| pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow | |
| ``` | |
| ## Video preprocessing | |
| The custom modeling code sets these environment variables internally to match the training-time setup. Override them in your shell **before** importing transformers if needed. | |
| | Env var | Default | Purpose | | |
| |---|---|---| | |
| | `FORCE_QWENVL_VIDEO_READER` | `torchcodec` | Video decoder backend | | |
| | `VIDEO_MAX_PIXELS` | `200704` | Max pixels per frame (~448x448) | | |
| | `FPS` | `2.0` | Frame sampling rate | | |
| | `FPS_MAX_FRAMES` | `240` | Cap on total frames (~2 min at 2 FPS) | | |
| | `FPS_MIN_FRAMES` | `4` | Floor for very short videos | | |
| ## License and attribution | |
| This redistribution is licensed under the **Business Source License 1.1**. The full license text is in [`LICENSE`](LICENSE). The Qwen3.5-2B base weights remain under Apache License 2.0 — see [`LICENSE-QWEN-BASE`](LICENSE-QWEN-BASE) and [`NOTICE`](NOTICE). | |
| Key terms of BSL 1.1 as applied here: | |
| - Copy, modify, redistribute, and non-production use are permitted. | |
| - Production use is permitted **except** for offering this work to third parties on a hosted or embedded basis in a way that competes with NemoStation's paid version(s). | |
| - Internal organizational use is explicitly not a competitive offering. | |
| - On the **Change Date** (two years after upstream public release), the license converts to Apache License 2.0. | |
| The "Marlin" name and any logos are trademarks of NemoStation and are not granted by this license. The class identifier `MarlinForConditionalGeneration` and the module name `modeling_marlin.py` are preserved only because `auto_map` requires them for `trust_remote_code` loading; they do not imply trademark use beyond technical interoperability. | |
| Upstream source: <https://huggingface.co/NemoStation/Marlin-2B> | |