Video-Text-to-Text
Transformers
Safetensors
English
qwen3_5
text-generation
video
multimodal
video-captioning
temporal-grounding
qwen
VLM
custom_code
Instructions to use cudabenchmarktest/video-scan with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cudabenchmarktest/video-scan with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("cudabenchmarktest/video-scan", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("cudabenchmarktest/video-scan", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
File size: 8,072 Bytes
f0ab8f1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | ---
license: other
license_name: bsl-1.1
license_link: LICENSE
language:
- en
base_model: Qwen/Qwen3.5-2B
pipeline_tag: video-text-to-text
library_name: transformers
tags:
- video
- multimodal
- video-captioning
- temporal-grounding
- qwen
- text-generation
- VLM
---
# video-scan
A 2B-parameter video VLM for dense captioning and natural-language temporal grounding. Given a video, it produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to `(start, end)` time spans in the video.
This repository is a redistribution of [`NemoStation/Marlin-2B`](https://huggingface.co/NemoStation/Marlin-2B) packaged for internal use. Weights are unmodified. Licensed under the Business Source License 1.1 β see [`LICENSE`](LICENSE) and [`NOTICE`](NOTICE) for terms and attribution. The internal Python module name (`modeling_marlin.py`) and class name (`MarlinForConditionalGeneration`) are preserved verbatim so that `trust_remote_code=True` loading via `auto_map` continues to work without modification.
## Capabilities
- **Caption mode**: returns `Scene: <paragraph>` followed by `Events: <X.X - Y.Y> <description>` lines.
- **Find mode**: given a natural-language event description, returns the matching time span as `From X.X to Y.Y.`.
- **Multichunk reasoning** (limited): `<think>`-style chunked-video reasoning with chunk-time to source-time arithmetic. Not exposed through the `.caption()` / `.find()` helpers β use a raw prompt to access it.
## Architecture
Fine-tune of Qwen3.5-2B with the video-capable visual tower kept intact. Custom modeling code in `modeling_marlin.py` exposes two convenience methods (`.caption()` and `.find()`) that wrap a single canonical training prompt per mode and parse the structured output into typed Python dicts. Raw `.generate()` is also available for custom prompts.
| Component | Value |
|---|---|
| Base model | Qwen/Qwen3.5-2B |
| Parameters | 2.21B (text + vision combined) |
| Precision | bfloat16 |
| Storage on disk | ~5.5 GB |
| Architecture string | `MarlinForConditionalGeneration` |
| `model_type` | `qwen3_5` |
| Context length | 262144 tokens |
## Training (upstream)
The following describes the upstream training pipeline as documented by NemoStation. We have not retrained or modified the weights.
- **Data**: ~400K clip-level annotations assembled from ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, and TimeLens, with dense re-annotations distilled from Gemini-3-Flash and targeted human review on the highest-impact splits.
- **Stage 1**: supervised fine-tuning on the curated corpus with a fixed canonical prompt per mode and Tarsier-schema output formatting.
- **Stage 2**: SimPO (Simple Preference Optimization) on a teacher-distilled preference set, scored against Gemini-3-Flash on factual accuracy, completeness, and temporal alignment.
- **Hardware**: single H100.
## Evaluation (upstream-reported)
Upstream benchmarks Marlin-2B on three suites:
- **CaReBench** β [arXiv:2501.00513](https://arxiv.org/abs/2501.00513)
- **DREAM-1K** β [arXiv:2407.00634](https://arxiv.org/abs/2407.00634)
- **TimeLens-Bench** β [arXiv:2512.14698](https://arxiv.org/abs/2512.14698)
Headline numbers reported by NemoStation: tops the CaReBench leaderboard at the 2B scale, +6.4 mIoU over Qwen2.5-VL-7B on TimeLens-Bench (Charades / ActivityNet / QVHighlights). These numbers have not been independently re-verified in this repository.
## Quickstart
```python
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"cudabenchmarktest/video-scan",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map={"": "cuda"},
)
model.compile() # optional β wraps torch.compile, faster after first call
```
### Caption mode
```python
result = model.caption("video.mp4")
print(result["caption"]) # full raw caption text (Scene: ... Events: ...)
print(result["scene"]) # parsed Scene paragraph
for ev in result["events"]:
print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}")
```
Optional kwargs:
- `max_new_tokens=2048` β generation token cap (default).
- `prompt=None` β override the canonical training prompt. Almost always leave as `None`.
- `do_sample=False`, `temperature=1.0`, `top_p=1.0` β sampling controls.
### Find mode
```python
result = model.find("video.mp4", event="a person enters the room")
print(result["raw"]) # "From 14.3 to 18.2." raw model output
print(result["span"]) # (14.3, 18.2) tuple in seconds, or None on parse failure
print(result["format_ok"]) # True if output matched the trained format
```
## Raw inference
To bypass the helper methods and call `generate()` directly:
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"cudabenchmarktest/video-scan",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map={"": "cuda"},
)
processor = AutoProcessor.from_pretrained(
"cudabenchmarktest/video-scan", trust_remote_code=True
)
messages = [{"role": "user", "content": [
{"type": "video", "video": "video.mp4"},
{"type": "text", "text": "Your custom prompt here"},
]}]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_tensors="pt", return_dict=True,
).to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
out = out[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(out, skip_special_tokens=True)[0]
print(text)
```
## Output format notes
The model emits a `<think>` token at the start of every response (an artifact of training with `add_non_thinking_prefix=True`). The `.caption()` and `.find()` helpers strip this automatically. When calling `generate()` directly, strip any leading `<think>...</think>` block (with or without closing tag) from the output before parsing.
## Requirements
- `transformers >= 5.7.0` (for native `qwen3_5` architecture)
- `torch >= 2.11.0`
- `torchcodec` (video decoding)
- `qwen-vl-utils >= 0.0.14`
- `av` (torchcodec system dependency)
- `pillow`
```bash
pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow
```
## Video preprocessing
The custom modeling code sets these environment variables internally to match the training-time setup. Override them in your shell **before** importing transformers if needed.
| Env var | Default | Purpose |
|---|---|---|
| `FORCE_QWENVL_VIDEO_READER` | `torchcodec` | Video decoder backend |
| `VIDEO_MAX_PIXELS` | `200704` | Max pixels per frame (~448x448) |
| `FPS` | `2.0` | Frame sampling rate |
| `FPS_MAX_FRAMES` | `240` | Cap on total frames (~2 min at 2 FPS) |
| `FPS_MIN_FRAMES` | `4` | Floor for very short videos |
## License and attribution
This redistribution is licensed under the **Business Source License 1.1**. The full license text is in [`LICENSE`](LICENSE). The Qwen3.5-2B base weights remain under Apache License 2.0 β see [`LICENSE-QWEN-BASE`](LICENSE-QWEN-BASE) and [`NOTICE`](NOTICE).
Key terms of BSL 1.1 as applied here:
- Copy, modify, redistribute, and non-production use are permitted.
- Production use is permitted **except** for offering this work to third parties on a hosted or embedded basis in a way that competes with NemoStation's paid version(s).
- Internal organizational use is explicitly not a competitive offering.
- On the **Change Date** (two years after upstream public release), the license converts to Apache License 2.0.
The "Marlin" name and any logos are trademarks of NemoStation and are not granted by this license. The class identifier `MarlinForConditionalGeneration` and the module name `modeling_marlin.py` are preserved only because `auto_map` requires them for `trust_remote_code` loading; they do not imply trademark use beyond technical interoperability.
Upstream source: <https://huggingface.co/NemoStation/Marlin-2B>
|