--- license: other license_name: bsl-1.1 license_link: LICENSE language: - en base_model: Qwen/Qwen3.5-2B pipeline_tag: video-text-to-text library_name: transformers tags: - video - multimodal - video-captioning - temporal-grounding - qwen - text-generation - VLM --- # video-scan A 2B-parameter video VLM for dense captioning and natural-language temporal grounding. Given a video, it produces structured Scene + Event captions with second-precise timestamps, and resolves natural-language queries to `(start, end)` time spans in the video. This repository is a redistribution of [`NemoStation/Marlin-2B`](https://huggingface.co/NemoStation/Marlin-2B) packaged for internal use. Weights are unmodified. Licensed under the Business Source License 1.1 — see [`LICENSE`](LICENSE) and [`NOTICE`](NOTICE) for terms and attribution. The internal Python module name (`modeling_marlin.py`) and class name (`MarlinForConditionalGeneration`) are preserved verbatim so that `trust_remote_code=True` loading via `auto_map` continues to work without modification. ## Capabilities - **Caption mode**: returns `Scene: ` followed by `Events: ` lines. - **Find mode**: given a natural-language event description, returns the matching time span as `From X.X to Y.Y.`. - **Multichunk reasoning** (limited): ``-style chunked-video reasoning with chunk-time to source-time arithmetic. Not exposed through the `.caption()` / `.find()` helpers — use a raw prompt to access it. ## Architecture Fine-tune of Qwen3.5-2B with the video-capable visual tower kept intact. Custom modeling code in `modeling_marlin.py` exposes two convenience methods (`.caption()` and `.find()`) that wrap a single canonical training prompt per mode and parse the structured output into typed Python dicts. Raw `.generate()` is also available for custom prompts. | Component | Value | |---|---| | Base model | Qwen/Qwen3.5-2B | | Parameters | 2.21B (text + vision combined) | | Precision | bfloat16 | | Storage on disk | ~5.5 GB | | Architecture string | `MarlinForConditionalGeneration` | | `model_type` | `qwen3_5` | | Context length | 262144 tokens | ## Training (upstream) The following describes the upstream training pipeline as documented by NemoStation. We have not retrained or modified the weights. - **Data**: ~400K clip-level annotations assembled from ActivityNet, LSMDC, Charades, Charades-Ego, TREC-VTT, WebVid-10M, HC-STVG, VidSTG, and TimeLens, with dense re-annotations distilled from Gemini-3-Flash and targeted human review on the highest-impact splits. - **Stage 1**: supervised fine-tuning on the curated corpus with a fixed canonical prompt per mode and Tarsier-schema output formatting. - **Stage 2**: SimPO (Simple Preference Optimization) on a teacher-distilled preference set, scored against Gemini-3-Flash on factual accuracy, completeness, and temporal alignment. - **Hardware**: single H100. ## Evaluation (upstream-reported) Upstream benchmarks Marlin-2B on three suites: - **CaReBench** — [arXiv:2501.00513](https://arxiv.org/abs/2501.00513) - **DREAM-1K** — [arXiv:2407.00634](https://arxiv.org/abs/2407.00634) - **TimeLens-Bench** — [arXiv:2512.14698](https://arxiv.org/abs/2512.14698) Headline numbers reported by NemoStation: tops the CaReBench leaderboard at the 2B scale, +6.4 mIoU over Qwen2.5-VL-7B on TimeLens-Bench (Charades / ActivityNet / QVHighlights). These numbers have not been independently re-verified in this repository. ## Quickstart ```python import torch from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "cudabenchmarktest/video-scan", trust_remote_code=True, dtype=torch.bfloat16, device_map={"": "cuda"}, ) model.compile() # optional — wraps torch.compile, faster after first call ``` ### Caption mode ```python result = model.caption("video.mp4") print(result["caption"]) # full raw caption text (Scene: ... Events: ...) print(result["scene"]) # parsed Scene paragraph for ev in result["events"]: print(f"<{ev['start']:.1f} - {ev['end']:.1f}> {ev['description']}") ``` Optional kwargs: - `max_new_tokens=2048` — generation token cap (default). - `prompt=None` — override the canonical training prompt. Almost always leave as `None`. - `do_sample=False`, `temperature=1.0`, `top_p=1.0` — sampling controls. ### Find mode ```python result = model.find("video.mp4", event="a person enters the room") print(result["raw"]) # "From 14.3 to 18.2." raw model output print(result["span"]) # (14.3, 18.2) tuple in seconds, or None on parse failure print(result["format_ok"]) # True if output matched the trained format ``` ## Raw inference To bypass the helper methods and call `generate()` directly: ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor model = AutoModelForCausalLM.from_pretrained( "cudabenchmarktest/video-scan", trust_remote_code=True, dtype=torch.bfloat16, device_map={"": "cuda"}, ) processor = AutoProcessor.from_pretrained( "cudabenchmarktest/video-scan", trust_remote_code=True ) messages = [{"role": "user", "content": [ {"type": "video", "video": "video.mp4"}, {"type": "text", "text": "Your custom prompt here"}, ]}] inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to(model.device) with torch.inference_mode(): out = model.generate(**inputs, max_new_tokens=512, do_sample=False) out = out[:, inputs["input_ids"].shape[1]:] text = processor.batch_decode(out, skip_special_tokens=True)[0] print(text) ``` ## Output format notes The model emits a `` token at the start of every response (an artifact of training with `add_non_thinking_prefix=True`). The `.caption()` and `.find()` helpers strip this automatically. When calling `generate()` directly, strip any leading `...` block (with or without closing tag) from the output before parsing. ## Requirements - `transformers >= 5.7.0` (for native `qwen3_5` architecture) - `torch >= 2.11.0` - `torchcodec` (video decoding) - `qwen-vl-utils >= 0.0.14` - `av` (torchcodec system dependency) - `pillow` ```bash pip install "transformers>=5.7.0" "torch>=2.11.0" torchcodec "qwen-vl-utils>=0.0.14" av pillow ``` ## Video preprocessing The custom modeling code sets these environment variables internally to match the training-time setup. Override them in your shell **before** importing transformers if needed. | Env var | Default | Purpose | |---|---|---| | `FORCE_QWENVL_VIDEO_READER` | `torchcodec` | Video decoder backend | | `VIDEO_MAX_PIXELS` | `200704` | Max pixels per frame (~448x448) | | `FPS` | `2.0` | Frame sampling rate | | `FPS_MAX_FRAMES` | `240` | Cap on total frames (~2 min at 2 FPS) | | `FPS_MIN_FRAMES` | `4` | Floor for very short videos | ## License and attribution This redistribution is licensed under the **Business Source License 1.1**. The full license text is in [`LICENSE`](LICENSE). The Qwen3.5-2B base weights remain under Apache License 2.0 — see [`LICENSE-QWEN-BASE`](LICENSE-QWEN-BASE) and [`NOTICE`](NOTICE). Key terms of BSL 1.1 as applied here: - Copy, modify, redistribute, and non-production use are permitted. - Production use is permitted **except** for offering this work to third parties on a hosted or embedded basis in a way that competes with NemoStation's paid version(s). - Internal organizational use is explicitly not a competitive offering. - On the **Change Date** (two years after upstream public release), the license converts to Apache License 2.0. The "Marlin" name and any logos are trademarks of NemoStation and are not granted by this license. The class identifier `MarlinForConditionalGeneration` and the module name `modeling_marlin.py` are preserved only because `auto_map` requires them for `trust_remote_code` loading; they do not imply trademark use beyond technical interoperability. Upstream source: