Instructions to use NemoStation/Marlin-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NemoStation/Marlin-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Inference speed
Using Find on a 2 min. 1920x832 video takes: 459.15s on RTX 4090 - can anything be done to speed it up? Like downscaling the video beforehand? Or is a turbo version planned?
459s for a 2-min 1920Γ832 clip is on the slow end but expected at that resolution. Two things you can try:
- Pre-downscale the video. 1920Γ832 is roughly 8Γ over the model's per-frame pixel budget (we cap at ~200K pixels via smart_resize internally). The internal resize handles it, but at decode cost. Downscaling to ~640Γ270 before sending to the model cuts the visual-encoder time substantially without hurting accuracy for grounding-style queries.
- Quantise the weights. On a 4090, AWQ-quantised weights + bf16 KV-cache typically give 3-4Γ throughput vs vanilla bf16. We haven't shipped a quantized checkpoint ourselves yet, but you can do this in a half-hour with llm-compressor or AutoAWQ. If you do, we'd be curious what mIoU you get on TimeLens-Bench to compare against our bf16 numbers.
No "turbo" variant planned β the model is already 2B params, so the realistic speedup path is inference-side, not architectural.
I tried downscaling and it didn't help. I would like to add it to my Pallaidium AI add-on for Blender, but currently it is simply too slow for me. Will check in later to see if something has improved. Thank you.
My attempt at SDNQ-int8 quantization (but I basically do not know what I'm doing) - it seems to just hang for me during inference (outside the built in test - script is included: https://huggingface.co/tintwotin/Marlin-2B-SDNQ-int8
1280x70 - 361 frames:
Marlin: captioning completed in 498.6s
Looks like nothing has been gained speedwise.
Getting Triton to work on Windows seems to add a bit speed to the sdnq weight. But it is still very slow.