Inference speed

#8
by tintwotin - opened

Using Find on a 2 min. 1920x832 video takes: 459.15s on RTX 4090 - can anything be done to speed it up? Like downscaling the video beforehand? Or is a turbo version planned?

Nemo Station org

459s for a 2-min 1920Γ—832 clip is on the slow end but expected at that resolution. Two things you can try:

  1. Pre-downscale the video. 1920Γ—832 is roughly 8Γ— over the model's per-frame pixel budget (we cap at ~200K pixels via smart_resize internally). The internal resize handles it, but at decode cost. Downscaling to ~640Γ—270 before sending to the model cuts the visual-encoder time substantially without hurting accuracy for grounding-style queries.
  2. Quantise the weights. On a 4090, AWQ-quantised weights + bf16 KV-cache typically give 3-4Γ— throughput vs vanilla bf16. We haven't shipped a quantized checkpoint ourselves yet, but you can do this in a half-hour with llm-compressor or AutoAWQ. If you do, we'd be curious what mIoU you get on TimeLens-Bench to compare against our bf16 numbers.

No "turbo" variant planned β€” the model is already 2B params, so the realistic speedup path is inference-side, not architectural.

I tried downscaling and it didn't help. I would like to add it to my Pallaidium AI add-on for Blender, but currently it is simply too slow for me. Will check in later to see if something has improved. Thank you.

My attempt at SDNQ-int8 quantization (but I basically do not know what I'm doing) - it seems to just hang for me during inference (outside the built in test - script is included: https://huggingface.co/tintwotin/Marlin-2B-SDNQ-int8

1280x70 - 361 frames:
Marlin: captioning completed in 498.6s

Looks like nothing has been gained speedwise.

Getting Triton to work on Windows seems to add a bit speed to the sdnq weight. But it is still very slow.

Sign up or log in to comment