yiyexy commited on
Commit
ee88bb4
·
verified ·
1 Parent(s): bc71fa5

Add codec video backend & docs (README.md)

Browse files
Files changed (1) hide show
  1. README.md +69 -2
README.md CHANGED
@@ -23,10 +23,34 @@ The model is distributed as a HuggingFace `transformers` checkpoint with custom
23
 
24
  ## Requirements
25
 
 
 
26
  ```bash
27
  pip install "transformers>=5.7.0" "torch>=2.4" pillow requests decord
28
  ```
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## Quick start
31
 
32
  The repository ships a ready-to-run `demo_inference.py` that covers both image and video paths.
@@ -90,11 +114,54 @@ out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
90
  print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
91
  ```
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  ## Notes
94
 
95
  - The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
96
- - `chat_template.jinja` follows the Qwen3 chat format and emits `<|vision_start|>...<|vision_end|>` placeholders; the processor expands them per-frame for video.
97
- - Inference was validated to be bit-exact at the pixel level and prefix-identical at the token level against the original reference implementation.
 
98
 
99
  ## License
100
 
 
23
 
24
  ## Requirements
25
 
26
+ ### Base (image + frame-sampling video)
27
+
28
  ```bash
29
  pip install "transformers>=5.7.0" "torch>=2.4" pillow requests decord
30
  ```
31
 
32
+ ### Optional: codec video backend
33
+
34
+ The model ships a second video backend (`video_backend="codec"`) that replaces
35
+ uniform frame sampling with codec-aware **canvas packing** driven by motion
36
+ vectors and bit-cost — typically yielding stronger long-video accuracy at the
37
+ same token budget. To enable it you need two extra pieces:
38
+
39
+ ```bash
40
+ # 1. The cv-preinfer CLI (PyPI: codec-video-prep) drives canvas extraction.
41
+ pip install codec-video-prep opencv-python
42
+
43
+ # 2. A working `ffmpeg` binary must be on PATH.
44
+ # Verify with: ffmpeg -version
45
+ ```
46
+
47
+ **ffmpeg version:** ffmpeg **4.4.x – 7.x** is recommended.
48
+
49
+ The codec backend additionally needs **POSIX `flock`** (already present on
50
+ Linux/macOS) for the on-disk result cache, and roughly **2 GB free disk** under
51
+ `$ONLINE_CODEC_CACHE_DIR` (defaults to `$HF_HOME/online_codec_v2`) per
52
+ processed video.
53
+
54
  ## Quick start
55
 
56
  The repository ships a ready-to-run `demo_inference.py` that covers both image and video paths.
 
114
  print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
115
  ```
116
 
117
+ ### Video — codec backend (recommended for long videos)
118
+
119
+ The codec backend is exposed as a single processor kwarg
120
+ (`video_backend="codec"`). Everything else — canvas extraction via
121
+ `cv-preinfer`, on-disk caching, patch-position bookkeeping, chat-template
122
+ rewriting — happens inside `processor(...)`:
123
+
124
+ ```python
125
+ # Make sure: `pip install codec-video-prep opencv-python` and ffmpeg on PATH.
126
+ messages = [{"role": "user", "content": [
127
+ {"type": "video"},
128
+ {"type": "text", "text": "Describe what happens in this long video."},
129
+ ]}]
130
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
131
+
132
+ inputs = processor(
133
+ text=[text],
134
+ videos=["long_clip.mp4"],
135
+ video_backend="codec",
136
+ max_pixels=150000, # per-canvas pixel budget; lower if OOM
137
+ return_tensors="pt",
138
+ padding=True,
139
+ # Optional: override codec defaults from preprocessor_config.json
140
+ # codec_config={"target_canvas": 32, "group_size": 32, "images_per_group": 4},
141
+ )
142
+ inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}
143
+
144
+ out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
145
+ print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
146
+ ```
147
+
148
+ Defaults for the codec pipeline live in `preprocessor_config.json` under the
149
+ `"codec"` key (`target_canvas=32`, `group_size=32`, `images_per_group=4`,
150
+ `patch=14`, `min_group_frames=8`, `max_group_frames=64`); they can be
151
+ overridden per call via `codec_config={...}`.
152
+
153
+ **Short-video behaviour:** if the input video has fewer frames than
154
+ `target_canvas` requires (or fewer than `min_group_frames`), a `UserWarning`
155
+ is emitted and inference proceeds with however many canvases `cv-preinfer`
156
+ can actually form. For very short clips, falling back to the frame-sampling
157
+ backend is usually a better choice.
158
+
159
  ## Notes
160
 
161
  - The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
162
+ - `chat_template.jinja` follows the Qwen3 chat format and emits `<|vision_start|>...<|vision_end|>` placeholders; the processor expands them per-frame (frames backend) or per-canvas-patch-run (codec backend).
163
+ - Two video backends are available via `processor(..., video_backend=...)`: **`"frames"`** (default, uniform sampling) and **`"codec"`** (canvas packing via `cv-preinfer`, requires `codec-video-prep` + `ffmpeg`).
164
+ - Inference was validated to be bit-exact at the pixel level and prefix-identical at the token level against the original reference implementation, on both backends.
165
 
166
  ## License
167