Instructions to use xiaomoguhzz/VisionEncoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use xiaomoguhzz/VisionEncoder with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("xiaomoguhzz/VisionEncoder", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Update model card for data/ + ckpts/ + legacy/ restructure
Browse files
README.md
CHANGED
|
@@ -9,31 +9,44 @@ tags:
|
|
| 9 |
- dinov3
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# VisionEncoder
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
**Training code**: https://github.com/xiaomoguhz/VisionEncoder
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
|
| 23 |
|---|---|
|
| 24 |
-
| `
|
| 25 |
-
| `
|
| 26 |
-
| `
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
|
| 31 |
-
|
|
| 32 |
-
| `
|
| 33 |
-
| `
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
- dinov3
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# VisionEncoder
|
| 13 |
|
| 14 |
+
Hosted artifacts (derived data + trained checkpoints) for the **VisionEncoder** research project.
|
| 15 |
|
| 16 |
+
**Training code + full reproduction guide**: https://github.com/xiaomoguhz/VisionEncoder
|
| 17 |
|
| 18 |
+
The repo is organized into three top-level folders.
|
| 19 |
|
| 20 |
+
## `data/` — current (V9.x) reproduction data (~6.5G)
|
| 21 |
|
| 22 |
+
| Path | Content |
|
| 23 |
|---|---|
|
| 24 |
+
| `data/vmllm_cached/qwen3vit/` | S2 `cached_dataset` arrow (image/video, 10pct + full); fed directly to stage-2 |
|
| 25 |
+
| `data/ms-swift-data/` | sampled sharegpt jsonl (10pct + full) |
|
| 26 |
+
| `data/llava_video/` | V9 decode-probed `good_manifest` for the video path |
|
| 27 |
+
|
| 28 |
+
## `ckpts/` — ready-made 4B MLLM inference weights
|
| 29 |
+
|
| 30 |
+
| Path | Content |
|
| 31 |
+
|---|---|
|
| 32 |
+
| `ckpts/4b_stock` | 4B stock baseline (raw Qwen3.5 ViT, skips declip), checkpoint-505, 9.5G |
|
| 33 |
+
| `ckpts/4b_v9_1` | 4B V9.1 (V-JEPA 2.1 video self-distill), checkpoint-505, 9.5G |
|
| 34 |
+
|
| 35 |
+
Download either and feed it straight to evaluation (see the GitHub README, step 7) to skip declip + S1 + S2.
|
| 36 |
+
|
| 37 |
+
## `legacy/` — historical assets (~368G)
|
| 38 |
+
|
| 39 |
+
Early-line products, not needed to reproduce the current main line: `declip_siglip2/spatial_align`, `kd_mllm`, `self_refine`, `video_mllm_swift` (old SigLIP2 / image-only S1+S2 ckpts), and old ViT-family arrow caches.
|
| 40 |
+
|
| 41 |
+
## Download
|
| 42 |
+
|
| 43 |
+
```bash
|
| 44 |
+
# current dev data
|
| 45 |
+
huggingface-cli download xiaomoguhzz/VisionEncoder --include "data/*" --local-dir .
|
| 46 |
+
# ready-made 4B MLLM ckpt (eval directly)
|
| 47 |
+
huggingface-cli download xiaomoguhzz/VisionEncoder --include "ckpts/4b_v9_1/*" --local-dir .
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
## Related
|
| 51 |
+
|
| 52 |
+
- Code + reproduction guide: https://github.com/xiaomoguhz/VisionEncoder
|