xiaomoguhzz
/

VisionEncoder

@@ -9,31 +9,44 @@ tags:
   - dinov3
 ---
-# VisionEncoder Checkpoints
-Final model checkpoints from the **VisionEncoder** research project.
-**Training code**: https://github.com/xiaomoguhz/VisionEncoder
-## Contents
-Each directory corresponds to one training pipeline in the code repo:
-| Directory | Training code |
 |---|---|
-| `declip_siglip2/spatial_align/` | `declip_siglip2/` — DeCLIP spatial alignment distillation on SigLIP2 using DINOv2 / DINOv3 as teacher |
-| `kd_mllm/s1_kd_pretrain/` | `ms-swift/kd_mllm/` stage-1 pretrain (`ms-swift/run_s1.sh`) |
-| `kd_mllm/s1_siglip2_qwen3_4b/` | `ms-swift/kd_mllm/` stage-1, SigLIP2 + Qwen3-4B backbone |
-| `kd_mllm/s2_siglip2_qwen3_4b_10pct/` | `ms-swift/kd_mllm/` stage-2 SFT on 10% data (`run_s2.sh`) |
-| `self_refine/qwen3vl_2b_10pct/` | `ms-swift/self_refine/` — register token injection + auto-calibrated GP threshold loss |
-| `video_mllm_swift/s1_siglip2_qwen3_1.7b/` | `ms-swift/video_mllm/` stage-1 with SigLIP2 encoder |
-| `video_mllm_swift/s1_declip_siglip2_qwen3_1.7b/` | `ms-swift/video_mllm/` stage-1 with DeCLIP-SigLIP2 encoder |
-| `video_mllm_swift/s2_siglip2_qwen3_1.7b_10pct/` | `ms-swift/video_mllm/` stage-2 SFT, SigLIP2 |
-| `video_mllm_swift/s2_declip_siglip2_qwen3_1.7b_10pct/` | `ms-swift/video_mllm/` stage-2 SFT, DeCLIP-SigLIP2 |
-| `video_mllm_swift/s2_image_only_10pct/` | Ablation: image-only stage-2 training |
-| `ms-swift-data/` | Not a checkpoint — preprocessed SFT training data (`ms-swift/data/`) used by the pipelines above |
-## Related repositories
-- **Code**: https://github.com/xiaomoguhz/VisionEncoder
-- **Evaluation data (~323 GB tarballs)**: https://huggingface.co/datasets/xiaomoguhzz/R3-Bench-data

   - dinov3
 ---
+# VisionEncoder
+Hosted artifacts (derived data + trained checkpoints) for the **VisionEncoder** research project.
+**Training code + full reproduction guide**: https://github.com/xiaomoguhz/VisionEncoder
+The repo is organized into three top-level folders.
+## `data/` — current (V9.x) reproduction data (~6.5G)
+| Path | Content |
 |---|---|
+| `data/vmllm_cached/qwen3vit/` | S2 `cached_dataset` arrow (image/video, 10pct + full); fed directly to stage-2 |
+| `data/ms-swift-data/` | sampled sharegpt jsonl (10pct + full) |
+| `data/llava_video/` | V9 decode-probed `good_manifest` for the video path |
+## `ckpts/` — ready-made 4B MLLM inference weights
+| Path | Content |
+|---|---|
+| `ckpts/4b_stock` | 4B stock baseline (raw Qwen3.5 ViT, skips declip), checkpoint-505, 9.5G |
+| `ckpts/4b_v9_1` | 4B V9.1 (V-JEPA 2.1 video self-distill), checkpoint-505, 9.5G |
+Download either and feed it straight to evaluation (see the GitHub README, step 7) to skip declip + S1 + S2.
+## `legacy/` — historical assets (~368G)
+Early-line products, not needed to reproduce the current main line: `declip_siglip2/spatial_align`, `kd_mllm`, `self_refine`, `video_mllm_swift` (old SigLIP2 / image-only S1+S2 ckpts), and old ViT-family arrow caches.
+## Download
+```bash
+# current dev data
+huggingface-cli download xiaomoguhzz/VisionEncoder --include "data/*" --local-dir .
+# ready-made 4B MLLM ckpt (eval directly)
+huggingface-cli download xiaomoguhzz/VisionEncoder --include "ckpts/4b_v9_1/*" --local-dir .
+```
+## Related
+- Code + reproduction guide: https://github.com/xiaomoguhz/VisionEncoder