xiaomoguhzz commited on
Commit
de4ee9c
·
verified ·
1 Parent(s): 8eda807

Update model card for data/ + ckpts/ + legacy/ restructure

Browse files
Files changed (1) hide show
  1. README.md +35 -22
README.md CHANGED
@@ -9,31 +9,44 @@ tags:
9
  - dinov3
10
  ---
11
 
12
- # VisionEncoder Checkpoints
13
 
14
- Final model checkpoints from the **VisionEncoder** research project.
15
 
16
- **Training code**: https://github.com/xiaomoguhz/VisionEncoder
17
 
18
- ## Contents
19
 
20
- Each directory corresponds to one training pipeline in the code repo:
21
 
22
- | Directory | Training code |
23
  |---|---|
24
- | `declip_siglip2/spatial_align/` | `declip_siglip2/` DeCLIP spatial alignment distillation on SigLIP2 using DINOv2 / DINOv3 as teacher |
25
- | `kd_mllm/s1_kd_pretrain/` | `ms-swift/kd_mllm/` stage-1 pretrain (`ms-swift/run_s1.sh`) |
26
- | `kd_mllm/s1_siglip2_qwen3_4b/` | `ms-swift/kd_mllm/` stage-1, SigLIP2 + Qwen3-4B backbone |
27
- | `kd_mllm/s2_siglip2_qwen3_4b_10pct/` | `ms-swift/kd_mllm/` stage-2 SFT on 10% data (`run_s2.sh`) |
28
- | `self_refine/qwen3vl_2b_10pct/` | `ms-swift/self_refine/` register token injection + auto-calibrated GP threshold loss |
29
- | `video_mllm_swift/s1_siglip2_qwen3_1.7b/` | `ms-swift/video_mllm/` stage-1 with SigLIP2 encoder |
30
- | `video_mllm_swift/s1_declip_siglip2_qwen3_1.7b/` | `ms-swift/video_mllm/` stage-1 with DeCLIP-SigLIP2 encoder |
31
- | `video_mllm_swift/s2_siglip2_qwen3_1.7b_10pct/` | `ms-swift/video_mllm/` stage-2 SFT, SigLIP2 |
32
- | `video_mllm_swift/s2_declip_siglip2_qwen3_1.7b_10pct/` | `ms-swift/video_mllm/` stage-2 SFT, DeCLIP-SigLIP2 |
33
- | `video_mllm_swift/s2_image_only_10pct/` | Ablation: image-only stage-2 training |
34
- | `ms-swift-data/` | Not a checkpoint — preprocessed SFT training data (`ms-swift/data/`) used by the pipelines above |
35
-
36
- ## Related repositories
37
-
38
- - **Code**: https://github.com/xiaomoguhz/VisionEncoder
39
- - **Evaluation data (~323 GB tarballs)**: https://huggingface.co/datasets/xiaomoguhzz/R3-Bench-data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - dinov3
10
  ---
11
 
12
+ # VisionEncoder
13
 
14
+ Hosted artifacts (derived data + trained checkpoints) for the **VisionEncoder** research project.
15
 
16
+ **Training code + full reproduction guide**: https://github.com/xiaomoguhz/VisionEncoder
17
 
18
+ The repo is organized into three top-level folders.
19
 
20
+ ## `data/` current (V9.x) reproduction data (~6.5G)
21
 
22
+ | Path | Content |
23
  |---|---|
24
+ | `data/vmllm_cached/qwen3vit/` | S2 `cached_dataset` arrow (image/video, 10pct + full); fed directly to stage-2 |
25
+ | `data/ms-swift-data/` | sampled sharegpt jsonl (10pct + full) |
26
+ | `data/llava_video/` | V9 decode-probed `good_manifest` for the video path |
27
+
28
+ ## `ckpts/` — ready-made 4B MLLM inference weights
29
+
30
+ | Path | Content |
31
+ |---|---|
32
+ | `ckpts/4b_stock` | 4B stock baseline (raw Qwen3.5 ViT, skips declip), checkpoint-505, 9.5G |
33
+ | `ckpts/4b_v9_1` | 4B V9.1 (V-JEPA 2.1 video self-distill), checkpoint-505, 9.5G |
34
+
35
+ Download either and feed it straight to evaluation (see the GitHub README, step 7) to skip declip + S1 + S2.
36
+
37
+ ## `legacy/` — historical assets (~368G)
38
+
39
+ Early-line products, not needed to reproduce the current main line: `declip_siglip2/spatial_align`, `kd_mllm`, `self_refine`, `video_mllm_swift` (old SigLIP2 / image-only S1+S2 ckpts), and old ViT-family arrow caches.
40
+
41
+ ## Download
42
+
43
+ ```bash
44
+ # current dev data
45
+ huggingface-cli download xiaomoguhzz/VisionEncoder --include "data/*" --local-dir .
46
+ # ready-made 4B MLLM ckpt (eval directly)
47
+ huggingface-cli download xiaomoguhzz/VisionEncoder --include "ckpts/4b_v9_1/*" --local-dir .
48
+ ```
49
+
50
+ ## Related
51
+
52
+ - Code + reproduction guide: https://github.com/xiaomoguhz/VisionEncoder