spikefly commited on
Commit
3a4a841
·
verified ·
1 Parent(s): 6a6aed4

Release: SemanticVLA-LIBERO checkpoint + config + dataset_statistics + model card

Browse files
Files changed (4) hide show
  1. README.md +35 -37
  2. config.yaml +131 -0
  3. dataset_statistics.json +133 -0
  4. final_model/pytorch_model.pt +3 -0
README.md CHANGED
@@ -13,9 +13,7 @@ tags:
13
 
14
  # SemanticVLA · LIBERO
15
 
16
- > 🚧 **Placeholder.** The URL is stable; checkpoints will be uploaded incrementally per the [release roadmap](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/docs/ROADMAP.md).
17
-
18
- [SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial) finetuned on the [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO) benchmark.
19
 
20
  ## Headline result
21
 
@@ -23,58 +21,58 @@ tags:
23
  |---|---:|
24
  | `libero_spatial` | 0.988 |
25
  | `libero_object` | 0.996 |
26
- | `libero_goal` | 0.986 |
27
- | `libero_10` | 0.966 |
28
- | **4-suite mean** | **0.9840** |
29
-
30
- Best configuration: `TL_saembs_lw010` (trace + latent semantic output, `sa_embs` injection, LM loss weight 0.10, step 30000).
31
 
32
  ## Architecture
33
 
34
- - **Backbone**: Qwen2.5-VL-3B (with trace + latent-action semantic heads)
35
- - **Action head**: GR00T-style flow-matching expert (continuous action chunks)
36
- - **LAM tokenizer**: [`SemanticVLA-LAM` → `libero/v5`](https://huggingface.co/spikefly/SemanticVLA-LAM)
37
- - **Action horizon**: 16
 
 
 
 
 
38
 
39
- ## Planned layout
40
 
41
  ```
42
  SemanticVLA-LIBERO/
43
- ├── tl-saembs-lw010-best/
44
- ├── pytorch_model.pt
45
- ├── config.yaml
46
- └── model_card.md
47
- └── README.md
48
  ```
49
 
50
- Additional ablation variants (`L_none_lw010`, `TL_none_lw010`, `TL_saembs_lw005`, etc.) may be uploaded as additional subdirectories upon release.
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  ## Sibling SemanticVLA checkpoint repos
53
 
54
  | Repo | Purpose |
55
  |---|---|
56
- | 🤗 [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) | LAM tokenizers used by this VLA |
57
- | 🤗 [`SemanticVLA-Bridge`](https://huggingface.co/spikefly/SemanticVLA-Bridge) | Bridge-finetuned VLA for SimplerEnv WidowX |
58
 
59
  ## Related resources
60
 
61
  - **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial
62
- - **Datasets**: https://hf.co/collections/spikefly/semanticvla-datasets
63
- - **Collection · Model Zoo**: https://hf.co/collections/spikefly/semanticvla-model-zoo
64
-
65
- ## How to load (placeholder API)
66
-
67
- ```python
68
- from huggingface_hub import hf_hub_download
69
- import torch
70
-
71
- ckpt = hf_hub_download(
72
- repo_id="spikefly/SemanticVLA-LIBERO",
73
- filename="tl-saembs-lw010-best/pytorch_model.pt",
74
- )
75
- state = torch.load(ckpt, map_location="cpu")
76
- # loader will be released with the code repo
77
- ```
78
 
79
  ## Citation
80
 
 
13
 
14
  # SemanticVLA · LIBERO
15
 
16
+ [SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial) finetuned on the [LIBERO](https://github.com/Lifelong-Robot-Learning/LIBERO) benchmark. The unified OXE LAM is used as the latent-action tokenizer, and the trace + latent-action auxiliary heads are supervised in the VLM's language stream.
 
 
17
 
18
  ## Headline result
19
 
 
21
  |---|---:|
22
  | `libero_spatial` | 0.988 |
23
  | `libero_object` | 0.996 |
24
+ | `libero_goal` | 0.974 |
25
+ | `libero_10` | 0.970 |
26
+ | **4-suite mean** | **0.982** |
 
 
27
 
28
  ## Architecture
29
 
30
+ | Component | Choice |
31
+ |---|---|
32
+ | VLM backbone | Qwen3-VL-4B-Instruct |
33
+ | Action head | DiT-B (flow matching) |
34
+ | LAM tokenizer | [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) (unified OXE LAM) |
35
+ | Semantic supervision | Trace + latent action tokens predicted in the VLM's language stream; action decoder unmodified |
36
+ | Latent vocabulary size | 32 |
37
+ | Latent tokens per sample | 4 |
38
+ | Action horizon | 8 |
39
 
40
+ ## Files
41
 
42
  ```
43
  SemanticVLA-LIBERO/
44
+ ├── README.md
45
+ ├── config.yaml # loadable model config
46
+ ├── dataset_statistics.json # action normalization stats
47
+ └── final_model/
48
+ └── pytorch_model.pt # policy state_dict
49
  ```
50
 
51
+ ## How to load
52
+
53
+ ```python
54
+ from semanticvla.model.framework.base_framework import baseframework
55
+
56
+ policy = baseframework.from_pretrained("final_model/pytorch_model.pt")
57
+ policy.eval()
58
+ ```
59
+
60
+ `baseframework.from_pretrained()` walks two directory levels up from the checkpoint file to locate `config.yaml` and `dataset_statistics.json`. The released layout follows this convention.
61
+
62
+ To run a full LIBERO evaluation, see [`examples/LIBERO/`](https://github.com/Fei-Ni/SemanticVLA_Offcial/tree/main/examples/LIBERO) in the code repo.
63
 
64
  ## Sibling SemanticVLA checkpoint repos
65
 
66
  | Repo | Purpose |
67
  |---|---|
68
+ | 🤗 [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) | Unified OXE LAM consumed by this policy |
69
+ | 🤗 [`SemanticVLA-SimplerEnv`](https://huggingface.co/spikefly/SemanticVLA-SimplerEnv) | SimplerEnv WidowX policy |
70
 
71
  ## Related resources
72
 
73
  - **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial
74
+ - **Datasets collection**: https://hf.co/collections/spikefly/semanticvla-datasets
75
+ - **Model Zoo collection**: https://hf.co/collections/spikefly/semanticvla-model-zoo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ## Citation
78
 
config.yaml ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Loadable config for SemanticVLA-LIBERO.
2
+ #
3
+ # Load via:
4
+ # from semanticvla.model.framework.base_framework import baseframework
5
+ # policy = baseframework.from_pretrained("final_model/pytorch_model.pt")
6
+ #
7
+ # The loader walks two directory levels up from the checkpoint file to locate
8
+ # this `config.yaml` and the sibling `dataset_statistics.json`.
9
+
10
+ seed: 42
11
+
12
+ framework:
13
+ name: SemanticVLA
14
+ qwenvl:
15
+ base_vlm: Qwen/Qwen3-VL-4B-Instruct
16
+ attn_implementation: flash_attention_2
17
+ vl_hidden_dim: 2048
18
+ dino:
19
+ dino_backbone: dinov2_vits14
20
+ action_model:
21
+ action_model_type: DiT-B
22
+ action_hidden_dim: 1024
23
+ hidden_size: 1024
24
+ add_pos_embed: true
25
+ max_seq_len: 1024
26
+ action_dim: 7
27
+ state_dim: 7
28
+ future_action_window_size: 7
29
+ action_horizon: 8
30
+ past_action_window_size: 0
31
+ repeated_diffusion_steps: 8
32
+ noise_beta_alpha: 1.5
33
+ noise_beta_beta: 1.0
34
+ noise_s: 0.999
35
+ num_timestep_buckets: 1000
36
+ num_inference_timesteps: 4
37
+ num_target_vision_tokens: 32
38
+ diffusion_model_cfg:
39
+ cross_attention_dim: 2048
40
+ dropout: 0.2
41
+ final_dropout: true
42
+ interleave_self_attention: true
43
+ norm_type: ada_norm
44
+ num_layers: 16
45
+ output_dim: 1024
46
+ positional_embeddings: null
47
+ progress_dim: 0
48
+ trace_dim: 0
49
+ trace:
50
+ injection_mode: none
51
+ hidden_dim: 256
52
+ num_layers: 3
53
+ num_heads: 8
54
+ window_size: 12
55
+ num_tokens: 4
56
+ dropout: 0.1
57
+ num_anchor_points: 4
58
+ lm_aux_loss: false
59
+ aux_loss_weight: 0.1
60
+ coord_range: 1000
61
+ prompt_style: plain
62
+ semantic_output:
63
+ enabled: true
64
+ mode: trace_latent
65
+ order: trace_latent
66
+ lm_loss_weight: 0.1
67
+ latent_vocab_size: 32
68
+ latent_num_tokens: 4
69
+ latent_token_prefix: LAM
70
+ prompt_style: plain
71
+ trace_anchor_points: 4
72
+ parse_trace_for_decoder: false
73
+ trainable_token_rows: false
74
+ reduce_in_full_precision: true
75
+
76
+ datasets:
77
+ vla_data:
78
+ dataset_py: lerobot_datasets
79
+ data_root_dir: /path/to/libero_lerobot
80
+ data_mix: libero_all
81
+ action_type: delta_qpos
82
+ CoT_prompt: Your task is {instruction}. To identify the key objects for your task. Locate their bounding boxes in [x1,y1,x2,y2] format.
83
+ CoT_answer: bbox
84
+ default_image_resolution: [3, 224, 224]
85
+ per_device_batch_size: 16
86
+ load_all_data_for_training: true
87
+ obs: [image_0]
88
+ trace:
89
+ enabled: true
90
+ root: /path/to/trace_annotations/libero
91
+ window_size: 12
92
+ normalize: true
93
+ num_anchor_points: 4
94
+ latent_action_labels:
95
+ enabled: true
96
+ root: /path/to/lam_labels
97
+ variant: semanticvla_lam
98
+ strict: true
99
+ missing_policy: error
100
+ out_key: latent_action_idx
101
+
102
+ trainer:
103
+ epochs: 100
104
+ max_train_steps: 30000
105
+ num_warmup_steps: 5000
106
+ save_interval: 5000
107
+ eval_interval: 2000
108
+ learning_rate:
109
+ base: 4.0e-05
110
+ qwen_vl_interface: 1.0e-05
111
+ action_model: 1.0e-04
112
+ lr_scheduler_type: cosine_with_min_lr
113
+ scheduler_specific_kwargs:
114
+ min_lr: 1.0e-06
115
+ freeze_modules: ''
116
+ loss_scale:
117
+ vla: 1.0
118
+ vlm: 0.1
119
+ max_grad_norm: 1.0
120
+ warmup_ratio: 0.1
121
+ weight_decay: 0.0
122
+ logging_frequency: 100
123
+ gradient_clipping: 1.0
124
+ gradient_accumulation_steps: 1
125
+ optimizer:
126
+ name: AdamW
127
+ betas: [0.9, 0.95]
128
+ eps: 1.0e-08
129
+ weight_decay: 1.0e-08
130
+ enable_gradient_checkpointing: true
131
+ enable_mixed_precision_training: true
dataset_statistics.json ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "franka": {
3
+ "action": {
4
+ "mean": [
5
+ 0.07237596483901143,
6
+ 0.08987006871029735,
7
+ -0.10144743137061596,
8
+ -0.00045383188989944756,
9
+ 0.006273590726777911,
10
+ -0.003878799732774496,
11
+ 0.524486355483532
12
+ ],
13
+ "std": [
14
+ 0.3498823308902479,
15
+ 0.37794140366375184,
16
+ 0.460084266976933,
17
+ 0.0403885784928603,
18
+ 0.06616144248501059,
19
+ 0.07763074391911857,
20
+ 0.4994683356809767
21
+ ],
22
+ "max": [
23
+ 0.9375,
24
+ 0.9375,
25
+ 0.9375,
26
+ 0.3557142913341522,
27
+ 0.375,
28
+ 0.375,
29
+ 1.0
30
+ ],
31
+ "min": [
32
+ -0.9375,
33
+ -0.9375,
34
+ -0.9375,
35
+ -0.2582142949104309,
36
+ -0.375,
37
+ -0.3675000071525574,
38
+ 0.0
39
+ ],
40
+ "q01": [
41
+ -0.8785714507102966,
42
+ -0.8758928775787354,
43
+ -0.9375,
44
+ -0.1510714292526245,
45
+ -0.20678570866584778,
46
+ -0.2742857038974762,
47
+ 0.0
48
+ ],
49
+ "q99": [
50
+ 0.9375,
51
+ 0.9107142686843872,
52
+ 0.9375,
53
+ 0.20357142388820648,
54
+ 0.26357144117355347,
55
+ 0.375,
56
+ 1.0
57
+ ],
58
+ "mask": [
59
+ true,
60
+ true,
61
+ true,
62
+ true,
63
+ true,
64
+ true,
65
+ false
66
+ ]
67
+ },
68
+ "state": {
69
+ "mean": [
70
+ -0.04889854742214084,
71
+ 0.03689368185587227,
72
+ 0.7890402488410473,
73
+ 2.9771945476531982,
74
+ -0.1417286954820156,
75
+ -0.11769362539052963,
76
+ 0.026436020154505968,
77
+ -0.02665513101965189
78
+ ],
79
+ "std": [
80
+ 0.10639013941746686,
81
+ 0.15115733130675715,
82
+ 0.38406895599530033,
83
+ 0.3530238395244304,
84
+ 0.8227341427331599,
85
+ 0.32357567121520087,
86
+ 0.014583991652936385,
87
+ 0.014467005007200339
88
+ ],
89
+ "max": [
90
+ 0.21031762659549713,
91
+ 0.39128610491752625,
92
+ 1.3660105466842651,
93
+ 3.6714255809783936,
94
+ 3.560650587081909,
95
+ 1.386339545249939,
96
+ 0.04233968257904053,
97
+ 0.0013633022317662835
98
+ ],
99
+ "min": [
100
+ -0.4828203022480011,
101
+ -0.3255046010017395,
102
+ 0.008128180168569088,
103
+ 0.35277295112609863,
104
+ -3.641430377960205,
105
+ -1.842738389968872,
106
+ -0.0013586411951109767,
107
+ -0.042040832340717316
108
+ ],
109
+ "q01": [
110
+ -0.42401049643754957,
111
+ -0.2838300323486328,
112
+ 0.009925739830359817,
113
+ 1.3085840785503386,
114
+ -2.886677579879761,
115
+ -1.1599004411697387,
116
+ 0.001503719249740243,
117
+ -0.040336399003863335
118
+ ],
119
+ "q99": [
120
+ 0.1530261474847791,
121
+ 0.3629165390133857,
122
+ 1.2910678112506866,
123
+ 3.303542451858519,
124
+ 2.7496529006957933,
125
+ 0.6893712210655194,
126
+ 0.040610933862626555,
127
+ -0.0015016929572448147
128
+ ]
129
+ },
130
+ "num_transitions": 272104,
131
+ "num_trajectories": 1693
132
+ }
133
+ }
final_model/pytorch_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee9ab7537a8b25a628ed506ea8bf347b5f24131ce82f2f001f57c2f234d65446
3
+ size 9974427154