spikefly commited on
Commit
152ab68
·
verified ·
1 Parent(s): 683f092

Add files using upload-large-folder tool

Browse files
Files changed (4) hide show
  1. README.md +49 -43
  2. config.yaml +130 -0
  3. dataset_statistics.json +71 -0
  4. final_model/pytorch_model.pt +3 -0
README.md CHANGED
@@ -6,74 +6,80 @@ tags:
6
  - robotics
7
  - vision-language-action
8
  - vla
9
- - bridge
10
- - widowx
11
  - simpler-env
 
 
12
  - manipulation
13
  - qwen-vl
14
  ---
15
 
16
- # SemanticVLA · Bridge (SimplerEnv WidowX)
17
 
18
- > 🚧 **Placeholder.** The URL is stable; checkpoints will be uploaded incrementally per the [release roadmap](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/docs/ROADMAP.md).
19
 
20
- [SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial) finetuned on [BridgeData V2](https://rail-berkeley.github.io/bridgedata/), targeting [SimplerEnv](https://github.com/simpler-env/SimplerEnv) WidowX evaluation (`widowx_spoon_on_towel` / `widowx_carrot_on_plate` / `widowx_stack_cube` / `widowx_put_eggplant_in_basket`).
21
 
22
- ## Configuration
 
 
 
 
 
 
23
 
24
- | Field | Value |
 
 
25
  |---|---|
26
- | Backbone | Qwen3VL-4B (Qwen3VL-GR00T-Bridge-RT-1 init) |
27
- | Action head | GR00T-style flow-matching expert |
28
- | Semantic output | `trace_latent` (trace + LAM latent-action token), `none` injection |
29
- | LM loss weight | 0.10 |
 
 
30
  | Action horizon | 16 |
31
- | LAM tokenizer | [`SemanticVLA-LAM` → `oxe-bridge-only/v4-step16k`](https://huggingface.co/spikefly/SemanticVLA-LAM) |
32
- | Training data | `bridge_orig_1.0.0_lerobot` (with dense trace labels via OXE NPY index) |
33
- | Target | 100,000 steps |
34
 
35
- ## Headline result
36
 
37
- SimplerEnv WidowX numbers will be filled in here once the 100k-step training and the 24-episodes-per-task evaluation complete. Training is in flight on Isambard; see the [code repo](https://github.com/Fei-Ni/SemanticVLA_Offcial) for the latest training metrics.
38
 
39
- ## Planned layout
40
 
41
  ```
42
- SemanticVLA-Bridge/
43
- ├── tl-none-lw010-step100k/
44
- ├── pytorch_model.pt
45
- ├── config.yaml
46
- └── model_card.md
47
- └── README.md
 
 
 
 
 
 
 
 
 
48
  ```
49
 
 
 
 
 
50
  ## Sibling SemanticVLA checkpoint repos
51
 
52
  | Repo | Purpose |
53
  |---|---|
54
- | 🤗 [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) | LAM tokenizers used by this VLA |
55
- | 🤗 [`SemanticVLA-LIBERO`](https://huggingface.co/spikefly/SemanticVLA-LIBERO) | LIBERO-finetuned VLA |
56
 
57
  ## Related resources
58
 
59
  - **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial
60
- - **Datasets · Bridge subset**: https://huggingface.co/datasets/spikefly/SemanticVLA-Bridge-LeRobot
61
- - **Datasets · all**: https://hf.co/collections/spikefly/semanticvla-datasets
62
- - **Collection · Model Zoo**: https://hf.co/collections/spikefly/semanticvla-model-zoo
63
-
64
- ## How to load (placeholder API)
65
-
66
- ```python
67
- from huggingface_hub import hf_hub_download
68
- import torch
69
-
70
- ckpt = hf_hub_download(
71
- repo_id="spikefly/SemanticVLA-Bridge",
72
- filename="tl-none-lw010-step100k/pytorch_model.pt",
73
- )
74
- state = torch.load(ckpt, map_location="cpu")
75
- # loader will be released with the code repo
76
- ```
77
 
78
  ## Citation
79
 
@@ -95,4 +101,4 @@ state = torch.load(ckpt, map_location="cpu")
95
 
96
  ## License
97
 
98
- Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE).
 
6
  - robotics
7
  - vision-language-action
8
  - vla
 
 
9
  - simpler-env
10
+ - widowx
11
+ - bridge
12
  - manipulation
13
  - qwen-vl
14
  ---
15
 
16
+ # SemanticVLA · SimplerEnv (WidowX)
17
 
18
+ [SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial) policy trained on BridgeData V2 (Open X-Embodiment `bridge_orig`) for **100K steps**, intended for [SimplerEnv](https://github.com/simpler-env/SimplerEnv) WidowX evaluation. The unified OXE LAM is used as the latent-action tokenizer, and the trace + latent-action auxiliary heads are supervised in the VLM's language stream.
19
 
20
+ ## Headline result (SimplerEnv WidowX)
21
 
22
+ | Task | Success rate |
23
+ |---|---:|
24
+ | Put Eggplant in Basket | 0.958 |
25
+ | Spoon on Towel | 1.000 |
26
+ | Carrot on Plate | 0.792 |
27
+ | Stack Cube | 0.458 |
28
+ | **Mean** | **0.802** |
29
 
30
+ ## Architecture
31
+
32
+ | Component | Choice |
33
  |---|---|
34
+ | VLM backbone | Qwen3-VL-4B-Instruct |
35
+ | Action head | DiT-B (flow matching) |
36
+ | LAM tokenizer | [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) (unified OXE LAM) |
37
+ | Semantic supervision | Trace + latent action tokens predicted in the VLM's language stream; action decoder unmodified |
38
+ | Latent vocabulary size | 32 |
39
+ | Latent tokens per sample | 4 |
40
  | Action horizon | 16 |
 
 
 
41
 
42
+ ## Training data
43
 
44
+ This checkpoint is trained on **BridgeData V2** (Open X-Embodiment `bridge_orig`) for 100K steps. It is intended specifically for SimplerEnv WidowX evaluation and is **not** meant as a general-purpose policy for unrelated robot embodiments.
45
 
46
+ ## Files
47
 
48
  ```
49
+ SemanticVLA-SimplerEnv/
50
+ ├── README.md
51
+ ├── config.yaml # loadable model config
52
+ ├── dataset_statistics.json # action normalization stats
53
+ └── final_model/
54
+ └── pytorch_model.pt # policy state_dict
55
+ ```
56
+
57
+ ## How to load
58
+
59
+ ```python
60
+ from semanticvla.model.framework.base_framework import baseframework
61
+
62
+ policy = baseframework.from_pretrained("pytorch_model.pt")
63
+ policy.eval()
64
  ```
65
 
66
+ `baseframework.from_pretrained()` walks two directory levels up from the checkpoint file to locate `config.yaml` and `dataset_statistics.json`. The released layout follows this convention.
67
+
68
+ To run the SimplerEnv WidowX suite, see [`examples/SimplerEnv/`](https://github.com/Fei-Ni/SemanticVLA_Offcial/tree/main/examples/SimplerEnv) in the code repo.
69
+
70
  ## Sibling SemanticVLA checkpoint repos
71
 
72
  | Repo | Purpose |
73
  |---|---|
74
+ | 🤗 [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) | Unified OXE LAM consumed by this policy |
75
+ | 🤗 [`SemanticVLA-LIBERO`](https://huggingface.co/spikefly/SemanticVLA-LIBERO) | LIBERO policy |
76
 
77
  ## Related resources
78
 
79
  - **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial
80
+ - **Dataset (BridgeData V2 in LeRobot v3 with dense traces)**: 🤗 [`SemanticVLA-TraceX-240K-Bridge`](https://huggingface.co/datasets/spikefly/SemanticVLA-TraceX-240K-Bridge)
81
+ - **Datasets collection**: https://hf.co/collections/spikefly/semanticvla-datasets
82
+ - **Model Zoo collection**: https://hf.co/collections/spikefly/semanticvla-model-zoo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  ## Citation
85
 
 
101
 
102
  ## License
103
 
104
+ Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE), subject to the upstream BridgeData V2 license.
config.yaml ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Loadable config for SemanticVLA-SimplerEnv (SimplerEnv WidowX policy).
2
+ #
3
+ # Load via:
4
+ # from semanticvla.model.framework.base_framework import baseframework
5
+ # policy = baseframework.from_pretrained("pytorch_model.pt")
6
+ #
7
+ # The loader walks two directory levels up from the checkpoint file to locate
8
+ # this `config.yaml` and the sibling `dataset_statistics.json`.
9
+
10
+ seed: 42
11
+
12
+ framework:
13
+ name: SemanticVLA
14
+ qwenvl:
15
+ base_vlm: Qwen/Qwen3-VL-4B-Instruct
16
+ attn_implementation: flash_attention_2
17
+ vl_hidden_dim: 2048
18
+ dino:
19
+ dino_backbone: dinov2_vits14
20
+ action_model:
21
+ action_model_type: DiT-B
22
+ action_hidden_dim: 1024
23
+ hidden_size: 1024
24
+ add_pos_embed: true
25
+ max_seq_len: 1024
26
+ action_dim: 7
27
+ state_dim: 7
28
+ future_action_window_size: 15
29
+ action_horizon: 16
30
+ past_action_window_size: 0
31
+ repeated_diffusion_steps: 8
32
+ noise_beta_alpha: 1.5
33
+ noise_beta_beta: 1.0
34
+ noise_s: 0.999
35
+ num_timestep_buckets: 1000
36
+ num_inference_timesteps: 4
37
+ num_target_vision_tokens: 32
38
+ diffusion_model_cfg:
39
+ cross_attention_dim: 2048
40
+ dropout: 0.2
41
+ final_dropout: true
42
+ interleave_self_attention: true
43
+ norm_type: ada_norm
44
+ num_layers: 16
45
+ output_dim: 1024
46
+ positional_embeddings: null
47
+ progress_dim: 0
48
+ trace_dim: 0
49
+ trace:
50
+ injection_mode: none
51
+ hidden_dim: 256
52
+ num_layers: 3
53
+ num_heads: 8
54
+ window_size: 12
55
+ num_tokens: 4
56
+ dropout: 0.1
57
+ num_anchor_points: 4
58
+ lm_aux_loss: false
59
+ aux_loss_weight: 0.1
60
+ coord_range: 1000
61
+ prompt_style: plain
62
+ semantic_output:
63
+ enabled: true
64
+ mode: trace_latent
65
+ order: trace_latent
66
+ lm_loss_weight: 0.1
67
+ latent_vocab_size: 32
68
+ latent_num_tokens: 4
69
+ latent_token_prefix: LAM
70
+ prompt_style: plain
71
+ trace_anchor_points: 4
72
+ parse_trace_for_decoder: false
73
+ trainable_token_rows: false
74
+ reduce_in_full_precision: true
75
+
76
+ datasets:
77
+ vla_data:
78
+ dataset_py: lerobot_datasets
79
+ data_root_dir: /path/to/bridge_lerobot
80
+ data_mix: bridge
81
+ statistics_key: oxe_bridge
82
+ action_horizon: 16
83
+ image_size: [224, 224]
84
+ default_image_resolution: [3, 224, 224]
85
+ per_device_batch_size: 16
86
+ num_workers: 4
87
+ trace:
88
+ enabled: true
89
+ root: /path/to/trace_annotations/bridge
90
+ window_size: 12
91
+ normalize: true
92
+ num_anchor_points: 4
93
+ latent_action_labels:
94
+ enabled: true
95
+ root: /path/to/lam_labels
96
+ variant: semanticvla_lam
97
+ strict: true
98
+ missing_policy: clip
99
+ out_key: latent_action_idx
100
+
101
+ trainer:
102
+ epochs: 100
103
+ max_train_steps: 100000
104
+ num_warmup_steps: 5000
105
+ save_interval: 5000
106
+ eval_interval: 2000
107
+ learning_rate:
108
+ base: 4.0e-05
109
+ qwen_vl_interface: 1.0e-05
110
+ action_model: 1.0e-04
111
+ lr_scheduler_type: cosine_with_min_lr
112
+ scheduler_specific_kwargs:
113
+ min_lr: 5.0e-07
114
+ freeze_modules: ''
115
+ loss_scale:
116
+ vla: 1.0
117
+ vlm: 0.1
118
+ max_grad_norm: 1.0
119
+ warmup_ratio: 0.1
120
+ weight_decay: 0.0
121
+ logging_frequency: 100
122
+ gradient_clipping: 1.0
123
+ gradient_accumulation_steps: 1
124
+ optimizer:
125
+ name: AdamW
126
+ betas: [0.9, 0.95]
127
+ eps: 1.0e-08
128
+ weight_decay: 1.0e-08
129
+ enable_gradient_checkpointing: true
130
+ enable_mixed_precision_training: true
dataset_statistics.json ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "oxe_bridge": {
3
+ "action": {
4
+ "mean": [
5
+ 0.00022731871285941452,
6
+ 0.00013112221495248377,
7
+ -0.00012641931243706495,
8
+ -0.00014410706353373826,
9
+ -0.00039030605694279075,
10
+ 0.0002406332059763372,
11
+ 0.5765891671180725
12
+ ],
13
+ "std": [
14
+ 0.009770569391548634,
15
+ 0.013695062138140202,
16
+ 0.012675146572291851,
17
+ 0.028455283492803574,
18
+ 0.03052123636007309,
19
+ 0.07739030569791794,
20
+ 0.4966523349285126
21
+ ],
22
+ "max": [
23
+ 0.41691166162490845,
24
+ 0.25864794850349426,
25
+ 0.21218234300613403,
26
+ 3.122201919555664,
27
+ 1.8618112802505493,
28
+ 6.272472858428955,
29
+ 1.0
30
+ ],
31
+ "min": [
32
+ -0.4007510244846344,
33
+ -0.13874775171279907,
34
+ -0.22553899884223938,
35
+ -3.2010786533355713,
36
+ -1.8618112802505493,
37
+ -6.279075622558594,
38
+ 0.0
39
+ ],
40
+ "q01": [
41
+ -0.02875255048274994,
42
+ -0.04170213546603918,
43
+ -0.026096721179783344,
44
+ -0.08052874729037285,
45
+ -0.09249906800687313,
46
+ -0.20738555490970612,
47
+ 0.0
48
+ ],
49
+ "q99": [
50
+ 0.028306663036346436,
51
+ 0.04089853074401617,
52
+ 0.0401805154979229,
53
+ 0.08173403143882751,
54
+ 0.07760760560631752,
55
+ 0.2038465365767479,
56
+ 1.0
57
+ ],
58
+ "mask": [
59
+ true,
60
+ true,
61
+ true,
62
+ true,
63
+ true,
64
+ true,
65
+ false
66
+ ]
67
+ },
68
+ "num_trajectories": 800,
69
+ "num_transitions": 27903
70
+ }
71
+ }
final_model/pytorch_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60c52447f026e9739267d9b131acb81639a4ca8b56912797387d53ef0c7c1aff
3
+ size 9974438810