Add files using upload-large-folder tool

Browse files

Files changed (4) hide show

README.md +49 -43
config.yaml +130 -0
dataset_statistics.json +71 -0
final_model/pytorch_model.pt +3 -0

README.md CHANGED Viewed

@@ -6,74 +6,80 @@ tags:
   - robotics
   - vision-language-action
   - vla
-  - bridge
-  - widowx
   - simpler-env
   - manipulation
   - qwen-vl
 ---
-# SemanticVLA · Bridge (SimplerEnv WidowX)
-> 🚧 **Placeholder.** The URL is stable; checkpoints will be uploaded incrementally per the [release roadmap](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/docs/ROADMAP.md).
-[SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial) finetuned on [BridgeData V2](https://rail-berkeley.github.io/bridgedata/), targeting [SimplerEnv](https://github.com/simpler-env/SimplerEnv) WidowX evaluation (`widowx_spoon_on_towel` / `widowx_carrot_on_plate` / `widowx_stack_cube` / `widowx_put_eggplant_in_basket`).
-## Configuration
-| Field | Value |
 |---|---|
-| Backbone | Qwen3VL-4B (Qwen3VL-GR00T-Bridge-RT-1 init) |
-| Action head | GR00T-style flow-matching expert |
-| Semantic output | `trace_latent` (trace + LAM latent-action token), `none` injection |
-| LM loss weight | 0.10 |
 | Action horizon | 16 |
-| LAM tokenizer | [`SemanticVLA-LAM` → `oxe-bridge-only/v4-step16k`](https://huggingface.co/spikefly/SemanticVLA-LAM) |
-| Training data | `bridge_orig_1.0.0_lerobot` (with dense trace labels via OXE NPY index) |
-| Target | 100,000 steps |
-## Headline result
-SimplerEnv WidowX numbers will be filled in here once the 100k-step training and the 24-episodes-per-task evaluation complete. Training is in flight on Isambard; see the [code repo](https://github.com/Fei-Ni/SemanticVLA_Offcial) for the latest training metrics.
-## Planned layout
 ```
-SemanticVLA-Bridge/
-├── tl-none-lw010-step100k/
-│   ├── pytorch_model.pt
-│   ├── config.yaml
-│   └── model_card.md
-└── README.md
 ```
 ## Sibling SemanticVLA checkpoint repos
 | Repo | Purpose |
 |---|---|
-| 🤗 [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) | LAM tokenizers used by this VLA |
-| 🤗 [`SemanticVLA-LIBERO`](https://huggingface.co/spikefly/SemanticVLA-LIBERO) | LIBERO-finetuned VLA |
 ## Related resources
 - **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial
-- **Datasets · Bridge subset**: https://huggingface.co/datasets/spikefly/SemanticVLA-Bridge-LeRobot
-- **Datasets · all**: https://hf.co/collections/spikefly/semanticvla-datasets
-- **Collection · Model Zoo**: https://hf.co/collections/spikefly/semanticvla-model-zoo
-## How to load (placeholder API)
-```python
-from huggingface_hub import hf_hub_download
-import torch
-ckpt = hf_hub_download(
-    repo_id="spikefly/SemanticVLA-Bridge",
-    filename="tl-none-lw010-step100k/pytorch_model.pt",
-)
-state = torch.load(ckpt, map_location="cpu")
-# loader will be released with the code repo
-```
 ## Citation
@@ -95,4 +101,4 @@ state = torch.load(ckpt, map_location="cpu")
 ## License
-Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE).

   - robotics
   - vision-language-action
   - vla
   - simpler-env
+  - widowx
+  - bridge
   - manipulation
   - qwen-vl
 ---
+# SemanticVLA · SimplerEnv (WidowX)
+[SemanticVLA](https://github.com/Fei-Ni/SemanticVLA_Offcial) policy trained on BridgeData V2 (Open X-Embodiment `bridge_orig`) for **100K steps**, intended for [SimplerEnv](https://github.com/simpler-env/SimplerEnv) WidowX evaluation. The unified OXE LAM is used as the latent-action tokenizer, and the trace + latent-action auxiliary heads are supervised in the VLM's language stream.
+## Headline result (SimplerEnv WidowX)
+| Task | Success rate |
+|---|---:|
+| Put Eggplant in Basket | 0.958 |
+| Spoon on Towel         | 1.000 |
+| Carrot on Plate        | 0.792 |
+| Stack Cube             | 0.458 |
+| **Mean**               | **0.802** |
+## Architecture
+| Component | Choice |
 |---|---|
+| VLM backbone | Qwen3-VL-4B-Instruct |
+| Action head | DiT-B (flow matching) |
+| LAM tokenizer | [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) (unified OXE LAM) |
+| Semantic supervision | Trace + latent action tokens predicted in the VLM's language stream; action decoder unmodified |
+| Latent vocabulary size | 32 |
+| Latent tokens per sample | 4 |
 | Action horizon | 16 |
+## Training data
+This checkpoint is trained on **BridgeData V2** (Open X-Embodiment `bridge_orig`) for 100K steps. It is intended specifically for SimplerEnv WidowX evaluation and is **not** meant as a general-purpose policy for unrelated robot embodiments.
+## Files
 ```
+SemanticVLA-SimplerEnv/
+├── README.md
+├── config.yaml              # loadable model config
+├── dataset_statistics.json  # action normalization stats
+└── final_model/
+    └── pytorch_model.pt     # policy state_dict
+```
+## How to load
+```python
+from semanticvla.model.framework.base_framework import baseframework
+policy = baseframework.from_pretrained("pytorch_model.pt")
+policy.eval()
 ```
+`baseframework.from_pretrained()` walks two directory levels up from the checkpoint file to locate `config.yaml` and `dataset_statistics.json`. The released layout follows this convention.
+To run the SimplerEnv WidowX suite, see [`examples/SimplerEnv/`](https://github.com/Fei-Ni/SemanticVLA_Offcial/tree/main/examples/SimplerEnv) in the code repo.
 ## Sibling SemanticVLA checkpoint repos
 | Repo | Purpose |
 |---|---|
+| 🤗 [`SemanticVLA-LAM`](https://huggingface.co/spikefly/SemanticVLA-LAM) | Unified OXE LAM consumed by this policy |
+| 🤗 [`SemanticVLA-LIBERO`](https://huggingface.co/spikefly/SemanticVLA-LIBERO) | LIBERO policy |
 ## Related resources
 - **Code**: https://github.com/Fei-Ni/SemanticVLA_Offcial
+- **Dataset (BridgeData V2 in LeRobot v3 with dense traces)**: 🤗 [`SemanticVLA-TraceX-240K-Bridge`](https://huggingface.co/datasets/spikefly/SemanticVLA-TraceX-240K-Bridge)
+- **Datasets collection**: https://hf.co/collections/spikefly/semanticvla-datasets
+- **Model Zoo collection**: https://hf.co/collections/spikefly/semanticvla-model-zoo
 ## Citation
 ## License
+Released under the [MIT License](https://github.com/Fei-Ni/SemanticVLA_Offcial/blob/main/LICENSE), subject to the upstream BridgeData V2 license.

config.yaml ADDED Viewed

	@@ -0,0 +1,130 @@

+# Loadable config for SemanticVLA-SimplerEnv (SimplerEnv WidowX policy).
+#
+# Load via:
+#   from semanticvla.model.framework.base_framework import baseframework
+#   policy = baseframework.from_pretrained("pytorch_model.pt")
+#
+# The loader walks two directory levels up from the checkpoint file to locate
+# this `config.yaml` and the sibling `dataset_statistics.json`.
+seed: 42
+framework:
+  name: SemanticVLA
+  qwenvl:
+    base_vlm: Qwen/Qwen3-VL-4B-Instruct
+    attn_implementation: flash_attention_2
+    vl_hidden_dim: 2048
+  dino:
+    dino_backbone: dinov2_vits14
+  action_model:
+    action_model_type: DiT-B
+    action_hidden_dim: 1024
+    hidden_size: 1024
+    add_pos_embed: true
+    max_seq_len: 1024
+    action_dim: 7
+    state_dim: 7
+    future_action_window_size: 15
+    action_horizon: 16
+    past_action_window_size: 0
+    repeated_diffusion_steps: 8
+    noise_beta_alpha: 1.5
+    noise_beta_beta: 1.0
+    noise_s: 0.999
+    num_timestep_buckets: 1000
+    num_inference_timesteps: 4
+    num_target_vision_tokens: 32
+    diffusion_model_cfg:
+      cross_attention_dim: 2048
+      dropout: 0.2
+      final_dropout: true
+      interleave_self_attention: true
+      norm_type: ada_norm
+      num_layers: 16
+      output_dim: 1024
+      positional_embeddings: null
+      progress_dim: 0
+      trace_dim: 0
+    trace:
+      injection_mode: none
+      hidden_dim: 256
+      num_layers: 3
+      num_heads: 8
+      window_size: 12
+      num_tokens: 4
+      dropout: 0.1
+      num_anchor_points: 4
+      lm_aux_loss: false
+      aux_loss_weight: 0.1
+      coord_range: 1000
+      prompt_style: plain
+    semantic_output:
+      enabled: true
+      mode: trace_latent
+      order: trace_latent
+      lm_loss_weight: 0.1
+      latent_vocab_size: 32
+      latent_num_tokens: 4
+      latent_token_prefix: LAM
+      prompt_style: plain
+      trace_anchor_points: 4
+      parse_trace_for_decoder: false
+      trainable_token_rows: false
+  reduce_in_full_precision: true
+datasets:
+  vla_data:
+    dataset_py: lerobot_datasets
+    data_root_dir: /path/to/bridge_lerobot
+    data_mix: bridge
+    statistics_key: oxe_bridge
+    action_horizon: 16
+    image_size: [224, 224]
+    default_image_resolution: [3, 224, 224]
+    per_device_batch_size: 16
+    num_workers: 4
+    trace:
+      enabled: true
+      root: /path/to/trace_annotations/bridge
+      window_size: 12
+      normalize: true
+      num_anchor_points: 4
+    latent_action_labels:
+      enabled: true
+      root: /path/to/lam_labels
+      variant: semanticvla_lam
+      strict: true
+      missing_policy: clip
+      out_key: latent_action_idx
+trainer:
+  epochs: 100
+  max_train_steps: 100000
+  num_warmup_steps: 5000
+  save_interval: 5000
+  eval_interval: 2000
+  learning_rate:
+    base: 4.0e-05
+    qwen_vl_interface: 1.0e-05
+    action_model: 1.0e-04
+  lr_scheduler_type: cosine_with_min_lr
+  scheduler_specific_kwargs:
+    min_lr: 5.0e-07
+  freeze_modules: ''
+  loss_scale:
+    vla: 1.0
+    vlm: 0.1
+  max_grad_norm: 1.0
+  warmup_ratio: 0.1
+  weight_decay: 0.0
+  logging_frequency: 100
+  gradient_clipping: 1.0
+  gradient_accumulation_steps: 1
+  optimizer:
+    name: AdamW
+    betas: [0.9, 0.95]
+    eps: 1.0e-08
+    weight_decay: 1.0e-08
+  enable_gradient_checkpointing: true
+  enable_mixed_precision_training: true

dataset_statistics.json ADDED Viewed

	@@ -0,0 +1,71 @@

+{
+  "oxe_bridge": {
+    "action": {
+      "mean": [
+        0.00022731871285941452,
+        0.00013112221495248377,
+        -0.00012641931243706495,
+        -0.00014410706353373826,
+        -0.00039030605694279075,
+        0.0002406332059763372,
+        0.5765891671180725
+      ],
+      "std": [
+        0.009770569391548634,
+        0.013695062138140202,
+        0.012675146572291851,
+        0.028455283492803574,
+        0.03052123636007309,
+        0.07739030569791794,
+        0.4966523349285126
+      ],
+      "max": [
+        0.41691166162490845,
+        0.25864794850349426,
+        0.21218234300613403,
+        3.122201919555664,
+        1.8618112802505493,
+        6.272472858428955,
+        1.0
+      ],
+      "min": [
+        -0.4007510244846344,
+        -0.13874775171279907,
+        -0.22553899884223938,
+        -3.2010786533355713,
+        -1.8618112802505493,
+        -6.279075622558594,
+        0.0
+      ],
+      "q01": [
+        -0.02875255048274994,
+        -0.04170213546603918,
+        -0.026096721179783344,
+        -0.08052874729037285,
+        -0.09249906800687313,
+        -0.20738555490970612,
+        0.0
+      ],
+      "q99": [
+        0.028306663036346436,
+        0.04089853074401617,
+        0.0401805154979229,
+        0.08173403143882751,
+        0.07760760560631752,
+        0.2038465365767479,
+        1.0
+      ],
+      "mask": [
+        true,
+        true,
+        true,
+        true,
+        true,
+        true,
+        false
+      ]
+    },
+    "num_trajectories": 800,
+    "num_transitions": 27903
+  }
+}

final_model/pytorch_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:60c52447f026e9739267d9b131acb81639a4ca8b56912797387d53ef0c7c1aff
+size 9974438810