|
|
--- |
|
|
base_model: |
|
|
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
|
|
datasets: |
|
|
- homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2 |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: robotics |
|
|
--- |
|
|
|
|
|
# AlphaSpace-1.5B |
|
|
|
|
|
## Introduction |
|
|
|
|
|
**"AlphaSpace:** ([Paper](https://huggingface.co/papers/2503.18769)), a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific [x, y, z] coordinates. |
|
|
|
|
|
Code: https://github.com/AlanDao/AlphaSpace |
|
|
|
|
|
## Model Details |
|
|
* Model architecture: [Deepseek-R1-Distil-Qwen-1.5B Instruct](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) |
|
|
* Dataset: |
|
|
* Training: [homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2](https://huggingface.co/datasets/homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2) |
|
|
* Eval: https://huggingface.co/datasets/EmbodiedBench/EB-Manipulation. |
|
|
* License: Apache-2.0 license |
|
|
* Developed by: Alan Dao, Dinh Bach Vu, Bui Quang Huy (Menlo Research) |
|
|
|
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline |
|
|
import torch |
|
|
from utils import tokenize_desk, SYSTEM_PROMPT |
|
|
|
|
|
# Load the mode |
|
|
|
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
|
|
|
# Define your workspace |
|
|
objects = [ |
|
|
{"red-cube": [51, 43, 17]}, |
|
|
{"black-cube": [44, 58, 17]}, |
|
|
{"purple-cube": [74, 59, 17]}, |
|
|
{"green-cube": [65, 82, 17]}, |
|
|
] |
|
|
|
|
|
# Give a natural language instruction |
|
|
instruction = "Throw the red cube on top of the blue cylinder" |
|
|
desk, object_height = tokenize_desk(objects) |
|
|
final_instruction = SYSTEM_PROMPT.format(object_height=object_height,instruction=instruction,TABLE_MAP=desk) |
|
|
chat = [ |
|
|
{"role": "user", "content": final_instruction.strip()} |
|
|
] |
|
|
tokenized_chat = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, use_system_prompt=False, return_tensors="pt") |
|
|
# print(len(tokenized_chat[0])) |
|
|
generated_ids = model.generate( |
|
|
tokenized_chat.to("cuda"), |
|
|
max_new_tokens=2048, |
|
|
do_sample=False, |
|
|
temperature=0.6, |
|
|
) |
|
|
# Get the solution |
|
|
result = tokenizer.decode(generated_ids[0][tokenized_chat.shape[1]:], skip_special_tokens=True) |
|
|
print(result) |
|
|
``` |
|
|
### Hardware |
|
|
|
|
|
**GPU Configuration**: Cluster of 8x NVIDIA H200-SXM-140GB. |
|
|
|
|
|
**GPU Usage**: |
|
|
- **SFT**: 40 mins. |
|
|
|
|
|
### Training Arguments |
|
|
|
|
|
We utilize [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) library to train the model. |
|
|
|
|
|
| **Parameter** | **Continual Training** | |
|
|
| --- | --- | |
|
|
| **Epoch** | 1 | |
|
|
| **Global batch size** | 128 | |
|
|
| **Learning Rate** | 1e-4 | |
|
|
| **Learning Scheduler** | cosine with warmup | |
|
|
| **Optimizer** | [AdamW Fused](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) | |
|
|
| **Warmup Ratio** | 0.1 | |
|
|
| **Max length** | 4096 | |
|
|
| **Precision** | bf16 | |
|
|
|
|
|
## Citation |
|
|
- https://arxiv.org/abs/2503.18769 |
|
|
|
|
|
## More Information |
|
|
* Contact the authors at alan@menlo.ai, bach@menlo.ai, yuuki@menlo.ai for further details. |