Robotics
Transformers
Safetensors
qwen2
text-generation
text-generation-inference
File size: 3,503 Bytes
2ad9f96
adc0d3a
65d6a0b
93e29ea
 
 
 
 
2ad9f96
 
e99e37e
2ad9f96
e99e37e
2ad9f96
93e29ea
2ad9f96
93e29ea
2ad9f96
e99e37e
 
 
 
 
 
 
2ad9f96
 
e99e37e
2ad9f96
e99e37e
eed1aaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e99e37e
 
2ad9f96
e99e37e
2ad9f96
e99e37e
 
2ad9f96
e99e37e
2ad9f96
e99e37e
2ad9f96
e99e37e
 
 
 
 
 
 
 
 
 
2ad9f96
e99e37e
93e29ea
2ad9f96
e99e37e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
datasets:
- homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2
library_name: transformers
license: apache-2.0
pipeline_tag: robotics
---

# AlphaSpace-1.5B

## Introduction

**"AlphaSpace:** ([Paper](https://huggingface.co/papers/2503.18769)), a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific [x, y, z] coordinates. 

Code: https://github.com/AlanDao/AlphaSpace

## Model Details
* Model architecture: [Deepseek-R1-Distil-Qwen-1.5B Instruct](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
* Dataset:
  * Training: [homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2](https://huggingface.co/datasets/homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2)
  * Eval: https://huggingface.co/datasets/EmbodiedBench/EB-Manipulation.
* License: Apache-2.0 license
* Developed by: Alan Dao, Dinh Bach Vu, Bui Quang Huy (Menlo Research)


## How to Get Started 

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
from utils import tokenize_desk, SYSTEM_PROMPT

# Load the mode


model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Define your workspace
objects = [
    {"red-cube": [51, 43, 17]},
    {"black-cube": [44, 58, 17]},
    {"purple-cube": [74, 59, 17]},
    {"green-cube": [65, 82, 17]},
]

# Give a natural language instruction
instruction = "Throw the red cube on top of the blue cylinder"
desk, object_height = tokenize_desk(objects)
final_instruction = SYSTEM_PROMPT.format(object_height=object_height,instruction=instruction,TABLE_MAP=desk)
chat = [
    {"role": "user", "content": final_instruction.strip()}
]
tokenized_chat = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, use_system_prompt=False, return_tensors="pt")
# print(len(tokenized_chat[0]))
generated_ids = model.generate(
    tokenized_chat.to("cuda"),
    max_new_tokens=2048,
    do_sample=False,
    temperature=0.6,
)
# Get the solution
result = tokenizer.decode(generated_ids[0][tokenized_chat.shape[1]:], skip_special_tokens=True)
print(result)
```
### Hardware

**GPU Configuration**: Cluster of 8x NVIDIA H200-SXM-140GB.

**GPU Usage**:
  - **SFT**: 40 mins.

### Training Arguments

We utilize [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) library to train the model. 

| **Parameter** | **Continual Training** |
| --- | --- |
| **Epoch** | 1 |
| **Global batch size** | 128 |
| **Learning Rate** | 1e-4 |
| **Learning Scheduler** | cosine with warmup |
| **Optimizer** | [AdamW Fused](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) |
| **Warmup Ratio** | 0.1 |
| **Max length** | 4096 |
| **Precision** | bf16 |

##  Citation
- https://arxiv.org/abs/2503.18769

## More Information
* Contact the authors at alan@menlo.ai, bach@menlo.ai, yuuki@menlo.ai for further details.