File size: 3,503 Bytes
2ad9f96 adc0d3a 65d6a0b 93e29ea 2ad9f96 e99e37e 2ad9f96 e99e37e 2ad9f96 93e29ea 2ad9f96 93e29ea 2ad9f96 e99e37e 2ad9f96 e99e37e 2ad9f96 e99e37e eed1aaf e99e37e 2ad9f96 e99e37e 2ad9f96 e99e37e 2ad9f96 e99e37e 2ad9f96 e99e37e 2ad9f96 e99e37e 2ad9f96 e99e37e 93e29ea 2ad9f96 e99e37e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | ---
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
datasets:
- homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2
library_name: transformers
license: apache-2.0
pipeline_tag: robotics
---
# AlphaSpace-1.5B
## Introduction
**"AlphaSpace:** ([Paper](https://huggingface.co/papers/2503.18769)), a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific [x, y, z] coordinates.
Code: https://github.com/AlanDao/AlphaSpace
## Model Details
* Model architecture: [Deepseek-R1-Distil-Qwen-1.5B Instruct](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)
* Dataset:
* Training: [homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2](https://huggingface.co/datasets/homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2)
* Eval: https://huggingface.co/datasets/EmbodiedBench/EB-Manipulation.
* License: Apache-2.0 license
* Developed by: Alan Dao, Dinh Bach Vu, Bui Quang Huy (Menlo Research)
## How to Get Started
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
from utils import tokenize_desk, SYSTEM_PROMPT
# Load the mode
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Define your workspace
objects = [
{"red-cube": [51, 43, 17]},
{"black-cube": [44, 58, 17]},
{"purple-cube": [74, 59, 17]},
{"green-cube": [65, 82, 17]},
]
# Give a natural language instruction
instruction = "Throw the red cube on top of the blue cylinder"
desk, object_height = tokenize_desk(objects)
final_instruction = SYSTEM_PROMPT.format(object_height=object_height,instruction=instruction,TABLE_MAP=desk)
chat = [
{"role": "user", "content": final_instruction.strip()}
]
tokenized_chat = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, use_system_prompt=False, return_tensors="pt")
# print(len(tokenized_chat[0]))
generated_ids = model.generate(
tokenized_chat.to("cuda"),
max_new_tokens=2048,
do_sample=False,
temperature=0.6,
)
# Get the solution
result = tokenizer.decode(generated_ids[0][tokenized_chat.shape[1]:], skip_special_tokens=True)
print(result)
```
### Hardware
**GPU Configuration**: Cluster of 8x NVIDIA H200-SXM-140GB.
**GPU Usage**:
- **SFT**: 40 mins.
### Training Arguments
We utilize [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) library to train the model.
| **Parameter** | **Continual Training** |
| --- | --- |
| **Epoch** | 1 |
| **Global batch size** | 128 |
| **Learning Rate** | 1e-4 |
| **Learning Scheduler** | cosine with warmup |
| **Optimizer** | [AdamW Fused](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html) |
| **Warmup Ratio** | 0.1 |
| **Max length** | 4096 |
| **Precision** | bf16 |
## Citation
- https://arxiv.org/abs/2503.18769
## More Information
* Contact the authors at alan@menlo.ai, bach@menlo.ai, yuuki@menlo.ai for further details. |