Safetensors
English
qwen3_vl
embodied
File size: 1,711 Bytes
4d60c4b
 
6e2b05e
4d60c4b
 
 
 
 
 
 
 
 
 
c29fdf2
 
 
 
 
ea35b45
c29fdf2
 
 
 
 
 
ea35b45
c29fdf2
ea35b45
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
---
license: other
license_name: attribution-noncommercial-sharealike4.0international
license_link: LICENSE
language:
- en
metrics:
- accuracy
- bleu
base_model:
- Qwen/Qwen3-VL-4B-Instruct
tags:
- embodied
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

**Thinker: A vision-language foundation model for embodied intelligence**

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->
We are pleased to open-source Thinker, a state-of-the-art vision-language foundation model specifically engineered for embodied intelligence. While conventional VLMs often struggle with perspective confusion and temporal oversight, Thinker is designed to bridge the gap between general scene understanding and robust robot-centric task-level capabilities. By leveraging high-quality dataset curation, multi-stage training, and reinforcement learning, Thinker exhibits advanced capabilities across four core dimensions: Task Planning with future-state prediction, Spatial Intelligence grounded in an egocentric coordinate system, Temporal Understanding through historical state integration, and precise Visual Grounding. Leveraging these capabilities, Thinker sets new records across 7 embodied AI benchmarks in Task Planning, Visual Grounding and Spatial Understanding, and significantly outperforms existing open-source, closed-source, and specialized baselines, showing its potential as a foundation for embodied intelligence and autonomous robotic decision-making.

- **Developed by:** Ubtech Thinker Team
- **Project page:** https://github.com/UBTECH-Robot/Thinker
- **License:** Attribution-NonCommercial-ShareAlike 4.0 International