Update README with model/dataset documentation
Browse files
README.md
CHANGED
|
@@ -53,40 +53,6 @@ You FIRST observe the image in <observe> </observe> tags, then visualise the rel
|
|
| 53 |
Image size: {Width} x {Height}
|
| 54 |
```
|
| 55 |
|
| 56 |
-
## Output Format
|
| 57 |
-
|
| 58 |
-
The model generates structured output with four components:
|
| 59 |
-
|
| 60 |
-
1. **`<observe>`**: Scene description covering relevant objects
|
| 61 |
-
2. **`<scene>`**: JSON scene graph with objects (id, bbox) and relationships (subject, predicate, object)
|
| 62 |
-
3. **`<think>`**: Step-by-step reasoning as internal monologue
|
| 63 |
-
4. **`<answer>`**: Final answer with option letter and text
|
| 64 |
-
|
| 65 |
-
### Example Output
|
| 66 |
-
|
| 67 |
-
```
|
| 68 |
-
<observe>
|
| 69 |
-
The image shows a living room with a couch, a coffee table, and a cat sitting on the floor.
|
| 70 |
-
</observe>
|
| 71 |
-
<scene>
|
| 72 |
-
{
|
| 73 |
-
"objects": [
|
| 74 |
-
{"id": "couch.1", "bbox": [50, 100, 400, 350]},
|
| 75 |
-
{"id": "cat.1", "bbox": [200, 300, 280, 400]},
|
| 76 |
-
{"id": "table.1", "bbox": [150, 250, 350, 320]}
|
| 77 |
-
],
|
| 78 |
-
"relationships": [
|
| 79 |
-
{"subject": "cat.1", "predicate": "in front of", "object": "couch.1"},
|
| 80 |
-
{"subject": "cat.1", "predicate": "beside", "object": "table.1"}
|
| 81 |
-
]
|
| 82 |
-
}
|
| 83 |
-
</scene>
|
| 84 |
-
<think>
|
| 85 |
-
Looking at the scene graph, the cat is positioned in front of the couch and beside the coffee table. The bounding box coordinates show the cat is at y=300-400 while the couch extends to y=350, confirming the cat is on the floor in front of the couch.
|
| 86 |
-
</think>
|
| 87 |
-
<answer> (B) in front of the couch </answer>
|
| 88 |
-
```
|
| 89 |
-
|
| 90 |
## Usage
|
| 91 |
|
| 92 |
```python
|
|
@@ -129,19 +95,6 @@ output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
|
| 129 |
print(output)
|
| 130 |
```
|
| 131 |
|
| 132 |
-
## Evaluation Results
|
| 133 |
-
|
| 134 |
-
SpatialThinker-7B achieves state-of-the-art performance on spatial reasoning benchmarks:
|
| 135 |
-
|
| 136 |
-
| Benchmark | SpatialThinker-7B |
|
| 137 |
-
|-----------|------------------------|
|
| 138 |
-
| CV-Bench (3D) | Strong performance |
|
| 139 |
-
| BLINK-Spatial | Outperforms GPT-4o |
|
| 140 |
-
| SpatialBench | SOTA results |
|
| 141 |
-
| RealWorldQA | Competitive |
|
| 142 |
-
|
| 143 |
-
See the [paper](https://arxiv.org/abs/2511.07403) for detailed results.
|
| 144 |
-
|
| 145 |
## Citation
|
| 146 |
|
| 147 |
```bibtex
|
|
|
|
| 53 |
Image size: {Width} x {Height}
|
| 54 |
```
|
| 55 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
## Usage
|
| 57 |
|
| 58 |
```python
|
|
|
|
| 95 |
print(output)
|
| 96 |
```
|
| 97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
## Citation
|
| 99 |
|
| 100 |
```bibtex
|