File size: 2,280 Bytes
d17c49b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67412da
27320e1
 
 
 
 
 
12a90e2
27320e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: apache-2.0
language:
- en
tags:
- computer-vision
- robotics
- spatial-reasoning
- vision-language-model
- multi-modal
- glm4v
- fine-tuned
base_model: glm4v
model_type: vision-language-model
datasets:
- custom
library_name: transformers
---

# TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

## Usage

### Environment Requirements

```bash
pip install -r requirements.txt
```

### Configuration

Before using the model, you need to update the configuration file `glm4v_tisr_full_inference.yaml`:

1. Update `media_dir` to your image directory:
   ```yaml
   media_dir: /path/to/your/images
   ```

2. Update the image path in `example_usage.py`:
   ```python
   image_paths = ["/path/to/your/image.jpg"]  # Replace with actual image path
   ```

### Basic Usage

```python
import sys
from llamafactory.chat.chat_model import ChatModel

# Load model using LLaMA-Factory ChatModel
config_file = "glm4v_tisr_full_inference.yaml"

# Simulate command line arguments
original_argv = sys.argv.copy()
sys.argv = [sys.argv[0], config_file]

try:
    chat_model = ChatModel()
finally:
    # Restore original command line arguments
    sys.argv = original_argv

# Prepare input
image_paths = ["/path/to/your/image.jpg"]  # Replace with actual image path
question = "Two points are circled on the image, labeled by A and B beside each circle. Which point is closer to the camera? Select from the following choices.\n(A) A is closer\n(B) B is closer"

# Prepare messages
messages = [
    {
        "role": "user",
        "content": question
    }
]

# Get model response
response = chat_model.chat(messages, images=image_paths)
assistant_texts = []

for resp in response:
    try:
        assistant_texts.append(resp.response_text)
    except Exception:
        assistant_texts.append(str(resp))

response_text = "\n".join(assistant_texts)
print(response_text)
```

## Citation

If you use this model, please cite:

```bibtex
@misc{2510.07181,
  Author = {Yi Han and Cheng Chi and Enshen Zhou and Shanyu Rong and Jingkun An and Pengwei Wang and Zhongyuan Wang and Lu Sheng and Shanghang Zhang},
  Title = {TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics},
  Year = {2025},
  Eprint = {arXiv:2510.07181},
}
```