|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- computer-vision |
|
|
- robotics |
|
|
- spatial-reasoning |
|
|
- vision-language-model |
|
|
- multi-modal |
|
|
- glm4v |
|
|
- fine-tuned |
|
|
base_model: glm4v |
|
|
model_type: vision-language-model |
|
|
datasets: |
|
|
- custom |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Environment Requirements |
|
|
|
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Configuration |
|
|
|
|
|
Before using the model, you need to update the configuration file `glm4v_tisr_full_inference.yaml`: |
|
|
|
|
|
1. Update `media_dir` to your image directory: |
|
|
```yaml |
|
|
media_dir: /path/to/your/images |
|
|
``` |
|
|
|
|
|
2. Update the image path in `example_usage.py`: |
|
|
```python |
|
|
image_paths = ["/path/to/your/image.jpg"] # Replace with actual image path |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
import sys |
|
|
from llamafactory.chat.chat_model import ChatModel |
|
|
|
|
|
# Load model using LLaMA-Factory ChatModel |
|
|
config_file = "glm4v_tisr_full_inference.yaml" |
|
|
|
|
|
# Simulate command line arguments |
|
|
original_argv = sys.argv.copy() |
|
|
sys.argv = [sys.argv[0], config_file] |
|
|
|
|
|
try: |
|
|
chat_model = ChatModel() |
|
|
finally: |
|
|
# Restore original command line arguments |
|
|
sys.argv = original_argv |
|
|
|
|
|
# Prepare input |
|
|
image_paths = ["/path/to/your/image.jpg"] # Replace with actual image path |
|
|
question = "Two points are circled on the image, labeled by A and B beside each circle. Which point is closer to the camera? Select from the following choices.\n(A) A is closer\n(B) B is closer" |
|
|
|
|
|
# Prepare messages |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": question |
|
|
} |
|
|
] |
|
|
|
|
|
# Get model response |
|
|
response = chat_model.chat(messages, images=image_paths) |
|
|
assistant_texts = [] |
|
|
|
|
|
for resp in response: |
|
|
try: |
|
|
assistant_texts.append(resp.response_text) |
|
|
except Exception: |
|
|
assistant_texts.append(str(resp)) |
|
|
|
|
|
response_text = "\n".join(assistant_texts) |
|
|
print(response_text) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{2510.07181, |
|
|
Author = {Yi Han and Cheng Chi and Enshen Zhou and Shanyu Rong and Jingkun An and Pengwei Wang and Zhongyuan Wang and Lu Sheng and Shanghang Zhang}, |
|
|
Title = {TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics}, |
|
|
Year = {2025}, |
|
|
Eprint = {arXiv:2510.07181}, |
|
|
} |
|
|
``` |