--- license: apache-2.0 language: - en tags: - computer-vision - robotics - spatial-reasoning - vision-language-model - multi-modal - glm4v - fine-tuned base_model: glm4v model_type: vision-language-model datasets: - custom library_name: transformers --- # TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics ## Usage ### Environment Requirements ```bash pip install -r requirements.txt ``` ### Configuration Before using the model, you need to update the configuration file `glm4v_tisr_full_inference.yaml`: 1. Update `media_dir` to your image directory: ```yaml media_dir: /path/to/your/images ``` 2. Update the image path in `example_usage.py`: ```python image_paths = ["/path/to/your/image.jpg"] # Replace with actual image path ``` ### Basic Usage ```python import sys from llamafactory.chat.chat_model import ChatModel # Load model using LLaMA-Factory ChatModel config_file = "glm4v_tisr_full_inference.yaml" # Simulate command line arguments original_argv = sys.argv.copy() sys.argv = [sys.argv[0], config_file] try: chat_model = ChatModel() finally: # Restore original command line arguments sys.argv = original_argv # Prepare input image_paths = ["/path/to/your/image.jpg"] # Replace with actual image path question = "Two points are circled on the image, labeled by A and B beside each circle. Which point is closer to the camera? Select from the following choices.\n(A) A is closer\n(B) B is closer" # Prepare messages messages = [ { "role": "user", "content": question } ] # Get model response response = chat_model.chat(messages, images=image_paths) assistant_texts = [] for resp in response: try: assistant_texts.append(resp.response_text) except Exception: assistant_texts.append(str(resp)) response_text = "\n".join(assistant_texts) print(response_text) ``` ## Citation If you use this model, please cite: ```bibtex @misc{2510.07181, Author = {Yi Han and Cheng Chi and Enshen Zhou and Shanyu Rong and Jingkun An and Pengwei Wang and Zhongyuan Wang and Lu Sheng and Shanghang Zhang}, Title = {TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics}, Year = {2025}, Eprint = {arXiv:2510.07181}, } ```