| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| license: mit # Please verify license in the repository | |
| # VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models | |
| This model, VolCano, is presented in the paper [VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models](https://arxiv.org/abs/2405.16919) and is designed for multi-step visually grounded reasoning. | |
| Code and further details are available at: https://github.com/RupertLuo/VoCoT | |
| ## Quick Start | |
| This example demonstrates basic usage. For more details, please refer to the project's GitHub repository. | |
| ```python | |
| from model.load_model import load_model, infer | |
| from PIL import Image | |
| # loading the model | |
| model_path = 'luoruipu1/Volcano-7b' | |
| model, preprocessor = load_model(model_path, precision='fp16') | |
| # perform reasoning, activate VoCoT by passing cot=True | |
| input_image = Image.open('figs/sample_input.jpg') | |
| response = infer(model, preprocessor, input_image, 'Describe the image.', cot=True) | |
| print('response: ', response[0]) | |
| ``` |