--- pipeline_tag: image-text-to-text library_name: transformers license: mit # Please verify license in the repository --- # VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models This model, VolCano, is presented in the paper [VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models](https://arxiv.org/abs/2405.16919) and is designed for multi-step visually grounded reasoning. Code and further details are available at: https://github.com/RupertLuo/VoCoT ## Quick Start This example demonstrates basic usage. For more details, please refer to the project's GitHub repository. ```python from model.load_model import load_model, infer from PIL import Image # loading the model model_path = 'luoruipu1/Volcano-7b' model, preprocessor = load_model(model_path, precision='fp16') # perform reasoning, activate VoCoT by passing cot=True input_image = Image.open('figs/sample_input.jpg') response = infer(model, preprocessor, input_image, 'Describe the image.', cot=True) print('response: ', response[0]) ```