|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Octopus-8B |
|
|
|
|
|
Octopus-8B is built based on Qwen-3-VL-8B-Instruct, featuring self-correction reasoning ability. |
|
|
|
|
|
Paper: https://arxiv.org/pdf/2602.08503 |
|
|
|
|
|
Project Page: https://dripnowhy.github.io/Octopus/ |
|
|
|
|
|
Code: https://github.com/DripNowhy/Octopus |
|
|
|
|
|
This is the weight repository for Octopus-8B. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
## Quickstart |
|
|
|
|
|
Below, we provide simple examples to show how to use $\texttt{Octopus-8B}$ with vLLM and 🤗 Transformers. |
|
|
|
|
|
First, Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: |
|
|
``` |
|
|
pip install git+https://github.com/huggingface/transformers |
|
|
# pip install transformers==4.57.0 # currently, V4.57.0 is not released |
|
|
``` |
|
|
|
|
|
### Using vLLM to Chat |
|
|
|
|
|
Here we show a code snippet to show how to use the chat model with `vllm`: |
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
from transformers import AutoProcessor |
|
|
from PIL import Image |
|
|
|
|
|
prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate <self-correction> </self-correction> tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission.""" |
|
|
|
|
|
MODEL_PATH = "Tuwhy/Octopus-8B" |
|
|
|
|
|
def main(): |
|
|
# Initialize model |
|
|
llm = LLM( |
|
|
model=MODEL_PATH, |
|
|
tensor_parallel_size=1, |
|
|
gpu_memory_utilization=0.9, |
|
|
seed=1, |
|
|
max_model_len=8192 * 8, |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained( |
|
|
MODEL_PATH, |
|
|
max_pixels=1280*28*28, |
|
|
min_pixels=256*28*28 |
|
|
) |
|
|
|
|
|
# Single case |
|
|
prompt = "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?" |
|
|
image_path = "./head.png" |
|
|
|
|
|
sampling_params = SamplingParams( |
|
|
temperature=1.0, |
|
|
top_p=0.95, |
|
|
top_k=-1, |
|
|
max_tokens=8192*2 |
|
|
) |
|
|
|
|
|
# Prepare messages |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image", "image": image_path}, |
|
|
{"type": "text", "text": prompt + prompt_suffix} |
|
|
] |
|
|
} |
|
|
] |
|
|
|
|
|
text_prompt = processor.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
# Load image |
|
|
image = Image.open(image_path).convert("RGB") |
|
|
|
|
|
# Prepare input |
|
|
inputs = { |
|
|
"prompt": text_prompt, |
|
|
"multi_modal_data": { |
|
|
"image": image |
|
|
} |
|
|
} |
|
|
|
|
|
# Generate |
|
|
outputs = llm.generate([inputs], sampling_params=sampling_params) |
|
|
|
|
|
# Print result |
|
|
generated_text = outputs[0].outputs[0].text |
|
|
|
|
|
print("Generated response:") |
|
|
print("=" * 50) |
|
|
print(generated_text) |
|
|
print("=" * 50) |
|
|
|
|
|
if __name__ == '__main__': |
|
|
main() |
|
|
|
|
|
``` |
|
|
|
|
|
### Using 🤗 Transformers to Chat |
|
|
|
|
|
Here we show a code snippet to show how to use the chat model with `transformers`: |
|
|
|
|
|
```python |
|
|
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor |
|
|
|
|
|
prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate <self-correction> </self-correction> tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission.""" |
|
|
|
|
|
# default: Load the model on the available device(s) |
|
|
model = Qwen3VLForConditionalGeneration.from_pretrained( |
|
|
"Tuwhy/Octopus-8B", dtype="auto", device_map="auto" |
|
|
) |
|
|
|
|
|
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios. |
|
|
# model = Qwen3VLForConditionalGeneration.from_pretrained( |
|
|
# "Qwen/Qwen3-VL-8B-Instruct", |
|
|
# dtype=torch.bfloat16, |
|
|
# attn_implementation="flash_attention_2", |
|
|
# device_map="auto", |
|
|
# ) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("Tuwhy/Octopus-8B") |
|
|
|
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "./head.png", |
|
|
}, |
|
|
{"type": "text", "text": "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?" + prompt_suffix}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
# Preparation for inference |
|
|
inputs = processor.apply_chat_template( |
|
|
messages, |
|
|
tokenize=True, |
|
|
add_generation_prompt=True, |
|
|
return_dict=True, |
|
|
return_tensors="pt" |
|
|
) |
|
|
inputs = inputs.to(model.device) |
|
|
|
|
|
# Inference: Generation of the output |
|
|
generated_ids = model.generate(**inputs, max_new_tokens=8192*2) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
output_text = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
) |
|
|
print(output_text) |
|
|
``` |
|
|
|
|
|
### Generation Hyperparameters |
|
|
#### VL |
|
|
```bash |
|
|
export greedy='false' |
|
|
export top_p=0.95 |
|
|
export top_k=-1 |
|
|
export temperature=0.6 |
|
|
export out_seq_length=16384 |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful, feel free to give us a cite. |
|
|
|
|
|
```bibtex |
|
|
@article{ding2025sherlock, |
|
|
title={Sherlock: Self-Correcting Reasoning in Vision-Language Models}, |
|
|
author={Ding, Yi and Zhang, Ruqi}, |
|
|
journal={arXiv preprint arXiv:2505.22651}, |
|
|
year={2025} |
|
|
} |
|
|
``` |