---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---
# Octopus-8B
Octopus-8B is built based on Qwen-3-VL-8B-Instruct, featuring self-correction reasoning ability.
Paper: https://arxiv.org/pdf/2602.08503
Project Page: https://dripnowhy.github.io/Octopus/
Code: https://github.com/DripNowhy/Octopus
This is the weight repository for Octopus-8B.
---
## Model Performance

## Quickstart
Below, we provide simple examples to show how to use $\texttt{Octopus-8B}$ with vLLM and 🤗 Transformers.
First, Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command:
```
pip install git+https://github.com/huggingface/transformers
# pip install transformers==4.57.0 # currently, V4.57.0 is not released
```
### Using vLLM to Chat
Here we show a code snippet to show how to use the chat model with `vllm`:
```python
from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from PIL import Image
prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission."""
MODEL_PATH = "Tuwhy/Octopus-8B"
def main():
# Initialize model
llm = LLM(
model=MODEL_PATH,
tensor_parallel_size=1,
gpu_memory_utilization=0.9,
seed=1,
max_model_len=8192 * 8,
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
MODEL_PATH,
max_pixels=1280*28*28,
min_pixels=256*28*28
)
# Single case
prompt = "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?"
image_path = "./head.png"
sampling_params = SamplingParams(
temperature=1.0,
top_p=0.95,
top_k=-1,
max_tokens=8192*2
)
# Prepare messages
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt + prompt_suffix}
]
}
]
text_prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Load image
image = Image.open(image_path).convert("RGB")
# Prepare input
inputs = {
"prompt": text_prompt,
"multi_modal_data": {
"image": image
}
}
# Generate
outputs = llm.generate([inputs], sampling_params=sampling_params)
# Print result
generated_text = outputs[0].outputs[0].text
print("Generated response:")
print("=" * 50)
print(generated_text)
print("=" * 50)
if __name__ == '__main__':
main()
```
### Using 🤗 Transformers to Chat
Here we show a code snippet to show how to use the chat model with `transformers`:
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission."""
# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Tuwhy/Octopus-8B", dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen3-VL-8B-Instruct",
# dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
processor = AutoProcessor.from_pretrained("Tuwhy/Octopus-8B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "./head.png",
},
{"type": "text", "text": "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?" + prompt_suffix},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192*2)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
### Generation Hyperparameters
#### VL
```bash
export greedy='false'
export top_p=0.95
export top_k=-1
export temperature=0.6
export out_seq_length=16384
```
## Citation
If you find our work helpful, feel free to give us a cite.
```bibtex
@article{ding2025sherlock,
title={Sherlock: Self-Correcting Reasoning in Vision-Language Models},
author={Ding, Yi and Zhang, Ruqi},
journal={arXiv preprint arXiv:2505.22651},
year={2025}
}
```