File size: 6,026 Bytes
9b85167
 
 
 
 
 
 
 
 
 
fc52b77
b526c69
 
 
 
9b85167
 
 
 
 
 
 
 
 
e6c29fd
9b85167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6c29fd
9b85167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e6c29fd
9b85167
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---

# Octopus-8B

Octopus-8B is built based on Qwen-3-VL-8B-Instruct, featuring self-correction reasoning ability.

Paper: https://arxiv.org/pdf/2602.08503

Project Page: https://dripnowhy.github.io/Octopus/

Code: https://github.com/DripNowhy/Octopus

This is the weight repository for Octopus-8B.


---

## Model Performance


![](head.png)


## Quickstart

Below, we provide simple examples to show how to use $\texttt{Octopus-8B}$ with vLLM and 🤗 Transformers.

First, Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command:
```
pip install git+https://github.com/huggingface/transformers
# pip install transformers==4.57.0 # currently, V4.57.0 is not released
```

### Using vLLM to Chat

Here we show a code snippet to show how to use the chat model with `vllm`:
```python
from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from PIL import Image

prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate <self-correction> </self-correction> tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission."""

MODEL_PATH = "Tuwhy/Octopus-8B"

def main():
    # Initialize model
    llm = LLM(
        model=MODEL_PATH,
        tensor_parallel_size=1,
        gpu_memory_utilization=0.9,
        seed=1,
        max_model_len=8192 * 8,
        trust_remote_code=True
    )

    processor = AutoProcessor.from_pretrained(
        MODEL_PATH,
        max_pixels=1280*28*28,
        min_pixels=256*28*28
    )

    # Single case
    prompt = "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?"
    image_path = "./head.png"

    sampling_params = SamplingParams(
        temperature=1.0,
        top_p=0.95,
        top_k=-1,
        max_tokens=8192*2
    )

    # Prepare messages
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image_path},
                {"type": "text", "text": prompt + prompt_suffix}
            ]
        }
    ]

    text_prompt = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    # Load image
    image = Image.open(image_path).convert("RGB")

    # Prepare input
    inputs = {
        "prompt": text_prompt,
        "multi_modal_data": {
            "image": image
        }
    }

    # Generate
    outputs = llm.generate([inputs], sampling_params=sampling_params)

    # Print result
    generated_text = outputs[0].outputs[0].text

    print("Generated response:")
    print("=" * 50)
    print(generated_text)
    print("=" * 50)

if __name__ == '__main__':
    main()

```

### Using 🤗 Transformers to Chat

Here we show a code snippet to show how to use the chat model with `transformers`:

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate <self-correction> </self-correction> tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission."""

# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Tuwhy/Octopus-8B", dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen3-VL-8B-Instruct",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

processor = AutoProcessor.from_pretrained("Tuwhy/Octopus-8B")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./head.png",
            },
            {"type": "text", "text": "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?" + prompt_suffix},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192*2)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

### Generation Hyperparameters
#### VL
```bash
export greedy='false'
export top_p=0.95
export top_k=-1
export temperature=0.6
export out_seq_length=16384
```

## Citation

If you find our work helpful, feel free to give us a cite.

```bibtex
@article{ding2025sherlock,
  title={Sherlock: Self-Correcting Reasoning in Vision-Language Models},
  author={Ding, Yi and Zhang, Ruqi},
  journal={arXiv preprint arXiv:2505.22651},
  year={2025}
}
```