File size: 6,026 Bytes
9b85167 fc52b77 b526c69 9b85167 e6c29fd 9b85167 e6c29fd 9b85167 e6c29fd 9b85167 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
---
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---
# Octopus-8B
Octopus-8B is built based on Qwen-3-VL-8B-Instruct, featuring self-correction reasoning ability.
Paper: https://arxiv.org/pdf/2602.08503
Project Page: https://dripnowhy.github.io/Octopus/
Code: https://github.com/DripNowhy/Octopus
This is the weight repository for Octopus-8B.
---
## Model Performance

## Quickstart
Below, we provide simple examples to show how to use $\texttt{Octopus-8B}$ with vLLM and 🤗 Transformers.
First, Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command:
```
pip install git+https://github.com/huggingface/transformers
# pip install transformers==4.57.0 # currently, V4.57.0 is not released
```
### Using vLLM to Chat
Here we show a code snippet to show how to use the chat model with `vllm`:
```python
from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from PIL import Image
prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate <self-correction> </self-correction> tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission."""
MODEL_PATH = "Tuwhy/Octopus-8B"
def main():
# Initialize model
llm = LLM(
model=MODEL_PATH,
tensor_parallel_size=1,
gpu_memory_utilization=0.9,
seed=1,
max_model_len=8192 * 8,
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
MODEL_PATH,
max_pixels=1280*28*28,
min_pixels=256*28*28
)
# Single case
prompt = "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?"
image_path = "./head.png"
sampling_params = SamplingParams(
temperature=1.0,
top_p=0.95,
top_k=-1,
max_tokens=8192*2
)
# Prepare messages
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt + prompt_suffix}
]
}
]
text_prompt = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Load image
image = Image.open(image_path).convert("RGB")
# Prepare input
inputs = {
"prompt": text_prompt,
"multi_modal_data": {
"image": image
}
}
# Generate
outputs = llm.generate([inputs], sampling_params=sampling_params)
# Print result
generated_text = outputs[0].outputs[0].text
print("Generated response:")
print("=" * 50)
print(generated_text)
print("=" * 50)
if __name__ == '__main__':
main()
```
### Using 🤗 Transformers to Chat
Here we show a code snippet to show how to use the chat model with `transformers`:
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate <self-correction> </self-correction> tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission."""
# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Tuwhy/Octopus-8B", dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
# "Qwen/Qwen3-VL-8B-Instruct",
# dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
processor = AutoProcessor.from_pretrained("Tuwhy/Octopus-8B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "./head.png",
},
{"type": "text", "text": "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?" + prompt_suffix},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192*2)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
### Generation Hyperparameters
#### VL
```bash
export greedy='false'
export top_p=0.95
export top_k=-1
export temperature=0.6
export out_seq_length=16384
```
## Citation
If you find our work helpful, feel free to give us a cite.
```bibtex
@article{ding2025sherlock,
title={Sherlock: Self-Correcting Reasoning in Vision-Language Models},
author={Ding, Yi and Zhang, Ruqi},
journal={arXiv preprint arXiv:2505.22651},
year={2025}
}
``` |