Octopus-8B / README.md

Update README.md

fc52b77 verified about 9 hours ago

6.03 kB

	---
	license: apache-2.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# Octopus-8B

	Octopus-8B is built based on Qwen-3-VL-8B-Instruct, featuring self-correction reasoning ability.

	Paper: https://arxiv.org/pdf/2602.08503

	Project Page: https://dripnowhy.github.io/Octopus/

	Code: https://github.com/DripNowhy/Octopus

	This is the weight repository for Octopus-8B.


	---

	## Model Performance


	![](head.png)


	## Quickstart

	Below, we provide simple examples to show how to use $\texttt{Octopus-8B}$ with vLLM and 🤗 Transformers.

	First, Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command:
	```
	pip install git+https://github.com/huggingface/transformers
	# pip install transformers==4.57.0 # currently, V4.57.0 is not released
	```

	### Using vLLM to Chat

	Here we show a code snippet to show how to use the chat model with `vllm`:
	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoProcessor
	from PIL import Image

	prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate <self-correction> </self-correction> tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission."""

	MODEL_PATH = "Tuwhy/Octopus-8B"

	def main():
	# Initialize model
	llm = LLM(
	model=MODEL_PATH,
	tensor_parallel_size=1,
	gpu_memory_utilization=0.9,
	seed=1,
	max_model_len=8192 * 8,
	trust_remote_code=True
	)

	processor = AutoProcessor.from_pretrained(
	MODEL_PATH,
	max_pixels=12802828,
	min_pixels=2562828
	)

	# Single case
	prompt = "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?"
	image_path = "./head.png"

	sampling_params = SamplingParams(
	temperature=1.0,
	top_p=0.95,
	top_k=-1,
	max_tokens=8192*2
	)

	# Prepare messages
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image_path},
	{"type": "text", "text": prompt + prompt_suffix}
	]
	}
	]

	text_prompt = processor.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	# Load image
	image = Image.open(image_path).convert("RGB")

	# Prepare input
	inputs = {
	"prompt": text_prompt,
	"multi_modal_data": {
	"image": image
	}
	}

	# Generate
	outputs = llm.generate([inputs], sampling_params=sampling_params)

	# Print result
	generated_text = outputs[0].outputs[0].text

	print("Generated response:")
	print("=" * 50)
	print(generated_text)
	print("=" * 50)

	if __name__ == '__main__':
	main()

	```

	### Using 🤗 Transformers to Chat

	Here we show a code snippet to show how to use the chat model with `transformers`:

	```python
	from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

	prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate <self-correction> </self-correction> tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within <think> </think> tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission."""

	# default: Load the model on the available device(s)
	model = Qwen3VLForConditionalGeneration.from_pretrained(
	"Tuwhy/Octopus-8B", dtype="auto", device_map="auto"
	)

	# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
	# model = Qwen3VLForConditionalGeneration.from_pretrained(
	# "Qwen/Qwen3-VL-8B-Instruct",
	# dtype=torch.bfloat16,
	# attn_implementation="flash_attention_2",
	# device_map="auto",
	# )

	processor = AutoProcessor.from_pretrained("Tuwhy/Octopus-8B")

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "./head.png",
	},
	{"type": "text", "text": "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?" + prompt_suffix},
	],
	}
	]

	# Preparation for inference
	inputs = processor.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_dict=True,
	return_tensors="pt"
	)
	inputs = inputs.to(model.device)

	# Inference: Generation of the output
	generated_ids = model.generate(*inputs, max_new_tokens=81922)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	### Generation Hyperparameters
	#### VL
	```bash
	export greedy='false'
	export top_p=0.95
	export top_k=-1
	export temperature=0.6
	export out_seq_length=16384
	```

	## Citation

	If you find our work helpful, feel free to give us a cite.

	```bibtex
	@article{ding2025sherlock,
	title={Sherlock: Self-Correcting Reasoning in Vision-Language Models},
	author={Ding, Yi and Zhang, Ruqi},
	journal={arXiv preprint arXiv:2505.22651},
	year={2025}
	}
	```