--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers --- # Octopus-8B Octopus-8B is built based on Qwen-3-VL-8B-Instruct, featuring self-correction reasoning ability. Paper: https://arxiv.org/pdf/2602.08503 Project Page: https://dripnowhy.github.io/Octopus/ Code: https://github.com/DripNowhy/Octopus This is the weight repository for Octopus-8B. --- ## Model Performance ![](head.png) ## Quickstart Below, we provide simple examples to show how to use $\texttt{Octopus-8B}$ with vLLM and 🤗 Transformers. First, Qwen3-VL has been in the latest Hugging Face transformers and we advise you to build from source with command: ``` pip install git+https://github.com/huggingface/transformers # pip install transformers==4.57.0 # currently, V4.57.0 is not released ``` ### Using vLLM to Chat Here we show a code snippet to show how to use the chat model with `vllm`: ```python from vllm import LLM, SamplingParams from transformers import AutoProcessor from PIL import Image prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission.""" MODEL_PATH = "Tuwhy/Octopus-8B" def main(): # Initialize model llm = LLM( model=MODEL_PATH, tensor_parallel_size=1, gpu_memory_utilization=0.9, seed=1, max_model_len=8192 * 8, trust_remote_code=True ) processor = AutoProcessor.from_pretrained( MODEL_PATH, max_pixels=1280*28*28, min_pixels=256*28*28 ) # Single case prompt = "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?" image_path = "./head.png" sampling_params = SamplingParams( temperature=1.0, top_p=0.95, top_k=-1, max_tokens=8192*2 ) # Prepare messages messages = [ { "role": "user", "content": [ {"type": "image", "image": image_path}, {"type": "text", "text": prompt + prompt_suffix} ] } ] text_prompt = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # Load image image = Image.open(image_path).convert("RGB") # Prepare input inputs = { "prompt": text_prompt, "multi_modal_data": { "image": image } } # Generate outputs = llm.generate([inputs], sampling_params=sampling_params) # Print result generated_text = outputs[0].outputs[0].text print("Generated response:") print("=" * 50) print(generated_text) print("=" * 50) if __name__ == '__main__': main() ``` ### Using 🤗 Transformers to Chat Here we show a code snippet to show how to use the chat model with `transformers`: ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor prompt_suffix = """\n\nYou first think through your reasoning process as an internal monologue, enclosed within tags. Then, provide your final answer enclosed within \\boxed{}. If you believe the answer can be further enhanced, generate tags enclosed with no content, and regenerate a new reasoning process and a new answer from scratch after that. The new response should first think through your reasoning process as an internal monologue, enclosed within tags. Then, provide your final answer enclosed within \\boxed{}. All reasoning, answer steps must be included without omission.""" # default: Load the model on the available device(s) model = Qwen3VLForConditionalGeneration.from_pretrained( "Tuwhy/Octopus-8B", dtype="auto", device_map="auto" ) # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios. # model = Qwen3VLForConditionalGeneration.from_pretrained( # "Qwen/Qwen3-VL-8B-Instruct", # dtype=torch.bfloat16, # attn_implementation="flash_attention_2", # device_map="auto", # ) processor = AutoProcessor.from_pretrained("Tuwhy/Octopus-8B") messages = [ { "role": "user", "content": [ { "type": "image", "image": "./head.png", }, {"type": "text", "text": "The accuracy gap between the Octopus-8B and the Qwen3-8B-VL-Thinking model is?" + prompt_suffix}, ], } ] # Preparation for inference inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" ) inputs = inputs.to(model.device) # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=8192*2) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ### Generation Hyperparameters #### VL ```bash export greedy='false' export top_p=0.95 export top_k=-1 export temperature=0.6 export out_seq_length=16384 ``` ## Citation If you find our work helpful, feel free to give us a cite. ```bibtex @article{ding2025sherlock, title={Sherlock: Self-Correcting Reasoning in Vision-Language Models}, author={Ding, Yi and Zhang, Ruqi}, journal={arXiv preprint arXiv:2505.22651}, year={2025} } ```