--- license: cc-by-nc-4.0 language: - en pipeline_tag: image-text-to-text --- # Isaac-0.2-2B by Perceptron Introducing the 2B parameter variant of Isaac-0.2, the hybrid-reasoning vision-language model. This release brings major upgrades — optional reasoning via thinking traces, perceptive tool calling (including our new Focus system), stronger grounding, better OCR, better desktop use, and improved structured output — while remaining fast, compact, and deployable. [Try it on our demo! 🚀](https://www.perceptron.inc/demo) - [API Docs 📘](https://docs.perceptron.inc/) - [Discord 💬](https://discord.gg/fgBeaACQzE) ## Extending the efficient frontier of perception Isaac 0.2 extends what we started with Isaac 0.1: small models that outperform systems 10× larger on visual reasoning and perception tasks, all running on commodity GPUs or edge devices. From robotics to media search to industrial inspection, Isaac 0.2 delivers high-accuracy perception without the heavy compute footprint. ![image](https://cdn-uploads.huggingface.co/production/uploads/65526dfffb76980adeffa369/yQl-9BAxLud6hhK8gCKLt.png) ## What's New in Isaac 0.2 * **Reasoning via Thinking Traces**: Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks. * **Perceptive Tool Calling + Focus (Zoom & Crop)**: Isaac 0.2 can trigger tool calls to focus (i.e., zoom and crop) and re-query the model on a smaller region — dramatically improving fine-grained perception. * **Structured Outputs**: More reliable structured output generation for consistent JSON and predictable downstream integration. * **Complex OCR**: Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes. * **Desktop Use**: Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Isaac faster and more capable for agentic use cases. ## Performance Benchmarks ![image](https://cdn-uploads.huggingface.co/production/uploads/65526dfffb76980adeffa369/scKXlSu474L4r8-I6Ahau.png) ## Chatting with Isaac in 🤗 Transformers Learn more at our [Huggingface Example Repo](https://github.com/perceptron-ai-inc/perceptron/tree/main/huggingface), where we demo extracting and rendering points. ```bash pip install perceptron ``` ### Usage ```python import torch from transformers import AutoModelForCausalLM, AutoProcessor from transformers.image_utils import load_image from transformers.utils.import_utils import is_torch_cuda_available def document_to_messages(document: list[dict]): messages, images = [], [] for item in document: if not (content := item.get("content")): continue role = item.get("role", "user") if item.get("type") == "image": images.append(load_image(content)) messages.append({"role": role, "content": ""}) elif item.get("type") == "text": messages.append({"role": role, "content": content}) return messages, images hf_path = "PerceptronAI/Isaac-0.2-2B-Preview" device, dtype = ("cuda",torch.bfloat16) if is_torch_cuda_available() else ("cpu",torch.float32) # Load model/processor from the checkpoint processor = AutoProcessor.from_pretrained(hf_path, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( hf_path, trust_remote_code=True, vision_attn_implementation="flash_attention_2" ) model = model.to(device=device, dtype=dtype) model.eval() # Prepare input for generation document = [ { "type": "text", "content": "BOX", "role": "user", }, { "type": "image", "content": "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp", "role": "user", }, { "type": "text", "content": "Determine whether it is safe to cross the street. Look for signage and moving traffic.", "role": "user", }, ] messages, images = document_to_messages(document) text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = processor(text=text, images=images, return_tensors="pt") # Generate text using the model generated_ids = model.generate( tensor_stream=inputs["tensor_stream"].to(next(model.parameters()).device), max_new_tokens=256, do_sample=False, ) generated_text = processor.tokenizer.decode( generated_ids[0], skip_special_tokens=False ) print(f"\nFull generated output:\n{generated_text}") ```