File size: 4,387 Bytes
418fbaf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
---
license: cc-by-nc-4.0
language:
- en
---
# Isaac-0.2-2B by Perceptron
Introducing the 2B parameter variant of Isaac-0.2, the hybrid-reasoning vision-language model.
This release brings major upgrades — optional reasoning via thinking traces, perceptive tool calling (including our new Focus system), stronger grounding, better OCR, better desktop use, and improved structured output — while remaining fast, compact, and deployable.
## Extending the efficient frontier of perception
Isaac 0.2 extends what we started with Isaac 0.1: small models that outperform systems 10× larger on visual reasoning and perception tasks, all running on commodity GPUs or edge devices.
From robotics to media search to industrial inspection, Isaac 0.2 delivers high-accuracy perception without the heavy compute footprint.

## What's New in Isaac 0.2
* **Reasoning via Thinking Traces**: Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
* **Perceptive Tool Calling + Focus (Zoom & Crop)**: Isaac 0.2 can trigger tool calls to focus (i.e., zoom and crop) and re-query the model on a smaller region — dramatically improving fine-grained perception.
* **Structured Outputs**: More reliable structured output generation for consistent JSON and predictable downstream integration.
* **Complex OCR**: Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.
* **Desktop Use**: Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Isaac faster and more capable for agentic use cases.
## Performance Benchmarks

## Chatting with Isaac in 🤗 Transformers
Learn more at our [Huggingface Example Repo](https://github.com/perceptron-ai-inc/perceptron/tree/main/huggingface), where we demo extracting and rendering points.
```bash
pip install perceptron
```
### Usage
```python
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers.utils.import_utils import is_torch_cuda_available
from transformers.image_utils import load_image
def document_to_messages(document: list[dict]):
messages, images = [], []
for item in document:
if not (content := item.get("content")):
continue
role = item.get("role", "user")
if item.get("type") == "image":
images.append(load_image(content))
messages.append({"role": role, "content": "<image>"})
elif item.get("type") == "text":
messages.append({"role": role, "content": content})
return messages, images
# Load model/processor from the checkpoint
checkpoint_path = "PerceptronAI/Isaac-0.2-2B-Preview"
processor = AutoProcessor.from_pretrained(checkpoint_path, trust_remote_code=True)
device, dtype = ("cuda","bfloat16") if is_torch_cuda_available() else ("cpu","float32")
model = AutoModelForCausalLM.from_pretrained(
checkpoint_path, trust_remote_code=True, vision_attn_implementation="flash_attention_2", dtype = dtype
).to(device=device)
document = [
{
"type": "text",
"content": "<hint>BOX</hint>",
"role": "user",
},
{
"type": "image",
"content": "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp",
"role": "user",
},
{
"type": "text",
"content": "Determine whether it is safe to cross the street. Look for signage and moving traffic.",
"role": "user",
},
]
# Prepare inputs for generation
messages, images = document_to_messages(document)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=images, return_tensors="pt")
# Generation
generated_ids = model.generate(
tensor_stream=inputs["tensor_stream"].to(device),
max_new_tokens=256,
do_sample=False,
)
generated_text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=False)
print(f"\n Output: {generated_text}")
```
|