Isaac-0.2-1B by Perceptron
Introducing the 1B parameter variant of Isaac-0.2, the hybrid-reasoning vision-language model.
This release brings major upgrades β optional reasoning via thinking traces, perceptive tool calling (including our new Focus system), stronger grounding, better OCR, better desktop use, and improved structured output β while remaining fast, compact, and deployable.
Extending the efficient frontier of perception
Isaac 0.2 extends what we started with Isaac 0.1: small models that outperform systems 10Γ larger on visual reasoning and perception tasks, all running on commodity GPUs or edge devices. From robotics to media search to industrial inspection, Isaac 0.2 delivers high-accuracy perception without the heavy compute footprint.
What's New in Isaac 0.2
Reasoning via Thinking Traces: Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
Perceptive Tool Calling + Focus (Zoom & Crop): Isaac 0.2 can trigger tool calls to focus (i.e., zoom and crop) and re-query the model on a smaller region β dramatically improving fine-grained perception.
Structured Outputs: More reliable structured output generation for consistent JSON and predictable downstream integration.
Complex OCR: Improved text recognition across cluttered, low-resolution, or distorted regions β enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.
Desktop Use: Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Isaac faster and more capable for agentic use cases.
Performance Benchmarks
Chatting with Isaac in π€ Transformers
Learn more at our Huggingface Example Repo, where we demo extracting and rendering points.
pip install perceptron
Usage
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers.image_utils import load_image
from transformers.utils.import_utils import is_torch_cuda_available
def document_to_messages(document: list[dict]):
messages, images = [], []
for item in document:
if not (content := item.get("content")):
continue
role = item.get("role", "user")
if item.get("type") == "image":
images.append(load_image(content))
messages.append({"role": role, "content": "<image>"})
elif item.get("type") == "text":
messages.append({"role": role, "content": content})
return messages, images
hf_path = "PerceptronAI/Isaac-0.2-1B"
device, dtype = ("cuda",torch.bfloat16) if is_torch_cuda_available() else ("cpu",torch.float32)
# Load model/processor from the checkpoint
processor = AutoProcessor.from_pretrained(hf_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
hf_path, trust_remote_code=True, vision_attn_implementation="flash_attention_2"
)
model = model.to(device=device, dtype=dtype)
model.eval()
# Prepare input for generation
document = [
{
"type": "text",
"content": "<hint>BOX</hint>",
"role": "user",
},
{
"type": "image",
"content": "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp",
"role": "user",
},
{
"type": "text",
"content": "Determine whether it is safe to cross the street. Look for signage and moving traffic.",
"role": "user",
},
]
messages, images = document_to_messages(document)
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=text, images=images, return_tensors="pt")
# Generate text using the model
generated_ids = model.generate(
tensor_stream=inputs["tensor_stream"].to(next(model.parameters()).device),
max_new_tokens=256,
do_sample=False,
)
generated_text = processor.tokenizer.decode(
generated_ids[0], skip_special_tokens=False
)
print(f"\nFull generated output:\n{generated_text}")
- Downloads last month
- 39

