philippguevorguian's picture
update README
418fbaf verified
|
raw
history blame
4.39 kB
metadata
license: cc-by-nc-4.0
language:
  - en

Isaac-0.2-2B by Perceptron

Introducing the 2B parameter variant of Isaac-0.2, the hybrid-reasoning vision-language model.

This release brings major upgrades β€” optional reasoning via thinking traces, perceptive tool calling (including our new Focus system), stronger grounding, better OCR, better desktop use, and improved structured output β€” while remaining fast, compact, and deployable.

Extending the efficient frontier of perception

Isaac 0.2 extends what we started with Isaac 0.1: small models that outperform systems 10Γ— larger on visual reasoning and perception tasks, all running on commodity GPUs or edge devices. From robotics to media search to industrial inspection, Isaac 0.2 delivers high-accuracy perception without the heavy compute footprint.

image

What's New in Isaac 0.2

  • Reasoning via Thinking Traces: Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.

  • Perceptive Tool Calling + Focus (Zoom & Crop): Isaac 0.2 can trigger tool calls to focus (i.e., zoom and crop) and re-query the model on a smaller region β€” dramatically improving fine-grained perception.

  • Structured Outputs: More reliable structured output generation for consistent JSON and predictable downstream integration.

  • Complex OCR: Improved text recognition across cluttered, low-resolution, or distorted regions β€” enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.

  • Desktop Use: Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Isaac faster and more capable for agentic use cases.

Performance Benchmarks

image

Chatting with Isaac in πŸ€— Transformers

Learn more at our Huggingface Example Repo, where we demo extracting and rendering points.

pip install perceptron

Usage

from transformers import AutoModelForCausalLM, AutoProcessor
from transformers.utils.import_utils import is_torch_cuda_available
from transformers.image_utils import load_image

def document_to_messages(document: list[dict]):
    messages, images = [], []
    for item in document:
        if not (content := item.get("content")):
            continue
        role = item.get("role", "user")
        if item.get("type") == "image":
            images.append(load_image(content))
            messages.append({"role": role, "content": "<image>"})
        elif item.get("type") == "text":
            messages.append({"role": role, "content": content})
    return messages, images

# Load model/processor from the checkpoint
checkpoint_path = "PerceptronAI/Isaac-0.2-2B-Preview"
processor = AutoProcessor.from_pretrained(checkpoint_path, trust_remote_code=True)
device, dtype = ("cuda","bfloat16") if is_torch_cuda_available() else ("cpu","float32")
model = AutoModelForCausalLM.from_pretrained(
    checkpoint_path, trust_remote_code=True, vision_attn_implementation="flash_attention_2", dtype = dtype
).to(device=device)

document = [
    {
        "type": "text",
        "content": "<hint>BOX</hint>",
        "role": "user",
    },
    {
        "type": "image",
        "content": "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp",
        "role": "user",
    },
    {
        "type": "text",
        "content": "Determine whether it is safe to cross the street. Look for signage and moving traffic.",
        "role": "user",
    },
]

# Prepare inputs for generation
messages, images = document_to_messages(document)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=images, return_tensors="pt")

# Generation
generated_ids = model.generate(
    tensor_stream=inputs["tensor_stream"].to(device),
    max_new_tokens=256,
    do_sample=False,
)
generated_text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=False)
print(f"\n Output: {generated_text}")