| | --- |
| | license: cc-by-nc-4.0 |
| | language: |
| | - en |
| | --- |
| | # Isaac-0.2-1B by Perceptron |
| |
|
| | Introducing the 1B parameter variant of Isaac-0.2, the hybrid-reasoning vision-language model. |
| |
|
| | This release brings major upgrades β optional reasoning via thinking traces, perceptive tool calling (including our new Focus system), stronger grounding, better OCR, better desktop use, and improved structured output β while remaining fast, compact, and deployable. |
| |
|
| | ## Extending the efficient frontier of perception |
| |
|
| | Isaac 0.2 extends what we started with Isaac 0.1: small models that outperform systems 10Γ larger on visual reasoning and perception tasks, all running on commodity GPUs or edge devices. |
| | From robotics to media search to industrial inspection, Isaac 0.2 delivers high-accuracy perception without the heavy compute footprint. |
| |
|
| |  |
| |
|
| | ## What's New in Isaac 0.2 |
| |
|
| | * **Reasoning via Thinking Traces**: Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks. |
| |
|
| | * **Perceptive Tool Calling + Focus (Zoom & Crop)**: Isaac 0.2 can trigger tool calls to focus (i.e., zoom and crop) and re-query the model on a smaller region β dramatically improving fine-grained perception. |
| |
|
| | * **Structured Outputs**: More reliable structured output generation for consistent JSON and predictable downstream integration. |
| |
|
| | * **Complex OCR**: Improved text recognition across cluttered, low-resolution, or distorted regions β enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes. |
| |
|
| | * **Desktop Use**: Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Isaac faster and more capable for agentic use cases. |
| |
|
| | ## Performance Benchmarks |
| |
|
| |  |
| |
|
| | ## Chatting with Isaac in π€ Transformers |
| | Learn more at our [Huggingface Example Repo](https://github.com/perceptron-ai-inc/perceptron/tree/main/huggingface), where we demo extracting and rendering points. |
| |
|
| | ```bash |
| | pip install perceptron |
| | ``` |
| |
|
| | ### Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoProcessor |
| | from transformers.utils.import_utils import is_torch_cuda_available |
| | from transformers.image_utils import load_image |
| | |
| | def document_to_messages(document: list[dict]): |
| | messages, images = [], [] |
| | for item in document: |
| | if not (content := item.get("content")): |
| | continue |
| | role = item.get("role", "user") |
| | if item.get("type") == "image": |
| | images.append(load_image(content)) |
| | messages.append({"role": role, "content": "<image>"}) |
| | elif item.get("type") == "text": |
| | messages.append({"role": role, "content": content}) |
| | return messages, images |
| | |
| | # Load model/processor from the checkpoint |
| | checkpoint_path = "PerceptronAI/Isaac-0.2-1B-Preview" |
| | processor = AutoProcessor.from_pretrained(checkpoint_path, trust_remote_code=True) |
| | device, dtype = ("cuda","bfloat16") if is_torch_cuda_available() else ("cpu","float32") |
| | model = AutoModelForCausalLM.from_pretrained( |
| | checkpoint_path, trust_remote_code=True, vision_attn_implementation="flash_attention_2", dtype = dtype |
| | ).to(device=device) |
| | |
| | document = [ |
| | { |
| | "type": "text", |
| | "content": "<hint>BOX</hint>", |
| | "role": "user", |
| | }, |
| | { |
| | "type": "image", |
| | "content": "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp", |
| | "role": "user", |
| | }, |
| | { |
| | "type": "text", |
| | "content": "Determine whether it is safe to cross the street. Look for signage and moving traffic.", |
| | "role": "user", |
| | }, |
| | ] |
| | |
| | # Prepare inputs for generation |
| | messages, images = document_to_messages(document) |
| | text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = processor(text=text, images=images, return_tensors="pt") |
| | |
| | # Generation |
| | generated_ids = model.generate( |
| | tensor_stream=inputs["tensor_stream"].to(device), |
| | max_new_tokens=256, |
| | do_sample=False, |
| | ) |
| | generated_text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=False) |
| | print(f"\n Output: {generated_text}") |
| | |
| | ``` |