philippguevorguian's picture
update README
418fbaf verified
|
raw
history blame
4.39 kB
---
license: cc-by-nc-4.0
language:
- en
---
# Isaac-0.2-2B by Perceptron
Introducing the 2B parameter variant of Isaac-0.2, the hybrid-reasoning vision-language model.
This release brings major upgrades β€” optional reasoning via thinking traces, perceptive tool calling (including our new Focus system), stronger grounding, better OCR, better desktop use, and improved structured output β€” while remaining fast, compact, and deployable.
## Extending the efficient frontier of perception
Isaac 0.2 extends what we started with Isaac 0.1: small models that outperform systems 10Γ— larger on visual reasoning and perception tasks, all running on commodity GPUs or edge devices.
From robotics to media search to industrial inspection, Isaac 0.2 delivers high-accuracy perception without the heavy compute footprint.
![image](https://cdn-uploads.huggingface.co/production/uploads/65526dfffb76980adeffa369/yQl-9BAxLud6hhK8gCKLt.png)
## What's New in Isaac 0.2
* **Reasoning via Thinking Traces**: Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.
* **Perceptive Tool Calling + Focus (Zoom & Crop)**: Isaac 0.2 can trigger tool calls to focus (i.e., zoom and crop) and re-query the model on a smaller region β€” dramatically improving fine-grained perception.
* **Structured Outputs**: More reliable structured output generation for consistent JSON and predictable downstream integration.
* **Complex OCR**: Improved text recognition across cluttered, low-resolution, or distorted regions β€” enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.
* **Desktop Use**: Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Isaac faster and more capable for agentic use cases.
## Performance Benchmarks
![image](https://cdn-uploads.huggingface.co/production/uploads/65526dfffb76980adeffa369/scKXlSu474L4r8-I6Ahau.png)
## Chatting with Isaac in πŸ€— Transformers
Learn more at our [Huggingface Example Repo](https://github.com/perceptron-ai-inc/perceptron/tree/main/huggingface), where we demo extracting and rendering points.
```bash
pip install perceptron
```
### Usage
```python
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers.utils.import_utils import is_torch_cuda_available
from transformers.image_utils import load_image
def document_to_messages(document: list[dict]):
messages, images = [], []
for item in document:
if not (content := item.get("content")):
continue
role = item.get("role", "user")
if item.get("type") == "image":
images.append(load_image(content))
messages.append({"role": role, "content": "<image>"})
elif item.get("type") == "text":
messages.append({"role": role, "content": content})
return messages, images
# Load model/processor from the checkpoint
checkpoint_path = "PerceptronAI/Isaac-0.2-2B-Preview"
processor = AutoProcessor.from_pretrained(checkpoint_path, trust_remote_code=True)
device, dtype = ("cuda","bfloat16") if is_torch_cuda_available() else ("cpu","float32")
model = AutoModelForCausalLM.from_pretrained(
checkpoint_path, trust_remote_code=True, vision_attn_implementation="flash_attention_2", dtype = dtype
).to(device=device)
document = [
{
"type": "text",
"content": "<hint>BOX</hint>",
"role": "user",
},
{
"type": "image",
"content": "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp",
"role": "user",
},
{
"type": "text",
"content": "Determine whether it is safe to cross the street. Look for signage and moving traffic.",
"role": "user",
},
]
# Prepare inputs for generation
messages, images = document_to_messages(document)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=images, return_tensors="pt")
# Generation
generated_ids = model.generate(
tensor_stream=inputs["tensor_stream"].to(device),
max_new_tokens=256,
do_sample=False,
)
generated_text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=False)
print(f"\n Output: {generated_text}")
```