Update README.md

8b34c48 verified 3 months ago

4.38 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	---
	# Isaac-0.2-1B by Perceptron

	Introducing the 1B parameter variant of Isaac-0.2, the hybrid-reasoning vision-language model.

	This release brings major upgrades — optional reasoning via thinking traces, perceptive tool calling (including our new Focus system), stronger grounding, better OCR, better desktop use, and improved structured output — while remaining fast, compact, and deployable.

	## Extending the efficient frontier of perception

	Isaac 0.2 extends what we started with Isaac 0.1: small models that outperform systems 10× larger on visual reasoning and perception tasks, all running on commodity GPUs or edge devices.
	From robotics to media search to industrial inspection, Isaac 0.2 delivers high-accuracy perception without the heavy compute footprint.

	![image](https://cdn-uploads.huggingface.co/production/uploads/65526dfffb76980adeffa369/yQl-9BAxLud6hhK8gCKLt.png)

	## What's New in Isaac 0.2

	* Reasoning via Thinking Traces: Short, structured reasoning traces improve multi-step decisions, small-object understanding, and ambiguous spatial tasks.

	* Perceptive Tool Calling + Focus (Zoom & Crop): Isaac 0.2 can trigger tool calls to focus (i.e., zoom and crop) and re-query the model on a smaller region — dramatically improving fine-grained perception.

	* Structured Outputs: More reliable structured output generation for consistent JSON and predictable downstream integration.

	* Complex OCR: Improved text recognition across cluttered, low-resolution, or distorted regions — enabling accurate extraction from documents, diagrams, labels, screens, and dense real-world scenes.

	* Desktop Use: Better performance on everyday desktop and mobile workflows such as UI understanding and navigation, making Isaac faster and more capable for agentic use cases.

	## Performance Benchmarks

	![image](https://cdn-uploads.huggingface.co/production/uploads/65526dfffb76980adeffa369/scKXlSu474L4r8-I6Ahau.png)

	## Chatting with Isaac in 🤗 Transformers
	Learn more at our [Huggingface Example Repo](https://github.com/perceptron-ai-inc/perceptron/tree/main/huggingface), where we demo extracting and rendering points.

	```bash
	pip install perceptron
	```

	### Usage

	```python
	from transformers import AutoModelForCausalLM, AutoProcessor
	from transformers.utils.import_utils import is_torch_cuda_available
	from transformers.image_utils import load_image

	def document_to_messages(document: list[dict]):
	messages, images = [], []
	for item in document:
	if not (content := item.get("content")):
	continue
	role = item.get("role", "user")
	if item.get("type") == "image":
	images.append(load_image(content))
	messages.append({"role": role, "content": "<image>"})
	elif item.get("type") == "text":
	messages.append({"role": role, "content": content})
	return messages, images

	# Load model/processor from the checkpoint
	checkpoint_path = "PerceptronAI/Isaac-0.2-1B-Preview"
	processor = AutoProcessor.from_pretrained(checkpoint_path, trust_remote_code=True)
	device, dtype = ("cuda","bfloat16") if is_torch_cuda_available() else ("cpu","float32")
	model = AutoModelForCausalLM.from_pretrained(
	checkpoint_path, trust_remote_code=True, vision_attn_implementation="flash_attention_2", dtype = dtype
	).to(device=device)

	document = [
	{
	"type": "text",
	"content": "<hint>BOX</hint>",
	"role": "user",
	},
	{
	"type": "image",
	"content": "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp",
	"role": "user",
	},
	{
	"type": "text",
	"content": "Determine whether it is safe to cross the street. Look for signage and moving traffic.",
	"role": "user",
	},
	]

	# Prepare inputs for generation
	messages, images = document_to_messages(document)
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=text, images=images, return_tensors="pt")

	# Generation
	generated_ids = model.generate(
	tensor_stream=inputs["tensor_stream"].to(device),
	max_new_tokens=256,
	do_sample=False,
	)
	generated_text = processor.tokenizer.decode(generated_ids[0], skip_special_tokens=False)
	print(f"\n Output: {generated_text}")

	```