KRAFTON
/

Raon-VisionEncoder

Feature Extraction

Model card Files Files and versions

Raon-VisionEncoder / README.md

ValentineKRAFTON's picture

ValentineKRAFTON

initial commit

acd771b verified 3 days ago

|

history blame contribute delete

3.35 kB

	---
	library_name: transformers
	tags:
	- vision
	- image-text
	- clip
	- zero-shot
	---

	<div align="center">
	<img class="block dark:hidden" src="assets/Raon-VisionEncoder-Gradient-Black.png" alt="Raon VisionEncoder" width="600">
	<img class="hidden dark:block" src="assets/Raon-VisionEncoder-Gradient-White.png" alt="Raon VisionEncoder" width="600">
	</div>

	<p align="center">
	<a href="https://www.krafton.ai/ko/"><img src="https://img.shields.io/badge/Homepage-KRAFTON%20AI-blue?style=flat&logo=google-chrome&logoColor=white" alt="Homepage"></a>
	<br>
	<a href="https://huggingface.co/KRAFTON"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-KRAFTON-yellow?style=flat" alt="Hugging Face"></a>
	<a href="https://x.com/Krafton_AI"><img src="https://img.shields.io/badge/X-KRAFTON%20AI-white?style=flat&logo=x&logoColor=black" alt="X"></a>
	<br>
	<a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-lightgrey?style=flat" alt="License"></a>
	</p>

	Raon-VisionEncoder is a 1.14B-parameter vision-language foundation model by [KRAFTON](https://www.krafton.com) for image and text feature extraction.
	It supports zero-shot image classification, image-text retrieval, and native aspect ratio inference via NaFlex.
	Built on [OpenCLIP](https://github.com/mlfoundations/open_clip) with a LocCa (Localized CoCa) architecture and ViT-SO400M vision encoder.

	## Pretrained Models

	\| Model \| Params (Inference) \| Vision \| Text \| Patch Size \| NaFlex Default Patches \|
	\|-------\|--------------------\|--------\|------\|------------\|------------------------\|
	\| LocCa ViT-SO400M-16-SigLIP2 \| 1.14B \| 0.43B \| 0.71B \| 16x16 \| 256 \|

	## Requirements

	```bash
	pip install torch torchvision timm transformers huggingface-hub safetensors ftfy
	```

	## Quick Start

	```python
	import torch
	from transformers import AutoModel
	from PIL import Image

	# Load model + processor
	model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True)
	model = model.to(dtype=torch.bfloat16).eval()
	processor = model.get_processor("KRAFTON/Raon-VisionEncoder")

	# Encode image and text
	img_inputs = processor(images=Image.open("assets/photo.jpg"))
	txt_inputs = processor(text=["a cat", "a dog"])

	with torch.no_grad():
	img_feat = model.encode_image(**img_inputs)
	txt_feat = model.encode_text(**txt_inputs)

	# Compute similarity with learned scale and bias
	logits = model.logit_scale.exp() * (img_feat @ txt_feat.T) + model.logit_bias
	probs = logits.softmax(dim=-1)
	print(probs)
	```

	## API Reference

	\| Method \| Input \| Output \|
	\|--------\|-------\|--------\|
	\| `model.encode_image(**inputs)` \| Processor output (image) \| `[B, 1152]` normalized image features \|
	\| `model.encode_text(**inputs)` \| Processor output (text) \| `[B, 1152]` normalized text features \|
	\| `model.logit_scale` \| - \| Learned temperature parameter \|
	\| `model.logit_bias` \| - \| Learned bias parameter \|
	\| `model.get_processor(repo_id)` \| HuggingFace repo ID \| Processor instance \|
	\| `processor(images=img)` \| PIL Image \| Preprocessed image dict \|
	\| `processor(text=["a cat"])` \| list of strings \| Tokenized text dict \|

	## License

	This repository is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
	Third-party notices in [NOTICE](NOTICE).

	© 2026 KRAFTON