| --- |
| library_name: transformers |
| tags: |
| - vision |
| - image-text |
| - clip |
| - zero-shot |
| --- |
| |
| <div align="center"> |
| <img class="block dark:hidden" src="assets/Raon-VisionEncoder-Gradient-Black.png" alt="Raon VisionEncoder" width="600"> |
| <img class="hidden dark:block" src="assets/Raon-VisionEncoder-Gradient-White.png" alt="Raon VisionEncoder" width="600"> |
| </div> |
|
|
| <p align="center"> |
| <a href="https://www.krafton.ai/ko/"><img src="https://img.shields.io/badge/Homepage-KRAFTON%20AI-blue?style=flat&logo=google-chrome&logoColor=white" alt="Homepage"></a> |
| <br> |
| <a href="https://huggingface.co/KRAFTON"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-KRAFTON-yellow?style=flat" alt="Hugging Face"></a> |
| <a href="https://x.com/Krafton_AI"><img src="https://img.shields.io/badge/X-KRAFTON%20AI-white?style=flat&logo=x&logoColor=black" alt="X"></a> |
| <br> |
| <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-lightgrey?style=flat" alt="License"></a> |
| </p> |
|
|
| **Raon-VisionEncoder** is a 1.14B-parameter vision-language foundation model by [KRAFTON](https://www.krafton.com) for image and text feature extraction. |
| It supports zero-shot image classification, image-text retrieval, and native aspect ratio inference via NaFlex. |
| Built on [OpenCLIP](https://github.com/mlfoundations/open_clip) with a LocCa (Localized CoCa) architecture and ViT-SO400M vision encoder. |
|
|
| ## Pretrained Models |
|
|
| | Model | Params (Inference) | Vision | Text | Patch Size | NaFlex Default Patches | |
| |-------|--------------------|--------|------|------------|------------------------| |
| | LocCa ViT-SO400M-16-SigLIP2 | 1.14B | 0.43B | 0.71B | 16x16 | 256 | |
|
|
| ## Requirements |
|
|
| ```bash |
| pip install torch torchvision timm transformers huggingface-hub safetensors ftfy |
| ``` |
|
|
| ## Quick Start |
|
|
| ```python |
| import torch |
| from transformers import AutoModel |
| from PIL import Image |
| |
| # Load model + processor |
| model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True) |
| model = model.to(dtype=torch.bfloat16).eval() |
| processor = model.get_processor("KRAFTON/Raon-VisionEncoder") |
| |
| # Encode image and text |
| img_inputs = processor(images=Image.open("assets/photo.jpg")) |
| txt_inputs = processor(text=["a cat", "a dog"]) |
| |
| with torch.no_grad(): |
| img_feat = model.encode_image(**img_inputs) |
| txt_feat = model.encode_text(**txt_inputs) |
| |
| # Compute similarity with learned scale and bias |
| logits = model.logit_scale.exp() * (img_feat @ txt_feat.T) + model.logit_bias |
| probs = logits.softmax(dim=-1) |
| print(probs) |
| ``` |
|
|
| ## API Reference |
|
|
| | Method | Input | Output | |
| |--------|-------|--------| |
| | `model.encode_image(**inputs)` | Processor output (image) | `[B, 1152]` normalized image features | |
| | `model.encode_text(**inputs)` | Processor output (text) | `[B, 1152]` normalized text features | |
| | `model.logit_scale` | - | Learned temperature parameter | |
| | `model.logit_bias` | - | Learned bias parameter | |
| | `model.get_processor(repo_id)` | HuggingFace repo ID | Processor instance | |
| | `processor(images=img)` | PIL Image | Preprocessed image dict | |
| | `processor(text=["a cat"])` | list of strings | Tokenized text dict | |
|
|
| ## License |
|
|
| This repository is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). |
| Third-party notices in [NOTICE](NOTICE). |
|
|
| © 2026 KRAFTON |
|
|