|
|
--- |
|
|
extra_gated_fields: |
|
|
First Name: text |
|
|
Last Name: text |
|
|
Date of birth: date_picker |
|
|
Country: country |
|
|
Affiliation: text |
|
|
I accept the terms and conditions: checkbox |
|
|
geo: ip_location |
|
|
? By clicking Submit below I accept the terms of the license and acknowledge that |
|
|
the information I provide will be collected stored processed and shared in accordance |
|
|
with the Meta Privacy Policy |
|
|
: checkbox |
|
|
extra_gated_description: The information you provide will be collected, stored, processed |
|
|
and shared in accordance with the [Meta Privacy Policy](https://www.facebook.com/privacy/policy/). |
|
|
extra_gated_button_content: Submit |
|
|
extra_gated_heading: Please be sure to provide your full legal name, date of birth, |
|
|
and full organization name with all corporate identifiers. Avoid the use of acronyms |
|
|
and special characters. Failure to follow these instructions may prevent you from |
|
|
accessing this model and others on Hugging Face. You will not have the ability to |
|
|
edit this form after submission, so please ensure all information is accurate. |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- meta-ai |
|
|
- meta-pytorch |
|
|
license: fair-noncommercial-research-license |
|
|
base_model: facebook/pixio-vit5b16 |
|
|
pipeline_tag: image-feature-extraction |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Model Card for Pixio |
|
|
|
|
|
Pixio is a family of versatile self-supervised vision foundation models. Pixio produces competitive dense features by simple masked autoencoding (MAE) on 2B web-crawled images with minimal human curation. |
|
|
|
|
|
Pixio enhances MAE pre-training framework by using a deeper decoder, masking at a larger granularity, and introducing additional class tokens. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
As described in the [Pixio](https://arxiv.org/abs/2512.15715) paper, 5 models are provided: |
|
|
|
|
|
- 1 ViT-5B trained from scratch, |
|
|
- 4 ViT-B/L/H/1B models distilled from the ViT-5B |
|
|
|
|
|
Each model takes an image as input and returns eight class tokens and patch tokens. These models follow a standard ViT architecture, with a patch size of 16. For a 256x256 image, this results in 8 class tokens + 256 patch tokens = 264 tokens. |
|
|
|
|
|
The models can accept larger images provided the image shapes are multiples of the patch size (16). |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** FAIR at Meta, HKU |
|
|
- **Model type:** Vision Transformer |
|
|
- **License:** [FAIR Noncommercial Research License](https://github.com/facebookresearch/pixio?tab=License-1-ov-file#readme) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [https://github.com/facebookresearch/pixio](https://github.com/facebookresearch/pixio) |
|
|
- **Paper:** [In Pursuit of Pixel Supervision for Visual Pre-training](https://arxiv.org/abs/2512.15715) |
|
|
|
|
|
### How to use |
|
|
|
|
|
Here is how to use this model: |
|
|
|
|
|
```python |
|
|
from transformers import AutoImageProcessor, AutoModel |
|
|
from PIL import Image |
|
|
import requests |
|
|
|
|
|
url = 'http://images.cocodataset.org/val2017/000000039769.jpg' |
|
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
|
|
processor = AutoImageProcessor.from_pretrained('facebook/pixio-vith16') |
|
|
model = AutoModel.from_pretrained('facebook/pixio-vith16') |
|
|
|
|
|
inputs = processor(images=image, return_tensors="pt") |
|
|
outputs = model(**inputs, output_hidden_states=True) |
|
|
last_hidden_states_norm = outputs.last_hidden_state # 8 class tokens + patch tokens after last LayerNorm |
|
|
last_hidden_states = outputs.hidden_states[-1] # 8 class tokens + patch tokens before last LayerNorm |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@article{pixio, |
|
|
title={In Pursuit of Pixel Supervision for Visual Pre-training}, |
|
|
author={Yang, Lihe and Li, Shang-Wen and Li, Yang and Lei, Xinjie and Wang, Dong and Mohamed, Abdelrahman and Zhao, Hengshuang and Xu, Hu}, |
|
|
journal={arXiv:2512.15715}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|