---
datasets:
- ILSVRC/imagenet-1k
library_name: transformers
license: apache-2.0
pipeline_tag: image-feature-extraction
---

# NEPA: Next-Embedding Prediction Makes Strong Vision Learners

[![Paper](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv&logoColor=b31b1b)](https://arxiv.org/abs/2512.16922)
[![Project Page](https://img.shields.io/badge/Project-Website-5B7493?logo=googlechrome&logoColor=5B7493)](https://sihanxu.me/nepa)
[![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/SihanXU/nepa)

This is a PyTorch/GPU re-implementation of Next-Embedding Prediction Makes Strong Vision Learners.

<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/63f233820a16587ea967adc2/f3ybK_7Mf7rMekc05AcWH.png" width="350">
</p>

Next-Embedding Predictive Autoregression. An image is split into patches and embedded into a sequence. An autoregressive model predicts the next embedding from previous ones.

```
@article{six2025nepa,
  title={Next-Embedding Prediction Makes Strong Vision Learners},
  author={Sihan Xu and Ziqiao Ma and Wenhao Chai and Xuweiyi Chen and Weiyang Jin and Joyce Chai and Saining Xie and Stella X. Yu},
  journal={arXiv preprint arXiv: 2512.16922},
  year={2025}
}
```


## Environment

The codebase has been tested with the following environment:

- Python 3.10
- PyTorch 2.8.0
- Transformers 4.56.2

### Installation

First, clone the repository:

```bash
git clone https://github.com/SihanXU/nepa
cd nepa
```

Then, create a conda environment and install dependencies:

```bash
conda env create -f environment.yml
conda activate nepa
```

Alternatively, you can install the dependencies manually:

```bash
pip install -r requirements.txt
```

## Quick Start

Here's a simple example to run inference with a pretrained NEPA model:

```python
from transformers import AutoImageProcessor
from models.vit_nepa import ViTNepaForImageClassification
from PIL import Image
import requests

url = 'https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained('SixAILab/nepa-large-patch14-224-sft')
model = ViTNepaForImageClassification.from_pretrained('SixAILab/nepa-large-patch14-224-sft')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])
```

## Setup Huggingface Token

To download pretrained models from Hugging Face Hub, you need to authenticate with your Hugging Face account:

```bash
hf auth login
```


## [Optional] Setup Wandb Token

We use Wandb to track experiments. You may want to use [Weights & Biases](https://wandb.ai/) to log and track your experiments:

```bash
pip install wandb
wandb login
```

## Prepare ImageNet-1k Dataset

We use the ImageNet-1k dataset for training and evaluation. To download the dataset via Hugging Face Datasets:

```bash
python download_dataset.py
```

This script will download and prepare the ImageNet-1k dataset. Note that this requires approximately 150GB of disk space. You may need to accept the dataset terms on [Hugging Face](https://huggingface.co/datasets/ILSVRC/imagenet-1k) before downloading.

## Evaluate Nepa for Image Classification

We provide pretrained checkpoints for NEPA models. The following table compares our reproduced results with the paper:

| Model  | SwiGLU (paper) | GeLU (reproduce) |
|--------|---------------:|-----------------:|
| Nepa-B | 83.8           | 83.75            |
| Nepa-L | 85.3           | 85.40            |

To evaluate the base model on ImageNet-1k validation set:

```bash
bash scripts/eval/nepa_b_sft_eval.sh 
```

This should give:
```
***** eval metrics *****
  eval_accuracy               =     0.8375
  eval_loss                   =     0.7169
```

To evaluate the large model:

```bash
bash scripts/eval/nepa_l_sft_eval.sh 
```

This should give:
```
***** eval metrics *****
  eval_accuracy               =      0.854
  eval_loss                   =     0.6371
```

## Fine-tune

To fine-tune a pretrained NEPA model on ImageNet-1k for image classification:

For the base model:

```bash
bash scripts/finetune/nepa_b_sft.sh 
```

For the large model:

```bash
bash scripts/finetune/nepa_l_sft.sh 
```

You can modify the training hyperparameters (learning rate, batch size, epochs, etc.) in the corresponding script files.

## Pretrain

To pretrain NEPA from scratch on ImageNet-1k:

For the base model:

```bash
bash scripts/pretrain/nepa_b.sh
```

For the large model:

```bash
bash scripts/pretrain/nepa_l.sh
```

Pretraining typically requires multiple GPUs. We recommend using at least 8 A100 GPUs for the large model.

## Convert a Pretrained Model to Classification Model

After pretraining, you can convert the pretrained model to a classification model by initializing a classification head. Use the `init_nepa_cls_from_pretrain.py` script:

Here is an example:
```
python init_nepa_cls_from_pretrain.py \
  --pretrained_model_id SixAILab/nepa-base-patch14-224 \
  --config_model_id configs/finetune/nepa-base-patch14-224-sft \
  --pretrained_revision main \
  --save_local \
  --local_dir ./nepa-base-patch14-224-sft
```

## Acknowledgements

We gratefully acknowledge the developers of [Transformers](https://github.com/huggingface/transformers), [Evaluate](https://github.com/huggingface/evaluate), and [timm](https://github.com/huggingface/pytorch-image-models) for their excellent open-source contributions.

## Contact

Feel free to contact me through email (sihanxu@umich.edu). Enjoy!