--- datasets: - ILSVRC/imagenet-1k library_name: transformers license: apache-2.0 pipeline_tag: image-feature-extraction --- # NEPA: Next-Embedding Prediction Makes Strong Vision Learners [![Paper](https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=arxiv&logoColor=b31b1b)](https://arxiv.org/abs/2512.16922) [![Project Page](https://img.shields.io/badge/Project-Website-5B7493?logo=googlechrome&logoColor=5B7493)](https://sihanxu.me/nepa) [![GitHub](https://img.shields.io/badge/GitHub-Code-181717?logo=github)](https://github.com/SihanXU/nepa) This is a PyTorch/GPU re-implementation of Next-Embedding Prediction Makes Strong Vision Learners.

Next-Embedding Predictive Autoregression. An image is split into patches and embedded into a sequence. An autoregressive model predicts the next embedding from previous ones. ``` @article{six2025nepa, title={Next-Embedding Prediction Makes Strong Vision Learners}, author={Sihan Xu and Ziqiao Ma and Wenhao Chai and Xuweiyi Chen and Weiyang Jin and Joyce Chai and Saining Xie and Stella X. Yu}, journal={arXiv preprint arXiv: 2512.16922}, year={2025} } ``` ## Environment The codebase has been tested with the following environment: - Python 3.10 - PyTorch 2.8.0 - Transformers 4.56.2 ### Installation First, clone the repository: ```bash git clone https://github.com/SihanXU/nepa cd nepa ``` Then, create a conda environment and install dependencies: ```bash conda env create -f environment.yml conda activate nepa ``` Alternatively, you can install the dependencies manually: ```bash pip install -r requirements.txt ``` ## Quick Start Here's a simple example to run inference with a pretrained NEPA model: ```python from transformers import AutoImageProcessor from models.vit_nepa import ViTNepaForImageClassification from PIL import Image import requests url = 'https://raw.githubusercontent.com/pytorch/hub/master/images/dog.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained('SixAILab/nepa-large-patch14-224-sft') model = ViTNepaForImageClassification.from_pretrained('SixAILab/nepa-large-patch14-224-sft') inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits # model predicts one of the 1000 ImageNet classes predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx]) ``` ## Setup Huggingface Token To download pretrained models from Hugging Face Hub, you need to authenticate with your Hugging Face account: ```bash hf auth login ``` ## [Optional] Setup Wandb Token We use Wandb to track experiments. You may want to use [Weights & Biases](https://wandb.ai/) to log and track your experiments: ```bash pip install wandb wandb login ``` ## Prepare ImageNet-1k Dataset We use the ImageNet-1k dataset for training and evaluation. To download the dataset via Hugging Face Datasets: ```bash python download_dataset.py ``` This script will download and prepare the ImageNet-1k dataset. Note that this requires approximately 150GB of disk space. You may need to accept the dataset terms on [Hugging Face](https://huggingface.co/datasets/ILSVRC/imagenet-1k) before downloading. ## Evaluate Nepa for Image Classification We provide pretrained checkpoints for NEPA models. The following table compares our reproduced results with the paper: | Model | SwiGLU (paper) | GeLU (reproduce) | |--------|---------------:|-----------------:| | Nepa-B | 83.8 | 83.75 | | Nepa-L | 85.3 | 85.40 | To evaluate the base model on ImageNet-1k validation set: ```bash bash scripts/eval/nepa_b_sft_eval.sh ``` This should give: ``` ***** eval metrics ***** eval_accuracy = 0.8375 eval_loss = 0.7169 ``` To evaluate the large model: ```bash bash scripts/eval/nepa_l_sft_eval.sh ``` This should give: ``` ***** eval metrics ***** eval_accuracy = 0.854 eval_loss = 0.6371 ``` ## Fine-tune To fine-tune a pretrained NEPA model on ImageNet-1k for image classification: For the base model: ```bash bash scripts/finetune/nepa_b_sft.sh ``` For the large model: ```bash bash scripts/finetune/nepa_l_sft.sh ``` You can modify the training hyperparameters (learning rate, batch size, epochs, etc.) in the corresponding script files. ## Pretrain To pretrain NEPA from scratch on ImageNet-1k: For the base model: ```bash bash scripts/pretrain/nepa_b.sh ``` For the large model: ```bash bash scripts/pretrain/nepa_l.sh ``` Pretraining typically requires multiple GPUs. We recommend using at least 8 A100 GPUs for the large model. ## Convert a Pretrained Model to Classification Model After pretraining, you can convert the pretrained model to a classification model by initializing a classification head. Use the `init_nepa_cls_from_pretrain.py` script: Here is an example: ``` python init_nepa_cls_from_pretrain.py \ --pretrained_model_id SixAILab/nepa-base-patch14-224 \ --config_model_id configs/finetune/nepa-base-patch14-224-sft \ --pretrained_revision main \ --save_local \ --local_dir ./nepa-base-patch14-224-sft ``` ## Acknowledgements We gratefully acknowledge the developers of [Transformers](https://github.com/huggingface/transformers), [Evaluate](https://github.com/huggingface/evaluate), and [timm](https://github.com/huggingface/pytorch-image-models) for their excellent open-source contributions. ## Contact Feel free to contact me through email (sihanxu@umich.edu). Enjoy!