InternViT-6B for Image Classification
This folder contains the implementation of the InternViT-6B for image classification, which corresponds to Section 4.2.1 of our InternVL 1.0 paper. The codebase for this part is derived from InternImage, with some code references to EVA and DINOv2. Thanks for their great work.
In this part, we validate the visual perception capabilities of InternViT-6B, the most core component of InternVL 1.0. We evaluate the quality of visual representation produced by InternViT-6B using the ImageNet-1K dataset. Following common practices, we adopt the linear probing evaluation, i.e. training a linear classifier while keeping the backbone frozen. In addition to the ImageNet-1K validation set, we also report performance metrics on several ImageNet variants, to benchmark the domain generalization capability.
InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are listed in the table below.
π οΈ Installation
Follow the installation guide to perform installations.
π¦ Data Preparation
Please prepare the dataset according to your needs.
ImageNet-1K: We use the standard ImageNet dataset, you can download it from http://image-net.org/.ImageNet-A: Download it from https://people.eecs.berkeley.edu/~hendrycks/imagenet-a.tar.ImageNet-R: Download it from https://people.eecs.berkeley.edu/~hendrycks/imagenet-r.tar.ImageNetV2: Download it from https://imagenetv2public.s3-us-west-2.amazonaws.com/imagenetv2-matched-frequency.tar.gz.ImageNet-Sketch: Download it usinggdown.# GDown is needed to download the dataset. # Please install it via `pip install gdown` gdown --id 1Mj0i5HBthqH1p_yeXzsg22gZduvgoNeA
First, please prepare the ImageNet-1K, ImageNet-A, ImageNet-R, ImageNetV2, and ImageNet-Sketch datasets following the directory structure outlined below.
$ tree data
data
βββ imagenet-1k
β βββ train
β βββ n01498041
β βββ ...
β βββ val
β βββ ILSVRC2012_val_00000001.JPEG
β βββ ...
βββ imagenet-a
β βββ n01498041
β βββ ...
βββ imagenet-r
β βββ n01443537
β βββ ...
βββ imagenet-sketch
β βββ n01440764
β βββ ...
βββ imagenetv2
βββ ImageNetV2-matched-frequency
Then, unzip the train.txt.zip and val.txt.zip in meta_data/.
cd meta_data/
unzip train.txt.zip
unzip val.txt.zip
π¦ Model Preparation
| model name | type | download | size |
|---|---|---|---|
| intern_vit_6b_224px.pth | pytorch | π€ HF link | 12 GB |
| intern_vit_6b_224px_head.pth | pytorch | π€ HF link | 25.7 MB |
Please download the above model weights and place them in the pretrained/ folder.
cd pretrained
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px.pth
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px_head.pth
The directory structure is:
pretrained
βββ intern_vit_6b_224px_head.pth
βββ intern_vit_6b_224px.pth
π Linear Probing on ImageNet-1K
Warning: Please install
apexbefore training (see installation guide for details).
To train a linear classifier for InternViT-6B on ImageNet with 8 GPUs, run:
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg configs/intern_vit_6b_1k_224.yaml
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --launcher slurm
Note, it is normal for the following information to appear during training and it can be safely ignored:
_IncompatibleKeys(missing_keys=[], unexpected_keys=['clip_projector.norm1_q.weight', 'clip_projector.norm1_q.bias', 'clip_projector.norm1_k.weight', 'clip_projector.norm1_k.bias', 'clip_projector.norm1_v.weight', 'clip_projector.norm1_v.bias', 'clip_projector.cross_attn.q_bias', 'clip_projector.cross_attn.k_bias', 'clip_projector.cross_attn.v_bias', 'clip_projector.cross_attn.q.weight', 'clip_projector.cross_attn.k.weight', 'clip_projector.cross_attn.v.weight', 'clip_projector.cross_attn.proj.weight', 'clip_projector.cross_attn.proj.bias'])
π Evaluation
Warning: Please install
apexbefore evaluation (see installation guide for details).
| model name | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch | download |
|---|---|---|---|---|---|---|---|
| intern_vit_6b_1k_224.yaml | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 | ckpt | log |
Evaluate InternViT-6B on ImageNet-1K val with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 88.230 Acc@5 98.474
Accuracy of the network on the 50000 test images: 88.2%
Evaluate InternViT-6B on ImageNet-ReaL with 1 GPU (click to expand).
Note: ImageNet-ReaL now only supports single-GPU testing.
python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=1 GPUS_PER_NODE=1 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* ReaL Acc@1 90.437 Acc@5 98.567 loss 0.605
ReaL Accuracy of the network on the 50000 test images: 90.4%
Evaluate InternViT-6B on ImageNetV2 with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 79.940 Acc@5 95.340
Accuracy of the network on the 10000 test images: 79.9%
Evaluate InternViT-6B on ImageNet-A with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 77.479 Acc@5 92.737
Accuracy of the network on the 7500 test images: 77.5%
Evaluate InternViT-6B on ImageNet-R with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 89.777 Acc@5 97.023
Accuracy of the network on the 30000 test images: 89.8%
Evaluate InternViT-6B on ImageNet-Sketch with 8 GPUs (click to expand).
python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
--cfg configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --eval \
--resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm
Expected results:
* Acc@1 69.117 Acc@5 88.341
Accuracy of the network on the 50889 test images: 69.1%