Vintern_finetune / classification /README.md

tqv06

Upload folder using huggingface_hub

866ee56 verified 7 months ago

preview code

raw

history blame contribute delete

10.3 kB

InternViT-6B for Image Classification

This folder contains the implementation of the InternViT-6B for image classification, which corresponds to Section 4.2.1 of our InternVL 1.0 paper. The codebase for this part is derived from InternImage, with some code references to EVA and DINOv2. Thanks for their great work.

In this part, we validate the visual perception capabilities of InternViT-6B, the most core component of InternVL 1.0. We evaluate the quality of visual representation produced by InternViT-6B using the ImageNet-1K dataset. Following common practices, we adopt the linear probing evaluation, i.e. training a linear classifier while keeping the backbone frozen. In addition to the ImageNet-1K validation set, we also report performance metrics on several ImageNet variants, to benchmark the domain generalization capability.

InternViT-6B follows the structure of vanilla ViT, and its hyperparameters are listed in the table below.

🛠️ Installation

Follow the installation guide to perform installations.

📦 Data Preparation

Please prepare the dataset according to your needs.

ImageNet-1K: We use the standard ImageNet dataset, you can download it from http://image-net.org/.
ImageNet-A: Download it from https://people.eecs.berkeley.edu/~hendrycks/imagenet-a.tar.
ImageNet-R: Download it from https://people.eecs.berkeley.edu/~hendrycks/imagenet-r.tar.
ImageNetV2: Download it from https://imagenetv2public.s3-us-west-2.amazonaws.com/imagenetv2-matched-frequency.tar.gz.

ImageNet-Sketch: Download it using gdown.

# GDown is needed to download the dataset.
# Please install it via `pip install gdown`
gdown --id 1Mj0i5HBthqH1p_yeXzsg22gZduvgoNeA

First, please prepare the ImageNet-1K, ImageNet-A, ImageNet-R, ImageNetV2, and ImageNet-Sketch datasets following the directory structure outlined below.

$ tree data
data
├── imagenet-1k
│         ├── train
          │    ├── n01498041
          │    └── ...
│         └── val
│              ├── ILSVRC2012_val_00000001.JPEG
│              └── ...
├── imagenet-a
│         ├── n01498041
│         └── ...
├── imagenet-r
│         ├── n01443537
│         └── ...
├── imagenet-sketch
│         ├── n01440764
│         └── ...
└── imagenetv2
    └── ImageNetV2-matched-frequency

Then, unzip the train.txt.zip and val.txt.zip in meta_data/.

cd meta_data/
unzip train.txt.zip
unzip val.txt.zip

📦 Model Preparation

model name	type	download	size
intern_vit_6b_224px.pth	pytorch	🤗 HF link	12 GB
intern_vit_6b_224px_head.pth	pytorch	🤗 HF link	25.7 MB

Please download the above model weights and place them in the pretrained/ folder.

cd pretrained
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px.pth
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/intern_vit_6b_224px_head.pth

The directory structure is:

pretrained
├── intern_vit_6b_224px_head.pth
└── intern_vit_6b_224px.pth

🔍 Linear Probing on ImageNet-1K

Warning: Please install apex before training (see installation guide for details).

To train a linear classifier for InternViT-6B on ImageNet with 8 GPUs, run:

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --cfg configs/intern_vit_6b_1k_224.yaml
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --launcher slurm

Note, it is normal for the following information to appear during training and it can be safely ignored:

_IncompatibleKeys(missing_keys=[], unexpected_keys=['clip_projector.norm1_q.weight', 'clip_projector.norm1_q.bias', 'clip_projector.norm1_k.weight', 'clip_projector.norm1_k.bias', 'clip_projector.norm1_v.weight', 'clip_projector.norm1_v.bias', 'clip_projector.cross_attn.q_bias', 'clip_projector.cross_attn.k_bias', 'clip_projector.cross_attn.v_bias', 'clip_projector.cross_attn.q.weight', 'clip_projector.cross_attn.k.weight', 'clip_projector.cross_attn.v.weight', 'clip_projector.cross_attn.proj.weight', 'clip_projector.cross_attn.proj.bias'])

📊 Evaluation

Warning: Please install apex before evaluation (see installation guide for details).

model name	IN-1K	IN-ReaL	IN-V2	IN-A	IN-R	IN-Sketch	download
intern_vit_6b_1k_224.yaml	88.2	90.4	79.9	77.5	89.8	69.1	ckpt \| log

Evaluate InternViT-6B on ImageNet-1K val with 8 GPUs (click to expand).

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 88.230 Acc@5 98.474
Accuracy of the network on the 50000 test images: 88.2%

Evaluate InternViT-6B on ImageNet-ReaL with 1 GPU (click to expand).

Note: ImageNet-ReaL now only supports single-GPU testing.

python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=1 GPUS_PER_NODE=1 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_real.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

* ReaL Acc@1 90.437 Acc@5 98.567 loss 0.605
ReaL Accuracy of the network on the 50000 test images: 90.4%

Evaluate InternViT-6B on ImageNetV2 with 8 GPUs (click to expand).

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenetv2.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 79.940 Acc@5 95.340
Accuracy of the network on the 10000 test images: 79.9%

Evaluate InternViT-6B on ImageNet-A with 8 GPUs (click to expand).

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_a.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 77.479 Acc@5 92.737
Accuracy of the network on the 7500 test images: 77.5%

Evaluate InternViT-6B on ImageNet-R with 8 GPUs (click to expand).

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_r.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 89.777 Acc@5 97.023
Accuracy of the network on the 30000 test images: 89.8%

Evaluate InternViT-6B on ImageNet-Sketch with 8 GPUs (click to expand).

python -m torch.distributed.launch --nproc_per_node 8 --master_port 12345 main.py --eval \
    --cfg configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --resume pretrained/intern_vit_6b_224px_head.pth
# or manage jobs with slurm
GPUS=8 sh train_in1k.sh <partition> <job-name> configs/intern_vit_6b_1k_224_test_imagenet_sketch.yaml --eval \
    --resume pretrained/intern_vit_6b_224px_head.pth --launcher slurm

Expected results:

 * Acc@1 69.117 Acc@5 88.341
Accuracy of the network on the 50889 test images: 69.1%