ViT-Flower: Vision Transformer for Flower Classification

Model Description

ViT-Flower is a Vision Transformer (ViT-Base-Patch16-384) model fine-tuned for flower classification. The model classifies images into 152 different flower categories with both Chinese and English names.

Model Architecture: Vision Transformer (ViT-Base-Patch16-384)
Task: Image Classification
Number of Classes: 152
Input Resolution: 384 × 384 pixels
Framework: PyTorch + Transformers

Model Architecture

ViT-Base-Patch16-384 Backbone
├── Patch Embedding: 16×16 patches → 768 dim
├── Transformer Encoder: 12 blocks
│   └── Each block: Multi-Head Self-Attention + MLP
└── CLS Token → 768-dim feature

Classification Head
├── Linear: 768 → 1024
├── BatchNorm1d: 1024
├── GELU Activation
├── Dropout: 0.4
└── Linear: 1024 → 152

Usage

Using Transformers

from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image

# Load model and processor
model_name = "your-username/Vit-Flower"
processor = AutoImageProcessor.from_pretrained(model_name)
model = AutoModelForImageClassification.from_pretrained(model_name)

# Load and process image
image = Image.open("flower.jpg")
inputs = processor(images=image, return_tensors="pt")

# Inference
outputs = model(**inputs)
logits = outputs.logits
predicted_id = logits.argmax(-1).item()
predicted_label = model.config.id2label[str(predicted_id)]

print(f"Predicted class: {predicted_label}")

Using timm + safetensors

import torch
from safetensors.torch import load_file
import timm
from PIL import Image
from torchvision import transforms

# Load model
model = timm.create_model('vit_base_patch16_384', pretrained=False, num_classes=152)
state_dict = load_file('Vit-Flower.safetensors')
model.load_state_dict(state_dict)
model.eval()

# Preprocessing
transform = transforms.Compose([
    transforms.Resize((384, 384)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

# Inference
image = Image.open("flower.jpg").convert('RGB')
input_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    output = model(input_tensor)
    predicted_id = output.argmax(-1).item()

Input Preprocessing

Parameter	Value
Image Size	384 × 384
Mean	[0.5, 0.5, 0.5]
Std	[0.5, 0.5, 0.5]
Format	RGB

Model Files

File	Description
`Vit-Flower.safetensors`	Model weights in safetensors format
`config.json`	Model configuration with id2label mapping
`preprocessor_config.json`	Image preprocessing configuration
`id2label_full.json`	Complete label mapping (category_id, chinese_name, english_name)

Label Categories (152 Classes)

Index	Category ID	Chinese Name	English Name
0	164	紫叶竹节秋海棠（紫竹梅）	Tradescantia pallida
1	165	龙牙草（仙鹤草）	Agrimonia eupatoria
2	166	络石（风车茉莉）	Trachelospermum jasminoides
3	167	湖北荚蒾	Eriocapitella hupehensis
4	168	桃叶风铃草	Campanula persicifolia
5	169	旱金莲	Tropaeolum majus
6	170	田野珍珠菜	Lysimachia arvensis
7	171	白花雪果	Symphoricarpos albus
8	172	吊兰	Chlorophytum comosum
9	173	啤酒花	Humulus lupulus
...	...	...	...
48	18	玫瑰	Rosa rugosa
...	...	...	...
151	1891	三色堇	Viola × wittrockiana

See id2label_full.json for complete 152 class mappings.

Training Details

Parameter	Value
Epochs	40
Learning Rate	1.5e-4
Optimizer	AdamW
Weight Decay	8e-5
Batch Size	32
Warmup Epochs	6
Frozen Blocks	10 of 12
Loss Function	FocalCrossEntropyLoss (α=0.25, γ=2)
Label Smoothing	0.1

Limitations

Input images must be RGB format
Optimal performance on flower images similar to training distribution
Model expects 384×384 input resolution

Citation

@article{dosovitskiy2021vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and others},
  journal={ICLR},
  year={2021}
}

License

Apache 2.0

Downloads last month: 16