Update README.md

cea6824 verified about 2 months ago

5 kB

	---
	license: apache-2.0
	datasets:
	- deepghs/danbooru2024
	tags:
	- Classification
	pipeline_tag: image-classification
	---

	# DT24-Tiny

	A Mutli-Label Image Classification model Trained using [deepghs/danbooru2024](https://huggingface.co/datasets/deepghs/danbooru2024) designed to tag anime-style illustrations with a vocabulary of 10,000 tags.

	This model uses ConvNeXt V2 Tiny as the backbone, optimized for a balance between speed and accuracy (448px resolution).

	\| Attribute \| Details \|
	\| :--- \| :--- \|
	\| Model Architecture \| ConvNeXt V2 Tiny + GeM Pooling \|
	\| Resolution \| 448 x 448 (Letterbox Padding) \|
	\| Vocabulary \| Top 10,000 Tags (Danbooru) \|
	\| Format \| SafeTensors (`model.safetensors`) \|

	## 🚀 Quick Start (Inference)

	You need `timm`, `torch`, and `Pillow` installed.

	```bash
	pip install torch torchvision timm pillow huggingface_hub
	```

	### Python Inference Script

	Since this model uses a custom head (GeM Pooling + Linear), you need to define the class structure before loading weights.

	```python
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	from PIL import Image
	import timm
	from huggingface_hub import hf_hub_download
	import json
	from torchvision import transforms

	# --- 1. Define Architecture ---
	class GeM(nn.Module):
	def __init__(self, p=3, eps=1e-6):
	super(GeM, self).__init__()
	self.p = nn.Parameter(torch.ones(1) * p)
	self.eps = eps
	def forward(self, x):
	return F.avg_pool2d(x.clamp(min=self.eps).pow(self.p), (x.size(-2), x.size(-1))).pow(1. / self.p)

	class DT24Tiny(nn.Module):
	def __init__(self, num_classes=10000):
	super().__init__()
	# Load backbone without head
	self.backbone = timm.create_model("convnextv2_tiny.fcmae_ft_in1k", pretrained=False, num_classes=0, global_pool='')
	self.pooling = GeM()
	self.head = nn.Linear(768, num_classes)

	def forward(self, x):
	# Sigmoid is applied here for multi-label classification
	return torch.sigmoid(self.head(self.pooling(self.backbone(x)).flatten(1)))

	# --- 2. Load Model & Tags ---
	REPO_ID = "igidn/DT24-Tiny"

	# Load Tags
	tag_path = hf_hub_download(repo_id=REPO_ID, filename="tags.json")
	with open(tag_path, "r") as f:
	tag_map = json.load(f)
	idx_to_tag = {v: k for k, v in tag_map.items()}

	# Load Weights (SafeTensors)
	from safetensors.torch import load_file
	model_path = hf_hub_download(repo_id=REPO_ID, filename="model.safetensors")
	state_dict = load_file(model_path)

	model = DT24Tiny(num_classes=len(tag_map))
	model.load_state_dict(state_dict)
	model.eval()

	# --- 3. Preprocessing (Letterbox Pad) ---
	class LetterboxPad:
	def __init__(self, size): self.size = size
	def __call__(self, img):
	w, h = img.size
	scale = self.size / max(w, h)
	new_w, new_h = int(w * scale), int(h * scale)
	img = img.resize((new_w, new_h), Image.BICUBIC)
	new_img = Image.new("RGB", (self.size, self.size), (0, 0, 0))
	new_img.paste(img, ((self.size - new_w) // 2, (self.size - new_h) // 2))
	return new_img

	transform = transforms.Compose([
	LetterboxPad(448),
	transforms.ToTensor(),
	transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
	])

	# --- 4. Prediction ---
	def predict(image_path, threshold=0.60):
	img = Image.open(image_path).convert("RGB")
	tensor = transform(img).unsqueeze(0)

	with torch.no_grad():
	probs = model(tensor)[0]

	# Filter results
	results = {}
	for idx, score in enumerate(probs):
	if score > threshold:
	results[idx_to_tag[idx]] = score.item()

	return dict(sorted(results.items(), key=lambda item: item[1], reverse=True))

	# Test
	# print(predict("test_image.jpg"))
	```

	## 🛠 Model Details

	### Training Data
	* Dataset: `deepghs/danbooru2024`
	* Selection: Top 10,000 most frequent tags.


	### Preprocessing
	Unlike standard resizing which distorts aspect ratio, this model uses Letterbox Padding.
	1. Resize the longest edge to 448px.
	2. Paste the image onto a black 448x448 canvas.
	3. Standard ImageNet normalization.

	### Architecture Nuances
	* Backbone: ConvNeXt V2 Tiny (Pretrained on ImageNet-1k).
	* Pooling: Replaced standard Global Average Pooling with GeM (Generalized Mean Pooling). This allows the model to better focus on salient features (like small accessories) rather than washing them out.
	* Head: A single Linear layer mapping 768 features to 10,000 tags.
	* Loss: Trained with Asymmetric Loss (ASL) to handle the extreme class imbalance of sparse tagging.

	## 📂 Files in Repo
	* `model.safetensors`: The FP16 trained weights (use this for inference).
	* `config.json`: Basic configuration parameters.
	* `tags.json`: The mapping of `Tag Name -> Index`.
	* `optimizer.pt`: (Optional) Optimizer state, only needed if you plan to resume training this model.

	## ⚖️ License
	This model is released under the Apache 2.0 license.