| | --- |
| | license: mit |
| | base_model: |
| | - apple/aimv2-large-patch14-native |
| | pipeline_tag: image-classification |
| | tags: |
| | - image-classification |
| | - vision |
| | library_name: transformers |
| | --- |
| | |
| |
|
| | # AIMv2-Large-Patch14-Native Image Classification |
| |
|
| | [Original AIMv2 Paper](https://arxiv.org/abs/2411.14402) | [BibTeX](#citation) |
| |
|
| | This repository contains an adapted version of the original AIMv2 model, modified to be compatible with the `AutoModelForImageClassification` class from Hugging Face Transformers. This adaptation enables seamless use of the model for image classification tasks. |
| |
|
| | **This model has not been trained/fine-tuned** |
| |
|
| | ## Introduction |
| |
|
| | We have adapted the original `apple/aimv2-large-patch14-native` model to work with `AutoModelForImageClassification`. The AIMv2 family consists of vision models pre-trained with a multimodal autoregressive objective, offering robust performance across various benchmarks. |
| |
|
| | Some highlights of the AIMv2 models include: |
| |
|
| | 1. Outperforming OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks. |
| | 2. Surpassing DINOv2 in open-vocabulary object detection and referring expression comprehension. |
| | 3. Demonstrating strong recognition performance, with AIMv2-3B achieving **89.5% on ImageNet using a frozen trunk**. |
| |
|
| | ## Usage |
| |
|
| | ### PyTorch |
| |
|
| | ```python |
| | import requests |
| | from PIL import Image |
| | from transformers import AutoImageProcessor, AutoModelForImageClassification |
| | |
| | url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
| | image = Image.open(requests.get(url, stream=True).raw) |
| | |
| | processor = AutoImageProcessor.from_pretrained( |
| | "amaye15/aimv2-large-patch14-native-image-classification", |
| | ) |
| | model = AutoModelForImageClassification.from_pretrained( |
| | "amaye15/aimv2-large-patch14-native-image-classification", |
| | trust_remote_code=True, |
| | ) |
| | |
| | inputs = processor(images=image, return_tensors="pt") |
| | outputs = model(**inputs) |
| | |
| | # Get predicted class |
| | predictions = outputs.logits.softmax(dim=-1) |
| | predicted_class = predictions.argmax(-1).item() |
| | |
| | print(f"Predicted class: {model.config.id2label[predicted_class]}") |
| | ``` |
| |
|
| | ## Model Details |
| |
|
| | - **Model Name**: `amaye15/aimv2-large-patch14-native-image-classification` |
| | - **Original Model**: `apple/aimv2-large-patch14-native` |
| | - **Adaptation**: Modified to be compatible with `AutoModelForImageClassification` for direct use in image classification tasks. |
| | - **Framework**: PyTorch |
| |
|
| | ## Citation |
| |
|
| | If you use this model or find it helpful, please consider citing the original AIMv2 paper: |
| |
|
| | ```bibtex |
| | @article{yang2023aimv2, |
| | title={AIMv2: Advances in Multimodal Vision Models}, |
| | author={Yang, Li and others}, |
| | journal={arXiv preprint arXiv:2411.14402}, |
| | year={2023} |
| | } |
| | ``` |