Update model card
Browse files
README.md
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: lucid
|
| 3 |
+
license: bsd-3-clause
|
| 4 |
+
tags:
|
| 5 |
+
- image-classification
|
| 6 |
+
- vit
|
| 7 |
+
- lucid
|
| 8 |
+
datasets:
|
| 9 |
+
- imagenet-1k
|
| 10 |
+
pipeline_tag: image-classification
|
| 11 |
+
model-index:
|
| 12 |
+
- name: vit-base-16
|
| 13 |
+
results:
|
| 14 |
+
- task: { type: image-classification }
|
| 15 |
+
dataset: { name: ImageNet-1K, type: imagenet-1k }
|
| 16 |
+
metrics:
|
| 17 |
+
- { type: acc@1, value: 81.072 }
|
| 18 |
+
- { type: acc@5, value: 95.318 }
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# ViT-Base/16
|
| 22 |
+
|
| 23 |
+
> Dosovitskiy et al., 2021 — *An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale* (arXiv:2010.11929)
|
| 24 |
+
|
| 25 |
+
[Lucid](https://github.com/ChanLumerico/lucid) port of `torchvision/ViT_B_16_Weights.IMAGENET1K_V1`,
|
| 26 |
+
converted to Lucid-native safetensors.
|
| 27 |
+
|
| 28 |
+
## Available weights
|
| 29 |
+
|
| 30 |
+
| Tag | acc@1 | acc@5 | Params | GFLOPs | Size | Source |
|
| 31 |
+
|---|---|---|---|---|---|---|
|
| 32 |
+
| `IMAGENET1K_V1` *(default)* | 81.072 | 95.318 | 86.6M | 17.564 | 330.24 MB | torchvision |
|
| 33 |
+
|
| 34 |
+
## Usage
|
| 35 |
+
|
| 36 |
+
```python
|
| 37 |
+
import lucid.models as models
|
| 38 |
+
from lucid.models.vision.resnet import VitBase16Weights
|
| 39 |
+
|
| 40 |
+
# default tag
|
| 41 |
+
model = models.vit_base_16_cls(pretrained=True)
|
| 42 |
+
|
| 43 |
+
# explicit tag (enum or string)
|
| 44 |
+
model = models.vit_base_16_cls(weights=VitBase16Weights.IMAGENET1K_V1)
|
| 45 |
+
model = models.vit_base_16_cls(pretrained="IMAGENET1K_V1")
|
| 46 |
+
|
| 47 |
+
# preprocessing travels with the weights
|
| 48 |
+
weights = VitBase16Weights.IMAGENET1K_V1
|
| 49 |
+
preprocess = weights.transforms()
|
| 50 |
+
logits = model(preprocess(image)[None]).logits
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
## Conversion
|
| 54 |
+
|
| 55 |
+
Converted from `torchvision/ViT_B_16_Weights.IMAGENET1K_V1` via
|
| 56 |
+
`python -m tools.convert_weights vit_base_16 --tag IMAGENET1K_V1`.
|
| 57 |
+
Key mapping + numerical parity verified against the source.
|
| 58 |
+
|
| 59 |
+
## License
|
| 60 |
+
|
| 61 |
+
`bsd-3-clause` — inherited from the original weights.
|
| 62 |
+
|
| 63 |
+
## Citation
|
| 64 |
+
|
| 65 |
+
```
|
| 66 |
+
@inproceedings{dosovitskiy2021image,
|
| 67 |
+
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
|
| 68 |
+
author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
|
| 69 |
+
booktitle={ICLR}, year={2021}
|
| 70 |
+
}
|
| 71 |
+
```
|