File size: 1,947 Bytes
db32299 efaf3db 4725c4d 586e9d0 efaf3db 586e9d0 efaf3db 4725c4d efaf3db 4725c4d efaf3db 4725c4d efaf3db db32299 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | ---
language: en
tags:
- vision
license: apache-2.0
---
Model Card for Mars ViT Base Model
## Model Architecture
- Architecture: Vision Transformer (ViT) Base
- Input Channels: 1 (grayscale images)
- Number of Classes: 0 (features extraction)
## Training Method
- Method: Masked Autoencoder (MAE)
- Dataset: 2 million CTX images
## Usage Examples
### Using timm (suggested now)
First download checkpoint-1199.pth (backbone only)
```python
import timm
import torch
model = timm.create_model(
'vit_base_patch16_224',
in_chans=1,
num_classes=0,
global_pool='',
checkpoint_path="./checkpoint-1199.pth" # must use local path
)
model.eval()
# for images, need to convert to single channel, 224, and normalize
# transform example:
# transform = transforms.Compose([
# transforms.ToTensor(),
# transforms.Resize((224, 224)),
# transforms.Grayscale(num_output_channels=1),
# transforms.Normalize(mean=[0.5], std=[0.5])
# ])
x = torch.randn(1, 1, 224, 224)
with torch.no_grad():
features = model.forward_features(x) # shape [1, tokens, embed_dim]
print(features.shape)
cls_token = features[:, 0]
patch_tokens = features[:, 1:]
```
Using transformers
```python
from transformers import AutoModel, AutoImageProcessor
model = AutoModel.from_pretrained("jfang/mars-vit-base-ctx2m")
image_processor = AutoImageProcessor.from_pretrained("jfang/mars-vit-base-ctx2m")
# Example usage
from PIL import Image
image = Image.open("some_image.png").convert("L") # 1-channel
inputs = image_processor(image, return_tensors="pt")
outputs = model(**inputs)
```
## MAE reconstruction
Under ./mae folder, there is full encoder-decoder MAE model and a notebook for visualization.
### Limitations
The model is trained specifically on CTX images and may not generalize well to other types of images without further fine-tuning.
The model is designed for feature extraction and does not include a classification head.
|