| | --- |
| | language: en |
| | tags: |
| | - vision |
| | license: apache-2.0 |
| | --- |
| | |
| | Model Card for Mars ViT Base Model |
| |
|
| | ## Model Architecture |
| | - Architecture: Vision Transformer (ViT) Base |
| | - Input Channels: 1 (grayscale images) |
| | - Number of Classes: 0 (features extraction) |
| |
|
| | ## Training Method |
| | - Method: Masked Autoencoder (MAE) |
| | - Dataset: 2 million CTX images |
| |
|
| | ## Usage Examples |
| | ### Using timm (suggested now) |
| |
|
| | First download checkpoint-1199.pth (backbone only) |
| |
|
| | ```python |
| | import timm |
| | import torch |
| | |
| | model = timm.create_model( |
| | 'vit_base_patch16_224', |
| | in_chans=1, |
| | num_classes=0, |
| | global_pool='', |
| | checkpoint_path="./checkpoint-1199.pth" # must use local path |
| | ) |
| | |
| | model.eval() |
| | |
| | # for images, need to convert to single channel, 224, and normalize |
| | |
| | # transform example: |
| | # transform = transforms.Compose([ |
| | # transforms.ToTensor(), |
| | # transforms.Resize((224, 224)), |
| | # transforms.Grayscale(num_output_channels=1), |
| | # transforms.Normalize(mean=[0.5], std=[0.5]) |
| | # ]) |
| | x = torch.randn(1, 1, 224, 224) |
| | with torch.no_grad(): |
| | features = model.forward_features(x) # shape [1, tokens, embed_dim] |
| | print(features.shape) |
| | |
| | cls_token = features[:, 0] |
| | patch_tokens = features[:, 1:] |
| | ``` |
| |
|
| | Using transformers |
| | ```python |
| | from transformers import AutoModel, AutoImageProcessor |
| | |
| | model = AutoModel.from_pretrained("jfang/mars-vit-base-ctx2m") |
| | image_processor = AutoImageProcessor.from_pretrained("jfang/mars-vit-base-ctx2m") |
| | |
| | # Example usage |
| | from PIL import Image |
| | image = Image.open("some_image.png").convert("L") # 1-channel |
| | inputs = image_processor(image, return_tensors="pt") |
| | |
| | |
| | outputs = model(**inputs) |
| | ``` |
| | ## MAE reconstruction |
| | Under ./mae folder, there is full encoder-decoder MAE model and a notebook for visualization. |
| |
|
| | ### Limitations |
| | The model is trained specifically on CTX images and may not generalize well to other types of images without further fine-tuning. |
| | The model is designed for feature extraction and does not include a classification head. |
| |
|
| |
|
| |
|