jfang
/

mars-vit-base-ctx2m

Model card Files Files and versions

mars-vit-base-ctx2m / README.md

jichao

README

4725c4d about 1 year ago

|

history blame contribute delete

1.95 kB

	---
	language: en
	tags:
	- vision
	license: apache-2.0
	---

	Model Card for Mars ViT Base Model

	## Model Architecture
	- Architecture: Vision Transformer (ViT) Base
	- Input Channels: 1 (grayscale images)
	- Number of Classes: 0 (features extraction)

	## Training Method
	- Method: Masked Autoencoder (MAE)
	- Dataset: 2 million CTX images

	## Usage Examples
	### Using timm (suggested now)

	First download checkpoint-1199.pth (backbone only)

	```python
	import timm
	import torch

	model = timm.create_model(
	'vit_base_patch16_224',
	in_chans=1,
	num_classes=0,
	global_pool='',
	checkpoint_path="./checkpoint-1199.pth" # must use local path
	)

	model.eval()

	# for images, need to convert to single channel, 224, and normalize

	# transform example:
	# transform = transforms.Compose([
	# transforms.ToTensor(),
	# transforms.Resize((224, 224)),
	# transforms.Grayscale(num_output_channels=1),
	# transforms.Normalize(mean=[0.5], std=[0.5])
	# ])
	x = torch.randn(1, 1, 224, 224)
	with torch.no_grad():
	features = model.forward_features(x) # shape [1, tokens, embed_dim]
	print(features.shape)

	cls_token = features[:, 0]
	patch_tokens = features[:, 1:]
	```

	Using transformers
	```python
	from transformers import AutoModel, AutoImageProcessor

	model = AutoModel.from_pretrained("jfang/mars-vit-base-ctx2m")
	image_processor = AutoImageProcessor.from_pretrained("jfang/mars-vit-base-ctx2m")

	# Example usage
	from PIL import Image
	image = Image.open("some_image.png").convert("L") # 1-channel
	inputs = image_processor(image, return_tensors="pt")


	outputs = model(**inputs)
	```
	## MAE reconstruction
	Under ./mae folder, there is full encoder-decoder MAE model and a notebook for visualization.

	### Limitations
	The model is trained specifically on CTX images and may not generalize well to other types of images without further fine-tuning.
	The model is designed for feature extraction and does not include a classification head.