Update README.md

ee9063a verified 10 months ago

5.94 kB

	---
	license: apache-2.0
	---
	# Model Card: SuryaKrishna02/swinv2-roberta-openclip

	## Model Description

	The `swinv2-roberta-openclip` model is a multimodal vision-language model that combines the Swin Transformer V2 architecture for image processing with a RoBERTa text encoder, implemented using the OpenCLIP framework. The Swin Transformer V2 improves upon the original Swin Transformer architecture with better training stability, improved handling of resolution differences between pre-training and fine-tuning, and reduced data requirements.

	This model follows the CLIP (Contrastive Language-Image Pre-training) approach, which enables zero-shot classification and multimodal understanding by learning joint image-text representations.

	## Model Architecture

	- Image Encoder: Swin Transformer V2 Base (Window 12, 192px)
	- Pre-trained `swinv2_base_window12_192.ms_in22k` model from timm
	- A hierarchical vision transformer that uses shifted windows for efficient attention computation
	- Patch dropout of 0.6
	- Outputs image embeddings that capture visual features at multiple scales

	- Text Encoder: RoBERTa Base
	- Uses `roberta-base` from Hugging Face
	- Mean pooling strategy for sentence embeddings
	- Processes text inputs to generate text embeddings in the same latent space as image embeddings

	- Joint Embedding Space: 512 dimensions
	- Both image and text features are projected to this common space

	- Framework: OpenCLIP
	- An open-source implementation of the CLIP architecture that supports various vision and text encoder combinations
	- Enables training on custom datasets with different model architectures

	## Use Cases

	This model can be used for:

	- Zero-shot image classification
	- Text-to-image and image-to-text retrieval
	- Multimodal search
	- Visual reasoning tasks
	- Foundation for fine-tuning on downstream tasks

	## Limitations

	- Performance may vary across domains not well-represented in the training data
	- May exhibit biases present in the training datasets
	- Visual understanding is limited to image-level features rather than fine-grained object detection

	## Training

	This model was trained on a subset of the PD12M dataset:

	- Dataset: 100,000 image-text pairs from PD12M (Product Descriptions 12M)
	- Training Duration: 3 epochs
	- Pre-processing:
	- Image normalization with mean [0.48145466, 0.4578275, 0.40821073] and std [0.26862954, 0.26130258, 0.27577711]
	- Bicubic interpolation with "shortest" resize mode
	- Model Initialization:
	- Vision encoder: Initialized with pre-trained `swinv2_base_window12_192.ms_in22k` weights
	- Text encoder: Initialized with pre-trained `roberta-base` weights
	- Image Size: 192x192 pixels

	The training process involved:
	1. Initializing the vision encoder (Swin Transformer V2) and text encoder (RoBERTa) with their respective pre-trained weights
	2. Training both encoders jointly using a contrastive learning objective
	3. Using the OpenCLIP framework for efficient training

	## Usage

	```python
	import open_clip
	import torch
	from PIL import Image

	# Load model and processors
	model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms(
	'hf-hub:SuryaKrishna02/swinv2-roberta-openclip'
	)
	tokenizer = open_clip.get_tokenizer('hf-hub:SuryaKrishna02/swinv2-roberta-openclip')

	# Process image
	image = preprocess_val(Image.open("example.jpg")).unsqueeze(0)

	# Process text
	text = tokenizer(["a photo of a cat", "a photo of a dog"])

	# Generate embeddings
	with torch.no_grad():
	image_features = model.encode_image(image)
	text_features = model.encode_text(text)

	# Normalize features
	image_features = image_features / image_features.norm(dim=1, keepdim=True)
	text_features = text_features / text_features.norm(dim=1, keepdim=True)

	# Calculate similarity
	similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
	print(f"Label probabilities: {similarity}")
	```

	## Citation

	If you use this model in your research, please cite:

	```
	@software{swinv2_roberta_openclip,
	author = {Guthikonda, Surya Krishna},
	title = {Swinv2-Roberta-OpenCLIP},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/SuryaKrishna02/swinv2-roberta-openclip}
	}
	```

	## Model Configuration

	```json
	{
	"model_cfg": {
	"embed_dim": 512,
	"vision_cfg": {
	"timm_model_name": "swinv2_base_window12_192.ms_in22k",
	"timm_model_pretrained": true,
	"patch_dropout": 0.6,
	"timm_pool": "avg",
	"timm_proj": "linear",
	"image_size": 192
	},
	"text_cfg": {
	"hf_model_name": "roberta-base",
	"hf_tokenizer_name": "roberta-base",
	"hf_pooler_type": "mean_pooler"
	}
	},
	"preprocess_cfg": {
	"mean": [0.48145466, 0.4578275, 0.40821073],
	"std": [0.26862954, 0.26130258, 0.27577711],
	"interpolation": "bicubic",
	"resize_mode": "shortest"
	}
	}
	```

	## References

	- OpenCLIP: An open source implementation of CLIP (https://github.com/mlfoundations/open_clip)
	- Swin Transformer V2: Scaling Up Capacity and Resolution (https://arxiv.org/abs/2111.09883)
	- RoBERTa: A Robustly Optimized BERT Pretraining Approach (https://arxiv.org/abs/1907.11692)
	- PD12M: An Open Dataset for Product Recognition and Detection (https://github.com/SuryaKrishna02/PD12M)

	## License

	This model is released under the Apache License 2.0.

	```
	Copyright 2025 Surya Guthikonda

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	```