| | --- |
| | license: apache-2.0 |
| | --- |
| | # Model Card: SuryaKrishna02/swinv2-roberta-openclip |
| |
|
| | ## Model Description |
| |
|
| | The `swinv2-roberta-openclip` model is a multimodal vision-language model that combines the Swin Transformer V2 architecture for image processing with a RoBERTa text encoder, implemented using the OpenCLIP framework. The Swin Transformer V2 improves upon the original Swin Transformer architecture with better training stability, improved handling of resolution differences between pre-training and fine-tuning, and reduced data requirements. |
| |
|
| | This model follows the CLIP (Contrastive Language-Image Pre-training) approach, which enables zero-shot classification and multimodal understanding by learning joint image-text representations. |
| |
|
| | ## Model Architecture |
| |
|
| | - **Image Encoder**: Swin Transformer V2 Base (Window 12, 192px) |
| | - Pre-trained `swinv2_base_window12_192.ms_in22k` model from timm |
| | - A hierarchical vision transformer that uses shifted windows for efficient attention computation |
| | - Patch dropout of 0.6 |
| | - Outputs image embeddings that capture visual features at multiple scales |
| |
|
| | - **Text Encoder**: RoBERTa Base |
| | - Uses `roberta-base` from Hugging Face |
| | - Mean pooling strategy for sentence embeddings |
| | - Processes text inputs to generate text embeddings in the same latent space as image embeddings |
| |
|
| | - **Joint Embedding Space**: 512 dimensions |
| | - Both image and text features are projected to this common space |
| |
|
| | - **Framework**: OpenCLIP |
| | - An open-source implementation of the CLIP architecture that supports various vision and text encoder combinations |
| | - Enables training on custom datasets with different model architectures |
| |
|
| | ## Use Cases |
| |
|
| | This model can be used for: |
| |
|
| | - Zero-shot image classification |
| | - Text-to-image and image-to-text retrieval |
| | - Multimodal search |
| | - Visual reasoning tasks |
| | - Foundation for fine-tuning on downstream tasks |
| |
|
| | ## Limitations |
| |
|
| | - Performance may vary across domains not well-represented in the training data |
| | - May exhibit biases present in the training datasets |
| | - Visual understanding is limited to image-level features rather than fine-grained object detection |
| |
|
| | ## Training |
| |
|
| | This model was trained on a subset of the PD12M dataset: |
| |
|
| | - **Dataset**: 100,000 image-text pairs from PD12M (Product Descriptions 12M) |
| | - **Training Duration**: 3 epochs |
| | - **Pre-processing**: |
| | - Image normalization with mean [0.48145466, 0.4578275, 0.40821073] and std [0.26862954, 0.26130258, 0.27577711] |
| | - Bicubic interpolation with "shortest" resize mode |
| | - **Model Initialization**: |
| | - Vision encoder: Initialized with pre-trained `swinv2_base_window12_192.ms_in22k` weights |
| | - Text encoder: Initialized with pre-trained `roberta-base` weights |
| | - **Image Size**: 192x192 pixels |
| |
|
| | The training process involved: |
| | 1. Initializing the vision encoder (Swin Transformer V2) and text encoder (RoBERTa) with their respective pre-trained weights |
| | 2. Training both encoders jointly using a contrastive learning objective |
| | 3. Using the OpenCLIP framework for efficient training |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import open_clip |
| | import torch |
| | from PIL import Image |
| | |
| | # Load model and processors |
| | model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms( |
| | 'hf-hub:SuryaKrishna02/swinv2-roberta-openclip' |
| | ) |
| | tokenizer = open_clip.get_tokenizer('hf-hub:SuryaKrishna02/swinv2-roberta-openclip') |
| | |
| | # Process image |
| | image = preprocess_val(Image.open("example.jpg")).unsqueeze(0) |
| | |
| | # Process text |
| | text = tokenizer(["a photo of a cat", "a photo of a dog"]) |
| | |
| | # Generate embeddings |
| | with torch.no_grad(): |
| | image_features = model.encode_image(image) |
| | text_features = model.encode_text(text) |
| | |
| | # Normalize features |
| | image_features = image_features / image_features.norm(dim=1, keepdim=True) |
| | text_features = text_features / text_features.norm(dim=1, keepdim=True) |
| | |
| | # Calculate similarity |
| | similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1) |
| | print(f"Label probabilities: {similarity}") |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you use this model in your research, please cite: |
| |
|
| | ``` |
| | @software{swinv2_roberta_openclip, |
| | author = {Guthikonda, Surya Krishna}, |
| | title = {Swinv2-Roberta-OpenCLIP}, |
| | year = {2025}, |
| | publisher = {Hugging Face}, |
| | url = {https://huggingface.co/SuryaKrishna02/swinv2-roberta-openclip} |
| | } |
| | ``` |
| |
|
| | ## Model Configuration |
| |
|
| | ```json |
| | { |
| | "model_cfg": { |
| | "embed_dim": 512, |
| | "vision_cfg": { |
| | "timm_model_name": "swinv2_base_window12_192.ms_in22k", |
| | "timm_model_pretrained": true, |
| | "patch_dropout": 0.6, |
| | "timm_pool": "avg", |
| | "timm_proj": "linear", |
| | "image_size": 192 |
| | }, |
| | "text_cfg": { |
| | "hf_model_name": "roberta-base", |
| | "hf_tokenizer_name": "roberta-base", |
| | "hf_pooler_type": "mean_pooler" |
| | } |
| | }, |
| | "preprocess_cfg": { |
| | "mean": [0.48145466, 0.4578275, 0.40821073], |
| | "std": [0.26862954, 0.26130258, 0.27577711], |
| | "interpolation": "bicubic", |
| | "resize_mode": "shortest" |
| | } |
| | } |
| | ``` |
| |
|
| | ## References |
| |
|
| | - OpenCLIP: An open source implementation of CLIP (https://github.com/mlfoundations/open_clip) |
| | - Swin Transformer V2: Scaling Up Capacity and Resolution (https://arxiv.org/abs/2111.09883) |
| | - RoBERTa: A Robustly Optimized BERT Pretraining Approach (https://arxiv.org/abs/1907.11692) |
| | - PD12M: An Open Dataset for Product Recognition and Detection (https://github.com/SuryaKrishna02/PD12M) |
| | |
| | ## License |
| | |
| | This model is released under the Apache License 2.0. |
| | |
| | ``` |
| | Copyright 2025 Surya Guthikonda |
| | |
| | Licensed under the Apache License, Version 2.0 (the "License"); |
| | you may not use this file except in compliance with the License. |
| | You may obtain a copy of the License at |
| | |
| | http://www.apache.org/licenses/LICENSE-2.0 |
| | |
| | Unless required by applicable law or agreed to in writing, software |
| | distributed under the License is distributed on an "AS IS" BASIS, |
| | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| | See the License for the specific language governing permissions and |
| | limitations under the License. |
| | ``` |