Upload clip_vision/open-clip-xlm-roberta-large-vit-huge-14_visual_fp16/README.md with huggingface_hub

Browse files

Files changed (1) hide show

clip_vision/open-clip-xlm-roberta-large-vit-huge-14_visual_fp16/README.md +149 -0

clip_vision/open-clip-xlm-roberta-large-vit-huge-14_visual_fp16/README.md ADDED Viewed

	@@ -0,0 +1,149 @@

+# CLIP-ViT-H-14 Vision Encoder for Wan Video Models
+This repository contains the CLIP vision encoder model converted from the ComfyUI WanVideoWrapper format to HuggingFace format for use with diffusers pipelines.
+## Model Details
+- **Base Architecture**: CLIP ViT-H/14 (Vision Transformer with 14x14 patches)
+- **Image Size**: 224x224
+- **Patch Size**: 14x14
+- **Hidden Dimension**: 1280
+- **Number of Layers**: 32
+- **Number of Attention Heads**: 16
+- **Output Dimension**: 1024
+## Purpose
+This model serves as the **image encoder** for Wan Video diffusion models (Wan 2.1 I2V, Wan 2.2, etc.). It encodes input images into latent representations that are then used as conditioning signals alongside text embeddings (from T5) in the video generation process.
+**Note**: This repository only contains the **vision encoder** component. Text encoding is handled separately by T5 models, not by CLIP's text encoder.
+## Conversion Process
+This model was converted from the ComfyUI WanVideoWrapper implementation to HuggingFace format using the following steps:
+### 1. Weight Conversion
+The model weights were remapped from the ComfyUI format to HuggingFace CLIP format using `scripts/convert_openclip_to_hf_clean.py`. Key remapping included:
+- Vision transformer blocks
+- Layer normalization parameters
+- Attention projections (Q, K, V)
+- MLP/FFN layers
+- Position embeddings
+### 2. Configuration Generation
+The `config.json` was generated based on the architecture parameters defined in the ComfyUI WanVideoWrapper's `wanvideo/modules/clip.py`:
+- `image_size=224`
+- `patch_size=14`
+- `hidden_size=1280` (vision_dim)
+- `num_hidden_layers=32` (vision_layers)
+- `num_attention_heads=16` (vision_heads)
+- `intermediate_size=5120` (mlp_ratio * hidden_size)
+- `projection_dim=1024` (embed_dim)
+### 3. Preprocessor Configuration
+The `preprocessor_config.json` was copied from the original [laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) repository, as the ComfyUI WanVideoWrapper uses identical preprocessing parameters:
+- **Image mean**: `[0.48145466, 0.4578275, 0.40821073]`
+- **Image std**: `[0.26862954, 0.26130258, 0.27577711]`
+- **Resize**: 224x224
+- **Interpolation**: Bicubic
+- **Center crop**: Enabled
+## Usage
+### With Diffusers
+```python
+from transformers import CLIPVisionModel, CLIPImageProcessor
+from diffusers import WanImageToVideoPipeline
+import torch
+# Load the vision encoder
+image_encoder = CLIPVisionModel.from_pretrained(
+    "your-username/wan-clip-vit-h-14",
+    torch_dtype=torch.float16
+)
+# Load the image processor
+image_processor = CLIPImageProcessor.from_pretrained(
+    "your-username/wan-clip-vit-h-14"
+)
+# Use with Wan pipeline
+pipe = WanImageToVideoPipeline.from_pretrained(
+    "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers",
+    image_encoder=image_encoder,
+    image_processor=image_processor,
+    torch_dtype=torch.bfloat16
+)
+```
+### Direct Usage
+```python
+from transformers import CLIPVisionModel, CLIPImageProcessor
+from PIL import Image
+import torch
+model = CLIPVisionModel.from_pretrained("your-username/wan-clip-vit-h-14")
+processor = CLIPImageProcessor.from_pretrained("your-username/wan-clip-vit-h-14")
+image = Image.open("your_image.jpg")
+inputs = processor(images=image, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs)
+# Get image embeddings from penultimate layer (used in Wan models)
+image_embeds = outputs.hidden_states[-2]  # Shape: [1, 257, 1280]
+```
+## Model Architecture
+This is a **vision-only** CLIP model. The architecture consists of:
+1. **Patch Embedding**: Converts 224x224 images into 16x16 patches (14x14 patch size)
+2. **Vision Transformer**: 32 layers of multi-head self-attention (16 heads, 1280 hidden dim)
+3. **Projection Head**: Projects 1280-dim features to 1024-dim output space
+## Important Notes
+- This model **does not include** the text encoder. For Wan video models, text encoding is performed by T5 (UMT5-XXL).
+- The `tokenizer.json` and text-related files from the original LAION CLIP model are **not needed** and **not included** in this repository.
+- This model outputs embeddings from the **penultimate layer** (`hidden_states[-2]`) rather than the final layer, as used in the Wan video pipeline implementation.
+## Files Included
+```
+.
+├── config.json                    # Model architecture configuration
+├── preprocessor_config.json       # Image preprocessing configuration
+├── model.safetensors             # Model weights (safetensors format)
+└── README.md                     # This file
+```
+## Source
+- **Original Weights**: ComfyUI WanVideoWrapper CLIP model
+- **Preprocessing Config**: [laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k)
+- **Conversion Script**: `scripts/convert_openclip_to_hf_clean.py`
+## License
+Please refer to the original model licenses:
+- [ComfyUI-WanVideoWrapper License](https://github.com/kijai/ComfyUI-WanVideoWrapper)
+- [OpenCLIP License](https://github.com/mlfoundations/open_clip)
+## Citation
+If you use this model, please cite the original CLIP and Wan Video papers:
+```bibtex
+@inproceedings{radford2021learning,
+  title={Learning transferable visual models from natural language supervision},
+  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
+  booktitle={International conference on machine learning},
+  pages={8748--8763},
+  year={2021},
+  organization={PMLR}
+}
+```