fadrizul commited on
Commit
5db5ea4
Β·
verified Β·
1 Parent(s): 296f50e

Upload clip_vision/open-clip-xlm-roberta-large-vit-huge-14_visual_fp16/README.md with huggingface_hub

Browse files
clip_vision/open-clip-xlm-roberta-large-vit-huge-14_visual_fp16/README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLIP-ViT-H-14 Vision Encoder for Wan Video Models
2
+
3
+ This repository contains the CLIP vision encoder model converted from the ComfyUI WanVideoWrapper format to HuggingFace format for use with diffusers pipelines.
4
+
5
+ ## Model Details
6
+
7
+ - **Base Architecture**: CLIP ViT-H/14 (Vision Transformer with 14x14 patches)
8
+ - **Image Size**: 224x224
9
+ - **Patch Size**: 14x14
10
+ - **Hidden Dimension**: 1280
11
+ - **Number of Layers**: 32
12
+ - **Number of Attention Heads**: 16
13
+ - **Output Dimension**: 1024
14
+
15
+ ## Purpose
16
+
17
+ This model serves as the **image encoder** for Wan Video diffusion models (Wan 2.1 I2V, Wan 2.2, etc.). It encodes input images into latent representations that are then used as conditioning signals alongside text embeddings (from T5) in the video generation process.
18
+
19
+ **Note**: This repository only contains the **vision encoder** component. Text encoding is handled separately by T5 models, not by CLIP's text encoder.
20
+
21
+ ## Conversion Process
22
+
23
+ This model was converted from the ComfyUI WanVideoWrapper implementation to HuggingFace format using the following steps:
24
+
25
+ ### 1. Weight Conversion
26
+ The model weights were remapped from the ComfyUI format to HuggingFace CLIP format using `scripts/convert_openclip_to_hf_clean.py`. Key remapping included:
27
+ - Vision transformer blocks
28
+ - Layer normalization parameters
29
+ - Attention projections (Q, K, V)
30
+ - MLP/FFN layers
31
+ - Position embeddings
32
+
33
+ ### 2. Configuration Generation
34
+ The `config.json` was generated based on the architecture parameters defined in the ComfyUI WanVideoWrapper's `wanvideo/modules/clip.py`:
35
+ - `image_size=224`
36
+ - `patch_size=14`
37
+ - `hidden_size=1280` (vision_dim)
38
+ - `num_hidden_layers=32` (vision_layers)
39
+ - `num_attention_heads=16` (vision_heads)
40
+ - `intermediate_size=5120` (mlp_ratio * hidden_size)
41
+ - `projection_dim=1024` (embed_dim)
42
+
43
+ ### 3. Preprocessor Configuration
44
+ The `preprocessor_config.json` was copied from the original [laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) repository, as the ComfyUI WanVideoWrapper uses identical preprocessing parameters:
45
+ - **Image mean**: `[0.48145466, 0.4578275, 0.40821073]`
46
+ - **Image std**: `[0.26862954, 0.26130258, 0.27577711]`
47
+ - **Resize**: 224x224
48
+ - **Interpolation**: Bicubic
49
+ - **Center crop**: Enabled
50
+
51
+ ## Usage
52
+
53
+ ### With Diffusers
54
+
55
+ ```python
56
+ from transformers import CLIPVisionModel, CLIPImageProcessor
57
+ from diffusers import WanImageToVideoPipeline
58
+ import torch
59
+
60
+ # Load the vision encoder
61
+ image_encoder = CLIPVisionModel.from_pretrained(
62
+ "your-username/wan-clip-vit-h-14",
63
+ torch_dtype=torch.float16
64
+ )
65
+
66
+ # Load the image processor
67
+ image_processor = CLIPImageProcessor.from_pretrained(
68
+ "your-username/wan-clip-vit-h-14"
69
+ )
70
+
71
+ # Use with Wan pipeline
72
+ pipe = WanImageToVideoPipeline.from_pretrained(
73
+ "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers",
74
+ image_encoder=image_encoder,
75
+ image_processor=image_processor,
76
+ torch_dtype=torch.bfloat16
77
+ )
78
+ ```
79
+
80
+ ### Direct Usage
81
+
82
+ ```python
83
+ from transformers import CLIPVisionModel, CLIPImageProcessor
84
+ from PIL import Image
85
+ import torch
86
+
87
+ model = CLIPVisionModel.from_pretrained("your-username/wan-clip-vit-h-14")
88
+ processor = CLIPImageProcessor.from_pretrained("your-username/wan-clip-vit-h-14")
89
+
90
+ image = Image.open("your_image.jpg")
91
+ inputs = processor(images=image, return_tensors="pt")
92
+
93
+ with torch.no_grad():
94
+ outputs = model(**inputs)
95
+
96
+ # Get image embeddings from penultimate layer (used in Wan models)
97
+ image_embeds = outputs.hidden_states[-2] # Shape: [1, 257, 1280]
98
+ ```
99
+
100
+ ## Model Architecture
101
+
102
+ This is a **vision-only** CLIP model. The architecture consists of:
103
+
104
+ 1. **Patch Embedding**: Converts 224x224 images into 16x16 patches (14x14 patch size)
105
+ 2. **Vision Transformer**: 32 layers of multi-head self-attention (16 heads, 1280 hidden dim)
106
+ 3. **Projection Head**: Projects 1280-dim features to 1024-dim output space
107
+
108
+ ## Important Notes
109
+
110
+ - This model **does not include** the text encoder. For Wan video models, text encoding is performed by T5 (UMT5-XXL).
111
+ - The `tokenizer.json` and text-related files from the original LAION CLIP model are **not needed** and **not included** in this repository.
112
+ - This model outputs embeddings from the **penultimate layer** (`hidden_states[-2]`) rather than the final layer, as used in the Wan video pipeline implementation.
113
+
114
+ ## Files Included
115
+
116
+ ```
117
+ .
118
+ β”œβ”€β”€ config.json # Model architecture configuration
119
+ β”œβ”€β”€ preprocessor_config.json # Image preprocessing configuration
120
+ β”œβ”€β”€ model.safetensors # Model weights (safetensors format)
121
+ └── README.md # This file
122
+ ```
123
+
124
+ ## Source
125
+
126
+ - **Original Weights**: ComfyUI WanVideoWrapper CLIP model
127
+ - **Preprocessing Config**: [laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k)
128
+ - **Conversion Script**: `scripts/convert_openclip_to_hf_clean.py`
129
+
130
+ ## License
131
+
132
+ Please refer to the original model licenses:
133
+ - [ComfyUI-WanVideoWrapper License](https://github.com/kijai/ComfyUI-WanVideoWrapper)
134
+ - [OpenCLIP License](https://github.com/mlfoundations/open_clip)
135
+
136
+ ## Citation
137
+
138
+ If you use this model, please cite the original CLIP and Wan Video papers:
139
+
140
+ ```bibtex
141
+ @inproceedings{radford2021learning,
142
+ title={Learning transferable visual models from natural language supervision},
143
+ author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
144
+ booktitle={International conference on machine learning},
145
+ pages={8748--8763},
146
+ year={2021},
147
+ organization={PMLR}
148
+ }
149
+ ```