AlpachinoNLP
/

LongCLIP-ViT-B-32

@@ -14,6 +14,7 @@ tags:
 ## What’s different vs standard CLIP
 - based on CLIP ViT-B/32 (from [laion/CLIP-ViT-B-32-laion2B-s34B-b79K](https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K))
 - **Longer text context**: `max_position_embeddings=512` (vs the usual 77).
 ## Usage
 ```python
@@ -43,9 +44,6 @@ with torch.no_grad():
 print(probs[0].tolist())
 ```
-## Training data
-- Train data: 1.67M image-caption pairs, caption regenerated by Qwen2.5-VL-72B (512-token max length), images sampled from LAION-2B.
 ## Zero-shot classification example
 ```python

 ## What’s different vs standard CLIP
 - based on CLIP ViT-B/32 (from [laion/CLIP-ViT-B-32-laion2B-s34B-b79K](https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K))
 - **Longer text context**: `max_position_embeddings=512` (vs the usual 77).
+- Train data: 1.67M image-caption pairs, caption regenerated by Qwen2.5-VL-72B (512-token max length), images sampled from LAION-2B.
 ## Usage
 ```python
 print(probs[0].tolist())
 ```
 ## Zero-shot classification example
 ```python