--- license: apache-2.0 language: - en base_model: - qihoo360/fg-clip2-base tags: - CLIP - FG-CLIP - FG-CLIP2 - Image-Text Encoder --- # FG-CLIP2 The version of FG-CLIP2 has been converted to run on the Axera NPU using w8a16 quantization. Compatible with Pulsar2 version: 4.2 If you want to know how to convert the FG-CLIP2 model into an axmodel that can run on the axera npu board, please read [this link](https://github.com/Jordan-5i/FG-CLIP/tree/main/ax_tools) in detail. ## Support Platform - AX650 ## End-of-board inference time | Stage | Time | |------|------| | image_encoder | 125.197 ms | | text_encoder | 10.817 ms | ## How to use Download all files from this repository to the device Run the following command: ```bash python3 run_axmodel.py ``` Model input and output examples are as follows: 1. the image you want to input: ![](bedroom.jpg) 2. The description of the image content: ```bash [ "一个简约风格的卧室角落,黑色金属衣架上挂着多件米色和白色的衣物,下方架子放着两双浅色鞋子,旁边是一盆绿植,左侧可见一张铺有白色床单和灰色枕头的床。", "一个简约风格的卧室角落,黑色金属衣架上挂着多件红色和蓝色的衣物,下方架子放着两双黑色高跟鞋,旁边是一盆绿植,左侧可见一张铺有白色床单和灰色枕头的床。", "一个简约风格的卧室角落,黑色金属衣架上挂着多件米色和白色的衣物,下方架子放着两双运动鞋,旁边是一盆仙人掌,左侧可见一张铺有白色床单和灰色枕头的床。", "一个繁忙的街头市场,摊位上摆满水果,背景是高楼大厦,人们在喧闹中购物。" ] ``` 3. The similarity between the output of the image encoder and the text encoder is ```bash Logits per image: tensor([[9.8757e-01, 4.7755e-03, 7.6510e-03, 1.3484e-14]], dtype=torch.float64) ```