Add pipeline tag and library name
Browse filesThis PR adds the `pipeline_tag` and `library_name` to the model card metadata. The `pipeline_tag` is set to `image-text-to-text` as the model processes images and text to generate text. `library_name` is set to `transformers` given the structure of the model files present in the repository.
README.md
CHANGED
|
@@ -1,8 +1,9 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
|
| 6 |
# Leopard-LLaVA
|
| 7 |
|
| 8 |
[Paper](https://arxiv.org/abs/2410.01744) | [Github](https://github.com/tencent-ailab/Leopard) | [Models-LLaVA](https://huggingface.co/wyu1/Leopard-LLaVA) | [Models-Idefics2](https://huggingface.co/wyu1/Leopard-Idefics2)
|
|
@@ -17,8 +18,6 @@ and resolutions of the input images. Experiments across a wide range of benchmar
|
|
| 17 |
|
| 18 |
For Leopard-LLaVA, we use SigLIP-SO-400M with 364 × 364 image resolutions as the visual encoder, as it supports a larger resolution than the commonly used 224 × 224 resolution CLIP visual encoder. Each image is encoded into a sequence of 26 × 26 = 676 visual features with a patch size of 14. With the visual feature pixel shuffling strategy, each image is further processed into a sequence of 169 visual features. We limit the maximum number of images (M) in each sample to 50, which produces up to 8,450 visual features in total. Following LLaVA, we adopt a two-layer MLP as the visual-language connector. We use LLaMA-3.1 as the language model.
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
## Citation
|
| 23 |
```
|
| 24 |
@article{jia2024leopard,
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
|
|
|
|
| 7 |
# Leopard-LLaVA
|
| 8 |
|
| 9 |
[Paper](https://arxiv.org/abs/2410.01744) | [Github](https://github.com/tencent-ailab/Leopard) | [Models-LLaVA](https://huggingface.co/wyu1/Leopard-LLaVA) | [Models-Idefics2](https://huggingface.co/wyu1/Leopard-Idefics2)
|
|
|
|
| 18 |
|
| 19 |
For Leopard-LLaVA, we use SigLIP-SO-400M with 364 × 364 image resolutions as the visual encoder, as it supports a larger resolution than the commonly used 224 × 224 resolution CLIP visual encoder. Each image is encoded into a sequence of 26 × 26 = 676 visual features with a patch size of 14. With the visual feature pixel shuffling strategy, each image is further processed into a sequence of 169 visual features. We limit the maximum number of images (M) in each sample to 50, which produces up to 8,450 visual features in total. Following LLaVA, we adopt a two-layer MLP as the visual-language connector. We use LLaMA-3.1 as the language model.
|
| 20 |
|
|
|
|
|
|
|
| 21 |
## Citation
|
| 22 |
```
|
| 23 |
@article{jia2024leopard,
|