nielsr HF Staff commited on
Commit
076d1de
·
verified ·
1 Parent(s): d56530e

Add pipeline tag and library name

Browse files

This PR adds the `pipeline_tag` and `library_name` to the model card metadata. The `pipeline_tag` is set to `image-text-to-text` as the model processes images and text to generate text. `library_name` is set to `transformers` given the structure of the model files present in the repository.

Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -1,8 +1,9 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
 
5
-
6
  # Leopard-LLaVA
7
 
8
  [Paper](https://arxiv.org/abs/2410.01744) | [Github](https://github.com/tencent-ailab/Leopard) | [Models-LLaVA](https://huggingface.co/wyu1/Leopard-LLaVA) | [Models-Idefics2](https://huggingface.co/wyu1/Leopard-Idefics2)
@@ -17,8 +18,6 @@ and resolutions of the input images. Experiments across a wide range of benchmar
17
 
18
  For Leopard-LLaVA, we use SigLIP-SO-400M with 364 × 364 image resolutions as the visual encoder, as it supports a larger resolution than the commonly used 224 × 224 resolution CLIP visual encoder. Each image is encoded into a sequence of 26 × 26 = 676 visual features with a patch size of 14. With the visual feature pixel shuffling strategy, each image is further processed into a sequence of 169 visual features. We limit the maximum number of images (M) in each sample to 50, which produces up to 8,450 visual features in total. Following LLaVA, we adopt a two-layer MLP as the visual-language connector. We use LLaMA-3.1 as the language model.
19
 
20
-
21
-
22
  ## Citation
23
  ```
24
  @article{jia2024leopard,
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
6
 
 
7
  # Leopard-LLaVA
8
 
9
  [Paper](https://arxiv.org/abs/2410.01744) | [Github](https://github.com/tencent-ailab/Leopard) | [Models-LLaVA](https://huggingface.co/wyu1/Leopard-LLaVA) | [Models-Idefics2](https://huggingface.co/wyu1/Leopard-Idefics2)
 
18
 
19
  For Leopard-LLaVA, we use SigLIP-SO-400M with 364 × 364 image resolutions as the visual encoder, as it supports a larger resolution than the commonly used 224 × 224 resolution CLIP visual encoder. Each image is encoded into a sequence of 26 × 26 = 676 visual features with a patch size of 14. With the visual feature pixel shuffling strategy, each image is further processed into a sequence of 169 visual features. We limit the maximum number of images (M) in each sample to 50, which produces up to 8,450 visual features in total. Following LLaVA, we adopt a two-layer MLP as the visual-language connector. We use LLaMA-3.1 as the language model.
20
 
 
 
21
  ## Citation
22
  ```
23
  @article{jia2024leopard,