Instructions to use HuggingFaceM4/idefics-80b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceM4/idefics-80b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceM4/idefics-80b")

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics-80b")
model = AutoModelForImageTextToText.from_pretrained("HuggingFaceM4/idefics-80b")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceM4/idefics-80b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceM4/idefics-80b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceM4/idefics-80b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/HuggingFaceM4/idefics-80b

SGLang

How to use HuggingFaceM4/idefics-80b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceM4/idefics-80b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceM4/idefics-80b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceM4/idefics-80b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceM4/idefics-80b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use HuggingFaceM4/idefics-80b with Docker Model Runner:
```
docker model run hf.co/HuggingFaceM4/idefics-80b
```

Leyo commited on Aug 10, 2023

Commit

4edcc87

1 Parent(s): 697ab43

fix_details_model_card (#11)

Browse files

- detail fixes (5edd82e6c8071cf4e18b5f390ec494165401ce8b)
- smalll fixes (258ef74eea4986fe296eaf65270e459e989a1ca2)
- add authors' names (8a431e388fb3a8c5d4a8c15e6752f0e3252f03dd)
- Merge commit 'refs/pr/11' of https://huggingface.co/HuggingFaceM4/idefics-80b into pr/11 (71dc0d079ff6b4e0a1d679db5e0d0b07a826274a)
- add colab link (27edbefdc6a43243806f6c78ad41a57b9ce99a2f)
- fix legend + display (7d92337cdbb979005914321b3997b66a21d57a7c)
- fix space (bb80664aa2a26e6af5ef5a7f1628bcf79c0a610d)

Files changed (1) hide show

README.md +11 -11

README.md CHANGED Viewed

@@ -60,7 +60,7 @@ The following screenshot is an example of interaction with the instructed model:
 # How to Get Started with the Model
-This [tutorial](https://github.com/huggingface/notebooks/pull/418/) shows a simple example to fine-tune IDEFICS on custom data. This [colab notebook](TODO) showcases how to do the fine-tuning in 4bits precision. TODO: change to the correct link once it's merged.
 We provide quick-start code for both the base and the instruct models.
@@ -139,7 +139,7 @@ for i, t in enumerate(generated_text):
 # Training Details
-## IDEFICS base
 We closely follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
@@ -207,7 +207,7 @@ We start from the base IDEFICS models and fine-tune the models by unfreezing all
 We note that all these datasets were obtained by using ChatGPT/GPT-4 in one way or another.
-Additionally, we found it beneficial to include the pre-training data in the fine-tuning with the following sampling ratios: 5.1% of image-text pairs and 31.0 of multimodal web documents.
 The training objective is the standard next token prediction. We use the following hyper and training parameters:
 | Parameters | | IDEFICS-80b-instruct | IDEFICS-9b-instruct |
@@ -229,7 +229,7 @@ The training objective is the standard next token prediction. We use the followi
 # Evaluation
-## IDEFICS base
 We follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image-text benchmarks ranging from visual question answering to image captioning.
@@ -243,15 +243,15 @@ As opposed to Flamingo, we did not train IDEFICS on video-text pairs datasets, a
 We note that since IDEFICS was trained on PMD (which contains COCO), the evaluation numbers on COCO are not directly comparable with Flamingo and OpenFlamingo since they did not explicitely have this dataset in the training mixture. Additionally, Flamingo is trained with images of resolution 320 x 320 while IDEFICS and OpenFlamingo were trained with images of 224 x 224 resolution.
-| Model | Shots | <nobr>VQAv2<br>OE VQA acc.</nobr> | <nobr>OKVQA<br>OE VQA acc.</nobr> | <nobr>TextVQA<br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps<br>CIDEr</nobr> | <nobr>Coco<br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial<br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA<br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group (text/image)</nobr> |
 |:------------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
-| IDEFICS 80B |       0 |                 60.0 |                 45.2 |                   30.9 |                  36.0 |               56.8 |           91.8 |             65.0 |             53.7 |             48.8 |                     60.6 |                   68.9 |                      60.5 |                               8.0 (18.75/22.5)|
 |             |       4 |                 63.6 |                 52.4 |                   34.4 |                  40.4 |               72.7 |          110.3 |             99.6 |             73.7 |             48.4 |                     57.8 |                   58.9 |                      66.6 |                              - |
 |             |       8 |                 64.8 |                 55.1 |                   35.7 |                  46.1 |               77.6 |          114.3 |            105.7 |             76.6 |             47.9 |                     58.2 |                   - |                      67.8 |                              - |
 |             |      16 |                 65.4 |                 56.8 |                   36.3 |                  48.3 |               81.4 |          116.6 |            107.0 |             80.1 |             - |                     55.8 |                   - |                      67.7 |                              - |
 |             |      32 |                 65.9 |                 57.8 |                   36.7 |                  50.0 |               82.7 |          116.6 |            107.5 |             81.1 |             - |                     52.5 |                   - |                      67.3 |                              - |
 <br>
-| IDEFICS 9B  |       0 |                 50.9 |                 38.4 |                   25.9 |                  35.5 |               25.4 |           46.0 |             36.8 |             27.3 |             48.7 |                     51.7 |                   44.2 |                      61.8 |                               5.0 (16.8/20.8) |
 |             |       4 |                 55.4 |                 45.5 |                   27.6 |                  36.9 |               60.0 |           93.0 |             81.3 |             59.7 |             47.9 |                     50.7 |                   37.4 |                      62.3 |                              - |
 |             |       8 |                 56.4 |                 47.7 |                   27.5 |                  40.4 |               63.2 |           97.0 |             86.8 |             61.9 |             47.6 |                     51.0 |                   - |                      66.3 |                              - |
 |             |      16 |                 57.0 |                 48.4 |                   27.9 |                  42.6 |               67.4 |           99.7 |             89.4 |             64.5 |             - |                     50.9 |                   - |                      67.8 |                              - |
@@ -271,9 +271,9 @@ For ImageNet-1k, we also report results where the priming samples are selected t
 Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
-| Model | Shots | <nobr>VQAv2 <br>OE VQA acc.</nobr> | <nobr>OKVQA <br>OE VQA acc.</nobr> | <nobr>TextVQA <br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps <br>CIDEr</nobr> | <nobr>Coco <br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial <br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA <br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group (text/image)</nobr> |
 | :--------------------- | --------: | ---------------------: | ---------------------: | -----------------------: | ----------------------: | -------------------: | ---------------: | -----------------: | -----------------: | -----------------: | -------------------------: | -----------------------: | --------------------------: | ----------------------------------: |
-| Finetuning data does not contain dataset | - | &#10060; | &#10060; | &#10060; | &#10004; | &#10060; | &#10060; | &#10004; | &#10004; | &#10060; | &#10004; | &#10060; | &#10004; | &#10060; |
 | <nobr>IDEFICS 80B Instruct<br> | 0 | 37.4 (-22.7) | 36.9 (-8.2) | 32.9 (1.9) | 26.2 (-9.8) | 76.5 (19.7) | 117.2 (25.4) | 104.5 (39.5) | 65.3 (11.7) | 49.3 (0.4) | 58.9 (-1.7) | 69.5 (0.5) | 67.3 (6.8) | 9.2/20.0/25.0 (1.2/1.2/2.5) |
 |  | 4 | 67.5 (4.0) | 54.0 (1.7) | 37.8 (3.5) | 39.8 (-0.7) | 71.7 (-1.0) | 116.9 (6.6) | 104.0 (4.4) | 67.1 (-6.6) | 48.9 (0.5) | 57.5 (-0.3) | 60.5 (1.6) | 65.5 (-1.1) | - |
 |  | 8 | 68.1 (3.4) | 56.9 (1.8) | 38.2 (2.5) | 44.8 (-1.3) | 72.7 (-4.9) | 116.8 (2.5) | 104.8 (-0.9) | 70.7 (-5.9) | 48.2 (0.3) | 58.0 (-0.2) | - | 68.6 (0.8) | - |
@@ -286,6 +286,7 @@ Similarly to the base IDEFICS models, we performed checkpoint selection to stop
 |  | 16 | 66.8 (9.8) | 51.7 (3.3) | 31.6 (3.7) | 44.8 (2.3) | 70.2 (2.7) | 128.8 (29.1) | 101.5 (12.2) | 75.8 (11.4) | - | 51.7 (0.7) | - | 63.3 (-4.6) | - |
 |  | 32 | 66.9 (9.0) | 52.3 (2.7) | 32.0 (3.7) | 46.0 (2.2) | 71.7 (3.6) | 127.8 (29.8) | 101.0 (10.5) | 76.3 (11.9) | - | 50.8 (1.0) | - | 60.9 (-6.1) | - |
 # Technical Specifications
@@ -393,8 +394,7 @@ We release the additional weights we trained under an MIT license.
 # Model Card Authors
-V, i, c, t, o, r, ,,  , S, t, a, s, ,,  , X, X, X
 # Model Card Contact
 Please open a discussion on the Community tab!

 # How to Get Started with the Model
+This [tutorial](https://github.com/huggingface/notebooks/pull/418/) shows a simple example to fine-tune IDEFICS on custom data. This [colab notebook](https://colab.research.google.com/drive/1o6hSdApDoaavkAXTI7clIG1ZWfJvwzRj?usp=sharing) showcases how to do the fine-tuning in 4bits precision. TODO: change to the correct link once it's merged.
 We provide quick-start code for both the base and the instruct models.
 # Training Details
+## IDEFICS
 We closely follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
 We note that all these datasets were obtained by using ChatGPT/GPT-4 in one way or another.
+Additionally, we found it beneficial to include the pre-training data in the fine-tuning with the following sampling ratios: 5.1% of image-text pairs and 30.7% of OBELICS multimodal web documents.
 The training objective is the standard next token prediction. We use the following hyper and training parameters:
 | Parameters | | IDEFICS-80b-instruct | IDEFICS-9b-instruct |
 # Evaluation
+## IDEFICS
 We follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image-text benchmarks ranging from visual question answering to image captioning.
 We note that since IDEFICS was trained on PMD (which contains COCO), the evaluation numbers on COCO are not directly comparable with Flamingo and OpenFlamingo since they did not explicitely have this dataset in the training mixture. Additionally, Flamingo is trained with images of resolution 320 x 320 while IDEFICS and OpenFlamingo were trained with images of 224 x 224 resolution.
+| Model | Shots | <nobr>VQAv2<br>OE VQA acc.</nobr> | <nobr>OKVQA<br>OE VQA acc.</nobr> | <nobr>TextVQA<br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps<br>CIDEr</nobr> | <nobr>Coco<br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial<br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA<br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group/text/image</nobr> |
 |:------------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
+| IDEFICS 80B |       0 |                 60.0 |                 45.2 |                   30.9 |                  36.0 |               56.8 |           91.8 |             65.0 |             53.7 |             48.8 |                     60.6 |                   68.9 |                      60.5 |                               8.0/18.75/22.5|
 |             |       4 |                 63.6 |                 52.4 |                   34.4 |                  40.4 |               72.7 |          110.3 |             99.6 |             73.7 |             48.4 |                     57.8 |                   58.9 |                      66.6 |                              - |
 |             |       8 |                 64.8 |                 55.1 |                   35.7 |                  46.1 |               77.6 |          114.3 |            105.7 |             76.6 |             47.9 |                     58.2 |                   - |                      67.8 |                              - |
 |             |      16 |                 65.4 |                 56.8 |                   36.3 |                  48.3 |               81.4 |          116.6 |            107.0 |             80.1 |             - |                     55.8 |                   - |                      67.7 |                              - |
 |             |      32 |                 65.9 |                 57.8 |                   36.7 |                  50.0 |               82.7 |          116.6 |            107.5 |             81.1 |             - |                     52.5 |                   - |                      67.3 |                              - |
 <br>
+| IDEFICS 9B  |       0 |                 50.9 |                 38.4 |                   25.9 |                  35.5 |               25.4 |           46.0 |             36.8 |             27.3 |             48.7 |                     51.7 |                   44.2 |                      61.8 |                               5.0/16.8/20.8 |
 |             |       4 |                 55.4 |                 45.5 |                   27.6 |                  36.9 |               60.0 |           93.0 |             81.3 |             59.7 |             47.9 |                     50.7 |                   37.4 |                      62.3 |                              - |
 |             |       8 |                 56.4 |                 47.7 |                   27.5 |                  40.4 |               63.2 |           97.0 |             86.8 |             61.9 |             47.6 |                     51.0 |                   - |                      66.3 |                              - |
 |             |      16 |                 57.0 |                 48.4 |                   27.9 |                  42.6 |               67.4 |           99.7 |             89.4 |             64.5 |             - |                     50.9 |                   - |                      67.8 |                              - |
 Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
+| Model | Shots | <nobr>VQAv2 <br>OE VQA acc.</nobr> | <nobr>OKVQA <br>OE VQA acc.</nobr> | <nobr>TextVQA <br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps <br>CIDEr</nobr> | <nobr>Coco <br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial <br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA <br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group/text/image</nobr> |
 | :--------------------- | --------: | ---------------------: | ---------------------: | -----------------------: | ----------------------: | -------------------: | ---------------: | -----------------: | -----------------: | -----------------: | -------------------------: | -----------------------: | --------------------------: | ----------------------------------: |
+| Finetuning data **does not** contain the evaluation dataset | - | &#10006; | &#10006; | &#10006; | &#10004; | &#10006; | &#10006; | &#10006; | &#10004; | &#10006; | &#10004; | &#10006; | &#10004; | &#10006; |
 | <nobr>IDEFICS 80B Instruct<br> | 0 | 37.4 (-22.7) | 36.9 (-8.2) | 32.9 (1.9) | 26.2 (-9.8) | 76.5 (19.7) | 117.2 (25.4) | 104.5 (39.5) | 65.3 (11.7) | 49.3 (0.4) | 58.9 (-1.7) | 69.5 (0.5) | 67.3 (6.8) | 9.2/20.0/25.0 (1.2/1.2/2.5) |
 |  | 4 | 67.5 (4.0) | 54.0 (1.7) | 37.8 (3.5) | 39.8 (-0.7) | 71.7 (-1.0) | 116.9 (6.6) | 104.0 (4.4) | 67.1 (-6.6) | 48.9 (0.5) | 57.5 (-0.3) | 60.5 (1.6) | 65.5 (-1.1) | - |
 |  | 8 | 68.1 (3.4) | 56.9 (1.8) | 38.2 (2.5) | 44.8 (-1.3) | 72.7 (-4.9) | 116.8 (2.5) | 104.8 (-0.9) | 70.7 (-5.9) | 48.2 (0.3) | 58.0 (-0.2) | - | 68.6 (0.8) | - |
 |  | 16 | 66.8 (9.8) | 51.7 (3.3) | 31.6 (3.7) | 44.8 (2.3) | 70.2 (2.7) | 128.8 (29.1) | 101.5 (12.2) | 75.8 (11.4) | - | 51.7 (0.7) | - | 63.3 (-4.6) | - |
 |  | 32 | 66.9 (9.0) | 52.3 (2.7) | 32.0 (3.7) | 46.0 (2.2) | 71.7 (3.6) | 127.8 (29.8) | 101.0 (10.5) | 76.3 (11.9) | - | 50.8 (1.0) | - | 60.9 (-6.1) | - |
+*() Improvement over non-instruct version.
 # Technical Specifications
 # Model Card Authors
+Stas Bekman, Victor Sanh, Léo Tronchon, Hugo Laurençon
 # Model Card Contact
 Please open a discussion on the Community tab!