Text Generation
Transformers
PyTorch
Safetensors
English
idefics
image-text-to-text
multimodal
text
image
image-to-text
text-generation-inference
Instructions to use HuggingFaceM4/idefics-80b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceM4/idefics-80b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HuggingFaceM4/idefics-80b")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics-80b") model = AutoModelForImageTextToText.from_pretrained("HuggingFaceM4/idefics-80b") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceM4/idefics-80b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceM4/idefics-80b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceM4/idefics-80b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/HuggingFaceM4/idefics-80b
- SGLang
How to use HuggingFaceM4/idefics-80b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceM4/idefics-80b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceM4/idefics-80b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceM4/idefics-80b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceM4/idefics-80b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use HuggingFaceM4/idefics-80b with Docker Model Runner:
docker model run hf.co/HuggingFaceM4/idefics-80b
Commit ·
ade3656
1
Parent(s): de1ddb0
Update README.md
Browse files
README.md
CHANGED
|
@@ -115,19 +115,18 @@ The model is trained on the following data mixture of openly accessible English
|
|
| 115 |
|
| 116 |
| Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
|
| 117 |
|-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
|
| 118 |
-
| [
|
| 119 |
-
| [
|
| 120 |
-
| [
|
| 121 |
-
| [
|
| 122 |
|
| 123 |
-
**
|
| 124 |
-
|
| 125 |
-
**LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image.
|
| 126 |
|
| 127 |
**Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
|
| 128 |
|
| 129 |
-
**
|
| 130 |
|
|
|
|
| 131 |
|
| 132 |
For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
|
| 133 |
|
|
|
|
| 115 |
|
| 116 |
| Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
|
| 117 |
|-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
|
| 118 |
+
| [OBELISC](https://huggingface.co/datasets/HuggingFaceM4/OBELISC) | Unstructured Multimodal Web Documents | TODO | TODO | 1 | 73.85% |
|
| 119 |
+
| [Wikipedia](https://huggingface.co/datasets/wikipedia) | Unstructured Multimodal Web Documents | TODO | TODO | 3 | 17.18% |
|
| 120 |
+
| [LAION](https://huggingface.co/datasets/laion/laion2B-en) | Image-Text Pairs | TODO | TODO | 1 | 6.15%
|
| 121 |
+
| [PMD](https://huggingface.co/datasets/facebook/pmd) | Image-Text Pairs | TODO | TODO | 3 | 2.82% | |
|
| 122 |
|
| 123 |
+
**OBELISC** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](TODO).
|
|
|
|
|
|
|
| 124 |
|
| 125 |
**Wkipedia** is the multimodal equivalent of the encyclopedia. We used the English dump of Wikipedia created on February 20th, 2023.
|
| 126 |
|
| 127 |
+
**LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image.
|
| 128 |
|
| 129 |
+
**PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.
|
| 130 |
|
| 131 |
For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.
|
| 132 |
|