Improve model card: Add pipeline tag, library name, paper, code, and project page links
Browse filesThis PR significantly enhances the model card for **AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling** by:
- Adding the `pipeline_tag: any-to-any` to the metadata, which accurately reflects the model's multimodal capabilities and improves discoverability on the Hugging Face Hub (e.g., at https://huggingface.co/models?pipeline_tag=any-to-any).
- Adding `library_name: transformers` to the metadata. This is supported by the `config.json` file (showing `LlamaForCausalLM` and `transformers_version`), enabling the automated "how to use" widget and code snippets for the `transformers` library.
- Adding prominent direct links to the Hugging Face paper page (https://huggingface.co/papers/2402.12226), the GitHub repository (https://github.com/OpenMOSS/AnyGPT), and the project page (https://junzhan2000.github.io/AnyGPT.github.io/) at the very top of the model card content. This provides users with immediate access to key resources.
These additions improve the model's visibility, usability, and compliance with Hugging Face Hub best practices.
|
@@ -1,10 +1,29 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
datasets:
|
| 4 |
- fnlp/AnyInstruct
|
| 5 |
language:
|
| 6 |
- en
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
# Chat model for paper "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
|
| 9 |
|
| 10 |
## Introduction
|
|
@@ -102,6 +121,49 @@ Due to limitations in data and training resources, the model's generation may st
|
|
| 102 |
|
| 103 |
The speech and music response will be saved to ```.wav``` files, and the image response will be saved to a ```jpg```. The filename will be a concatenation of the prompt and the time. The paths to these files will be indicated in the response.
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
### Training
|
| 106 |
#### Pretraining
|
| 107 |
|
|
|
|
| 1 |
---
|
| 2 |
+
datasets:
|
| 3 |
+
- fnlp/AnyInstruct
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
license: apache-2.0
|
| 7 |
+
pipeline_tag: any-to-any
|
| 8 |
+
library_name: transformers
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
datasets:
|
| 13 |
- fnlp/AnyInstruct
|
| 14 |
language:
|
| 15 |
- en
|
| 16 |
+
license: apache-2.0
|
| 17 |
+
pipeline_tag: any-to-any
|
| 18 |
+
library_name: transformers
|
| 19 |
---
|
| 20 |
+
|
| 21 |
+
[Paper: AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling](https://huggingface.co/papers/2402.12226)
|
| 22 |
+
|
| 23 |
+
[Code: OpenMOSS/AnyGPT](https://github.com/OpenMOSS/AnyGPT)
|
| 24 |
+
|
| 25 |
+
[Project Page](https://junzhan2000.github.io/AnyGPT.github.io/)
|
| 26 |
+
|
| 27 |
# Chat model for paper "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
|
| 28 |
|
| 29 |
## Introduction
|
|
|
|
| 121 |
|
| 122 |
The speech and music response will be saved to ```.wav``` files, and the image response will be saved to a ```jpg```. The filename will be a concatenation of the prompt and the time. The paths to these files will be indicated in the response.
|
| 123 |
|
| 124 |
+
### Chat model CLI Inference
|
| 125 |
+
|
| 126 |
+
```bash
|
| 127 |
+
python anygpt/src/infer/cli_infer_chat_model.py \
|
| 128 |
+
--model-name-or-path 'path/to/model' \
|
| 129 |
+
--image-tokenizer-path 'path/to/model' \
|
| 130 |
+
--speech-tokenizer-path 'path/to/model' \
|
| 131 |
+
--speech-tokenizer-config 'path/to/config' \
|
| 132 |
+
--soundstorm-path 'path/to/model' \
|
| 133 |
+
--output-dir "infer_output/chat"
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
for example
|
| 137 |
+
|
| 138 |
+
```bash
|
| 139 |
+
python anygpt/src/infer/cli_infer_chat_model.py \
|
| 140 |
+
--model-name-or-path models/anygpt/chat \
|
| 141 |
+
--image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \
|
| 142 |
+
--speech-tokenizer-path models/speechtokenizer/ckpt.dev \
|
| 143 |
+
--speech-tokenizer-config models/speechtokenizer/config.json \
|
| 144 |
+
--soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \
|
| 145 |
+
--output-dir "infer_output/chat"
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
Instruct format
|
| 149 |
+
|
| 150 |
+
```bash
|
| 151 |
+
interleaved|{text_instruction}|{modality}|{image_path}|{voice_prompt}|{speech_instruction}|{music_path}
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
Where ``text_instruction`` is the input text command, ``speech_instruction`` is the input voice command; only one needs to be specified.
|
| 155 |
+
|
| 156 |
+
``image_path`` and ``music_path`` are the paths for the input image and music, respectively. ``voice_prompt`` is the specified tone of the model's response; if not specified, a random tone is used.
|
| 157 |
+
|
| 158 |
+
``modality`` refers to the type of output modality, which can be chosen as speech, image, or music; otherwise, it is considered as text. This will only affect which decoding configuration file under the config directory is used by the model (this is because the model's training is limited, leading to different decoding strategies for different modalities). It can also decode token by token, modifying the decoding strategy to the corresponding modality when generating the start token of the modality.
|
| 159 |
+
|
| 160 |
+
**example**
|
| 161 |
+
|
| 162 |
+
* interleaved||image|||static/infer/speech/instruction/Can you draw me a picture of a sunny beach.wav
|
| 163 |
+
* interleaved||music|||static/infer/speech/instruction/Give me a similar style of music.wav
|
| 164 |
+
|
| 165 |
+
To clear the conversation history, please input ``|clear``
|
| 166 |
+
|
| 167 |
### Training
|
| 168 |
#### Pretraining
|
| 169 |
|