Improve model card: Add pipeline tag, library name, paper, code, and project page links
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,10 +1,29 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
datasets:
|
| 4 |
- fnlp/AnyInstruct
|
| 5 |
language:
|
| 6 |
- en
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
# Chat model for paper "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
|
| 9 |
|
| 10 |
## Introduction
|
|
@@ -102,6 +121,49 @@ Due to limitations in data and training resources, the model's generation may st
|
|
| 102 |
|
| 103 |
The speech and music response will be saved to ```.wav``` files, and the image response will be saved to a ```jpg```. The filename will be a concatenation of the prompt and the time. The paths to these files will be indicated in the response.
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
### Training
|
| 106 |
#### Pretraining
|
| 107 |
|
|
|
|
| 1 |
---
|
| 2 |
+
datasets:
|
| 3 |
+
- fnlp/AnyInstruct
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
license: apache-2.0
|
| 7 |
+
pipeline_tag: any-to-any
|
| 8 |
+
library_name: transformers
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
datasets:
|
| 13 |
- fnlp/AnyInstruct
|
| 14 |
language:
|
| 15 |
- en
|
| 16 |
+
license: apache-2.0
|
| 17 |
+
pipeline_tag: any-to-any
|
| 18 |
+
library_name: transformers
|
| 19 |
---
|
| 20 |
+
|
| 21 |
+
[Paper: AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling](https://huggingface.co/papers/2402.12226)
|
| 22 |
+
|
| 23 |
+
[Code: OpenMOSS/AnyGPT](https://github.com/OpenMOSS/AnyGPT)
|
| 24 |
+
|
| 25 |
+
[Project Page](https://junzhan2000.github.io/AnyGPT.github.io/)
|
| 26 |
+
|
| 27 |
# Chat model for paper "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
|
| 28 |
|
| 29 |
## Introduction
|
|
|
|
| 121 |
|
| 122 |
The speech and music response will be saved to ```.wav``` files, and the image response will be saved to a ```jpg```. The filename will be a concatenation of the prompt and the time. The paths to these files will be indicated in the response.
|
| 123 |
|
| 124 |
+
### Chat model CLI Inference
|
| 125 |
+
|
| 126 |
+
```bash
|
| 127 |
+
python anygpt/src/infer/cli_infer_chat_model.py \
|
| 128 |
+
--model-name-or-path 'path/to/model' \
|
| 129 |
+
--image-tokenizer-path 'path/to/model' \
|
| 130 |
+
--speech-tokenizer-path 'path/to/model' \
|
| 131 |
+
--speech-tokenizer-config 'path/to/config' \
|
| 132 |
+
--soundstorm-path 'path/to/model' \
|
| 133 |
+
--output-dir "infer_output/chat"
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
for example
|
| 137 |
+
|
| 138 |
+
```bash
|
| 139 |
+
python anygpt/src/infer/cli_infer_chat_model.py \
|
| 140 |
+
--model-name-or-path models/anygpt/chat \
|
| 141 |
+
--image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \
|
| 142 |
+
--speech-tokenizer-path models/speechtokenizer/ckpt.dev \
|
| 143 |
+
--speech-tokenizer-config models/speechtokenizer/config.json \
|
| 144 |
+
--soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \
|
| 145 |
+
--output-dir "infer_output/chat"
|
| 146 |
+
```
|
| 147 |
+
|
| 148 |
+
Instruct format
|
| 149 |
+
|
| 150 |
+
```bash
|
| 151 |
+
interleaved|{text_instruction}|{modality}|{image_path}|{voice_prompt}|{speech_instruction}|{music_path}
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
Where ``text_instruction`` is the input text command, ``speech_instruction`` is the input voice command; only one needs to be specified.
|
| 155 |
+
|
| 156 |
+
``image_path`` and ``music_path`` are the paths for the input image and music, respectively. ``voice_prompt`` is the specified tone of the model's response; if not specified, a random tone is used.
|
| 157 |
+
|
| 158 |
+
``modality`` refers to the type of output modality, which can be chosen as speech, image, or music; otherwise, it is considered as text. This will only affect which decoding configuration file under the config directory is used by the model (this is because the model's training is limited, leading to different decoding strategies for different modalities). It can also decode token by token, modifying the decoding strategy to the corresponding modality when generating the start token of the modality.
|
| 159 |
+
|
| 160 |
+
**example**
|
| 161 |
+
|
| 162 |
+
* interleaved||image|||static/infer/speech/instruction/Can you draw me a picture of a sunny beach.wav
|
| 163 |
+
* interleaved||music|||static/infer/speech/instruction/Give me a similar style of music.wav
|
| 164 |
+
|
| 165 |
+
To clear the conversation history, please input ``|clear``
|
| 166 |
+
|
| 167 |
### Training
|
| 168 |
#### Pretraining
|
| 169 |
|