nielsr HF Staff commited on
Commit
2bbab83
·
verified ·
1 Parent(s): c899962

Improve model card: Add pipeline tag, library name, paper, code, and project page links

Browse files

This PR significantly enhances the model card for **AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling** by:

- Adding the `pipeline_tag: any-to-any` to the metadata, which accurately reflects the model's multimodal capabilities and improves discoverability on the Hugging Face Hub (e.g., at https://huggingface.co/models?pipeline_tag=any-to-any).
- Adding `library_name: transformers` to the metadata. This is supported by the `config.json` file (showing `LlamaForCausalLM` and `transformers_version`), enabling the automated "how to use" widget and code snippets for the `transformers` library.
- Adding prominent direct links to the Hugging Face paper page (https://huggingface.co/papers/2402.12226), the GitHub repository (https://github.com/OpenMOSS/AnyGPT), and the project page (https://junzhan2000.github.io/AnyGPT.github.io/) at the very top of the model card content. This provides users with immediate access to key resources.

These additions improve the model's visibility, usability, and compliance with Hugging Face Hub best practices.

Files changed (1) hide show
  1. README.md +62 -0
README.md CHANGED
@@ -1,10 +1,29 @@
1
  ---
 
 
 
 
2
  license: apache-2.0
 
 
 
 
 
3
  datasets:
4
  - fnlp/AnyInstruct
5
  language:
6
  - en
 
 
 
7
  ---
 
 
 
 
 
 
 
8
  # Chat model for paper "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
9
 
10
  ## Introduction
@@ -102,6 +121,49 @@ Due to limitations in data and training resources, the model's generation may st
102
 
103
  The speech and music response will be saved to ```.wav``` files, and the image response will be saved to a ```jpg```. The filename will be a concatenation of the prompt and the time. The paths to these files will be indicated in the response.
104
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  ### Training
106
  #### Pretraining
107
 
 
1
  ---
2
+ datasets:
3
+ - fnlp/AnyInstruct
4
+ language:
5
+ - en
6
  license: apache-2.0
7
+ pipeline_tag: any-to-any
8
+ library_name: transformers
9
+ ---
10
+
11
+ ---
12
  datasets:
13
  - fnlp/AnyInstruct
14
  language:
15
  - en
16
+ license: apache-2.0
17
+ pipeline_tag: any-to-any
18
+ library_name: transformers
19
  ---
20
+
21
+ [Paper: AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling](https://huggingface.co/papers/2402.12226)
22
+
23
+ [Code: OpenMOSS/AnyGPT](https://github.com/OpenMOSS/AnyGPT)
24
+
25
+ [Project Page](https://junzhan2000.github.io/AnyGPT.github.io/)
26
+
27
  # Chat model for paper "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
28
 
29
  ## Introduction
 
121
 
122
  The speech and music response will be saved to ```.wav``` files, and the image response will be saved to a ```jpg```. The filename will be a concatenation of the prompt and the time. The paths to these files will be indicated in the response.
123
 
124
+ ### Chat model CLI Inference
125
+
126
+ ```bash
127
+ python anygpt/src/infer/cli_infer_chat_model.py \
128
+ --model-name-or-path 'path/to/model' \
129
+ --image-tokenizer-path 'path/to/model' \
130
+ --speech-tokenizer-path 'path/to/model' \
131
+ --speech-tokenizer-config 'path/to/config' \
132
+ --soundstorm-path 'path/to/model' \
133
+ --output-dir "infer_output/chat"
134
+ ```
135
+
136
+ for example
137
+
138
+ ```bash
139
+ python anygpt/src/infer/cli_infer_chat_model.py \
140
+ --model-name-or-path models/anygpt/chat \
141
+ --image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \
142
+ --speech-tokenizer-path models/speechtokenizer/ckpt.dev \
143
+ --speech-tokenizer-config models/speechtokenizer/config.json \
144
+ --soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \
145
+ --output-dir "infer_output/chat"
146
+ ```
147
+
148
+ Instruct format
149
+
150
+ ```bash
151
+ interleaved|{text_instruction}|{modality}|{image_path}|{voice_prompt}|{speech_instruction}|{music_path}
152
+ ```
153
+
154
+ Where ``text_instruction`` is the input text command, ``speech_instruction`` is the input voice command; only one needs to be specified.
155
+
156
+ ``image_path`` and ``music_path`` are the paths for the input image and music, respectively. ``voice_prompt`` is the specified tone of the model's response; if not specified, a random tone is used.
157
+
158
+ ``modality`` refers to the type of output modality, which can be chosen as speech, image, or music; otherwise, it is considered as text. This will only affect which decoding configuration file under the config directory is used by the model (this is because the model's training is limited, leading to different decoding strategies for different modalities). It can also decode token by token, modifying the decoding strategy to the corresponding modality when generating the start token of the modality.
159
+
160
+ **example**
161
+
162
+ * interleaved||image|||static/infer/speech/instruction/Can you draw me a picture of a sunny beach.wav
163
+ * interleaved||music|||static/infer/speech/instruction/Give me a similar style of music.wav
164
+
165
+ To clear the conversation history, please input ``|clear``
166
+
167
  ### Training
168
  #### Pretraining
169