eustlb HF Staff commited on
Commit
cff39b2
·
verified ·
1 Parent(s): a0a212e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +392 -0
README.md ADDED
@@ -0,0 +1,392 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-to-speech
6
+ tags:
7
+ - text-to-speech
8
+ ---
9
+
10
+ ---
11
+ license: apache-2.0
12
+ language:
13
+ - en
14
+ pipeline_tag: text-to-speech
15
+ tags:
16
+ - model_hub_mixin
17
+ - pytorch_model_hub_mixin
18
+ - text-to-speech
19
+ ---
20
+
21
+ ## CSM 1B
22
+
23
+ **2025/03/13** - We are releasing the 1B CSM variant. Orignal code is available on GitHub: [SesameAILabs/csm](https://github.com/SesameAILabs/csm).
24
+
25
+ ---
26
+
27
+ CSM (Conversational Speech Model) is a speech generation model from [Sesame](sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes.
28
+
29
+ A fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice).
30
+
31
+ A hosted [HuggingFace space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation.
32
+
33
+ ## Usage
34
+
35
+ ### Without Conversational Context
36
+
37
+ CSM can be used to simply generate speech from a text prompt:
38
+
39
+ ```python
40
+ import torch
41
+ from transformers import CsmForConditionalGeneration, AutoProcessor
42
+
43
+ model_id = "eustlb/csm-1b"
44
+ device = "cuda" if torch.cuda.is_available() else "cpu"
45
+
46
+ # load the model and the processor
47
+ processor = AutoProcessor.from_pretrained(model_id)
48
+ model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
49
+
50
+ # prepare the inputs
51
+ text = "[0]The past is just a story we tell ourselves." # `[0]` for speaker id 0
52
+ inputs = processor(text, add_special_tokens=True).to(device)
53
+
54
+ # another equivalent way to prepare the inputs
55
+ conversation = [
56
+ {"role": "0", "content": [{"type": "text", "text": "The past is just a story we tell ourselves."}]},
57
+ ]
58
+ inputs = processor.apply_chat_template(
59
+ conversation,
60
+ tokenize=True,
61
+ return_dict=True,
62
+ ).to(device)
63
+
64
+ # infer the model
65
+ audio = model.generate(**inputs, output_audio=True)
66
+ processor.save_audio(audio, "example_without_context.wav")
67
+ ```
68
+
69
+ ### With Conversational Context
70
+
71
+ CSM can be used to generate speech given a conversation, allowing consistency in the voices and content-aware generation:
72
+
73
+ ```python
74
+ import torch
75
+ from transformers import CsmForConditionalGeneration, AutoProcessor
76
+ from datasets import load_dataset, Audio
77
+
78
+ model_id = "eustlb/csm-1b"
79
+ device = "cuda" if torch.cuda.is_available() else "cpu"
80
+
81
+ # load the model and the processor
82
+ processor = AutoProcessor.from_pretrained(model_id)
83
+ model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
84
+
85
+ # prepare the inputs
86
+ ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
87
+ # ensure the audio is 24kHz
88
+ ds = ds.cast_column("audio", Audio(sampling_rate=24000))
89
+ conversation = []
90
+
91
+ # 1. context
92
+ for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
93
+ conversation.append(
94
+ {
95
+ "role": f"{speaker_id}",
96
+ "content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
97
+ }
98
+ )
99
+
100
+ # 2. text prompt
101
+ conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})
102
+
103
+ inputs = processor.apply_chat_template(
104
+ conversation,
105
+ tokenize=True,
106
+ return_dict=True,
107
+ ).to(device)
108
+
109
+ # infer the model
110
+ audio = model.generate(**inputs, output_audio=True)
111
+ processor.save_audio(audio, "example_with_context.wav")
112
+ ```
113
+
114
+ ### Batched Inference
115
+
116
+ CSM supports batched inference!
117
+
118
+ ```python
119
+ import torch
120
+ from transformers import CsmForConditionalGeneration, AutoProcessor
121
+ from datasets import load_dataset, Audio
122
+
123
+ model_id = "eustlb/csm-1b"
124
+ device = "cuda" if torch.cuda.is_available() else "cpu"
125
+
126
+ # load the model and the processor
127
+ processor = AutoProcessor.from_pretrained(model_id)
128
+ model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
129
+
130
+ # prepare the inputs
131
+ ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
132
+ # ensure the audio is 24kHz
133
+ ds = ds.cast_column("audio", Audio(sampling_rate=24000))
134
+ # here a batch with two prompts
135
+ conversation = [
136
+ [
137
+ {
138
+ "role": f"{ds[0]['speaker_id']}",
139
+ "content": [
140
+ {"type": "text", "text": ds[0]["text"]},
141
+ {"type": "audio", "path": ds[0]["audio"]["array"]},
142
+ ],
143
+ },
144
+ {
145
+ "role": f"{ds[1]['speaker_id']}",
146
+ "content": [
147
+ {"type": "text", "text": ds[1]["text"]},
148
+ ],
149
+ },
150
+ ],
151
+ [
152
+ {
153
+ "role": f"{ds[0]['speaker_id']}",
154
+ "content": [
155
+ {"type": "text", "text": ds[0]["text"]},
156
+ ],
157
+ }
158
+ ],
159
+ ]
160
+ inputs = processor.apply_chat_template(
161
+ conversation,
162
+ tokenize=True,
163
+ return_dict=True,
164
+ ).to(device)
165
+
166
+ audio = model.generate(**inputs, output_audio=True)
167
+ processor.save_audio(audio, [f"speech_batch_idx_{i}.wav" for i in range(len(audio))])
168
+ ```
169
+
170
+ ### Making The Model Go Brrr
171
+
172
+ CSM supports full-graph compilation with CUDA graphs!
173
+
174
+ ```python
175
+ import torch
176
+ import copy
177
+ from transformers import CsmForConditionalGeneration, AutoProcessor
178
+ from datasets import load_dataset
179
+
180
+ model_id = "eustlb/csm-1b"
181
+ device = "cuda"
182
+
183
+ # set logs to ensure no recompilation and graph breaks
184
+ torch._logging.set_logs(graph_breaks=True, recompiles=True, cudagraphs=True)
185
+
186
+ # load the model and the processor
187
+ processor = AutoProcessor.from_pretrained(model_id)
188
+ model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
189
+
190
+ # use static cache, enabling automatically torch compile with fullgraph and reduce-overhead
191
+ model.generation_config.max_length = 250 # big enough to avoid recompilation
192
+ model.generation_config.max_new_tokens = None # would take precedence over max_length
193
+ model.generation_config.cache_implementation = "static"
194
+ model.depth_decoder.generation_config.cache_implementation = "static"
195
+
196
+ # generation kwargs
197
+ gen_kwargs = {
198
+ "do_sample": False,
199
+ "depth_decoder_do_sample": False,
200
+ "temperature": 1.0,
201
+ "depth_decoder_temperature": 1.0,
202
+ }
203
+
204
+ # Define a timing decorator
205
+ class TimerContext:
206
+ def __init__(self, name="Execution"):
207
+ self.name = name
208
+ self.start_event = None
209
+ self.end_event = None
210
+
211
+ def __enter__(self):
212
+ # Use CUDA events for more accurate GPU timing
213
+ self.start_event = torch.cuda.Event(enable_timing=True)
214
+ self.end_event = torch.cuda.Event(enable_timing=True)
215
+ self.start_event.record()
216
+ return self
217
+
218
+ def __exit__(self, *args):
219
+ self.end_event.record()
220
+ torch.cuda.synchronize()
221
+ elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
222
+ print(f"{self.name} time: {elapsed_time:.4f} seconds")
223
+
224
+ # prepare the inputs
225
+ ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
226
+
227
+ conversation = [
228
+ {
229
+ "role": f"{ds[0]['speaker_id']}",
230
+ "content": [
231
+ {"type": "text", "text": ds[0]["text"]},
232
+ {"type": "audio", "path": ds[0]["audio"]["array"]},
233
+ ],
234
+ },
235
+ {
236
+ "role": f"{ds[1]['speaker_id']}",
237
+ "content": [
238
+ {"type": "text", "text": ds[1]["text"]},
239
+ {"type": "audio", "path": ds[1]["audio"]["array"]},
240
+ ],
241
+ },
242
+ {
243
+ "role": f"{ds[2]['speaker_id']}",
244
+ "content": [
245
+ {"type": "text", "text": ds[2]["text"]},
246
+ ],
247
+ },
248
+ ]
249
+
250
+ padded_inputs_1 = processor.apply_chat_template(
251
+ conversation,
252
+ tokenize=True,
253
+ return_dict=True,
254
+ ).to(device)
255
+
256
+ print("\n" + "="*50)
257
+ print("First generation - compiling and recording CUDA graphs...")
258
+ with TimerContext("First generation"):
259
+ _ = model.generate(**padded_inputs_1, **gen_kwargs)
260
+ print("="*50)
261
+
262
+ print("\n" + "="*50)
263
+ print("Second generation - fast !!!")
264
+ with TimerContext("Second generation"):
265
+ _ = model.generate(**padded_inputs_1, **gen_kwargs)
266
+ print("="*50)
267
+
268
+ # now with different inputs
269
+ conversation = [
270
+ {
271
+ "role": f"{ds[0]['speaker_id']}",
272
+ "content": [
273
+ {"type": "text", "text": ds[2]["text"]},
274
+ {"type": "audio", "path": ds[2]["audio"]["array"]},
275
+ ],
276
+ },
277
+ {
278
+ "role": f"{ds[1]['speaker_id']}",
279
+ "content": [
280
+ {"type": "text", "text": ds[3]["text"]},
281
+ {"type": "audio", "path": ds[3]["audio"]["array"]},
282
+ ],
283
+ },
284
+ {
285
+ "role": f"{ds[2]['speaker_id']}",
286
+ "content": [
287
+ {"type": "text", "text": ds[4]["text"]},
288
+ ],
289
+ },
290
+ ]
291
+ padded_inputs_2 = processor.apply_chat_template(
292
+ conversation,
293
+ tokenize=True,
294
+ return_dict=True,
295
+ ).to(device)
296
+
297
+ print("\n" + "="*50)
298
+ print("Generation with other inputs!")
299
+ with TimerContext("Generation with different inputs"):
300
+ _ = model.generate(**padded_inputs_2, **gen_kwargs)
301
+ print("="*50)
302
+ ```
303
+
304
+ ### Fine-tuning & training
305
+
306
+ CSM can be easily fine-tuned using [Transformers' Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer).
307
+
308
+ ```python
309
+ from datasets import load_dataset, Audio
310
+ from transformers import (
311
+ CsmForConditionalGeneration,
312
+ TrainingArguments,
313
+ CsmProcessor,
314
+ Trainer
315
+ )
316
+
317
+ processor = CsmProcessor.from_pretrained("eustlb/csm-1b")
318
+ model = CsmForConditionalGeneration.from_pretrained("eustlb/csm-1b")
319
+ model.train()
320
+
321
+ ds = load_dataset("eustlb/dailytalk-conversations-grouped", split="train")
322
+ ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
323
+
324
+ def data_collator(samples):
325
+ conversations = []
326
+
327
+ for sample in samples:
328
+ concatenated_audio_array = sample["audio"]["array"]
329
+ audio = [concatenated_audio_array[s: e] for s, e in sample["audio_cut_idxs"]]
330
+
331
+ conversation = []
332
+ for speaker_id, text, audio in zip(sample["speaker_ids"], sample["texts"], audio):
333
+ conversation.append({
334
+ "role": f"{speaker_id}",
335
+ "content": [
336
+ {"type": "text", "text": text},
337
+ {"type": "audio", "audio": audio}
338
+ ]
339
+ })
340
+
341
+ conversations.append(conversation)
342
+
343
+ inputs = processor.apply_chat_template(
344
+ conversations,
345
+ tokenize=True,
346
+ return_dict=True,
347
+ output_labels=True,
348
+ )
349
+ return inputs
350
+
351
+ training_args = TrainingArguments(
352
+ "test-trainer",
353
+ remove_unused_columns=False,
354
+ gradient_checkpointing=True,
355
+ )
356
+
357
+ trainer = Trainer(
358
+ model,
359
+ training_args,
360
+ train_dataset=ds,
361
+ data_collator=data_collator,
362
+ )
363
+
364
+ trainer.train()
365
+ ```
366
+
367
+ ## FAQ
368
+
369
+ **Does this model come with any voices?**
370
+
371
+ The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.
372
+
373
+ **Can I converse with the model?**
374
+
375
+ CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
376
+
377
+ **Does it support other languages?**
378
+
379
+ The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.
380
+
381
+ ## Misuse and abuse ⚠️
382
+
383
+ This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we **explicitly prohibit** the following:
384
+
385
+ - **Impersonation or Fraud**: Do not use this model to generate speech that mimics real individuals without their explicit consent.
386
+ - **Misinformation or Deception**: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
387
+ - **Illegal or Harmful Activities**: Do not use this model for any illegal, harmful, or malicious purposes.
388
+
389
+ By using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.
390
+
391
+ **Authors**
392
+ Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.