ibrazebra commited on
Commit
d8526f3
·
0 Parent(s):

Initial upload of CSM-1B

Browse files
.gitattributes ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ prompts/conversational_a.wav filter=lfs diff=lfs merge=lfs -text
37
+ prompts/conversational_b.wav filter=lfs diff=lfs merge=lfs -text
38
+ prompts/read_speech_a.wav filter=lfs diff=lfs merge=lfs -text
39
+ prompts/read_speech_b.wav filter=lfs diff=lfs merge=lfs -text
40
+ prompts/read_speech_c.wav filter=lfs diff=lfs merge=lfs -text
41
+ prompts/read_speech_d.wav filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-to-speech
6
+ tags:
7
+ - model_hub_mixin
8
+ - pytorch_model_hub_mixin
9
+ - text-to-speech
10
+ ---
11
+
12
+ ## CSM 1B
13
+
14
+ **2025/03/13** - We are releasing the 1B CSM variant. Code is available on GitHub: [SesameAILabs/csm](https://github.com/SesameAILabs/csm).
15
+
16
+ ---
17
+
18
+ CSM (Conversational Speech Model) is a speech generation model from [Sesame](sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes.
19
+
20
+ A fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice).
21
+
22
+ A hosted [HuggingFace space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation.
23
+
24
+ ## Usage
25
+
26
+ Setup the repo
27
+
28
+ ```bash
29
+ git clone git@github.com:SesameAILabs/csm.git
30
+ cd csm
31
+ python3.10 -m venv .venv
32
+ source .venv/bin/activate
33
+ pip install -r requirements.txt
34
+
35
+ # You will need access to sesame/csm-1b and meta-llama/Llama-3.2-1B
36
+ huggingface-cli login
37
+ ```
38
+
39
+ Generate a sentence
40
+
41
+ ```python
42
+ from generator import load_csm_1b
43
+ import torchaudio
44
+
45
+ generator = load_csm_1b(device="cuda")
46
+
47
+ audio = generator.generate(
48
+ text="Hello from Sesame.",
49
+ speaker=0,
50
+ context=[],
51
+ max_audio_length_ms=10_000,
52
+ )
53
+
54
+ torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
55
+ ```
56
+
57
+ CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker utterance.
58
+
59
+ ```python
60
+ speakers = [0, 1, 0, 0]
61
+ transcripts = [
62
+ "Hey how are you doing.",
63
+ "Pretty good, pretty good.",
64
+ "I'm great.",
65
+ "So happy to be speaking to you.",
66
+ ]
67
+ audio_paths = [
68
+ "utterance_0.wav",
69
+ "utterance_1.wav",
70
+ "utterance_2.wav",
71
+ "utterance_3.wav",
72
+ ]
73
+
74
+ def load_audio(audio_path):
75
+ audio_tensor, sample_rate = torchaudio.load(audio_path)
76
+ audio_tensor = torchaudio.functional.resample(
77
+ audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
78
+ )
79
+ return audio_tensor
80
+
81
+ segments = [
82
+ Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
83
+ for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
84
+ ]
85
+ audio = generator.generate(
86
+ text="Me too, this is some cool stuff huh?",
87
+ speaker=1,
88
+ context=segments,
89
+ max_audio_length_ms=10_000,
90
+ )
91
+
92
+ torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
93
+ ```
94
+
95
+ ## FAQ
96
+
97
+ **Does this model come with any voices?**
98
+
99
+ The model open sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.
100
+
101
+ **Can I converse with the model?**
102
+
103
+ CSM is trained to be an audio generation model and not a general purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
104
+
105
+ **Does it support other languages?**
106
+
107
+ The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.
108
+
109
+ ## Misuse and abuse ⚠️
110
+
111
+ This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we **explicitly prohibit** the following:
112
+
113
+ - **Impersonation or Fraud**: Do not use this model to generate speech that mimics real individuals without their explicit consent.
114
+ - **Misinformation or Deception**: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
115
+ - **Illegal or Harmful Activities**: Do not use this model for any illegal, harmful, or malicious purposes.
116
+
117
+ By using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.
118
+
119
+ **Authors**
120
+ Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.
config.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "CsmForConditionalGeneration"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "audio_eos_token_id": 128003,
8
+ "audio_token_id": 128002,
9
+ "bos_token_id": 128000,
10
+ "codebook_eos_token_id": 0,
11
+ "codebook_pad_token_id": 2050,
12
+ "codec_config": {
13
+ "_name_or_path": "kyutai/mimi",
14
+ "architectures": [
15
+ "MimiModel"
16
+ ],
17
+ "attention_bias": false,
18
+ "attention_dropout": 0.0,
19
+ "audio_channels": 1,
20
+ "codebook_dim": 256,
21
+ "codebook_size": 2048,
22
+ "compress": 2,
23
+ "dilation_growth_rate": 2,
24
+ "frame_rate": 12.5,
25
+ "head_dim": 64,
26
+ "hidden_act": "gelu",
27
+ "hidden_size": 512,
28
+ "initializer_range": 0.02,
29
+ "intermediate_size": 2048,
30
+ "kernel_size": 7,
31
+ "last_kernel_size": 3,
32
+ "layer_scale_initial_scale": 0.01,
33
+ "max_position_embeddings": 8000,
34
+ "model_type": "mimi",
35
+ "norm_eps": 1e-05,
36
+ "normalize": false,
37
+ "num_attention_heads": 8,
38
+ "num_filters": 64,
39
+ "num_hidden_layers": 8,
40
+ "num_key_value_heads": 8,
41
+ "num_quantizers": 32,
42
+ "num_residual_layers": 1,
43
+ "num_semantic_quantizers": 1,
44
+ "pad_mode": "constant",
45
+ "residual_kernel_size": 3,
46
+ "rope_theta": 10000.0,
47
+ "sampling_rate": 24000,
48
+ "sliding_window": 250,
49
+ "torch_dtype": "float32",
50
+ "trim_right_ratio": 1.0,
51
+ "upsample_groups": 512,
52
+ "upsampling_ratios": [
53
+ 8,
54
+ 6,
55
+ 5,
56
+ 4
57
+ ],
58
+ "use_cache": false,
59
+ "use_causal_conv": true,
60
+ "use_conv_shortcut": false,
61
+ "vector_quantization_hidden_dimension": 256
62
+ },
63
+ "depth_decoder_config": {
64
+ "attention_bias": false,
65
+ "attention_dropout": 0.0,
66
+ "backbone_hidden_size": 2048,
67
+ "head_dim": 128,
68
+ "hidden_act": "silu",
69
+ "hidden_size": 1024,
70
+ "initializer_range": 0.02,
71
+ "intermediate_size": 8192,
72
+ "max_position_embeddings": 33,
73
+ "mlp_bias": false,
74
+ "model_type": "csm_depth_decoder_model",
75
+ "num_attention_heads": 8,
76
+ "num_codebooks": 32,
77
+ "num_hidden_layers": 4,
78
+ "num_key_value_heads": 2,
79
+ "rms_norm_eps": 1e-05,
80
+ "rope_scaling": {
81
+ "factor": 32.0,
82
+ "high_freq_factor": 0.0078125,
83
+ "low_freq_factor": 0.001953125,
84
+ "original_max_position_embeddings": 16,
85
+ "rope_type": "llama3"
86
+ },
87
+ "rope_theta": 500000,
88
+ "use_cache": true,
89
+ "vocab_size": 2051
90
+ },
91
+ "head_dim": 64,
92
+ "hidden_act": "silu",
93
+ "hidden_size": 2048,
94
+ "initializer_range": 0.02,
95
+ "intermediate_size": 8192,
96
+ "max_position_embeddings": 2048,
97
+ "mlp_bias": false,
98
+ "model_type": "csm",
99
+ "num_attention_heads": 32,
100
+ "num_codebooks": 32,
101
+ "num_hidden_layers": 16,
102
+ "num_key_value_heads": 8,
103
+ "pad_token_id": 128002,
104
+ "rms_norm_eps": 1e-05,
105
+ "rope_scaling": {
106
+ "factor": 32.0,
107
+ "high_freq_factor": 0.5,
108
+ "low_freq_factor": 0.125,
109
+ "original_max_position_embeddings": 1024,
110
+ "rope_type": "llama3"
111
+ },
112
+ "rope_theta": 500000,
113
+ "text_vocab_size": 128256,
114
+ "tie_codebooks_embeddings": true,
115
+ "tie_word_embeddings": false,
116
+ "torch_dtype": "float32",
117
+ "transformers_version": "4.52.0.dev0",
118
+ "use_cache": true,
119
+ "vocab_size": 2051,
120
+ "audio_num_codebooks": 32,
121
+ "audio_vocab_size": 2051,
122
+ "backbone_flavor": "llama-1B",
123
+ "decoder_flavor": "llama-100M",
124
+ "transformers_weights": "transformers.safetensors.index.json"
125
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2e7721144afe38b906d4f1048671da639fe142423f4a26283606ecebe894f4bf
3
+ size 6211186784
prompts/conversational_a.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:356648c1bc6c1da7883004557e9b21a2ef7d01682d8b9d02d6dcb950b348b04f
3
+ size 2646044
prompts/conversational_b.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c247153011385d33aaeed193adfec562c32182e2facd30cc8cd0b3e820e94afb
3
+ size 2646044
prompts/read_speech_a.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59480708f84c77ab2967d14d821c2ccade9d7761685d060575121f49a149005b
3
+ size 831412
prompts/read_speech_b.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f582640265864499cbe6a8c687ea0f9e08e7fa41eeb2caa923d0a3bada55fcef
3
+ size 576052
prompts/read_speech_c.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7da15ab3ee7f8bbc8abfce73ce65936a80a535ae4a86db2d9c4756caba69e9c3
3
+ size 385964
prompts/read_speech_d.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:09cad0494f9d0038b0f0eb039f47d752c45e56d92679f96587e20f67b2c1b7d8
3
+ size 435884