LenDigLearn commited on
Commit
dc2c1ff
·
verified ·
1 Parent(s): 237d7d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -3
README.md CHANGED
@@ -1,3 +1,114 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - HuggingFaceFW/fineweb
5
+ - HuggingFaceFW/fineweb-2
6
+ - amphion/Emilia-Dataset
7
+ - facebook/voxpopuli
8
+ - uhhlt/Tuda-De
9
+ - openslr/librispeech_asr
10
+ - facebook/multilingual_librispeech
11
+ - Thorsten-Voice/TV-44kHz-Full
12
+ - CSTR-Edinburgh/vctk
13
+ - commonvoice_23
14
+ - kerstin
15
+ language:
16
+ - de
17
+ - en
18
+ base_model:
19
+ - utter-project/EuroLLM-1.7B
20
+ pipeline_tag: text-to-speech
21
+ ---
22
+
23
+
24
+ <img src="https://educaai.de/webapp/splash/img/dark-1x.png" style="float:left">
25
+
26
+ ## educa AI voice (preview)
27
+
28
+ ---
29
+
30
+ educa AI voice is our in-house text to speech model developed on top of [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B).
31
+
32
+ This version of the model is trained on a single speaker and is capable of generating natural-sounding German (and to some extent also English) speech.
33
+
34
+ Be advised that this is a preview model meant to showcase the base model's capability. We are going to publish more advanced models in the near future (see bottom of this model card).
35
+
36
+ #### Examples:
37
+
38
+ <audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_1.mp3"></audio>
39
+ <audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_2.mp3"></audio>
40
+ <audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_3.mp3"></audio>
41
+
42
+
43
+ ### Model details
44
+
45
+ - **Base LLM**: [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)
46
+ - **Audio Tokenizer**: [NeuCodec](https://huggingface.co/neuphonic/neucodec)
47
+
48
+ #### Pre-training
49
+
50
+ We pre-trained the model in two stages, first training on billions of tokens of mixed audio and text data using a next-token-prediction objective.
51
+ Then, we trained on tens of thousands of hours of German and English speech mixed with a little text instruction data to preserve the text understanding capability of the model.
52
+
53
+ We used the following datasets, as well as some in-house datasets:
54
+ - [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
55
+ - [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
56
+ - [amphion/Emilia-Dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset) (German and English YODAS subsets)
57
+ - [facebook/voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
58
+ - [uhhlt/Tuda-De](https://huggingface.co/datasets/uhhlt/Tuda-De)
59
+ - [openslr/librispeech_asr](https://huggingface.co/datasets/openslr/librispeech_asr)
60
+ - [facebook/multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech)
61
+ - [Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full)
62
+ - [CSTR-Edinburgh/vctk](https://huggingface.co/datasets/CSTR-Edinburgh/vctk)
63
+ - [commonvoice_23](https://datacollective.mozillafoundation.org/datasets?q=common+voice)
64
+ - [kerstin](https://datacollective.mozillafoundation.org/datasets/cmi7mgbam000bnx074097g2yg)
65
+
66
+
67
+
68
+ ### Inference example
69
+
70
+ ```python
71
+ import torch
72
+ import torchaudio
73
+ from transformers import AutoModelForCausalLM, AutoTokenizer
74
+ from neucodec import NeuCodec
75
+
76
+ device = "cuda"
77
+ model_id = "DigitalLearningGmbH/educa-ai-voice-preview"
78
+ audio_end_token_id = 128001
79
+ audio_tokens_offset = 128006
80
+
81
+ model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)
82
+ model = model.to(device)
83
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
84
+
85
+ codec_model = NeuCodec.from_pretrained("neuphonic/neucodec")
86
+ codec_model = codec_model.eval().to(device)
87
+
88
+ prompt_template = "<|task_tts|>{prompt} <|audio_start|>"
89
+ prompt = "Brautkleid bleibt Brautkleid und Blaukraut bleibt Blaukraut."
90
+
91
+ input_ids = tokenizer.encode(prompt_template.format(prompt=prompt), return_tensors="pt").to(device)
92
+
93
+ outputs = model.generate(input_ids=input_ids, do_sample=True, temperature=0.6, top_p=0.999, repetition_penalty=1.1, max_new_tokens=2048)
94
+ outputs_audio = outputs[0][input_ids.shape[1]:(outputs[0] == audio_end_token_id).nonzero(as_tuple=True)[0][0].item()] - audio_tokens_offset
95
+
96
+ with torch.no_grad():
97
+ recon = codec_model.decode_code(outputs_audio.unsqueeze(0).unsqueeze(0).to(device)).cpu()
98
+
99
+ torchaudio.save("tts.wav", recon[0, :, :], 24_000)
100
+ ```
101
+
102
+ ### What's to come
103
+
104
+ As stated in the model's name, this is a preview model, mainly meant to showcase the capability of the base model.
105
+ We trained on a small dataset of a single speaker without any special emotion tagging etc.
106
+
107
+ We are actively working on
108
+ - multiple speakers with emotional control and nonverbal elements (fillers, laughing, ...)
109
+ - fine-tuning for general zero-shot voice cloning
110
+ - post-training with reinforcement learning
111
+
112
+ Also, we have a fine-tuned version of NeuCodec which we used to generate the speech examples above, which we also plan on realeasing.
113
+
114
+ Stay tuned - january 2026 is going to be exciting!