Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,114 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- HuggingFaceFW/fineweb
|
| 5 |
+
- HuggingFaceFW/fineweb-2
|
| 6 |
+
- amphion/Emilia-Dataset
|
| 7 |
+
- facebook/voxpopuli
|
| 8 |
+
- uhhlt/Tuda-De
|
| 9 |
+
- openslr/librispeech_asr
|
| 10 |
+
- facebook/multilingual_librispeech
|
| 11 |
+
- Thorsten-Voice/TV-44kHz-Full
|
| 12 |
+
- CSTR-Edinburgh/vctk
|
| 13 |
+
- commonvoice_23
|
| 14 |
+
- kerstin
|
| 15 |
+
language:
|
| 16 |
+
- de
|
| 17 |
+
- en
|
| 18 |
+
base_model:
|
| 19 |
+
- utter-project/EuroLLM-1.7B
|
| 20 |
+
pipeline_tag: text-to-speech
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
<img src="https://educaai.de/webapp/splash/img/dark-1x.png" style="float:left">
|
| 25 |
+
|
| 26 |
+
## educa AI voice (preview)
|
| 27 |
+
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
educa AI voice is our in-house text to speech model developed on top of [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B).
|
| 31 |
+
|
| 32 |
+
This version of the model is trained on a single speaker and is capable of generating natural-sounding German (and to some extent also English) speech.
|
| 33 |
+
|
| 34 |
+
Be advised that this is a preview model meant to showcase the base model's capability. We are going to publish more advanced models in the near future (see bottom of this model card).
|
| 35 |
+
|
| 36 |
+
#### Examples:
|
| 37 |
+
|
| 38 |
+
<audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_1.mp3"></audio>
|
| 39 |
+
<audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_2.mp3"></audio>
|
| 40 |
+
<audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_3.mp3"></audio>
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
### Model details
|
| 44 |
+
|
| 45 |
+
- **Base LLM**: [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)
|
| 46 |
+
- **Audio Tokenizer**: [NeuCodec](https://huggingface.co/neuphonic/neucodec)
|
| 47 |
+
|
| 48 |
+
#### Pre-training
|
| 49 |
+
|
| 50 |
+
We pre-trained the model in two stages, first training on billions of tokens of mixed audio and text data using a next-token-prediction objective.
|
| 51 |
+
Then, we trained on tens of thousands of hours of German and English speech mixed with a little text instruction data to preserve the text understanding capability of the model.
|
| 52 |
+
|
| 53 |
+
We used the following datasets, as well as some in-house datasets:
|
| 54 |
+
- [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
|
| 55 |
+
- [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
|
| 56 |
+
- [amphion/Emilia-Dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset) (German and English YODAS subsets)
|
| 57 |
+
- [facebook/voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
|
| 58 |
+
- [uhhlt/Tuda-De](https://huggingface.co/datasets/uhhlt/Tuda-De)
|
| 59 |
+
- [openslr/librispeech_asr](https://huggingface.co/datasets/openslr/librispeech_asr)
|
| 60 |
+
- [facebook/multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech)
|
| 61 |
+
- [Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full)
|
| 62 |
+
- [CSTR-Edinburgh/vctk](https://huggingface.co/datasets/CSTR-Edinburgh/vctk)
|
| 63 |
+
- [commonvoice_23](https://datacollective.mozillafoundation.org/datasets?q=common+voice)
|
| 64 |
+
- [kerstin](https://datacollective.mozillafoundation.org/datasets/cmi7mgbam000bnx074097g2yg)
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
### Inference example
|
| 69 |
+
|
| 70 |
+
```python
|
| 71 |
+
import torch
|
| 72 |
+
import torchaudio
|
| 73 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 74 |
+
from neucodec import NeuCodec
|
| 75 |
+
|
| 76 |
+
device = "cuda"
|
| 77 |
+
model_id = "DigitalLearningGmbH/educa-ai-voice-preview"
|
| 78 |
+
audio_end_token_id = 128001
|
| 79 |
+
audio_tokens_offset = 128006
|
| 80 |
+
|
| 81 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)
|
| 82 |
+
model = model.to(device)
|
| 83 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 84 |
+
|
| 85 |
+
codec_model = NeuCodec.from_pretrained("neuphonic/neucodec")
|
| 86 |
+
codec_model = codec_model.eval().to(device)
|
| 87 |
+
|
| 88 |
+
prompt_template = "<|task_tts|>{prompt} <|audio_start|>"
|
| 89 |
+
prompt = "Brautkleid bleibt Brautkleid und Blaukraut bleibt Blaukraut."
|
| 90 |
+
|
| 91 |
+
input_ids = tokenizer.encode(prompt_template.format(prompt=prompt), return_tensors="pt").to(device)
|
| 92 |
+
|
| 93 |
+
outputs = model.generate(input_ids=input_ids, do_sample=True, temperature=0.6, top_p=0.999, repetition_penalty=1.1, max_new_tokens=2048)
|
| 94 |
+
outputs_audio = outputs[0][input_ids.shape[1]:(outputs[0] == audio_end_token_id).nonzero(as_tuple=True)[0][0].item()] - audio_tokens_offset
|
| 95 |
+
|
| 96 |
+
with torch.no_grad():
|
| 97 |
+
recon = codec_model.decode_code(outputs_audio.unsqueeze(0).unsqueeze(0).to(device)).cpu()
|
| 98 |
+
|
| 99 |
+
torchaudio.save("tts.wav", recon[0, :, :], 24_000)
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
### What's to come
|
| 103 |
+
|
| 104 |
+
As stated in the model's name, this is a preview model, mainly meant to showcase the capability of the base model.
|
| 105 |
+
We trained on a small dataset of a single speaker without any special emotion tagging etc.
|
| 106 |
+
|
| 107 |
+
We are actively working on
|
| 108 |
+
- multiple speakers with emotional control and nonverbal elements (fillers, laughing, ...)
|
| 109 |
+
- fine-tuning for general zero-shot voice cloning
|
| 110 |
+
- post-training with reinforcement learning
|
| 111 |
+
|
| 112 |
+
Also, we have a fine-tuned version of NeuCodec which we used to generate the speech examples above, which we also plan on realeasing.
|
| 113 |
+
|
| 114 |
+
Stay tuned - january 2026 is going to be exciting!
|