|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb |
|
|
- HuggingFaceFW/fineweb-2 |
|
|
- amphion/Emilia-Dataset |
|
|
- facebook/voxpopuli |
|
|
- uhhlt/Tuda-De |
|
|
- openslr/librispeech_asr |
|
|
- facebook/multilingual_librispeech |
|
|
- Thorsten-Voice/TV-44kHz-Full |
|
|
- CSTR-Edinburgh/vctk |
|
|
- commonvoice_23 |
|
|
- kerstin |
|
|
language: |
|
|
- de |
|
|
- en |
|
|
base_model: |
|
|
- utter-project/EuroLLM-1.7B |
|
|
pipeline_tag: text-to-speech |
|
|
--- |
|
|
|
|
|
|
|
|
<img src="https://educaai.de/webapp/splash/img/dark-1x.png" style="float:left"> |
|
|
|
|
|
## educa AI voice (preview) |
|
|
|
|
|
--- |
|
|
|
|
|
educa AI voice is our in-house text to speech model developed on top of [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B). |
|
|
|
|
|
This version of the model is trained on a single speaker and is capable of generating natural-sounding German (and to some extent also English) speech. |
|
|
|
|
|
Be advised that this is a preview model meant to showcase the base model's capability. We are going to publish more advanced models in the near future (see bottom of this model card). |
|
|
|
|
|
#### Examples: |
|
|
|
|
|
<audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_1.mp3"></audio> |
|
|
<audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_2.mp3"></audio> |
|
|
<audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_3.mp3"></audio> |
|
|
|
|
|
|
|
|
### Model details |
|
|
|
|
|
- **Base LLM**: [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B) |
|
|
- **Audio Tokenizer**: [NeuCodec](https://huggingface.co/neuphonic/neucodec) |
|
|
|
|
|
#### Pre-training |
|
|
|
|
|
We pre-trained the model in two stages, first training on billions of tokens of mixed speech and text data using a next-token-prediction objective. |
|
|
Then, we trained on tens of thousands of hours of German and English TTS data mixed with a little text instruction data to preserve the text understanding capability of the model. |
|
|
|
|
|
We used the following datasets, as well as some in-house datasets: |
|
|
- [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) |
|
|
- [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) |
|
|
- [amphion/Emilia-Dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset) (German and English YODAS subsets) |
|
|
- [facebook/voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) |
|
|
- [uhhlt/Tuda-De](https://huggingface.co/datasets/uhhlt/Tuda-De) |
|
|
- [openslr/librispeech_asr](https://huggingface.co/datasets/openslr/librispeech_asr) |
|
|
- [facebook/multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) |
|
|
- [Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full) |
|
|
- [CSTR-Edinburgh/vctk](https://huggingface.co/datasets/CSTR-Edinburgh/vctk) |
|
|
- [commonvoice_23](https://datacollective.mozillafoundation.org/datasets?q=common+voice) |
|
|
- [kerstin](https://datacollective.mozillafoundation.org/datasets/cmi7mgbam000bnx074097g2yg) |
|
|
|
|
|
|
|
|
|
|
|
### Inference example |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torchaudio |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from neucodec import NeuCodec |
|
|
|
|
|
device = "cuda" |
|
|
model_id = "DigitalLearningGmbH/educa-ai-voice-preview" |
|
|
audio_end_token_id = 128001 |
|
|
audio_tokens_offset = 128006 |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16) |
|
|
model = model.to(device) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
|
|
codec_model = NeuCodec.from_pretrained("neuphonic/neucodec") |
|
|
codec_model = codec_model.eval().to(device) |
|
|
|
|
|
prompt_template = "<|task_tts|>{prompt} <|audio_start|>" |
|
|
prompt = "Brautkleid bleibt Brautkleid und Blaukraut bleibt Blaukraut." |
|
|
|
|
|
input_ids = tokenizer.encode(prompt_template.format(prompt=prompt), return_tensors="pt").to(device) |
|
|
|
|
|
outputs = model.generate(input_ids=input_ids, do_sample=True, temperature=0.6, top_p=0.999, repetition_penalty=1.1, max_new_tokens=2048) |
|
|
outputs_audio = outputs[0][input_ids.shape[1]:(outputs[0] == audio_end_token_id).nonzero(as_tuple=True)[0][0].item()] - audio_tokens_offset |
|
|
|
|
|
with torch.no_grad(): |
|
|
recon = codec_model.decode_code(outputs_audio.unsqueeze(0).unsqueeze(0).to(device)).cpu() |
|
|
|
|
|
torchaudio.save("tts.wav", recon[0, :, :], 24_000) |
|
|
``` |
|
|
|
|
|
For even higher fidelity in German speech, use our [finetuned NeuCodec decoder](https://huggingface.co/DigitalLearningGmbH/neucodec-decoder-ft-de). |
|
|
|
|
|
### What's to come |
|
|
|
|
|
As stated in the model's name, this is a preview model, mainly meant to showcase the capability of the base model. |
|
|
We trained on a small dataset of a single speaker without any special emotion tagging etc. |
|
|
|
|
|
We are actively working on |
|
|
- multiple speakers with emotional control and nonverbal elements (fillers, laughing, ...) |
|
|
- fine-tuning for general zero-shot voice cloning |
|
|
- phoneme-based / hybrid generation |
|
|
- post-training with reinforcement learning |
|
|
|
|
|
Stay tuned - january 2026 is going to be exciting! |
|
|
|