Update README.md

ddccf91 verified 28 days ago

4.84 kB

	---
	license: apache-2.0
	datasets:
	- HuggingFaceFW/fineweb
	- HuggingFaceFW/fineweb-2
	- amphion/Emilia-Dataset
	- facebook/voxpopuli
	- uhhlt/Tuda-De
	- openslr/librispeech_asr
	- facebook/multilingual_librispeech
	- Thorsten-Voice/TV-44kHz-Full
	- CSTR-Edinburgh/vctk
	- commonvoice_23
	- kerstin
	language:
	- de
	- en
	base_model:
	- utter-project/EuroLLM-1.7B
	pipeline_tag: text-to-speech
	---


	<img src="https://educaai.de/webapp/splash/img/dark-1x.png" style="float:left">

	## educa AI voice (preview)

	---

	educa AI voice is our in-house text to speech model developed on top of [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B).

	This version of the model is trained on a single speaker and is capable of generating natural-sounding German (and to some extent also English) speech.

	Be advised that this is a preview model meant to showcase the base model's capability. We are going to publish more advanced models in the near future (see bottom of this model card).

	#### Examples:

	<audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_1.mp3"></audio>
	<audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_2.mp3"></audio>
	<audio controls src="https://huggingface.co/DigitalLearningGmbH/educa-ai-voice-preview/resolve/main/example_3.mp3"></audio>


	### Model details

	- Base LLM: [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B)
	- Audio Tokenizer: [NeuCodec](https://huggingface.co/neuphonic/neucodec)

	#### Pre-training

	We pre-trained the model in two stages, first training on billions of tokens of mixed speech and text data using a next-token-prediction objective.
	Then, we trained on tens of thousands of hours of German and English TTS data mixed with a little text instruction data to preserve the text understanding capability of the model.

	We used the following datasets, as well as some in-house datasets:
	- [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
	- [HuggingFaceFW/fineweb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
	- [amphion/Emilia-Dataset](https://huggingface.co/datasets/amphion/Emilia-Dataset) (German and English YODAS subsets)
	- [facebook/voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli)
	- [uhhlt/Tuda-De](https://huggingface.co/datasets/uhhlt/Tuda-De)
	- [openslr/librispeech_asr](https://huggingface.co/datasets/openslr/librispeech_asr)
	- [facebook/multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech)
	- [Thorsten-Voice/TV-44kHz-Full](https://huggingface.co/datasets/Thorsten-Voice/TV-44kHz-Full)
	- [CSTR-Edinburgh/vctk](https://huggingface.co/datasets/CSTR-Edinburgh/vctk)
	- [commonvoice_23](https://datacollective.mozillafoundation.org/datasets?q=common+voice)
	- [kerstin](https://datacollective.mozillafoundation.org/datasets/cmi7mgbam000bnx074097g2yg)



	### Inference example

	```python
	import torch
	import torchaudio
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from neucodec import NeuCodec

	device = "cuda"
	model_id = "DigitalLearningGmbH/educa-ai-voice-preview"
	audio_end_token_id = 128001
	audio_tokens_offset = 128006

	model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)
	model = model.to(device)
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	codec_model = NeuCodec.from_pretrained("neuphonic/neucodec")
	codec_model = codec_model.eval().to(device)

	prompt_template = "<\|task_tts\|>{prompt} <\|audio_start\|>"
	prompt = "Brautkleid bleibt Brautkleid und Blaukraut bleibt Blaukraut."

	input_ids = tokenizer.encode(prompt_template.format(prompt=prompt), return_tensors="pt").to(device)

	outputs = model.generate(input_ids=input_ids, do_sample=True, temperature=0.6, top_p=0.999, repetition_penalty=1.1, max_new_tokens=2048)
	outputs_audio = outputs[0][input_ids.shape[1]:(outputs[0] == audio_end_token_id).nonzero(as_tuple=True)[0][0].item()] - audio_tokens_offset

	with torch.no_grad():
	recon = codec_model.decode_code(outputs_audio.unsqueeze(0).unsqueeze(0).to(device)).cpu()

	torchaudio.save("tts.wav", recon[0, :, :], 24_000)
	```

	For even higher fidelity in German speech, use our [finetuned NeuCodec decoder](https://huggingface.co/DigitalLearningGmbH/neucodec-decoder-ft-de).

	### What's to come

	As stated in the model's name, this is a preview model, mainly meant to showcase the capability of the base model.
	We trained on a small dataset of a single speaker without any special emotion tagging etc.

	We are actively working on
	- multiple speakers with emotional control and nonverbal elements (fillers, laughing, ...)
	- fine-tuning for general zero-shot voice cloning
	- phoneme-based / hybrid generation
	- post-training with reinforcement learning

	Stay tuned - january 2026 is going to be exciting!