ZLSCompLing
/

VITS2-Claude

Model card Files Files and versions

VITS2-Claude / README.md

ZLSCompLing's picture

Upload README.md with huggingface_hub

98aa56d verified 3 days ago

|

history blame contribute delete

3.09 kB

	---
	license: mit
	language:
	- lb
	tags:
	- text-to-speech
	- tts
	- vits2
	- luxembourgish
	pipeline_tag: text-to-speech
	---

	# VITS2 - Claude (Luxembourgish Gender-Neutral Voice)

	A VITS2-based text-to-speech model for Luxembourgish, featuring a synthetic gender-neutral voice.

	## Model Description

	This model was trained using the VITS2 architecture on Luxembourgish speech data from the [Lëtzebuerger Online Dictionnaire (LOD)](https://lod.lu) example sentences.

	"Claude" is a synthetic gender-neutral Luxembourgish voice created by modulating the original LOD recordings.

	### Model Details

	- Architecture: VITS2 with duration discriminator and transformer flows
	- Language: Luxembourgish (lb)
	- Speaker: Single speaker (gender-neutral, synthetic)
	- Sample Rate: 24000 Hz
	- Checkpoint: G_57000 (57,000 steps)
	- License: MIT

	## Usage

	Note: Text should be lowercased before synthesis. Additional text normalization may be required.

	This model requires the included Python source files for inference.

	### Basic Usage

	```python
	import torch
	import scipy.io.wavfile as wavfile
	from vits2_engine import VITS2Engine

	# Load the model
	engine = VITS2Engine(model_dir="path/to/vits2-claude")

	# Generate speech
	wav = engine.tts("moien, wéi geet et dir?")

	# Save to file
	wavfile.write("output.wav", engine.sample_rate, wav)
	```

	### Command Line

	```bash
	python inference.py "moien, wéi geet et dir?"

	# With custom parameters
	python inference.py "Text" --noise_scale 0.5 --length_scale 1.1 -o output.wav
	```

	### Parameters

	- `noise_scale`: Controls voice variation (default: 0.667, lower = more consistent)
	- `noise_scale_w`: Controls duration variation (default: 0.8)
	- `length_scale`: Controls speech speed (default: 1.0, higher = slower)

	## Technical Specifications

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hidden Channels \| 192 \|
	\| Filter Channels \| 768 \|
	\| Attention Heads \| 2 \|
	\| Encoder Layers \| 6 \|
	\| Mel Channels \| 80 \|
	\| FFT Size \| 1024 \|
	\| Hop Length \| 256 \|

	## Requirements

	- Python 3.8+
	- PyTorch
	- scipy
	- numpy
	- Cython (for monotonic_align)

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{zls2025vits2claude,
	title={VITS2 Claude - Luxembourgish Gender-Neutral Voice},
	author={Zenter fir d'Lëtzebuerger Sprooch},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/ZLSCompLing/VITS2-Claude}
	}
	```

	## Acknowledgments

	Developed by [Zenter fir d'Lëtzebuerger Sprooch](https://zls.lu).

	Voice data sourced from the [Lëtzebuerger Online Dictionnaire (LOD)](https://lod.lu). The original audio files are available via the [LOD linguistic data on data.public.lu](https://data.public.lu/en/datasets/letzebuerger-online-dictionnaire-lod-linguistesch-daten/), which provides an XML file containing example sentence IDs. Audio files can be accessed at:

	```
	https://lod.lu/uploads/examples/AAC/{folder}/{id}.m4a
	```

	where `{folder}` is the first 2 characters of `{id}`.

	This model is used in [Sproochmaschinn](https://sproochmaschinn.lu), a Luxembourgish speech processing platform.