| --- |
| language: |
| - bm |
| pipeline_tag: text-to-speech |
| library_name: transformers |
| datasets: |
| - YazoPi/maya-multi-l-tts-data |
| base_model: |
| - maya-research/maya1 |
| license: cc-by-nc-sa-4.0 |
| tags: |
| - text-to-speech |
| - tts |
| - llm-based-tts |
| - bambara |
| - bomu |
| - llama |
| - african-languages |
| - open-source |
| - mali |
| - text-generation-inference |
| - transformers |
| - unsloth |
| --- |
| |
| # Wuro — Text-to-Speech for Bambara and Bomu |
|
|
| **Wuro** is a **text-to-speech (TTS)** model designed to generate speech in **Bambara** and **Bomu** from text. |
|
|
| ## Description |
|
|
| This model was developed for speech synthesis in Malian languages, especially **Bambara** and **Bomu**. |
| It uses an autoregressive approach that combines tokenized text with compressed audio codes. |
|
|
|
|
| ## Supported Languages |
|
|
| - **Bambara** (`bm`) |
| - **Bomu** (`bmq`) |
|
|
| ## Training Data |
|
|
| The model was trained on a TTS dataset built from multiple sources in Bambara and Bomu. |
|
|
| **Dataset used:** |
|
|
| - [YazoPi/bambara-bomu-tts-dataset](https://huggingface.co/datasets/YazoPi/bambara-bomu-tts-dataset) |
|
|
| **Base model:** |
|
|
| - [maya-research/maya1](https://huggingface.co/maya-research/maya1) |
|
|
| --- |
|
|
| ## Features |
|
|
| - Text-to-speech generation |
| - Multilingual support (Bambara and Bomu) |
| - Audio generation based on compressed SNAC tokens |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Installation |
|
|
| ```bash |
| pip install unsloth snac bambara-text-normalizer |
| ``` |
|
|
| ```python |
| from unsloth import FastLanguageModel |
| import torch |
| |
| model, tokenizer = FastLanguageModel.from_pretrained( |
| model_name="YazoPi/Wuro", |
| max_seq_length=4096, |
| dtype=None, |
| load_in_4bit=False, |
| ) |
| |
| FastLanguageModel.for_inference(model) |
| ``` |
| #### Normalisation du Bambara |
|
|
| The *bambara_normalizer* module is used only for Bambara. |
| It is used to normalize certain parts of the text before generation. |
|
|
| For more information on how to use it, see: *[bambara-text-normalizer](https://github.com/sudoping01/bambara-text-normalization)* |
|
|
| ```python |
| import torch |
| from snac import SNAC |
| import soundfile as sf |
| import numpy as np |
| from unsloth import FastLanguageModel |
| from bambara_normalizer import normalize_dates_in_text, normalize_measurements_in_text, normalize_numbers_in_text, normalize_times_in_text |
| |
| CODE_START_TOKEN_ID = 128257 |
| CODE_END_TOKEN_ID = 128258 |
| CODE_TOKEN_OFFSET = 128266 |
| SNAC_MIN_ID = 128266 |
| SNAC_MAX_ID = 156937 |
| SNAC_TOKENS_PER_FRAME = 7 |
| |
| SOH_ID = 128259 |
| EOH_ID = 128260 |
| SOA_ID = 128261 |
| BOS_ID = 128000 |
| TEXT_EOT_ID = 128009 |
| |
| tokeniser_length = 128256 |
| pad_token = tokeniser_length + 7 |
| |
| |
| def build_prompt(tokenizer, description: str, text: str, normalize = False) -> str: |
| """Build formatted prompt for the model.""" |
| soh_token = tokenizer.decode([SOH_ID]) |
| eoh_token = tokenizer.decode([EOH_ID]) |
| soa_token = tokenizer.decode([SOA_ID]) |
| sos_token = tokenizer.decode([CODE_START_TOKEN_ID]) |
| eot_token = tokenizer.decode([TEXT_EOT_ID]) |
| bos_token = tokenizer.bos_token |
| |
| |
| if normalize: |
| text = normalize_dates_in_text(text) |
| text = normalize_times_in_text(text) |
| text = normalize_measurements_in_text(text) |
| text = normalize_numbers_in_text(text) |
| |
| #print(text) |
| |
| formatted_text = f'<description="{description}"> {text}' |
| |
| prompt = ( |
| soh_token + bos_token + formatted_text + eot_token + |
| eoh_token + soa_token + sos_token |
| ) |
| |
| return prompt |
| |
| |
| def extract_snac_codes(token_ids: list) -> list: |
| """Extract SNAC codes from generated tokens.""" |
| try: |
| eos_idx = token_ids.index(CODE_END_TOKEN_ID) |
| except ValueError: |
| eos_idx = len(token_ids) |
| |
| snac_codes = [ |
| token_id for token_id in token_ids[:eos_idx] |
| if SNAC_MIN_ID <= token_id <= SNAC_MAX_ID |
| ] |
| |
| return snac_codes |
| |
| |
| def unpack_snac_from_7(snac_tokens: list) -> list: |
| """Unpack 7-token SNAC frames to 3 hierarchical levels.""" |
| if snac_tokens and snac_tokens[-1] == CODE_END_TOKEN_ID: |
| snac_tokens = snac_tokens[:-1] |
| |
| frames = len(snac_tokens) // SNAC_TOKENS_PER_FRAME |
| snac_tokens = snac_tokens[:frames * SNAC_TOKENS_PER_FRAME] |
| |
| if frames == 0: |
| return [[], [], []] |
| |
| l1, l2, l3 = [], [], [] |
| |
| for i in range(frames): |
| slots = snac_tokens[i * 7:(i + 1) * 7] |
| l1.append((slots[0] - CODE_TOKEN_OFFSET) % 4096) |
| l2.extend([ |
| (slots[1] - CODE_TOKEN_OFFSET) % 4096, |
| (slots[4] - CODE_TOKEN_OFFSET) % 4096, |
| ]) |
| l3.extend([ |
| (slots[2] - CODE_TOKEN_OFFSET) % 4096, |
| (slots[3] - CODE_TOKEN_OFFSET) % 4096, |
| (slots[5] - CODE_TOKEN_OFFSET) % 4096, |
| (slots[6] - CODE_TOKEN_OFFSET) % 4096, |
| ]) |
| |
| return [l1, l2, l3] |
| |
| |
| snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval() |
| |
| if torch.cuda.is_available(): |
| device = "cuda" |
| snac_model = snac_model.to(device) |
| snac_model.quantizer = snac_model.quantizer.to(device) |
| snac_model.decoder = snac_model.decoder.to(device) |
| else: |
| device = "cpu" |
| |
| |
| def main( |
| description="bomu", |
| text="bwe wa wuro.", |
| temp=1.0, |
| top_p=0.95, |
| max_tokens=4096, |
| min_tokens=28, |
| rp=1.2, |
| do_sample = True, |
| normalize = False |
| ): |
| |
| prompt = build_prompt(tokenizer, description, text, normalize) |
| |
| inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=True) |
| |
| if torch.cuda.is_available(): |
| inputs = {k: v.to("cuda") for k, v in inputs.items()} |
| |
| with torch.inference_mode(): |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=max_tokens, |
| min_new_tokens=min_tokens, |
| temperature=temp, |
| top_p=top_p, |
| repetition_penalty=rp, |
| do_sample=do_sample, |
| eos_token_id=CODE_END_TOKEN_ID, |
| pad_token_id=tokenizer.pad_token_id |
| ) |
| |
| generated_ids = outputs[0, inputs["input_ids"].shape[1]:].tolist() |
| |
| if CODE_END_TOKEN_ID in generated_ids: |
| eos_position = generated_ids.index(CODE_END_TOKEN_ID) |
| |
| snac_tokens = extract_snac_codes(generated_ids) |
| |
| snac_count = sum(1 for t in generated_ids if SNAC_MIN_ID <= t <= SNAC_MAX_ID) |
| other_count = sum(1 for t in generated_ids if t < SNAC_MIN_ID or t > SNAC_MAX_ID) |
| |
| if CODE_START_TOKEN_ID in generated_ids: |
| sos_pos = generated_ids.index(CODE_START_TOKEN_ID) |
| else: |
| print("No SOS token found in generated output!") |
| |
| if len(snac_tokens) < 7: |
| print("Error: Not enough SNAC tokens generated") |
| return |
| |
| levels = unpack_snac_from_7(snac_tokens) |
| frames = len(levels[0]) |
| |
| print(f"Unpacked to {frames} frames") |
| print(f"L1: {len(levels[0])} codes") |
| print(f"L2: {len(levels[1])} codes") |
| print(f"L3: {len(levels[2])} codes") |
| |
| codes_tensor = [ |
| torch.tensor(level, dtype=torch.long, device=device).unsqueeze(0) |
| for level in levels |
| ] |
| |
| print("Decoding to audio...") |
| with torch.inference_mode(): |
| z_q = snac_model.quantizer.from_codes(codes_tensor) |
| audio = snac_model.decoder(z_q)[0, 0].cpu().numpy() |
| |
| if len(audio) > max_tokens: |
| audio = audio[max_tokens:] |
| |
| duration_sec = len(audio) / 24000 |
| print(f"Audio generated: {len(audio)} samples ({duration_sec:.2f}s)") |
| |
| output_file = f"{description}-{text}.wav" |
| sf.write(output_file, audio, 24000) |
| ``` |
|
|
|
|
| #### Example with Bomu |
|
|
| A Bomu audio sample. |
| ```python |
| main( |
| description="bomu", |
| text="Hee wa banu yɛrɛ wuro.", |
| max_tokens=2048, |
| temp=0.4, |
| top_p=0.9, |
| rp=1.1 |
| ) |
| ``` |
| <audio controls src="https://github.com/yazopie/TTS-teste/raw/refs/heads/main/bomu-Hee%20wa%20banu%20y%C9%9Br%C9%9B%20wuro..wav"></audio> |
|
|
| You can also read a Bambara text with a Bwa (Bomu) accent: |
| ```python |
| main( |
| description="bomu", |
| text="An me kɛrɛlamana dɔw don a tigilamɔgɔ la.", |
| max_tokens=2048, |
| temp=0.4, |
| top_p=0.9, |
| rp=1.1, |
| ) |
| ``` |
| <audio controls src="https://github.com/yazopie/TTS-teste/raw/refs/heads/main/bomu-An%20me%20k%C9%9Br%C9%9Blamana%20d%C9%94w%20don%20a%20tigilam%C9%94g%C9%94%20la..wav"></audio> |
|
|
| #### Exemple with Bambara |
|
|
| ```python |
| main( |
| description="A Male voice with a bambara accent.", |
| text="Mali ye anw faso de ye !", |
| max_tokens=2048, |
| temp=0.4, |
| top_p=0.9, |
| rp=1.1 |
| ) |
| ``` |
| <audio controls src="https://github.com/yazopie/TTS-teste/raw/refs/heads/main/A%20Male%20voice%20with%20a%20bambara%20accent.-Mali%20ye%20anw%20faso%20de%20ye%20!.wav"></audio> |
|
|
|
|
| #### Exemple avec la normalisation du texte |
| The *[bambara-text-normalizer](https://github.com/sudoping01/bambara-text-normalization)* package is very useful for normalizing text before speech synthesis: |
|
|
| ```python |
| main( |
| description="Clear bambara voice.", |
| text="Ne taara sugu la 24-12-2025 la, 10:45 waati, ne ye tulu 6 l san ani sukaro 10 kg.", |
| normalize = True, |
| do_sample = False |
| ) |
| ``` |
| <audio controls src="https://github.com/yazopie/TTS-teste/raw/refs/heads/main/Clear%20bambara%20voice.-Ne%20taara%20sugu%20la%2024-12-2025%20la,%2010_45%20waati,%20ne%20ye%20tulu%206%20l%20san%20ani%20sukaro%2010%20kg..wav"></audio> |
|
|
| Read a Bomu text with a Bambara accent: |
| ```python |
| main( |
| description="bambara", |
| text="Yacouba, yɛrɛ we zin ma nucoza ue.", |
| max_tokens=2048, |
| temp=0.4, |
| top_p=0.9, |
| rp=1.1 |
| ) |
| ``` |
| <audio controls src="https://github.com/yazopie/TTS-teste/raw/refs/heads/main/bambara-Yacouba,%20y%C9%9Br%C9%9B%20we%20zin%20ma%20nucoza%20ue..wav"></audio> |
|
|
| #### Exemple with English |
|
|
| ```python |
| main( |
| description="bomu", |
| text="Hello! This is RobotsMali TTS model!", |
| max_tokens=2048, |
| temp=0.4, |
| top_p=0.9, |
| rp=1.1, |
| do_sample = False |
| ) |
| ``` |
| <audio controls src="https://github.com/yazopie/TTS-teste/raw/refs/heads/main/bomu-Hello!%20This%20is%20RobotsMali%20TTS%20model!.wav"></audio> |
|
|
| ## License |
|
|
| Non-commercial use due to some data used in model training. |