oute_ewe_r64_16bit / README.md
analist's picture
Update README.md
e04b93a verified
metadata
language:
  - ee
  - en
license: cc-by-nc-2.0
tags:
  - text-to-speech
  - outetts
  - unsloth
  - ewe
  - audio
  - gaia-suite
datasets:
  - google/WaxalNLP
base_model: OuteAI/Llama-OuteTTS-1.0-1B

Gaia Suite: Llama-OuteTTS-1.0-1B - Ewe (ee)

The model is a gaia suite on local languages and this model is adapted to Ewe (ʋegbe).

This is a fine-tuned version of OuteAI/Llama-OuteTTS-1.0-1B specifically trained to synthesize speech in the Ewe language. The model was fine-tuned using the Unsloth library with 16-bit LoRA adapters (Rank 64) for memory-efficient and fast training.

Model Details

  • Model Type: Text-to-Speech (TTS) Auto-Regressive Language Model
  • Language(s): Ewe (ee)
  • Base Model: OuteAI/Llama-OuteTTS-1.0-1B
  • Training Dataset: google/WaxalNLP (Ewe TTS subset)
  • Fine-Tuning Method: LoRA
  • Framework: Hugging Face transformers, trl, unsloth
  • License: CC-BY-4.0 (Attribution required)

Intended Use

This model is intended for generating Ewe speech from text. It is suitable for:

  • Accessibility tools for Ewe speakers
  • Educational applications and language learning
  • Voice assistants and read-aloud features in Ewe

Citation & Attribution

If you use this model in your research, applications, or projects, you must cite and attribute Junior Adenyo.

Limitations & Preprocessing

  • Text Normalization: Like many TTS models, this model struggles with raw numbers, acronyms, and special symbols. It is highly recommended to spell out numbers and dates in Ewe (e.g., convert 240 to its Ewe word equivalent) before feeding the text to the model.
  • Ewe Orthography: Ensure the input text correctly uses Ewe specific characters (Ɖ, Ɛ, Ƒ, Ɣ, Ŋ, Ɔ, Ʋ, ɖ, ɛ, ƒ, ɣ, ŋ, ɔ, ʋ) as the tokenizer has been explicitly resized to support them.

Usage (with OuteTTS and Unsloth)

import torch
import re
from unsloth import FastModel

# Load the fine-tuned model
model, tokenizer = FastModel.from_pretrained(
    model_name="analist/oute_ewe_r64_16bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=False,
)
FastModel.for_inference(model)

# Prepare your Ewe text
input_text = "Ya ʋuduʋudu si ƒe kpekpeme anɔ abe agbadroƒe blaatɔ̄ le gaƒoƒo ɖeka me ene la, aƒo."
formated_text = "<|text_start|>" + input_text + "<|text_end|>"
prompt = "\n".join([
    "<|im_start|>",
    formated_text,
    "<|audio_start|><|global_features_start|>",
])

model_inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

# Generate audio tokens
with torch.inference_mode():
    with torch.amp.autocast('cuda', dtype=model.dtype):
        generated_ids = model.generate(
            **model_inputs,
            temperature=0.1, 
            top_k=40,
            top_p=0.9,
            repetition_penalty=1.0, 
            min_p=0.05,
            max_new_tokens=4096,
        )

# Decode audio tokens to audio codes
decoded_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]
c1 = list(map(int, re.findall(r"<\|c1_(\d+)\|>", decoded_output)))
c2 = list(map(int, re.findall(r"<\|c2_(\d+)\|>", decoded_output)))

t = min(len(c1), len(c2))
audio_tokens = [c1[:t], c2[:t]]

# Note: To decode the generated tokens into a waveform, 
# you will need the DAC (Descript Audio Codec) interface from the OuteTTS library.
# from outetts.dac.interface import DacInterface
# dac = DacInterface()
# audio_waveform = dac.decode(torch.tensor([audio_tokens], dtype=torch.int64).to(dac.device))

Training Procedure

  • Batch Size: 2 (with Gradient Accumulation steps = 8)
  • Learning Rate: 5e-5
  • Epochs: 6
  • Optimizer: adamw_8bit
  • Hardware: Trained on a single NVIDIA RTX PRO 6000 Blackwell Edition.

Acknowledgements

  • Model architecture by OuteAI.
  • Dataset provided by Google's WaxalNLP project.
  • Fine-tuning powered by Unsloth.