YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Multilingual TTS System

This repository contains a modular Multilingual Text-to-Speech (TTS) system consisting of multiple microservices:

Gateway Service → Main API entrypoint
ASR Service → Speech-to-text using Whisper
Transliteration Service → Converts Romanized text to native scripts
Normalization Service → Convert transcripts (including reference-derived and input text) into canonical spoken-form representations
F5‑TTS Core Module → Performs inference for multilingual speech generation

Each service runs independently with its own Docker container, and the entire stack can be deployed using Docker Compose or run individually as standalone services.

🚀 Features

Multilingual Text‑to‑Speech with support for 11 Indic languages
Automatic ASR support for input reference speaker wav
Text normalization & transliteration
Modular microservice-based architecture
Can run via single command (docker‑compose) or individual service deployment

📁 Project Structure

.
├── gateway/                  # Main API
├── asr_service/              # Whisper speech-to-text
├── transliteration_service/  # Language transliteration
├── normalization_service/    # Text cleaning and formatting
├── f5_tts/                   # Core multilingual TTS inference module
├── docker-compose.yml        # Docker compose file for running all services 
├──.env                       # Required environment variables
├── Dockerfile                # Gateway Dockerfile
├── requirements.txt          # Gateway requirements file
├── requirements_all.txt      # Requirements file for local development 
└── utils/                    # Utilites for TTS model loading

🛠️ Prerequisites

(Current system Docker version detected: Docker version 29.1.1, build 0aedba5, Docker Compose version v2.40.3) Before running, ensure you have:

Docker >= 24.0
Docker Compose >= v2
GPU with CUDA (optional but recommended)

CPU is supported, but GPU acceleration significantly improves inference speed.

📦 Install Required System Packages

If you're running on a Linux host and plan to enable GPU support via NVIDIA Container Toolkit, install the following prerequisites:

sudo apt-get update && sudo apt-get install -y --no-install-recommends \
   curl \
   gnupg2

🔧 Install NVIDIA Container Toolkit (For GPU Support)

Add NVIDIA repository:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update package list:

sudo apt-get update

Install toolkit:

export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.1-1
sudo apt-get install -y \
    nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
    libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

Restart Docker:

sudo systemctl restart docker

🧱 Installation

Clone the repository:

git clone https://huggingface.co/immverse-ai/voice-tech-for-all-challenge
cd voice-tech-for-all-challenge

🏃 Running the System

Option 1 — Run Entire System With Docker Compose (Recommended)

docker compose up -d

This spins up:

gateway → exposed service
asr_service
transliteration_service
normalization_service

After startup, API is available at: http://localhost:8080

Option 2 — Run Each Container Individually

1️⃣ Build images

docker build -t tts-gateway ./gateway
docker build -t tts-asr ./asr_service
docker build -t tts-translit ./transliteration_service
docker build -t tts-normalization ./normalization_service

2️⃣ Run containers manually

sudo docker run --name tts-gateway --rm -d --gpus all \
  -e REPO_ID="" \
  -e HF_TOKEN="" \
  -e ASR_URL="" \
  -e TRANSLIT_URL="" \
  -e NORM_URL="" \
  -p 8080:8080 \
  tts-gateway

sudo docker run -d --name asr --gpus all -p 8072:8072 tts-asr

sudo docker run -d --name norm -p 8070:8070 tts-translit

docker run -p 8030:8030 tts-normalization

2️⃣ Run containers logs

sudo docker logs -f tts-gateway

sudo docker logs -f tts-asr

sudo docker logs -f tts-translit

sudo docker logs -f tts-normalization

🗣️ API Usage (Example)

Python Example

import time
import requests
import os

base_url = 'http://<ip_address>:8080/Get_Inference'

speaker_wav = ""

params = {
  'text': "",
  'lang': "",
  }

file_name = os.path.basename(speaker_wav).split(".")[0]

t1 = time.time()
with open(speaker_wav, "rb") as AudioFile:
    response = requests.get(base_url, params = params, files = {'speaker_wav':AudioFile.read()})
t2 = time.time()

print("Time Taken:", t2 - t1, "sec")

file_path = ""

if response.status_code == 200:
    with open(file_path, 'wb') as f:
        f.write(response.content)
        print("Audio saved at file_path",file_path)
else:
    print(f"Request failed with status code {response.status_code}")
    print("Response:", response.text)

CURL Example

curl -X GET "http://<ip_address>:8080/Get_Inference?text=<your_text>&lang=<your_language>" \
     -H "Content-Type: multipart/form-data" \
     -F "speaker_wav=@/path/to/your/file.wav" \
     --output output.wav

🧩 Environment Variables

(Default configuration used by Docker Compose. Adjust values only if deploying manually for option 2 or customizing services. ) Below environment variables are used by the Gateway Service to communicate with other microservices:

Variable	Description	Example
`ASR_URL`	Endpoint of the ASR Whisper service	`http://asr:8072`
`TRANSLIT_URL`	Endpoint of the Transliteration service	`http://translit:8071`
`NORM_URL`	Endpoint of the Text Normalization service	`http://norm:8070`
`HF_TOKEN`	Optional Hugging Face token for private models	`xxxxxxx`
`REPO_ID`	Hugging Face model repo ID for TTS	`microsoft/f5-tts`

Example .env file (used automatically by Docker Compose):

ASR_URL=http://asr:8072
TRANSLIT_URL=http://translit:8071
NORM_URL=http://norm:8070
HF_TOKEN=
REPO_ID=

🔧 Development Mode

Run gateway directly without Docker (for debugging):

# 1. Create and activate virtual environment
python -m venv tts-venv

# On macOS/Linux
source tts-venv/bin/activate

# 2. Install dependencies
pip install -r requirements_all.txt

# 3. Run asr service
python -m asr_service.whisper_api

# 4. Run text normalization service
python -m normalization_service.text_norm_api

# 4. Run transliteration service
python -m transliteration_service.transliteration_api

# 3. Run the gateway
python -m gateway.main

🧑‍💻 Contributing

Pull requests and feature requests are welcome!

💬 Support

For issues or improvements, open a ticket in the repository.

Downloads last month: 15

Safetensors

Model size

0.4B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support