Spaces:
Running
Running
| # Dia TTS Server - Technical Documentation | |
| **Version:** 1.0.0 | |
| **Date:** 2025-04-22 | |
| **Table of Contents:** | |
| 1. [Overview](#1-overview) | |
| 2. [Visual Overview](#2-visual-overview) | |
| * [Directory Structure](#21-directory-structure) | |
| * [Component Diagram](#22-component-diagram) | |
| 3. [System Prerequisites](#3-system-prerequisites) | |
| 4. [Installation and Setup](#4-installation-and-setup) | |
| * [Cloning the Repository](#41-cloning-the-repository) | |
| * [Setting up Python Virtual Environment](#42-setting-up-python-virtual-environment) | |
| * [Windows Setup](#421-windows-setup) | |
| * [Linux Setup (Debian/Ubuntu Example)](#422-linux-setup-debianubuntu-example) | |
| * [Installing Dependencies](#43-installing-dependencies) | |
| * [NVIDIA Driver and CUDA Setup (Required for GPU Acceleration)](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration) | |
| * [Step 1: Check/Install NVIDIA Drivers](#441-step-1-checkinstall-nvidia-drivers) | |
| * [Step 2: Install PyTorch with CUDA Support](#442-step-2-install-pytorch-with-cuda-support) | |
| * [Step 3: Verify PyTorch CUDA Installation](#443-step-3-verify-pytorch-cuda-installation) | |
| 5. [Configuration](#5-configuration) | |
| * [Configuration Files (`.env` and `config.py`)](#51-configuration-files-env-and-configpy) | |
| * [Configuration Parameters](#52-configuration-parameters) | |
| 6. [Running the Server](#6-running-the-server) | |
| 7. [Usage](#7-usage) | |
| * [Web User Interface (Web UI)](#71-web-user-interface-web-ui) | |
| * [Main Generation Form](#711-main-generation-form) | |
| * [Presets](#712-presets) | |
| * [Voice Cloning](#713-voice-cloning) | |
| * [Generation Parameters](#714-generation-parameters) | |
| * [Server Configuration (UI)](#715-server-configuration-ui) | |
| * [Generated Audio Player](#716-generated-audio-player) | |
| * [Theme Toggle](#717-theme-toggle) | |
| * [API Endpoints](#72-api-endpoints) | |
| * [POST /v1/audio/speech (OpenAI Compatible)](#721-post-v1audiospeech-openai-compatible) | |
| * [POST /tts (Custom Parameters)](#722-post-tts-custom-parameters) | |
| * [Configuration & Helper Endpoints](#723-configuration--helper-endpoints) | |
| 8. [Troubleshooting](#8-troubleshooting) | |
| 9. [Project Architecture](#9-project-architecture) | |
| 10. [License and Disclaimer](#10-license-and-disclaimer) | |
| --- | |
| ## 1. Overview | |
| The Dia TTS Server provides a backend service and web interface for generating high-fidelity speech, including dialogue with multiple speakers and non-verbal sounds, using the Dia text-to-speech model family (originally from Nari Labs, with support for community conversions like SafeTensors). | |
| This server is built using the FastAPI framework and offers both a RESTful API (including an OpenAI-compatible endpoint) and an interactive web UI powered by Jinja2, Tailwind CSS, and JavaScript. It supports voice cloning via audio prompts and allows configuration of various generation parameters. | |
| **Key Features:** | |
| * **High-Quality TTS:** Leverages the Dia model for realistic speech synthesis. | |
| * **Dialogue Generation:** Supports `[S1]` and `[S2]` tags for multi-speaker dialogue. | |
| * **Non-Verbal Sounds:** Can generate sounds like `(laughs)`, `(sighs)`, etc., when included in the text. | |
| * **Voice Cloning:** Allows conditioning the output voice on a provided reference audio file. | |
| * **Flexible Model Loading:** Supports loading models from Hugging Face repositories, including both `.pth` and `.safetensors` formats (defaults to BF16 SafeTensors for efficiency). | |
| * **API Access:** Provides a custom API endpoint (`/tts`) and an OpenAI-compatible endpoint (`/v1/audio/speech`). | |
| * **Web Interface:** Offers an easy-to-use UI for text input, parameter adjustment, preset loading, reference audio management, and audio playback. | |
| * **Configuration:** Server settings, model sources, paths, and default generation parameters are configurable via an `.env` file. | |
| * **GPU Acceleration:** Utilizes NVIDIA GPUs via CUDA for significantly faster inference when available, falling back to CPU otherwise. | |
| --- | |
| ## 2. Visual Overview | |
| ### 2.1 Directory Structure | |
| ``` | |
| dia-tts-server/ | |
| β | |
| βββ .env # Local configuration overrides (user-created) | |
| βββ config.py # Default configuration and management class | |
| βββ engine.py # Core model loading and generation logic | |
| βββ models.py # Pydantic models for API requests | |
| βββ requirements.txt # Python dependencies | |
| βββ server.py # Main FastAPI application, API endpoints, UI routes | |
| βββ utils.py # Utility functions (audio encoding, saving, etc.) | |
| β | |
| βββ dia/ # Core Dia model implementation package | |
| β βββ __init__.py | |
| β βββ audio.py # Audio processing helpers (delay, codebook conversion) | |
| β βββ config.py # Pydantic models for Dia model architecture config | |
| β βββ layers.py # Custom PyTorch layers for the Dia model | |
| β βββ model.py # Dia model class wrapper (loading, generation) | |
| β | |
| βββ static/ # Static assets (e.g., favicon.ico) | |
| β βββ favicon.ico | |
| β | |
| βββ ui/ # Web User Interface files | |
| β βββ index.html # Main HTML template (Jinja2) | |
| β βββ presets.yaml # Predefined UI examples | |
| β βββ script.js # Frontend JavaScript logic | |
| β βββ style.css # Frontend CSS styling (Tailwind via CDN/build) | |
| β | |
| βββ model_cache/ # Default directory for downloaded model files (configurable) | |
| βββ outputs/ # Default directory for saved audio output (configurable) | |
| βββ reference_audio/ # Default directory for voice cloning reference files (configurable) | |
| ``` | |
| ### 2.2 Component Diagram | |
| ``` | |
| βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ | |
| β User (Web UI / ββββββ β FastAPI Server ββββββ β TTS Engine ββββββ β Dia Model Wrapper β | |
| β API Client) β β (server.py) β β (engine.py) β β (dia/model.py) β | |
| βββββββββββββββββββββ βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ | |
| β β β | |
| β Uses β Uses β Uses | |
| βΌ βΌ βΌ | |
| βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ | |
| β Configuration β ββββ β .env File β β Dia Model Layers β | |
| β (config.py) β βββββββββββββββββββββ β (dia/layers.py) β | |
| βββββββββββββββββββββ βββββββββββββββββββββ | |
| β β Uses | |
| β Uses β | |
| βΌ β | |
| βββββββββββββββββββββ β Uses | |
| β Utilities β βΌ | |
| β (utils.py) β βββββββββββββββββββββ | |
| βββββββββββββββββββββ β PyTorch / CUDA β | |
| β² βββββββββββββββββββββ | |
| β Uses β Uses | |
| β βΌ | |
| βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ | |
| β Web UI Files β ββββ β Jinja2 Templates β β DAC Model β | |
| β (ui/) β βββββββββββββββββββββ β (descript-audio..)β | |
| βββββββββββββββββββββ β² βββββββββββββββββββββ | |
| β Renders β² | |
| β β Uses | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| **Diagram Legend:** | |
| * Boxes represent major components or file groups. | |
| * Arrows (`β`) indicate primary data flow or control flow. | |
| * Lines with "Uses" indicate dependencies or function calls. | |
| --- | |
| ## 3. System Prerequisites | |
| Before installing and running the Dia TTS Server, ensure your system meets the following requirements: | |
| * **Operating System:** | |
| * Windows 10/11 (64-bit) | |
| * Linux (Debian/Ubuntu recommended, other distributions may require adjustments) | |
| * **Python:** Python 3.10 or later (Python 3.10.x recommended based on tracebacks). Ensure Python and Pip are added to your system's PATH. | |
| * **Version Control:** Git (for cloning the repository). | |
| * **Internet Connection:** Required for downloading dependencies and model files. | |
| * **(Optional but Highly Recommended for Performance):** | |
| * **NVIDIA GPU:** A CUDA-compatible NVIDIA GPU (Maxwell architecture or newer). Check compatibility [here](https://developer.nvidia.com/cuda-gpus). Sufficient VRAM is needed (BF16 model requires ~5-6GB, full precision ~10GB). | |
| * **NVIDIA Drivers:** Latest appropriate drivers for your GPU and OS. | |
| * **CUDA Toolkit:** Version compatible with the chosen PyTorch build (e.g., 11.8, 12.1). See [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration). | |
| * **(Linux System Libraries):** | |
| * `libsndfile1`: Required by the `soundfile` Python library for audio I/O. Install using your package manager (e.g., `sudo apt install libsndfile1` on Debian/Ubuntu). | |
| --- | |
| ## 4. Installation and Setup | |
| Follow these steps to set up the project environment and install necessary dependencies. | |
| ### 4.1 Cloning the Repository | |
| Open your terminal or command prompt and navigate to the directory where you want to store the project. Then, clone the repository: | |
| ```bash | |
| git clone https://github.com/devnen/dia-tts-server.git # Replace with the actual repo URL if different | |
| cd dia-tts-server | |
| ``` | |
| ### 4.2 Setting up Python Virtual Environment | |
| Using a virtual environment is strongly recommended to isolate project dependencies. | |
| #### 4.2.1 Windows Setup | |
| 1. **Open PowerShell or Command Prompt** in the project directory (`dia-tts-server`). | |
| 2. **Create the virtual environment:** | |
| ```powershell | |
| python -m venv venv | |
| ``` | |
| 3. **Activate the virtual environment:** | |
| ```powershell | |
| .\venv\Scripts\activate | |
| ``` | |
| Your terminal prompt should now be prefixed with `(venv)`. | |
| #### 4.2.2 Linux Setup (Debian/Ubuntu Example) | |
| 1. **Install prerequisites (if not already present):** | |
| ```bash | |
| sudo apt update | |
| sudo apt install python3 python3-venv python3-pip libsndfile1 -y | |
| ``` | |
| 2. **Open your terminal** in the project directory (`dia-tts-server`). | |
| 3. **Create the virtual environment:** | |
| ```bash | |
| python3 -m venv venv | |
| ``` | |
| 4. **Activate the virtual environment:** | |
| ```bash | |
| source venv/bin/activate | |
| ``` | |
| Your terminal prompt should now be prefixed with `(venv)`. | |
| ### 4.3 Installing Dependencies | |
| With your virtual environment activated (`(venv)` prefix visible), install the required Python packages: | |
| ```bash | |
| # Upgrade pip first (optional but good practice) | |
| pip install --upgrade pip | |
| # Install all dependencies from requirements.txt | |
| pip install -r requirements.txt | |
| ``` | |
| **Note:** This command installs the CPU-only version of PyTorch by default. If you have a compatible NVIDIA GPU and want acceleration, proceed to [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration) **before** running the server. | |
| ### 4.4 NVIDIA Driver and CUDA Setup (Required for GPU Acceleration) | |
| Follow these steps **only if you have a compatible NVIDIA GPU** and want faster inference. | |
| #### 4.4.1 Step 1: Check/Install NVIDIA Drivers | |
| 1. **Check Existing Driver:** Open Command Prompt (Windows) or Terminal (Linux) and run: | |
| ```bash | |
| nvidia-smi | |
| ``` | |
| 2. **Interpret Output:** | |
| * If the command runs successfully, note the **Driver Version** and the **CUDA Version** listed in the top right corner. This CUDA version is the *maximum* supported by your current driver. | |
| * If the command fails ("not recognized"), you need to install or update your NVIDIA drivers. | |
| 3. **Install/Update Drivers:** Go to the [NVIDIA Driver Downloads](https://www.nvidia.com/Download/index.aspx) page. Select your GPU model and OS, then download and install the latest recommended driver (Game Ready or Studio). **Reboot your computer** after installation. Run `nvidia-smi` again to confirm it works. | |
| #### 4.4.2 Step 2: Install PyTorch with CUDA Support | |
| 1. **Go to PyTorch Website:** Visit [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/). | |
| 2. **Configure:** Select: | |
| * **PyTorch Build:** Stable | |
| * **Your OS:** Windows or Linux | |
| * **Package:** Pip | |
| * **Language:** Python | |
| * **Compute Platform:** Choose the CUDA version **equal to or lower than** the version reported by `nvidia-smi`. For example, if `nvidia-smi` shows `CUDA Version: 12.4`, select `CUDA 12.1`. If it shows `11.8`, select `CUDA 11.8`. **Do not select a version higher than your driver supports.** (CUDA 12.1 or 11.8 are common stable choices). | |
| 3. **Copy Command:** Copy the generated installation command. It will look similar to: | |
| ```bash | |
| # Example for CUDA 12.1 (Windows/Linux): | |
| pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 | |
| # Example for CUDA 11.8 (Windows/Linux): | |
| pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 | |
| ``` | |
| *(Use `pip` instead of `pip3` if that's your command)* | |
| 4. **Install in Activated venv:** | |
| * Ensure your `(venv)` is active. | |
| * **Uninstall CPU PyTorch first:** | |
| ```bash | |
| pip uninstall torch torchvision torchaudio -y | |
| ``` | |
| * **Paste and run the copied command** from the PyTorch website. | |
| #### 4.4.3 Step 3: Verify PyTorch CUDA Installation | |
| 1. With the `(venv)` still active, start a Python interpreter: | |
| ```bash | |
| python | |
| ``` | |
| 2. Run the following Python code: | |
| ```python | |
| import torch | |
| print(f"PyTorch version: {torch.__version__}") | |
| cuda_available = torch.cuda.is_available() | |
| print(f"CUDA available: {cuda_available}") | |
| if cuda_available: | |
| print(f"CUDA version used by PyTorch: {torch.version.cuda}") | |
| print(f"Device count: {torch.cuda.device_count()}") | |
| print(f"Current device index: {torch.cuda.current_device()}") | |
| print(f"Device name: {torch.cuda.get_device_name(torch.cuda.current_device())}") | |
| else: | |
| print("CUDA not available to PyTorch. Ensure drivers and CUDA-enabled PyTorch are installed correctly.") | |
| exit() | |
| ``` | |
| 3. If `CUDA available:` shows `True`, the setup was successful. If `False`, review driver installation and the PyTorch installation command. | |
| --- | |
| ## 5. Configuration | |
| The server's behavior, including model selection, paths, and default generation parameters, is controlled via configuration settings. | |
| ### 5.1 Configuration Files (`.env` and `config.py`) | |
| * **`config.py`:** Defines the *default* values for all configuration parameters in the `DEFAULT_CONFIG` dictionary. It also contains the `ConfigManager` class and getter functions used by the application. | |
| * **`.env` File:** This file, located in the project root directory (`dia-tts-server/.env`), allows you to *override* the default values. Create this file if it doesn't exist. Settings are defined as `KEY=VALUE` pairs, one per line. The server reads this file on startup using `python-dotenv`. | |
| **Priority:** Values set in the `.env` file take precedence over the defaults in `config.py`. Environment variables set directly in your system also override `.env` file values (though using `.env` is generally recommended for project-specific settings). | |
| ### 5.2 Configuration Parameters | |
| The following parameters can be set in your `.env` file: | |
| | Parameter Name (in `.env`) | Default Value (`config.py`) | Description | Example `.env` Value | | |
| | :--------------------------------- | :--------------------------------- | :--------------------------------------------------------------------------------------------------------- | :----------------------------------- | | |
| | **Server Settings** | | | | | |
| | `HOST` | `0.0.0.0` | The network interface address the server listens on. `0.0.0.0` makes it accessible on your local network. | `127.0.0.1` (localhost only) | | |
| | `PORT` | `8003` | The port number the server listens on. | `8080` | | |
| | **Model Source Settings** | | | | | |
| | `DIA_MODEL_REPO_ID` | `ttj/dia-1.6b-safetensors` | The Hugging Face repository ID containing the model files. | `nari-labs/Dia-1.6B` | | |
| | `DIA_MODEL_CONFIG_FILENAME` | `config.json` | The filename of the model's configuration JSON within the repository. | `config.json` | | |
| | `DIA_MODEL_WEIGHTS_FILENAME` | `dia-v0_1_bf16.safetensors` | The filename of the model weights file (`.safetensors` or `.pth`) within the repository to load. | `dia-v0_1.safetensors` or `dia-v0_1.pth` | | |
| | **Path Settings** | | | | | |
| | `DIA_MODEL_CACHE_PATH` | `./model_cache` | Local directory to store downloaded model files. Relative paths are based on the project root. | `/path/to/shared/cache` | | |
| | `REFERENCE_AUDIO_PATH` | `./reference_audio` | Local directory to store reference audio files (`.wav`, `.mp3`) used for voice cloning. | `./voices` | | |
| | `OUTPUT_PATH` | `./outputs` | Local directory where generated audio files from the Web UI are saved. | `./generated_speech` | | |
| | **Default Generation Parameters** | | *(These set the initial UI values and can be saved via the UI)* | | | |
| | `GEN_DEFAULT_SPEED_FACTOR` | `0.90` | Default playback speed factor applied *after* generation (UI slider initial value). | `1.0` | | |
| | `GEN_DEFAULT_CFG_SCALE` | `3.0` | Default Classifier-Free Guidance scale (UI slider initial value). | `2.5` | | |
| | `GEN_DEFAULT_TEMPERATURE` | `1.3` | Default sampling temperature (UI slider initial value). | `1.2` | | |
| | `GEN_DEFAULT_TOP_P` | `0.95` | Default nucleus sampling probability (UI slider initial value). | `0.9` | | |
| | `GEN_DEFAULT_CFG_FILTER_TOP_K` | `35` | Default Top-K value for CFG filtering (UI slider initial value). | `40` | | |
| **Example `.env` File (Using Original Nari Labs Model):** | |
| ```dotenv | |
| # .env | |
| # Example configuration to use the original Nari Labs model | |
| HOST=0.0.0.0 | |
| PORT=8003 | |
| DIA_MODEL_REPO_ID=nari-labs/Dia-1.6B | |
| DIA_MODEL_CONFIG_FILENAME=config.json | |
| DIA_MODEL_WEIGHTS_FILENAME=dia-v0_1.pth | |
| # Keep other paths as default or specify custom ones | |
| # DIA_MODEL_CACHE_PATH=./model_cache | |
| # REFERENCE_AUDIO_PATH=./reference_audio | |
| # OUTPUT_PATH=./outputs | |
| # Keep default generation parameters or override them | |
| # GEN_DEFAULT_SPEED_FACTOR=0.90 | |
| # GEN_DEFAULT_CFG_SCALE=3.0 | |
| # GEN_DEFAULT_TEMPERATURE=1.3 | |
| # GEN_DEFAULT_TOP_P=0.95 | |
| # GEN_DEFAULT_CFG_FILTER_TOP_K=35 | |
| ``` | |
| **Important:** You must **restart the server** after making changes to the `.env` file for them to take effect. | |
| --- | |
| ## 6. Running the Server | |
| 1. **Activate Virtual Environment:** Ensure your virtual environment is activated (`(venv)` prefix). | |
| * Windows: `.\venv\Scripts\activate` | |
| * Linux: `source venv/bin/activate` | |
| 2. **Navigate to Project Root:** Make sure your terminal is in the `dia-tts-server` directory. | |
| 3. **Run the Server:** | |
| ```bash | |
| python server.py | |
| ``` | |
| 4. **Server Output:** You should see log messages indicating the server is starting, including: | |
| * The configuration being used (repo ID, filenames, paths). | |
| * The device being used (CPU or CUDA). | |
| * Model loading progress (downloading if necessary). | |
| * Confirmation that the server is running (e.g., `Uvicorn running on http://0.0.0.0:8003`). | |
| * URLs for accessing the Web UI and API Docs. | |
| 5. **Accessing the Server:** | |
| * **Web UI:** Open your web browser and go to `http://localhost:PORT` (e.g., `http://localhost:8003` if using the default port). If running on a different machine or VM, replace `localhost` with the server's IP address. | |
| * **API Docs:** Access the interactive API documentation (Swagger UI) at `http://localhost:PORT/docs`. | |
| 6. **Stopping the Server:** Press `CTRL+C` in the terminal where the server is running. | |
| **Auto-Reload:** The server is configured to run with `reload=True`. This means Uvicorn will automatically restart the server if it detects changes in `.py`, `.html`, `.css`, `.js`, `.env`, or `.yaml` files within the project or `ui` directory. This is useful for development but should generally be disabled in production. | |
| --- | |
| ## 7. Usage | |
| The Dia TTS Server can be used via its Web UI or its API endpoints. | |
| ### 7.1 Web User Interface (Web UI) | |
| Access the UI by navigating to the server's base URL (e.g., `http://localhost:8003`). | |
| #### 7.1.1 Main Generation Form | |
| * **Text to speak:** Enter the text you want to synthesize. | |
| * Use `[S1]` and `[S2]` tags to indicate speaker turns for dialogue. | |
| * Include non-verbal cues like `(laughs)`, `(sighs)`, `(clears throat)` directly in the text where desired. | |
| * For voice cloning, **prepend the exact transcript** of the selected reference audio before the text you want generated (e.g., `[S1] Reference transcript text. [S1] This is the new text to generate in the cloned voice.`). | |
| * **Voice Mode:** Select the desired generation mode: | |
| * **Single / Dialogue (Use [S1]/[S2]):** Use this for single-speaker text (you can use `[S1]` or omit tags if the model handles it) or multi-speaker dialogue (using `[S1]` and `[S2]`). | |
| * **Voice Clone (from Reference):** Enables voice cloning based on a selected audio file. Requires selecting a file below and prepending its transcript to the text input. | |
| * **Generate Speech Button:** Submits the text and settings to the server to start generation. | |
| #### 7.1.2 Presets | |
| * Located below the Voice Mode selection. | |
| * Clicking a preset button (e.g., "Standard Dialogue", "Expressive Narration") will automatically populate the "Text to speak" area and the "Generation Parameters" sliders with predefined values, demonstrating different use cases. | |
| #### 7.1.3 Voice Cloning | |
| * This section appears only when "Voice Clone" mode is selected. | |
| * **Reference Audio File Dropdown:** Lists available `.wav` and `.mp3` files found in the configured `REFERENCE_AUDIO_PATH`. Select the file whose voice you want to clone. Remember to prepend its transcript to the main text input. | |
| * **Load Button:** Click this to open your system's file browser. You can select one or more `.wav` or `.mp3` files to upload. The selected files will be copied to the server's `REFERENCE_AUDIO_PATH`, and the dropdown list will refresh automatically. The first newly uploaded file will be selected in the dropdown. | |
| #### 7.1.4 Generation Parameters | |
| * Expand this section to fine-tune the generation process. These values correspond to the parameters used by the underlying Dia model. | |
| * **Sliders:** Adjust Speed Factor, CFG Scale, Temperature, Top P, and CFG Filter Top K. The current value is displayed next to the label. | |
| * **Save Generation Defaults Button:** Saves the *current* values of these sliders to the `.env` file (as `GEN_DEFAULT_...` keys). These saved values will become the default settings loaded into the UI the next time the server starts. | |
| #### 7.1.5 Server Configuration (UI) | |
| * Expand this section to view and modify server-level settings stored in the `.env` file. | |
| * **Fields:** Edit Model Repo ID, Config/Weights Filenames, Cache/Reference/Output Paths, Host, and Port. | |
| * **Save Server Configuration Button:** Saves the values currently shown in these fields to the `.env` file. **A server restart is required** for most of these changes (especially model source or paths) to take effect. | |
| * **Restart Server Button:** (Appears after saving) Attempts to trigger a server restart. This works best if the server was started with `reload=True` or is managed by a process manager like systemd or Supervisor. | |
| #### 7.1.6 Generated Audio Player | |
| * Appears below the main form after a successful generation. | |
| * **Waveform:** Visual representation of the generated audio. | |
| * **Play/Pause Button:** Controls audio playback. | |
| * **Download WAV Button:** Downloads the generated audio as a `.wav` file. | |
| * **Info:** Displays the voice mode used, generation time, and audio duration. | |
| #### 7.1.7 Theme Toggle | |
| * Located in the top-right navigation bar. | |
| * Click the Sun/Moon icon to switch between Light and Dark themes. Your preference is saved in your browser's `localStorage`. | |
| ### 7.2 API Endpoints | |
| Access the interactive API documentation via the `/docs` path (e.g., `http://localhost:8003/docs`). | |
| #### 7.2.1 POST `/v1/audio/speech` (OpenAI Compatible) | |
| * **Purpose:** Provides an endpoint compatible with the basic OpenAI TTS API for easier integration with existing tools. | |
| * **Request Body:** (`application/json`) - Uses the `OpenAITTSRequest` model. | |
| | Field | Type | Required | Description | Example | | |
| | :---------------- | :----------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------- | | |
| | `model` | string | No | Ignored by this server (always uses Dia). Included for compatibility. Defaults to `dia-1.6b`. | `"dia-1.6b"` | | |
| | `input` | string | Yes | The text to synthesize. Use `[S1]`/`[S2]` tags for dialogue. For cloning, prepend reference transcript. | `"Hello [S1] world."` | | |
| | `voice` | string | No | Maps to Dia modes. Use `"S1"`, `"S2"`, `"dialogue"`, or the filename of a reference audio (e.g., `"my_ref.wav"`) for cloning. Defaults to `S1`. | `"dialogue"` or `"ref.mp3"` | | |
| | `response_format` | `"opus"` \| `"wav"` | No | Desired audio output format. Defaults to `opus`. | `"wav"` | | |
| | `speed` | float | No | Playback speed factor (0.5-2.0). Applied *after* generation. Defaults to `1.0`. | `0.9` | | |
| * **Response:** | |
| * **Success (200 OK):** `StreamingResponse` containing the binary audio data (`audio/opus` or `audio/wav`). | |
| * **Error:** Standard FastAPI JSON error response (e.g., 400, 404, 500). | |
| #### 7.2.2 POST `/tts` (Custom Parameters) | |
| * **Purpose:** Allows generation using all specific Dia generation parameters. | |
| * **Request Body:** (`application/json`) - Uses the `CustomTTSRequest` model. | |
| | Field | Type | Required | Description | Default | | |
| | :------------------------- | :------------------------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------- | :---------- | | |
| | `text` | string | Yes | The text to synthesize. Use `[S1]`/`[S2]` tags. Prepend transcript for cloning. | | | |
| | `voice_mode` | `"dialogue"` \| `"clone"` | No | Generation mode. Note: `single_s1`/`single_s2` are handled via `dialogue` mode with appropriate tags in the text. | `dialogue` | | |
| | `clone_reference_filename` | string \| null | No | Filename of reference audio in `REFERENCE_AUDIO_PATH`. **Required if `voice_mode` is `clone`**. | `null` | | |
| | `output_format` | `"opus"` \| `"wav"` | No | Desired audio output format. | `opus` | | |
| | `max_tokens` | integer \| null | No | Maximum audio tokens to generate. `null` uses the model's default. | `null` | | |
| | `cfg_scale` | float | No | Classifier-Free Guidance scale. | `3.0` | | |
| | `temperature` | float | No | Sampling temperature. | `1.3` | | |
| | `top_p` | float | No | Nucleus sampling probability. | `0.95` | | |
| | `speed_factor` | float | No | Playback speed factor (0.5-2.0). Applied *after* generation. | `0.90` | | |
| | `cfg_filter_top_k` | integer | No | Top-K value for CFG filtering. | `35` | | |
| * **Response:** | |
| * **Success (200 OK):** `StreamingResponse` containing the binary audio data (`audio/opus` or `audio/wav`). | |
| * **Error:** Standard FastAPI JSON error response (e.g., 400, 404, 500). | |
| #### 7.2.3 Configuration & Helper Endpoints | |
| * **GET `/get_config`:** Returns the current server configuration as JSON. | |
| * **POST `/save_config`:** Saves server configuration settings provided in the JSON request body to the `.env` file. Requires server restart. | |
| * **POST `/save_generation_defaults`:** Saves default generation parameters provided in the JSON request body to the `.env` file. Affects UI defaults on next load. | |
| * **POST `/restart_server`:** Attempts to trigger a server restart (reliability depends on execution environment). | |
| * **POST `/upload_reference`:** Uploads one or more audio files (`.wav`, `.mp3`) as `multipart/form-data` to the reference audio directory. Returns JSON with status and updated file list. | |
| * **GET `/health`:** Basic health check endpoint. Returns `{"status": "healthy", "model_loaded": true/false}`. | |
| --- | |
| ## 8. Troubleshooting | |
| * **Error: `CUDA available: False` or Slow Performance:** | |
| * Verify NVIDIA drivers are installed correctly (`nvidia-smi` command). | |
| * Ensure you installed the correct PyTorch version with CUDA support matching your driver (See [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration)). Reinstall PyTorch using the command from the official website if unsure. | |
| * Check if another process is using all GPU VRAM. | |
| * **Error: `ImportError: No module named 'dac'` (or `safetensors`, `yaml`, etc.):** | |
| * Make sure your virtual environment is activated. | |
| * Run `pip install -r requirements.txt` again to install missing dependencies. | |
| * Specifically for `dac`, ensure you installed `descript-audio-codec` and not a different package named `dac`. Run `pip uninstall dac -y && pip install descript-audio-codec`. | |
| * **Error: `libsndfile library not found` (or similar `soundfile` error, mainly on Linux):** | |
| * Install the system library: `sudo apt update && sudo apt install libsndfile1` (Debian/Ubuntu) or the equivalent for your distribution. | |
| * **Error: Model Download Fails (e.g., `HTTPError`, `ConnectionError`):** | |
| * Check your internet connection. | |
| * Verify the `DIA_MODEL_REPO_ID`, `DIA_MODEL_CONFIG_FILENAME`, and `DIA_MODEL_WEIGHTS_FILENAME` in your `.env` file (or defaults in `config.py`) are correct and accessible on Hugging Face Hub. | |
| * Check Hugging Face Hub status if multiple downloads fail. | |
| * Ensure the cache directory (`DIA_MODEL_CACHE_PATH`) is writable. | |
| * **Error: `RuntimeError: Failed to load DAC model...`:** | |
| * This usually indicates an issue with the `descript-audio-codec` installation or version incompatibility. Ensure it's installed correctly (see `ImportError` above). | |
| * Check logs for specific `AttributeError` messages (like missing `utils` or `download`) which might indicate version mismatches between the Dia code's expectation and the installed library. The current code expects `dac.utils.download()`. | |
| * **Error: `FileNotFoundError` during generation (Reference Audio):** | |
| * Ensure the filename selected/provided for voice cloning exists in the configured `REFERENCE_AUDIO_PATH`. | |
| * Check that the path in `config.py` or `.env` is correct and the server has permission to read from it. | |
| * **Error: Cannot Save Output/Reference Files (`PermissionError`, etc.):** | |
| * Ensure the directories specified by `OUTPUT_PATH` and `REFERENCE_AUDIO_PATH` exist and the server process has write permissions to them. | |
| * **Web UI Issues (Buttons don't work, styles missing):** | |
| * Clear your browser cache. | |
| * Check the browser's developer console (usually F12) for JavaScript errors. | |
| * Ensure `ui/script.js` and `ui/style.css` are being loaded correctly (check network tab in developer tools). | |
| * **Generation Cancel Button Doesn't Stop Process:** | |
| * This is expected ("Fake Cancel"). The button currently only prevents the UI from processing the result when it eventually arrives. True cancellation is complex and not implemented. Clicking "Generate" again *will* cancel the *previous UI request's result processing* before starting the new one. | |
| --- | |
| ## 9. Project Architecture | |
| * **`server.py`:** The main entry point using FastAPI. Defines API routes, serves the Web UI using Jinja2, handles requests, and orchestrates calls to the engine. | |
| * **`engine.py`:** Responsible for loading the Dia model (including downloading files via `huggingface_hub`), managing the model instance, preparing inputs for the model's `generate` method based on user requests (handling voice modes), and calling the model's generation function. Also handles post-processing like speed adjustment. | |
| * **`config.py`:** Manages all configuration settings using default values and overrides from a `.env` file. Provides getter functions for easy access to settings. | |
| * **`dia/` package:** Contains the core implementation of the Dia model itself. | |
| * `model.py`: Defines the `Dia` class, which wraps the underlying PyTorch model (`DiaModel`). It handles loading weights (`.pth` or `.safetensors`), loading the required DAC model, preparing inputs specifically for the `DiaModel` forward pass (including CFG logic), and running the autoregressive generation loop. | |
| * `config.py` (within `dia/`): Defines Pydantic models representing the *structure* and hyperparameters of the Dia model architecture (encoder, decoder, data parameters). This is loaded from the `config.json` file associated with the model weights. | |
| * `layers.py`: Contains custom PyTorch `nn.Module` implementations used within the `DiaModel` (e.g., Attention blocks, MLP blocks, RoPE). | |
| * `audio.py`: Includes helper functions for audio processing specific to the model's tokenization and delay patterns (e.g., `audio_to_codebook`, `codebook_to_audio`, `apply_audio_delay`). | |
| * **`ui/` directory:** Contains all files related to the Web UI. | |
| * `index.html`: The main Jinja2 template. | |
| * `script.js`: Frontend JavaScript for interactivity, API calls, theme switching, etc. | |
| * `presets.yaml`: Definitions for the UI preset examples. | |
| * **`utils.py`:** General utility functions, such as audio encoding (`encode_audio`) and saving (`save_audio_to_file`) using the `soundfile` library. | |
| * **Dependencies:** Relies heavily on `FastAPI`, `Uvicorn`, `PyTorch`, `torchaudio`, `huggingface_hub`, `safetensors`, `descript-audio-codec`, `soundfile`, `PyYAML`, `python-dotenv`, `pydantic`, and `Jinja2`. | |
| --- | |
| ## 10. License and Disclaimer | |
| * **License:** This project is licensed under the MIT License. | |
| * **Disclaimer:** This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**: | |
| * **Identity Misuse**: Do not produce audio resembling real individuals without permission. | |
| * **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news) | |
| * **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm. | |
| By using this model, you agree to uphold relevant legal standards and ethical responsibilities. The creators **are not responsible** for any misuse and firmly oppose any unethical usage of this technology. | |
| --- | |