Spaces:

MIP-Tech
/

Speach-To-Text

Sleeping

File size: 17,314 Bytes

e333dd9

---
title: Speech To Text API
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---

Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis.

---

## Table of Contents

1. [Project Overview](#project-overview)
2. [Prerequisites](#prerequisites)
3. [Environment Setup](#environment-setup)
4. [Starting the Server](#starting-the-server)
   - [Option A — Docker (Recommended)](#option-a--docker-recommended)
   - [Option B — Local Development (no Docker)](#option-b--local-development-no-docker)
5. [API Reference](#api-reference)
   - [GET /health](#get-health)
   - [POST /api/v1/transcribe](#post-apiv1transcribe)
   - [POST /api/v1/transcribe/autocorrect](#post-apiv1transcribeautocorrect)
   - [POST /api/v1/transcribe/corrected](#post-apiv1transcribecorrected)
   - [POST /api/v1/transcribe/analyze](#post-apiv1transcribeanalyze)
6. [Error Codes](#error-codes)
7. [Interactive Docs (Swagger UI)](#interactive-docs-swagger-ui)
8. [Training Pipeline](#training-pipeline)

---

## Project Overview

This project fine-tunes `openai/whisper-large-v3` on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service.

**Stack:**

- **Inference:** Whisper (HuggingFace Transformers) + Silero VAD
- **Post-processing:** Google Gemini (speaker diarisation, entity extraction, call analysis)
- **API:** FastAPI + Uvicorn
- **Reverse proxy:** Nginx
- **Container:** Docker + Docker Compose

---

## Prerequisites

### For Docker deployment (recommended)

| Requirement | Version |
| --- | --- |
| Docker | ≥ 24 |
| Docker Compose | ≥ 2.20 (bundled with Docker Desktop) |
| NVIDIA Container Toolkit | Required for GPU; skip for CPU-only |
| NVIDIA GPU driver | ≥ 525 (for CUDA 12) |

### For local development (no Docker)

| Requirement | Version |
| --- | --- |
| Python | 3.10 or 3.11 |
| ffmpeg | Any recent version |
| libsndfile | Any recent version (Linux/macOS) |
| CUDA toolkit | 12.x (optional, for GPU) |

---

## Environment Setup

**Step 1 — Copy the example environment file:**

```bash
cp .env.example .env
```

**Step 2 — Open `.env` and fill in your values:**

```env
# Path inside the container where the model will be mounted
MODEL_PATH=/models/merged_model

# Host machine path to your model directory (mounted into the container)
MODEL_DIR=/opt/stt/models

# Inference device: "cuda" or "cpu" (leave blank to auto-detect)
DEVICE=cuda

# Required for /autocorrect, /corrected, and /analyze endpoints
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.5-flash
```

**Key variables explained:**

| Variable | Required | Default | Description |
| --- | --- | --- | --- |
| `MODEL_PATH` | Yes | `/models/merged_model` | Path **inside the container** to the Whisper model directory |
| `MODEL_DIR` | Yes | `/opt/stt/models` | Path on the **host machine** that gets mounted into the container as `/models` |
| `DEVICE` | No | auto-detect | `cuda` or `cpu` |
| `GEMINI_API_KEY` | For AI endpoints | — | Google Gemini API key |
| `GEMINI_MODEL` | No | `gemini-2.5-flash` | Gemini model to use |

> **Note:** If `GEMINI_API_KEY` is not set, the `/autocorrect`, `/corrected`, and `/analyze` endpoints will return `503 Service Unavailable`.

---

## Starting the Server

### Option A — Docker (Recommended)

This runs FastAPI behind an Nginx reverse proxy, with GPU support.

**Step 1 — Make sure `.env` is configured** (see [Environment Setup](#environment-setup) above).

**Step 2 — Build and start all services:**

```bash
docker compose up --build -d
```

This will:
1. Build the inference Docker image (installs Python deps, copies `src/inference/` and `api/`)
2. Start the `stt-api` container (FastAPI on port 8000 internally)
3. Start the `stt-nginx` container (Nginx on port **80** externally)
4. Wait for the API health check before Nginx accepts traffic (Whisper can take 60–120 s to load)

**Step 3 — Verify the server is healthy:**

```bash
curl http://localhost/health
```

Expected response when ready:
```json
{
  "status": "ok",
  "whisper_loaded": true,
  "gemini_available": true,
  "model_path": "/models/merged_model"
}
```

If `whisper_loaded` is `false`, the model failed to load — check container logs:

```bash
docker compose logs api
```

**Step 4 — Send your first request:**

```bash
curl -X POST http://localhost/api/v1/transcribe \
  -F "audio=@/path/to/your/audio.mp3"
```

---

**Useful Docker commands:**

```bash
# View live logs
docker compose logs -f api

# Stop all services
docker compose down

# Restart after a code change (rebuild image)
docker compose up --build -d

# Check container status
docker compose ps
```

---

**CPU-only deployment:**

If you do not have an NVIDIA GPU, remove the `deploy` block from `docker-compose.yml`:

```yaml
# Delete these lines from the `api` service:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]
```

Then set `DEVICE=cpu` in your `.env` file. Transcription will be significantly slower.

---

### Option B — Local Development (no Docker)

**Step 1 — Install system dependencies:**

On Ubuntu/Debian:
```bash
sudo apt-get install -y ffmpeg libsndfile1
```

On macOS (Homebrew):
```bash
brew install ffmpeg libsndfile
```

On Windows: install [ffmpeg](https://ffmpeg.org/download.html) and add it to `PATH`.

**Step 2 — Create and activate a virtual environment:**

```bash
python -m venv .venv
source .venv/bin/activate        # Linux/macOS
.venv\Scripts\activate           # Windows
```

**Step 3 — Install API dependencies:**

```bash
pip install -r requirements-api.txt
```

**Step 4 — Create your `.env` file** (see [Environment Setup](#environment-setup)) and point `MODEL_PATH` to your local model directory:

```env
MODEL_PATH=outputs/checkpoints/merged_model
GEMINI_API_KEY=your_gemini_api_key_here
```

**Step 5 — Start the server:**

```bash
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
```

The server will be available at `http://localhost:8000`.

> Remove `--reload` in production — it watches for file changes and is not suitable for production use.

**Step 6 — Verify:**

```bash
curl http://localhost:8000/health
```

---

## API Reference

All transcription endpoints accept a `multipart/form-data` POST request with a single field named `audio`.

**Supported audio formats:** `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, `.webm`

**Maximum file size:** 200 MB

**Base URL:**
- Docker deployment: `http://localhost` (port 80, via Nginx)
- Local development: `http://localhost:8000`

---

### GET /health

Check the server status and which services are loaded.

**Request:**
```bash
curl http://localhost/health
```

**Response `200 OK`:**
```json
{
  "status": "ok",
  "whisper_loaded": true,
  "gemini_available": true,
  "model_path": "/models/merged_model"
}
```

| Field | Type | Description |
| --- | --- | --- |
| `status` | `string` | `"ok"` if Whisper is loaded, `"degraded"` otherwise |
| `whisper_loaded` | `boolean` | Whether the Whisper model loaded successfully |
| `gemini_available` | `boolean` | Whether the Gemini analyzer is ready (requires `GEMINI_API_KEY`) |
| `model_path` | `string` | The model path the server loaded from |

---

### POST /api/v1/transcribe

Transcribe an audio file using Whisper only. No post-processing is applied — returns raw Arabic text directly from the model.

**When to use:** You need a fast transcript and do not need speaker labels or error correction.

**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe \
  -F "audio=@recording.mp3"
```

**Response `200 OK`:**
```json
{
  "audio_filename": "recording.mp3",
  "transcript": "ازيك يا فندم، أنا بتصل من شركة مصر إيطاليا عشان..."
}
```

| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Arabic text from Whisper |

---

### POST /api/v1/transcribe/autocorrect

Transcribe with Whisper, then send the raw transcript to Gemini for **phonetic and orthographic correction only**. No speaker labels are added — returns a single continuous Arabic text.

**When to use:** You need clean, corrected Arabic text but do not care who said what.

**Requires:** `GEMINI_API_KEY`

**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe/autocorrect \
  -F "audio=@recording.mp3"
```

**Response `200 OK`:**
```json
{
  "audio_filename": "recording.mp3",
  "transcript": "ازيك يا فندم انا بتصل من شركة مصر ايطاليا...",
  "corrected_transcript": "أزيك يا فندم، أنا بتصل من شركة مصر إيطاليا..."
}
```

| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Whisper output (unmodified) |
| `corrected_transcript` | `string` | Phonetically and orthographically corrected Arabic text |

---

### POST /api/v1/transcribe/corrected

Transcribe with Whisper, then send the transcript to Gemini, which returns a **speaker-separated, phonetically corrected** version. Speakers are labelled as `SPEAKER_01` (Agent) and `SPEAKER_00` (Customer).

**When to use:** You need a clean, readable transcript that shows who said what.

**Requires:** `GEMINI_API_KEY`

**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe/corrected \
  -F "audio=@recording.mp3"
```

**Response `200 OK`:**
```json
{
  "audio_filename": "recording.mp3",
  "transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...",
  "corrected_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا، كيف أقدر أساعدك؟\nSPEAKER_00: أهلاً، أنا عايز أعرف تفاصيل الوحدة..."
}
```

| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Whisper output (unmodified) |
| `corrected_transcript` | `string` | Speaker-labelled, corrected Arabic transcript (`SPEAKER_01` = Agent, `SPEAKER_00` = Customer) |

---

### POST /api/v1/transcribe/analyze

The most powerful endpoint. Transcribes the audio, then runs a full **Gemini call analysis** that extracts structured information from the conversation.

**When to use:** You want a complete picture of the call — who spoke, what happened, what needs follow-up.

**Requires:** `GEMINI_API_KEY`

**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe/analyze \
  -F "audio=@recording.mp3"
```

**Response `200 OK`:**
```json
{
  "audio_filename": "recording.mp3",
  "transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...",
  "cleaned_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا...\nSPEAKER_00: ...",
  "agent_name": "أحمد",
  "customer_name": "محمد السيد",
  "unit_number": ["B2-401"],
  "project_name": "IL BOSCO",
  "department_mentioned": "Sales",
  "call_type": "Inbound",
  "customer_satisfaction": 3,
  "is_urgent": false,
  "pain_points": ["تأخير موعد التسليم", "عدم وضوح معاد الصيانة"],
  "action_items_promised": ["إرسال بريد إلكتروني بمواعيد التسليم"],
  "next_steps": ["متابعة العميل خلال 48 ساعة"]
}
```

**Response fields:**

| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Whisper output (unmodified) |
| `cleaned_transcript` | `string` | Speaker-labelled, corrected Arabic transcript |
| `agent_name` | `string \| null` | Name of the agent extracted from the conversation |
| `customer_name` | `string \| null` | Name of the customer extracted from the conversation |
| `unit_number` | `string[]` | Unit identifiers mentioned (e.g. `["B2-401"]`) |
| `project_name` | `string \| null` | Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.) |
| `department_mentioned` | `string \| null` | Department referenced (Sales, Maintenance, Housekeeping) |
| `call_type` | `string` | `"Inbound"` or `"Outbound"` |
| `customer_satisfaction` | `integer` | Satisfaction score **1–5** inferred from tone (1 = very unhappy, 5 = very happy) |
| `is_urgent` | `boolean` | `true` if satisfaction ≤ 2 or the customer expressed critical frustration |
| `pain_points` | `string[]` | List of issues or complaints mentioned |
| `action_items_promised` | `string[]` | Commitments made by the agent during the call |
| `next_steps` | `string[]` | Follow-up actions identified |

---

## Error Codes

| Code | Meaning | How to fix |
| --- | --- | --- |
| `200` | Success | — |
| `413` | File exceeds 200 MB limit | Compress or trim the audio |
| `422` | Unsupported audio format | Use `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, or `.webm` |
| `500` | Whisper transcription failed | Check server logs: `docker compose logs api` |
| `502` | Gemini call failed | Check `GEMINI_API_KEY` and network access to Google APIs |
| `503` | Model not loaded | Whisper or Gemini did not initialise — check logs |

---

## Interactive Docs (Swagger UI)

FastAPI automatically generates interactive API documentation.

| URL | Description |
| --- | --- |
| `http://localhost/docs` | Swagger UI — try endpoints directly in the browser |
| `http://localhost/redoc` | ReDoc — clean, readable reference |
| `http://localhost/openapi.json` | Raw OpenAPI 3.0 schema |

> For local development (no Docker), replace `localhost` with `localhost:8000`.

---

## Training Pipeline

### Project structure

```
.
├── config/
│   └── training_config.yaml    # All hyperparameters in one place
├── data/
│   ├── raw/
│   │   ├── audio/              ← put your audio files here (.mp3, .wav, …)
│   │   └── transcripts/        ← matching .txt transcript files (same filename stem)
│   └── processed/              ← auto-generated (segments + HF dataset)
├── src/
│   ├── data_preparation/
│   │   ├── parse_transcripts.py
│   │   ├── segment_audio.py
│   │   └── build_dataset.py
│   ├── training/
│   │   └── trainer.py
│   └── inference/
│       ├── transcribe.py
│       └── analyze_call.py
├── scripts/
│   ├── import_existing_data.py ← run once to import files from project root
│   ├── prepare_data.py         ← step 1: build dataset
│   ├── train.py                ← step 2: fine-tune
│   └── transcribe.py           ← step 3: run inference CLI
├── api/                        ← FastAPI server
├── nginx/                      ← Nginx config
├── Dockerfile
└── docker-compose.yml
```

### Transcript format

Each `.txt` file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line):

```
0.0: سيادة الكولونيل، صبرك في محله،
3.076: مبروك علينا،
4.238: عملنا أفجر طيارة في تاريخ "أمريكا".
```

### Step 1 — Install dependencies

```bash
pip install -r requirements.txt
```

### Step 2 — Add your data

Option A — files already in the project root:
```bash
python scripts/import_existing_data.py
```

Option B — place files directly:
- Copy audio → `data/raw/audio/my_file.mp3`
- Copy transcript → `data/raw/transcripts/my_file.txt` *(same stem)*

### Step 3 — Prepare the dataset

```bash
python scripts/prepare_data.py
```

Splits audio into ≤25-second WAV segments aligned to the transcript, then builds a HuggingFace `DatasetDict` saved to `data/processed/`.

### Step 4 — Fine-tune

```bash
python scripts/train.py

# Resume from a checkpoint
python scripts/train.py --resume outputs/checkpoints/checkpoint-500
```

### Step 5 — Transcribe via CLI

```bash
# Use the fine-tuned model (auto-detected)
python scripts/transcribe.py path/to/audio.mp3

# Specify a model explicitly
python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3

# Save output to file
python scripts/transcribe.py audio.mp3 --output result.txt
```

### Adding more data later

1. Drop new `audio.mp3` + `audio.txt` pairs into `data/raw/`.
2. Re-run `python scripts/prepare_data.py` — rebuilds everything from scratch.
3. Re-run `python scripts/train.py`.

### Configuration

Edit `config/training_config.yaml` to change:
- `model.base_model` — swap to `openai/whisper-medium` for faster training
- `training.per_device_train_batch_size` — reduce if out of GPU memory
- `training.fp16: false` — disable on CPU or older GPUs
- `data.max_segment_duration` — segment length (max 30 s for Whisper)

### GPU requirements

| Model | Min VRAM | Recommended |
| --- | --- | --- |
| whisper-large-v3 | 16 GB | 24 GB A10/A100 |
| whisper-medium | 8 GB | 16 GB |
| whisper-small | 4 GB | 8 GB |

Use `gradient_checkpointing: true` and lower `per_device_train_batch_size` to fit in less VRAM at the cost of slower training.