Speach-To-Text / README.md
MIP-Tech's picture
Add README with Space config
e333dd9
---
title: Speech To Text API
emoji: ๐ŸŽ™๏ธ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis.
---
## Table of Contents
1. [Project Overview](#project-overview)
2. [Prerequisites](#prerequisites)
3. [Environment Setup](#environment-setup)
4. [Starting the Server](#starting-the-server)
- [Option A โ€” Docker (Recommended)](#option-a--docker-recommended)
- [Option B โ€” Local Development (no Docker)](#option-b--local-development-no-docker)
5. [API Reference](#api-reference)
- [GET /health](#get-health)
- [POST /api/v1/transcribe](#post-apiv1transcribe)
- [POST /api/v1/transcribe/autocorrect](#post-apiv1transcribeautocorrect)
- [POST /api/v1/transcribe/corrected](#post-apiv1transcribecorrected)
- [POST /api/v1/transcribe/analyze](#post-apiv1transcribeanalyze)
6. [Error Codes](#error-codes)
7. [Interactive Docs (Swagger UI)](#interactive-docs-swagger-ui)
8. [Training Pipeline](#training-pipeline)
---
## Project Overview
This project fine-tunes `openai/whisper-large-v3` on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service.
**Stack:**
- **Inference:** Whisper (HuggingFace Transformers) + Silero VAD
- **Post-processing:** Google Gemini (speaker diarisation, entity extraction, call analysis)
- **API:** FastAPI + Uvicorn
- **Reverse proxy:** Nginx
- **Container:** Docker + Docker Compose
---
## Prerequisites
### For Docker deployment (recommended)
| Requirement | Version |
| --- | --- |
| Docker | โ‰ฅ 24 |
| Docker Compose | โ‰ฅ 2.20 (bundled with Docker Desktop) |
| NVIDIA Container Toolkit | Required for GPU; skip for CPU-only |
| NVIDIA GPU driver | โ‰ฅ 525 (for CUDA 12) |
### For local development (no Docker)
| Requirement | Version |
| --- | --- |
| Python | 3.10 or 3.11 |
| ffmpeg | Any recent version |
| libsndfile | Any recent version (Linux/macOS) |
| CUDA toolkit | 12.x (optional, for GPU) |
---
## Environment Setup
**Step 1 โ€” Copy the example environment file:**
```bash
cp .env.example .env
```
**Step 2 โ€” Open `.env` and fill in your values:**
```env
# Path inside the container where the model will be mounted
MODEL_PATH=/models/merged_model
# Host machine path to your model directory (mounted into the container)
MODEL_DIR=/opt/stt/models
# Inference device: "cuda" or "cpu" (leave blank to auto-detect)
DEVICE=cuda
# Required for /autocorrect, /corrected, and /analyze endpoints
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.5-flash
```
**Key variables explained:**
| Variable | Required | Default | Description |
| --- | --- | --- | --- |
| `MODEL_PATH` | Yes | `/models/merged_model` | Path **inside the container** to the Whisper model directory |
| `MODEL_DIR` | Yes | `/opt/stt/models` | Path on the **host machine** that gets mounted into the container as `/models` |
| `DEVICE` | No | auto-detect | `cuda` or `cpu` |
| `GEMINI_API_KEY` | For AI endpoints | โ€” | Google Gemini API key |
| `GEMINI_MODEL` | No | `gemini-2.5-flash` | Gemini model to use |
> **Note:** If `GEMINI_API_KEY` is not set, the `/autocorrect`, `/corrected`, and `/analyze` endpoints will return `503 Service Unavailable`.
---
## Starting the Server
### Option A โ€” Docker (Recommended)
This runs FastAPI behind an Nginx reverse proxy, with GPU support.
**Step 1 โ€” Make sure `.env` is configured** (see [Environment Setup](#environment-setup) above).
**Step 2 โ€” Build and start all services:**
```bash
docker compose up --build -d
```
This will:
1. Build the inference Docker image (installs Python deps, copies `src/inference/` and `api/`)
2. Start the `stt-api` container (FastAPI on port 8000 internally)
3. Start the `stt-nginx` container (Nginx on port **80** externally)
4. Wait for the API health check before Nginx accepts traffic (Whisper can take 60โ€“120 s to load)
**Step 3 โ€” Verify the server is healthy:**
```bash
curl http://localhost/health
```
Expected response when ready:
```json
{
"status": "ok",
"whisper_loaded": true,
"gemini_available": true,
"model_path": "/models/merged_model"
}
```
If `whisper_loaded` is `false`, the model failed to load โ€” check container logs:
```bash
docker compose logs api
```
**Step 4 โ€” Send your first request:**
```bash
curl -X POST http://localhost/api/v1/transcribe \
-F "audio=@/path/to/your/audio.mp3"
```
---
**Useful Docker commands:**
```bash
# View live logs
docker compose logs -f api
# Stop all services
docker compose down
# Restart after a code change (rebuild image)
docker compose up --build -d
# Check container status
docker compose ps
```
---
**CPU-only deployment:**
If you do not have an NVIDIA GPU, remove the `deploy` block from `docker-compose.yml`:
```yaml
# Delete these lines from the `api` service:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
```
Then set `DEVICE=cpu` in your `.env` file. Transcription will be significantly slower.
---
### Option B โ€” Local Development (no Docker)
**Step 1 โ€” Install system dependencies:**
On Ubuntu/Debian:
```bash
sudo apt-get install -y ffmpeg libsndfile1
```
On macOS (Homebrew):
```bash
brew install ffmpeg libsndfile
```
On Windows: install [ffmpeg](https://ffmpeg.org/download.html) and add it to `PATH`.
**Step 2 โ€” Create and activate a virtual environment:**
```bash
python -m venv .venv
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
```
**Step 3 โ€” Install API dependencies:**
```bash
pip install -r requirements-api.txt
```
**Step 4 โ€” Create your `.env` file** (see [Environment Setup](#environment-setup)) and point `MODEL_PATH` to your local model directory:
```env
MODEL_PATH=outputs/checkpoints/merged_model
GEMINI_API_KEY=your_gemini_api_key_here
```
**Step 5 โ€” Start the server:**
```bash
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
```
The server will be available at `http://localhost:8000`.
> Remove `--reload` in production โ€” it watches for file changes and is not suitable for production use.
**Step 6 โ€” Verify:**
```bash
curl http://localhost:8000/health
```
---
## API Reference
All transcription endpoints accept a `multipart/form-data` POST request with a single field named `audio`.
**Supported audio formats:** `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, `.webm`
**Maximum file size:** 200 MB
**Base URL:**
- Docker deployment: `http://localhost` (port 80, via Nginx)
- Local development: `http://localhost:8000`
---
### GET /health
Check the server status and which services are loaded.
**Request:**
```bash
curl http://localhost/health
```
**Response `200 OK`:**
```json
{
"status": "ok",
"whisper_loaded": true,
"gemini_available": true,
"model_path": "/models/merged_model"
}
```
| Field | Type | Description |
| --- | --- | --- |
| `status` | `string` | `"ok"` if Whisper is loaded, `"degraded"` otherwise |
| `whisper_loaded` | `boolean` | Whether the Whisper model loaded successfully |
| `gemini_available` | `boolean` | Whether the Gemini analyzer is ready (requires `GEMINI_API_KEY`) |
| `model_path` | `string` | The model path the server loaded from |
---
### POST /api/v1/transcribe
Transcribe an audio file using Whisper only. No post-processing is applied โ€” returns raw Arabic text directly from the model.
**When to use:** You need a fast transcript and do not need speaker labels or error correction.
**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe \
-F "audio=@recording.mp3"
```
**Response `200 OK`:**
```json
{
"audio_filename": "recording.mp3",
"transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู…ุŒ ุฃู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุฅูŠุทุงู„ูŠุง ุนุดุงู†..."
}
```
| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Arabic text from Whisper |
---
### POST /api/v1/transcribe/autocorrect
Transcribe with Whisper, then send the raw transcript to Gemini for **phonetic and orthographic correction only**. No speaker labels are added โ€” returns a single continuous Arabic text.
**When to use:** You need clean, corrected Arabic text but do not care who said what.
**Requires:** `GEMINI_API_KEY`
**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe/autocorrect \
-F "audio=@recording.mp3"
```
**Response `200 OK`:**
```json
{
"audio_filename": "recording.mp3",
"transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
"corrected_transcript": "ุฃุฒูŠูƒ ูŠุง ูู†ุฏู…ุŒ ุฃู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุฅูŠุทุงู„ูŠุง..."
}
```
| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Whisper output (unmodified) |
| `corrected_transcript` | `string` | Phonetically and orthographically corrected Arabic text |
---
### POST /api/v1/transcribe/corrected
Transcribe with Whisper, then send the transcript to Gemini, which returns a **speaker-separated, phonetically corrected** version. Speakers are labelled as `SPEAKER_01` (Agent) and `SPEAKER_00` (Customer).
**When to use:** You need a clean, readable transcript that shows who said what.
**Requires:** `GEMINI_API_KEY`
**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe/corrected \
-F "audio=@recording.mp3"
```
**Response `200 OK`:**
```json
{
"audio_filename": "recording.mp3",
"transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
"corrected_transcript": "SPEAKER_01: ุฃู‡ู„ุงู‹ุŒ ู…ุนุงูƒ ุฃุญู…ุฏ ู…ู† ู…ุตุฑ ุฅูŠุทุงู„ูŠุงุŒ ูƒูŠู ุฃู‚ุฏุฑ ุฃุณุงุนุฏูƒุŸ\nSPEAKER_00: ุฃู‡ู„ุงู‹ุŒ ุฃู†ุง ุนุงูŠุฒ ุฃุนุฑู ุชูุงุตูŠู„ ุงู„ูˆุญุฏุฉ..."
}
```
| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Whisper output (unmodified) |
| `corrected_transcript` | `string` | Speaker-labelled, corrected Arabic transcript (`SPEAKER_01` = Agent, `SPEAKER_00` = Customer) |
---
### POST /api/v1/transcribe/analyze
The most powerful endpoint. Transcribes the audio, then runs a full **Gemini call analysis** that extracts structured information from the conversation.
**When to use:** You want a complete picture of the call โ€” who spoke, what happened, what needs follow-up.
**Requires:** `GEMINI_API_KEY`
**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe/analyze \
-F "audio=@recording.mp3"
```
**Response `200 OK`:**
```json
{
"audio_filename": "recording.mp3",
"transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
"cleaned_transcript": "SPEAKER_01: ุฃู‡ู„ุงู‹ุŒ ู…ุนุงูƒ ุฃุญู…ุฏ ู…ู† ู…ุตุฑ ุฅูŠุทุงู„ูŠุง...\nSPEAKER_00: ...",
"agent_name": "ุฃุญู…ุฏ",
"customer_name": "ู…ุญู…ุฏ ุงู„ุณูŠุฏ",
"unit_number": ["B2-401"],
"project_name": "IL BOSCO",
"department_mentioned": "Sales",
"call_type": "Inbound",
"customer_satisfaction": 3,
"is_urgent": false,
"pain_points": ["ุชุฃุฎูŠุฑ ู…ูˆุนุฏ ุงู„ุชุณู„ูŠู…", "ุนุฏู… ูˆุถูˆุญ ู…ุนุงุฏ ุงู„ุตูŠุงู†ุฉ"],
"action_items_promised": ["ุฅุฑุณุงู„ ุจุฑูŠุฏ ุฅู„ูƒุชุฑูˆู†ูŠ ุจู…ูˆุงุนูŠุฏ ุงู„ุชุณู„ูŠู…"],
"next_steps": ["ู…ุชุงุจุนุฉ ุงู„ุนู…ูŠู„ ุฎู„ุงู„ 48 ุณุงุนุฉ"]
}
```
**Response fields:**
| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Whisper output (unmodified) |
| `cleaned_transcript` | `string` | Speaker-labelled, corrected Arabic transcript |
| `agent_name` | `string \| null` | Name of the agent extracted from the conversation |
| `customer_name` | `string \| null` | Name of the customer extracted from the conversation |
| `unit_number` | `string[]` | Unit identifiers mentioned (e.g. `["B2-401"]`) |
| `project_name` | `string \| null` | Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.) |
| `department_mentioned` | `string \| null` | Department referenced (Sales, Maintenance, Housekeeping) |
| `call_type` | `string` | `"Inbound"` or `"Outbound"` |
| `customer_satisfaction` | `integer` | Satisfaction score **1โ€“5** inferred from tone (1 = very unhappy, 5 = very happy) |
| `is_urgent` | `boolean` | `true` if satisfaction โ‰ค 2 or the customer expressed critical frustration |
| `pain_points` | `string[]` | List of issues or complaints mentioned |
| `action_items_promised` | `string[]` | Commitments made by the agent during the call |
| `next_steps` | `string[]` | Follow-up actions identified |
---
## Error Codes
| Code | Meaning | How to fix |
| --- | --- | --- |
| `200` | Success | โ€” |
| `413` | File exceeds 200 MB limit | Compress or trim the audio |
| `422` | Unsupported audio format | Use `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, or `.webm` |
| `500` | Whisper transcription failed | Check server logs: `docker compose logs api` |
| `502` | Gemini call failed | Check `GEMINI_API_KEY` and network access to Google APIs |
| `503` | Model not loaded | Whisper or Gemini did not initialise โ€” check logs |
---
## Interactive Docs (Swagger UI)
FastAPI automatically generates interactive API documentation.
| URL | Description |
| --- | --- |
| `http://localhost/docs` | Swagger UI โ€” try endpoints directly in the browser |
| `http://localhost/redoc` | ReDoc โ€” clean, readable reference |
| `http://localhost/openapi.json` | Raw OpenAPI 3.0 schema |
> For local development (no Docker), replace `localhost` with `localhost:8000`.
---
## Training Pipeline
### Project structure
```
.
โ”œโ”€โ”€ config/
โ”‚ โ””โ”€โ”€ training_config.yaml # All hyperparameters in one place
โ”œโ”€โ”€ data/
โ”‚ โ”œโ”€โ”€ raw/
โ”‚ โ”‚ โ”œโ”€โ”€ audio/ โ† put your audio files here (.mp3, .wav, โ€ฆ)
โ”‚ โ”‚ โ””โ”€โ”€ transcripts/ โ† matching .txt transcript files (same filename stem)
โ”‚ โ””โ”€โ”€ processed/ โ† auto-generated (segments + HF dataset)
โ”œโ”€โ”€ src/
โ”‚ โ”œโ”€โ”€ data_preparation/
โ”‚ โ”‚ โ”œโ”€โ”€ parse_transcripts.py
โ”‚ โ”‚ โ”œโ”€โ”€ segment_audio.py
โ”‚ โ”‚ โ””โ”€โ”€ build_dataset.py
โ”‚ โ”œโ”€โ”€ training/
โ”‚ โ”‚ โ””โ”€โ”€ trainer.py
โ”‚ โ””โ”€โ”€ inference/
โ”‚ โ”œโ”€โ”€ transcribe.py
โ”‚ โ””โ”€โ”€ analyze_call.py
โ”œโ”€โ”€ scripts/
โ”‚ โ”œโ”€โ”€ import_existing_data.py โ† run once to import files from project root
โ”‚ โ”œโ”€โ”€ prepare_data.py โ† step 1: build dataset
โ”‚ โ”œโ”€โ”€ train.py โ† step 2: fine-tune
โ”‚ โ””โ”€โ”€ transcribe.py โ† step 3: run inference CLI
โ”œโ”€โ”€ api/ โ† FastAPI server
โ”œโ”€โ”€ nginx/ โ† Nginx config
โ”œโ”€โ”€ Dockerfile
โ””โ”€โ”€ docker-compose.yml
```
### Transcript format
Each `.txt` file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line):
```
0.0: ุณูŠุงุฏุฉ ุงู„ูƒูˆู„ูˆู†ูŠู„ุŒ ุตุจุฑูƒ ููŠ ู…ุญู„ู‡ุŒ
3.076: ู…ุจุฑูˆูƒ ุนู„ูŠู†ุงุŒ
4.238: ุนู…ู„ู†ุง ุฃูุฌุฑ ุทูŠุงุฑุฉ ููŠ ุชุงุฑูŠุฎ "ุฃู…ุฑูŠูƒุง".
```
### Step 1 โ€” Install dependencies
```bash
pip install -r requirements.txt
```
### Step 2 โ€” Add your data
Option A โ€” files already in the project root:
```bash
python scripts/import_existing_data.py
```
Option B โ€” place files directly:
- Copy audio โ†’ `data/raw/audio/my_file.mp3`
- Copy transcript โ†’ `data/raw/transcripts/my_file.txt` *(same stem)*
### Step 3 โ€” Prepare the dataset
```bash
python scripts/prepare_data.py
```
Splits audio into โ‰ค25-second WAV segments aligned to the transcript, then builds a HuggingFace `DatasetDict` saved to `data/processed/`.
### Step 4 โ€” Fine-tune
```bash
python scripts/train.py
# Resume from a checkpoint
python scripts/train.py --resume outputs/checkpoints/checkpoint-500
```
### Step 5 โ€” Transcribe via CLI
```bash
# Use the fine-tuned model (auto-detected)
python scripts/transcribe.py path/to/audio.mp3
# Specify a model explicitly
python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3
# Save output to file
python scripts/transcribe.py audio.mp3 --output result.txt
```
### Adding more data later
1. Drop new `audio.mp3` + `audio.txt` pairs into `data/raw/`.
2. Re-run `python scripts/prepare_data.py` โ€” rebuilds everything from scratch.
3. Re-run `python scripts/train.py`.
### Configuration
Edit `config/training_config.yaml` to change:
- `model.base_model` โ€” swap to `openai/whisper-medium` for faster training
- `training.per_device_train_batch_size` โ€” reduce if out of GPU memory
- `training.fp16: false` โ€” disable on CPU or older GPUs
- `data.max_segment_duration` โ€” segment length (max 30 s for Whisper)
### GPU requirements
| Model | Min VRAM | Recommended |
| --- | --- | --- |
| whisper-large-v3 | 16 GB | 24 GB A10/A100 |
| whisper-medium | 8 GB | 16 GB |
| whisper-small | 4 GB | 8 GB |
Use `gradient_checkpointing: true` and lower `per_device_train_batch_size` to fit in less VRAM at the cost of slower training.