Spaces:
Sleeping
Sleeping
| title: Speech To Text API | |
| emoji: ๐๏ธ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis. | |
| --- | |
| ## Table of Contents | |
| 1. [Project Overview](#project-overview) | |
| 2. [Prerequisites](#prerequisites) | |
| 3. [Environment Setup](#environment-setup) | |
| 4. [Starting the Server](#starting-the-server) | |
| - [Option A โ Docker (Recommended)](#option-a--docker-recommended) | |
| - [Option B โ Local Development (no Docker)](#option-b--local-development-no-docker) | |
| 5. [API Reference](#api-reference) | |
| - [GET /health](#get-health) | |
| - [POST /api/v1/transcribe](#post-apiv1transcribe) | |
| - [POST /api/v1/transcribe/autocorrect](#post-apiv1transcribeautocorrect) | |
| - [POST /api/v1/transcribe/corrected](#post-apiv1transcribecorrected) | |
| - [POST /api/v1/transcribe/analyze](#post-apiv1transcribeanalyze) | |
| 6. [Error Codes](#error-codes) | |
| 7. [Interactive Docs (Swagger UI)](#interactive-docs-swagger-ui) | |
| 8. [Training Pipeline](#training-pipeline) | |
| --- | |
| ## Project Overview | |
| This project fine-tunes `openai/whisper-large-v3` on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service. | |
| **Stack:** | |
| - **Inference:** Whisper (HuggingFace Transformers) + Silero VAD | |
| - **Post-processing:** Google Gemini (speaker diarisation, entity extraction, call analysis) | |
| - **API:** FastAPI + Uvicorn | |
| - **Reverse proxy:** Nginx | |
| - **Container:** Docker + Docker Compose | |
| --- | |
| ## Prerequisites | |
| ### For Docker deployment (recommended) | |
| | Requirement | Version | | |
| | --- | --- | | |
| | Docker | โฅ 24 | | |
| | Docker Compose | โฅ 2.20 (bundled with Docker Desktop) | | |
| | NVIDIA Container Toolkit | Required for GPU; skip for CPU-only | | |
| | NVIDIA GPU driver | โฅ 525 (for CUDA 12) | | |
| ### For local development (no Docker) | |
| | Requirement | Version | | |
| | --- | --- | | |
| | Python | 3.10 or 3.11 | | |
| | ffmpeg | Any recent version | | |
| | libsndfile | Any recent version (Linux/macOS) | | |
| | CUDA toolkit | 12.x (optional, for GPU) | | |
| --- | |
| ## Environment Setup | |
| **Step 1 โ Copy the example environment file:** | |
| ```bash | |
| cp .env.example .env | |
| ``` | |
| **Step 2 โ Open `.env` and fill in your values:** | |
| ```env | |
| # Path inside the container where the model will be mounted | |
| MODEL_PATH=/models/merged_model | |
| # Host machine path to your model directory (mounted into the container) | |
| MODEL_DIR=/opt/stt/models | |
| # Inference device: "cuda" or "cpu" (leave blank to auto-detect) | |
| DEVICE=cuda | |
| # Required for /autocorrect, /corrected, and /analyze endpoints | |
| GEMINI_API_KEY=your_gemini_api_key_here | |
| GEMINI_MODEL=gemini-2.5-flash | |
| ``` | |
| **Key variables explained:** | |
| | Variable | Required | Default | Description | | |
| | --- | --- | --- | --- | | |
| | `MODEL_PATH` | Yes | `/models/merged_model` | Path **inside the container** to the Whisper model directory | | |
| | `MODEL_DIR` | Yes | `/opt/stt/models` | Path on the **host machine** that gets mounted into the container as `/models` | | |
| | `DEVICE` | No | auto-detect | `cuda` or `cpu` | | |
| | `GEMINI_API_KEY` | For AI endpoints | โ | Google Gemini API key | | |
| | `GEMINI_MODEL` | No | `gemini-2.5-flash` | Gemini model to use | | |
| > **Note:** If `GEMINI_API_KEY` is not set, the `/autocorrect`, `/corrected`, and `/analyze` endpoints will return `503 Service Unavailable`. | |
| --- | |
| ## Starting the Server | |
| ### Option A โ Docker (Recommended) | |
| This runs FastAPI behind an Nginx reverse proxy, with GPU support. | |
| **Step 1 โ Make sure `.env` is configured** (see [Environment Setup](#environment-setup) above). | |
| **Step 2 โ Build and start all services:** | |
| ```bash | |
| docker compose up --build -d | |
| ``` | |
| This will: | |
| 1. Build the inference Docker image (installs Python deps, copies `src/inference/` and `api/`) | |
| 2. Start the `stt-api` container (FastAPI on port 8000 internally) | |
| 3. Start the `stt-nginx` container (Nginx on port **80** externally) | |
| 4. Wait for the API health check before Nginx accepts traffic (Whisper can take 60โ120 s to load) | |
| **Step 3 โ Verify the server is healthy:** | |
| ```bash | |
| curl http://localhost/health | |
| ``` | |
| Expected response when ready: | |
| ```json | |
| { | |
| "status": "ok", | |
| "whisper_loaded": true, | |
| "gemini_available": true, | |
| "model_path": "/models/merged_model" | |
| } | |
| ``` | |
| If `whisper_loaded` is `false`, the model failed to load โ check container logs: | |
| ```bash | |
| docker compose logs api | |
| ``` | |
| **Step 4 โ Send your first request:** | |
| ```bash | |
| curl -X POST http://localhost/api/v1/transcribe \ | |
| -F "audio=@/path/to/your/audio.mp3" | |
| ``` | |
| --- | |
| **Useful Docker commands:** | |
| ```bash | |
| # View live logs | |
| docker compose logs -f api | |
| # Stop all services | |
| docker compose down | |
| # Restart after a code change (rebuild image) | |
| docker compose up --build -d | |
| # Check container status | |
| docker compose ps | |
| ``` | |
| --- | |
| **CPU-only deployment:** | |
| If you do not have an NVIDIA GPU, remove the `deploy` block from `docker-compose.yml`: | |
| ```yaml | |
| # Delete these lines from the `api` service: | |
| deploy: | |
| resources: | |
| reservations: | |
| devices: | |
| - driver: nvidia | |
| count: 1 | |
| capabilities: [gpu] | |
| ``` | |
| Then set `DEVICE=cpu` in your `.env` file. Transcription will be significantly slower. | |
| --- | |
| ### Option B โ Local Development (no Docker) | |
| **Step 1 โ Install system dependencies:** | |
| On Ubuntu/Debian: | |
| ```bash | |
| sudo apt-get install -y ffmpeg libsndfile1 | |
| ``` | |
| On macOS (Homebrew): | |
| ```bash | |
| brew install ffmpeg libsndfile | |
| ``` | |
| On Windows: install [ffmpeg](https://ffmpeg.org/download.html) and add it to `PATH`. | |
| **Step 2 โ Create and activate a virtual environment:** | |
| ```bash | |
| python -m venv .venv | |
| source .venv/bin/activate # Linux/macOS | |
| .venv\Scripts\activate # Windows | |
| ``` | |
| **Step 3 โ Install API dependencies:** | |
| ```bash | |
| pip install -r requirements-api.txt | |
| ``` | |
| **Step 4 โ Create your `.env` file** (see [Environment Setup](#environment-setup)) and point `MODEL_PATH` to your local model directory: | |
| ```env | |
| MODEL_PATH=outputs/checkpoints/merged_model | |
| GEMINI_API_KEY=your_gemini_api_key_here | |
| ``` | |
| **Step 5 โ Start the server:** | |
| ```bash | |
| uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload | |
| ``` | |
| The server will be available at `http://localhost:8000`. | |
| > Remove `--reload` in production โ it watches for file changes and is not suitable for production use. | |
| **Step 6 โ Verify:** | |
| ```bash | |
| curl http://localhost:8000/health | |
| ``` | |
| --- | |
| ## API Reference | |
| All transcription endpoints accept a `multipart/form-data` POST request with a single field named `audio`. | |
| **Supported audio formats:** `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, `.webm` | |
| **Maximum file size:** 200 MB | |
| **Base URL:** | |
| - Docker deployment: `http://localhost` (port 80, via Nginx) | |
| - Local development: `http://localhost:8000` | |
| --- | |
| ### GET /health | |
| Check the server status and which services are loaded. | |
| **Request:** | |
| ```bash | |
| curl http://localhost/health | |
| ``` | |
| **Response `200 OK`:** | |
| ```json | |
| { | |
| "status": "ok", | |
| "whisper_loaded": true, | |
| "gemini_available": true, | |
| "model_path": "/models/merged_model" | |
| } | |
| ``` | |
| | Field | Type | Description | | |
| | --- | --- | --- | | |
| | `status` | `string` | `"ok"` if Whisper is loaded, `"degraded"` otherwise | | |
| | `whisper_loaded` | `boolean` | Whether the Whisper model loaded successfully | | |
| | `gemini_available` | `boolean` | Whether the Gemini analyzer is ready (requires `GEMINI_API_KEY`) | | |
| | `model_path` | `string` | The model path the server loaded from | | |
| --- | |
| ### POST /api/v1/transcribe | |
| Transcribe an audio file using Whisper only. No post-processing is applied โ returns raw Arabic text directly from the model. | |
| **When to use:** You need a fast transcript and do not need speaker labels or error correction. | |
| **Request:** | |
| ```bash | |
| curl -X POST http://localhost/api/v1/transcribe \ | |
| -F "audio=@recording.mp3" | |
| ``` | |
| **Response `200 OK`:** | |
| ```json | |
| { | |
| "audio_filename": "recording.mp3", | |
| "transcript": "ุงุฒูู ูุง ููุฏู ุ ุฃูุง ุจุชุตู ู ู ุดุฑูุฉ ู ุตุฑ ุฅูุทุงููุง ุนุดุงู..." | |
| } | |
| ``` | |
| | Field | Type | Description | | |
| | --- | --- | --- | | |
| | `audio_filename` | `string` | Name of the uploaded file | | |
| | `transcript` | `string` | Raw Arabic text from Whisper | | |
| --- | |
| ### POST /api/v1/transcribe/autocorrect | |
| Transcribe with Whisper, then send the raw transcript to Gemini for **phonetic and orthographic correction only**. No speaker labels are added โ returns a single continuous Arabic text. | |
| **When to use:** You need clean, corrected Arabic text but do not care who said what. | |
| **Requires:** `GEMINI_API_KEY` | |
| **Request:** | |
| ```bash | |
| curl -X POST http://localhost/api/v1/transcribe/autocorrect \ | |
| -F "audio=@recording.mp3" | |
| ``` | |
| **Response `200 OK`:** | |
| ```json | |
| { | |
| "audio_filename": "recording.mp3", | |
| "transcript": "ุงุฒูู ูุง ููุฏู ุงูุง ุจุชุตู ู ู ุดุฑูุฉ ู ุตุฑ ุงูุทุงููุง...", | |
| "corrected_transcript": "ุฃุฒูู ูุง ููุฏู ุ ุฃูุง ุจุชุตู ู ู ุดุฑูุฉ ู ุตุฑ ุฅูุทุงููุง..." | |
| } | |
| ``` | |
| | Field | Type | Description | | |
| | --- | --- | --- | | |
| | `audio_filename` | `string` | Name of the uploaded file | | |
| | `transcript` | `string` | Raw Whisper output (unmodified) | | |
| | `corrected_transcript` | `string` | Phonetically and orthographically corrected Arabic text | | |
| --- | |
| ### POST /api/v1/transcribe/corrected | |
| Transcribe with Whisper, then send the transcript to Gemini, which returns a **speaker-separated, phonetically corrected** version. Speakers are labelled as `SPEAKER_01` (Agent) and `SPEAKER_00` (Customer). | |
| **When to use:** You need a clean, readable transcript that shows who said what. | |
| **Requires:** `GEMINI_API_KEY` | |
| **Request:** | |
| ```bash | |
| curl -X POST http://localhost/api/v1/transcribe/corrected \ | |
| -F "audio=@recording.mp3" | |
| ``` | |
| **Response `200 OK`:** | |
| ```json | |
| { | |
| "audio_filename": "recording.mp3", | |
| "transcript": "ุงุฒูู ูุง ููุฏู ุงูุง ุจุชุตู ู ู ู ุตุฑ ุงูุทุงููุง...", | |
| "corrected_transcript": "SPEAKER_01: ุฃููุงูุ ู ุนุงู ุฃุญู ุฏ ู ู ู ุตุฑ ุฅูุทุงููุงุ ููู ุฃูุฏุฑ ุฃุณุงุนุฏูุ\nSPEAKER_00: ุฃููุงูุ ุฃูุง ุนุงูุฒ ุฃุนุฑู ุชูุงุตูู ุงููุญุฏุฉ..." | |
| } | |
| ``` | |
| | Field | Type | Description | | |
| | --- | --- | --- | | |
| | `audio_filename` | `string` | Name of the uploaded file | | |
| | `transcript` | `string` | Raw Whisper output (unmodified) | | |
| | `corrected_transcript` | `string` | Speaker-labelled, corrected Arabic transcript (`SPEAKER_01` = Agent, `SPEAKER_00` = Customer) | | |
| --- | |
| ### POST /api/v1/transcribe/analyze | |
| The most powerful endpoint. Transcribes the audio, then runs a full **Gemini call analysis** that extracts structured information from the conversation. | |
| **When to use:** You want a complete picture of the call โ who spoke, what happened, what needs follow-up. | |
| **Requires:** `GEMINI_API_KEY` | |
| **Request:** | |
| ```bash | |
| curl -X POST http://localhost/api/v1/transcribe/analyze \ | |
| -F "audio=@recording.mp3" | |
| ``` | |
| **Response `200 OK`:** | |
| ```json | |
| { | |
| "audio_filename": "recording.mp3", | |
| "transcript": "ุงุฒูู ูุง ููุฏู ุงูุง ุจุชุตู ู ู ู ุตุฑ ุงูุทุงููุง...", | |
| "cleaned_transcript": "SPEAKER_01: ุฃููุงูุ ู ุนุงู ุฃุญู ุฏ ู ู ู ุตุฑ ุฅูุทุงููุง...\nSPEAKER_00: ...", | |
| "agent_name": "ุฃุญู ุฏ", | |
| "customer_name": "ู ุญู ุฏ ุงูุณูุฏ", | |
| "unit_number": ["B2-401"], | |
| "project_name": "IL BOSCO", | |
| "department_mentioned": "Sales", | |
| "call_type": "Inbound", | |
| "customer_satisfaction": 3, | |
| "is_urgent": false, | |
| "pain_points": ["ุชุฃุฎูุฑ ู ูุนุฏ ุงูุชุณููู ", "ุนุฏู ูุถูุญ ู ุนุงุฏ ุงูุตูุงูุฉ"], | |
| "action_items_promised": ["ุฅุฑุณุงู ุจุฑูุฏ ุฅููุชุฑููู ุจู ูุงุนูุฏ ุงูุชุณููู "], | |
| "next_steps": ["ู ุชุงุจุนุฉ ุงูุนู ูู ุฎูุงู 48 ุณุงุนุฉ"] | |
| } | |
| ``` | |
| **Response fields:** | |
| | Field | Type | Description | | |
| | --- | --- | --- | | |
| | `audio_filename` | `string` | Name of the uploaded file | | |
| | `transcript` | `string` | Raw Whisper output (unmodified) | | |
| | `cleaned_transcript` | `string` | Speaker-labelled, corrected Arabic transcript | | |
| | `agent_name` | `string \| null` | Name of the agent extracted from the conversation | | |
| | `customer_name` | `string \| null` | Name of the customer extracted from the conversation | | |
| | `unit_number` | `string[]` | Unit identifiers mentioned (e.g. `["B2-401"]`) | | |
| | `project_name` | `string \| null` | Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.) | | |
| | `department_mentioned` | `string \| null` | Department referenced (Sales, Maintenance, Housekeeping) | | |
| | `call_type` | `string` | `"Inbound"` or `"Outbound"` | | |
| | `customer_satisfaction` | `integer` | Satisfaction score **1โ5** inferred from tone (1 = very unhappy, 5 = very happy) | | |
| | `is_urgent` | `boolean` | `true` if satisfaction โค 2 or the customer expressed critical frustration | | |
| | `pain_points` | `string[]` | List of issues or complaints mentioned | | |
| | `action_items_promised` | `string[]` | Commitments made by the agent during the call | | |
| | `next_steps` | `string[]` | Follow-up actions identified | | |
| --- | |
| ## Error Codes | |
| | Code | Meaning | How to fix | | |
| | --- | --- | --- | | |
| | `200` | Success | โ | | |
| | `413` | File exceeds 200 MB limit | Compress or trim the audio | | |
| | `422` | Unsupported audio format | Use `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, or `.webm` | | |
| | `500` | Whisper transcription failed | Check server logs: `docker compose logs api` | | |
| | `502` | Gemini call failed | Check `GEMINI_API_KEY` and network access to Google APIs | | |
| | `503` | Model not loaded | Whisper or Gemini did not initialise โ check logs | | |
| --- | |
| ## Interactive Docs (Swagger UI) | |
| FastAPI automatically generates interactive API documentation. | |
| | URL | Description | | |
| | --- | --- | | |
| | `http://localhost/docs` | Swagger UI โ try endpoints directly in the browser | | |
| | `http://localhost/redoc` | ReDoc โ clean, readable reference | | |
| | `http://localhost/openapi.json` | Raw OpenAPI 3.0 schema | | |
| > For local development (no Docker), replace `localhost` with `localhost:8000`. | |
| --- | |
| ## Training Pipeline | |
| ### Project structure | |
| ``` | |
| . | |
| โโโ config/ | |
| โ โโโ training_config.yaml # All hyperparameters in one place | |
| โโโ data/ | |
| โ โโโ raw/ | |
| โ โ โโโ audio/ โ put your audio files here (.mp3, .wav, โฆ) | |
| โ โ โโโ transcripts/ โ matching .txt transcript files (same filename stem) | |
| โ โโโ processed/ โ auto-generated (segments + HF dataset) | |
| โโโ src/ | |
| โ โโโ data_preparation/ | |
| โ โ โโโ parse_transcripts.py | |
| โ โ โโโ segment_audio.py | |
| โ โ โโโ build_dataset.py | |
| โ โโโ training/ | |
| โ โ โโโ trainer.py | |
| โ โโโ inference/ | |
| โ โโโ transcribe.py | |
| โ โโโ analyze_call.py | |
| โโโ scripts/ | |
| โ โโโ import_existing_data.py โ run once to import files from project root | |
| โ โโโ prepare_data.py โ step 1: build dataset | |
| โ โโโ train.py โ step 2: fine-tune | |
| โ โโโ transcribe.py โ step 3: run inference CLI | |
| โโโ api/ โ FastAPI server | |
| โโโ nginx/ โ Nginx config | |
| โโโ Dockerfile | |
| โโโ docker-compose.yml | |
| ``` | |
| ### Transcript format | |
| Each `.txt` file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line): | |
| ``` | |
| 0.0: ุณูุงุฏุฉ ุงููููููููุ ุตุจุฑู ูู ู ุญููุ | |
| 3.076: ู ุจุฑูู ุนูููุงุ | |
| 4.238: ุนู ููุง ุฃูุฌุฑ ุทูุงุฑุฉ ูู ุชุงุฑูุฎ "ุฃู ุฑููุง". | |
| ``` | |
| ### Step 1 โ Install dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### Step 2 โ Add your data | |
| Option A โ files already in the project root: | |
| ```bash | |
| python scripts/import_existing_data.py | |
| ``` | |
| Option B โ place files directly: | |
| - Copy audio โ `data/raw/audio/my_file.mp3` | |
| - Copy transcript โ `data/raw/transcripts/my_file.txt` *(same stem)* | |
| ### Step 3 โ Prepare the dataset | |
| ```bash | |
| python scripts/prepare_data.py | |
| ``` | |
| Splits audio into โค25-second WAV segments aligned to the transcript, then builds a HuggingFace `DatasetDict` saved to `data/processed/`. | |
| ### Step 4 โ Fine-tune | |
| ```bash | |
| python scripts/train.py | |
| # Resume from a checkpoint | |
| python scripts/train.py --resume outputs/checkpoints/checkpoint-500 | |
| ``` | |
| ### Step 5 โ Transcribe via CLI | |
| ```bash | |
| # Use the fine-tuned model (auto-detected) | |
| python scripts/transcribe.py path/to/audio.mp3 | |
| # Specify a model explicitly | |
| python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3 | |
| # Save output to file | |
| python scripts/transcribe.py audio.mp3 --output result.txt | |
| ``` | |
| ### Adding more data later | |
| 1. Drop new `audio.mp3` + `audio.txt` pairs into `data/raw/`. | |
| 2. Re-run `python scripts/prepare_data.py` โ rebuilds everything from scratch. | |
| 3. Re-run `python scripts/train.py`. | |
| ### Configuration | |
| Edit `config/training_config.yaml` to change: | |
| - `model.base_model` โ swap to `openai/whisper-medium` for faster training | |
| - `training.per_device_train_batch_size` โ reduce if out of GPU memory | |
| - `training.fp16: false` โ disable on CPU or older GPUs | |
| - `data.max_segment_duration` โ segment length (max 30 s for Whisper) | |
| ### GPU requirements | |
| | Model | Min VRAM | Recommended | | |
| | --- | --- | --- | | |
| | whisper-large-v3 | 16 GB | 24 GB A10/A100 | | |
| | whisper-medium | 8 GB | 16 GB | | |
| | whisper-small | 4 GB | 8 GB | | |
| Use `gradient_checkpointing: true` and lower `per_device_train_batch_size` to fit in less VRAM at the cost of slower training. | |