--- title: Speech To Text API emoji: 🎙️ colorFrom: blue colorTo: purple sdk: docker app_port: 7860 pinned: false --- Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis. --- ## Table of Contents 1. [Project Overview](#project-overview) 2. [Prerequisites](#prerequisites) 3. [Environment Setup](#environment-setup) 4. [Starting the Server](#starting-the-server) - [Option A — Docker (Recommended)](#option-a--docker-recommended) - [Option B — Local Development (no Docker)](#option-b--local-development-no-docker) 5. [API Reference](#api-reference) - [GET /health](#get-health) - [POST /api/v1/transcribe](#post-apiv1transcribe) - [POST /api/v1/transcribe/autocorrect](#post-apiv1transcribeautocorrect) - [POST /api/v1/transcribe/corrected](#post-apiv1transcribecorrected) - [POST /api/v1/transcribe/analyze](#post-apiv1transcribeanalyze) 6. [Error Codes](#error-codes) 7. [Interactive Docs (Swagger UI)](#interactive-docs-swagger-ui) 8. [Training Pipeline](#training-pipeline) --- ## Project Overview This project fine-tunes `openai/whisper-large-v3` on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service. **Stack:** - **Inference:** Whisper (HuggingFace Transformers) + Silero VAD - **Post-processing:** Google Gemini (speaker diarisation, entity extraction, call analysis) - **API:** FastAPI + Uvicorn - **Reverse proxy:** Nginx - **Container:** Docker + Docker Compose --- ## Prerequisites ### For Docker deployment (recommended) | Requirement | Version | | --- | --- | | Docker | ≥ 24 | | Docker Compose | ≥ 2.20 (bundled with Docker Desktop) | | NVIDIA Container Toolkit | Required for GPU; skip for CPU-only | | NVIDIA GPU driver | ≥ 525 (for CUDA 12) | ### For local development (no Docker) | Requirement | Version | | --- | --- | | Python | 3.10 or 3.11 | | ffmpeg | Any recent version | | libsndfile | Any recent version (Linux/macOS) | | CUDA toolkit | 12.x (optional, for GPU) | --- ## Environment Setup **Step 1 — Copy the example environment file:** ```bash cp .env.example .env ``` **Step 2 — Open `.env` and fill in your values:** ```env # Path inside the container where the model will be mounted MODEL_PATH=/models/merged_model # Host machine path to your model directory (mounted into the container) MODEL_DIR=/opt/stt/models # Inference device: "cuda" or "cpu" (leave blank to auto-detect) DEVICE=cuda # Required for /autocorrect, /corrected, and /analyze endpoints GEMINI_API_KEY=your_gemini_api_key_here GEMINI_MODEL=gemini-2.5-flash ``` **Key variables explained:** | Variable | Required | Default | Description | | --- | --- | --- | --- | | `MODEL_PATH` | Yes | `/models/merged_model` | Path **inside the container** to the Whisper model directory | | `MODEL_DIR` | Yes | `/opt/stt/models` | Path on the **host machine** that gets mounted into the container as `/models` | | `DEVICE` | No | auto-detect | `cuda` or `cpu` | | `GEMINI_API_KEY` | For AI endpoints | — | Google Gemini API key | | `GEMINI_MODEL` | No | `gemini-2.5-flash` | Gemini model to use | > **Note:** If `GEMINI_API_KEY` is not set, the `/autocorrect`, `/corrected`, and `/analyze` endpoints will return `503 Service Unavailable`. --- ## Starting the Server ### Option A — Docker (Recommended) This runs FastAPI behind an Nginx reverse proxy, with GPU support. **Step 1 — Make sure `.env` is configured** (see [Environment Setup](#environment-setup) above). **Step 2 — Build and start all services:** ```bash docker compose up --build -d ``` This will: 1. Build the inference Docker image (installs Python deps, copies `src/inference/` and `api/`) 2. Start the `stt-api` container (FastAPI on port 8000 internally) 3. Start the `stt-nginx` container (Nginx on port **80** externally) 4. Wait for the API health check before Nginx accepts traffic (Whisper can take 60–120 s to load) **Step 3 — Verify the server is healthy:** ```bash curl http://localhost/health ``` Expected response when ready: ```json { "status": "ok", "whisper_loaded": true, "gemini_available": true, "model_path": "/models/merged_model" } ``` If `whisper_loaded` is `false`, the model failed to load — check container logs: ```bash docker compose logs api ``` **Step 4 — Send your first request:** ```bash curl -X POST http://localhost/api/v1/transcribe \ -F "audio=@/path/to/your/audio.mp3" ``` --- **Useful Docker commands:** ```bash # View live logs docker compose logs -f api # Stop all services docker compose down # Restart after a code change (rebuild image) docker compose up --build -d # Check container status docker compose ps ``` --- **CPU-only deployment:** If you do not have an NVIDIA GPU, remove the `deploy` block from `docker-compose.yml`: ```yaml # Delete these lines from the `api` service: deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ``` Then set `DEVICE=cpu` in your `.env` file. Transcription will be significantly slower. --- ### Option B — Local Development (no Docker) **Step 1 — Install system dependencies:** On Ubuntu/Debian: ```bash sudo apt-get install -y ffmpeg libsndfile1 ``` On macOS (Homebrew): ```bash brew install ffmpeg libsndfile ``` On Windows: install [ffmpeg](https://ffmpeg.org/download.html) and add it to `PATH`. **Step 2 — Create and activate a virtual environment:** ```bash python -m venv .venv source .venv/bin/activate # Linux/macOS .venv\Scripts\activate # Windows ``` **Step 3 — Install API dependencies:** ```bash pip install -r requirements-api.txt ``` **Step 4 — Create your `.env` file** (see [Environment Setup](#environment-setup)) and point `MODEL_PATH` to your local model directory: ```env MODEL_PATH=outputs/checkpoints/merged_model GEMINI_API_KEY=your_gemini_api_key_here ``` **Step 5 — Start the server:** ```bash uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload ``` The server will be available at `http://localhost:8000`. > Remove `--reload` in production — it watches for file changes and is not suitable for production use. **Step 6 — Verify:** ```bash curl http://localhost:8000/health ``` --- ## API Reference All transcription endpoints accept a `multipart/form-data` POST request with a single field named `audio`. **Supported audio formats:** `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, `.webm` **Maximum file size:** 200 MB **Base URL:** - Docker deployment: `http://localhost` (port 80, via Nginx) - Local development: `http://localhost:8000` --- ### GET /health Check the server status and which services are loaded. **Request:** ```bash curl http://localhost/health ``` **Response `200 OK`:** ```json { "status": "ok", "whisper_loaded": true, "gemini_available": true, "model_path": "/models/merged_model" } ``` | Field | Type | Description | | --- | --- | --- | | `status` | `string` | `"ok"` if Whisper is loaded, `"degraded"` otherwise | | `whisper_loaded` | `boolean` | Whether the Whisper model loaded successfully | | `gemini_available` | `boolean` | Whether the Gemini analyzer is ready (requires `GEMINI_API_KEY`) | | `model_path` | `string` | The model path the server loaded from | --- ### POST /api/v1/transcribe Transcribe an audio file using Whisper only. No post-processing is applied — returns raw Arabic text directly from the model. **When to use:** You need a fast transcript and do not need speaker labels or error correction. **Request:** ```bash curl -X POST http://localhost/api/v1/transcribe \ -F "audio=@recording.mp3" ``` **Response `200 OK`:** ```json { "audio_filename": "recording.mp3", "transcript": "ازيك يا فندم، أنا بتصل من شركة مصر إيطاليا عشان..." } ``` | Field | Type | Description | | --- | --- | --- | | `audio_filename` | `string` | Name of the uploaded file | | `transcript` | `string` | Raw Arabic text from Whisper | --- ### POST /api/v1/transcribe/autocorrect Transcribe with Whisper, then send the raw transcript to Gemini for **phonetic and orthographic correction only**. No speaker labels are added — returns a single continuous Arabic text. **When to use:** You need clean, corrected Arabic text but do not care who said what. **Requires:** `GEMINI_API_KEY` **Request:** ```bash curl -X POST http://localhost/api/v1/transcribe/autocorrect \ -F "audio=@recording.mp3" ``` **Response `200 OK`:** ```json { "audio_filename": "recording.mp3", "transcript": "ازيك يا فندم انا بتصل من شركة مصر ايطاليا...", "corrected_transcript": "أزيك يا فندم، أنا بتصل من شركة مصر إيطاليا..." } ``` | Field | Type | Description | | --- | --- | --- | | `audio_filename` | `string` | Name of the uploaded file | | `transcript` | `string` | Raw Whisper output (unmodified) | | `corrected_transcript` | `string` | Phonetically and orthographically corrected Arabic text | --- ### POST /api/v1/transcribe/corrected Transcribe with Whisper, then send the transcript to Gemini, which returns a **speaker-separated, phonetically corrected** version. Speakers are labelled as `SPEAKER_01` (Agent) and `SPEAKER_00` (Customer). **When to use:** You need a clean, readable transcript that shows who said what. **Requires:** `GEMINI_API_KEY` **Request:** ```bash curl -X POST http://localhost/api/v1/transcribe/corrected \ -F "audio=@recording.mp3" ``` **Response `200 OK`:** ```json { "audio_filename": "recording.mp3", "transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...", "corrected_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا، كيف أقدر أساعدك؟\nSPEAKER_00: أهلاً، أنا عايز أعرف تفاصيل الوحدة..." } ``` | Field | Type | Description | | --- | --- | --- | | `audio_filename` | `string` | Name of the uploaded file | | `transcript` | `string` | Raw Whisper output (unmodified) | | `corrected_transcript` | `string` | Speaker-labelled, corrected Arabic transcript (`SPEAKER_01` = Agent, `SPEAKER_00` = Customer) | --- ### POST /api/v1/transcribe/analyze The most powerful endpoint. Transcribes the audio, then runs a full **Gemini call analysis** that extracts structured information from the conversation. **When to use:** You want a complete picture of the call — who spoke, what happened, what needs follow-up. **Requires:** `GEMINI_API_KEY` **Request:** ```bash curl -X POST http://localhost/api/v1/transcribe/analyze \ -F "audio=@recording.mp3" ``` **Response `200 OK`:** ```json { "audio_filename": "recording.mp3", "transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...", "cleaned_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا...\nSPEAKER_00: ...", "agent_name": "أحمد", "customer_name": "محمد السيد", "unit_number": ["B2-401"], "project_name": "IL BOSCO", "department_mentioned": "Sales", "call_type": "Inbound", "customer_satisfaction": 3, "is_urgent": false, "pain_points": ["تأخير موعد التسليم", "عدم وضوح معاد الصيانة"], "action_items_promised": ["إرسال بريد إلكتروني بمواعيد التسليم"], "next_steps": ["متابعة العميل خلال 48 ساعة"] } ``` **Response fields:** | Field | Type | Description | | --- | --- | --- | | `audio_filename` | `string` | Name of the uploaded file | | `transcript` | `string` | Raw Whisper output (unmodified) | | `cleaned_transcript` | `string` | Speaker-labelled, corrected Arabic transcript | | `agent_name` | `string \| null` | Name of the agent extracted from the conversation | | `customer_name` | `string \| null` | Name of the customer extracted from the conversation | | `unit_number` | `string[]` | Unit identifiers mentioned (e.g. `["B2-401"]`) | | `project_name` | `string \| null` | Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.) | | `department_mentioned` | `string \| null` | Department referenced (Sales, Maintenance, Housekeeping) | | `call_type` | `string` | `"Inbound"` or `"Outbound"` | | `customer_satisfaction` | `integer` | Satisfaction score **1–5** inferred from tone (1 = very unhappy, 5 = very happy) | | `is_urgent` | `boolean` | `true` if satisfaction ≤ 2 or the customer expressed critical frustration | | `pain_points` | `string[]` | List of issues or complaints mentioned | | `action_items_promised` | `string[]` | Commitments made by the agent during the call | | `next_steps` | `string[]` | Follow-up actions identified | --- ## Error Codes | Code | Meaning | How to fix | | --- | --- | --- | | `200` | Success | — | | `413` | File exceeds 200 MB limit | Compress or trim the audio | | `422` | Unsupported audio format | Use `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, or `.webm` | | `500` | Whisper transcription failed | Check server logs: `docker compose logs api` | | `502` | Gemini call failed | Check `GEMINI_API_KEY` and network access to Google APIs | | `503` | Model not loaded | Whisper or Gemini did not initialise — check logs | --- ## Interactive Docs (Swagger UI) FastAPI automatically generates interactive API documentation. | URL | Description | | --- | --- | | `http://localhost/docs` | Swagger UI — try endpoints directly in the browser | | `http://localhost/redoc` | ReDoc — clean, readable reference | | `http://localhost/openapi.json` | Raw OpenAPI 3.0 schema | > For local development (no Docker), replace `localhost` with `localhost:8000`. --- ## Training Pipeline ### Project structure ``` . ├── config/ │ └── training_config.yaml # All hyperparameters in one place ├── data/ │ ├── raw/ │ │ ├── audio/ ← put your audio files here (.mp3, .wav, …) │ │ └── transcripts/ ← matching .txt transcript files (same filename stem) │ └── processed/ ← auto-generated (segments + HF dataset) ├── src/ │ ├── data_preparation/ │ │ ├── parse_transcripts.py │ │ ├── segment_audio.py │ │ └── build_dataset.py │ ├── training/ │ │ └── trainer.py │ └── inference/ │ ├── transcribe.py │ └── analyze_call.py ├── scripts/ │ ├── import_existing_data.py ← run once to import files from project root │ ├── prepare_data.py ← step 1: build dataset │ ├── train.py ← step 2: fine-tune │ └── transcribe.py ← step 3: run inference CLI ├── api/ ← FastAPI server ├── nginx/ ← Nginx config ├── Dockerfile └── docker-compose.yml ``` ### Transcript format Each `.txt` file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line): ``` 0.0: سيادة الكولونيل، صبرك في محله، 3.076: مبروك علينا، 4.238: عملنا أفجر طيارة في تاريخ "أمريكا". ``` ### Step 1 — Install dependencies ```bash pip install -r requirements.txt ``` ### Step 2 — Add your data Option A — files already in the project root: ```bash python scripts/import_existing_data.py ``` Option B — place files directly: - Copy audio → `data/raw/audio/my_file.mp3` - Copy transcript → `data/raw/transcripts/my_file.txt` *(same stem)* ### Step 3 — Prepare the dataset ```bash python scripts/prepare_data.py ``` Splits audio into ≤25-second WAV segments aligned to the transcript, then builds a HuggingFace `DatasetDict` saved to `data/processed/`. ### Step 4 — Fine-tune ```bash python scripts/train.py # Resume from a checkpoint python scripts/train.py --resume outputs/checkpoints/checkpoint-500 ``` ### Step 5 — Transcribe via CLI ```bash # Use the fine-tuned model (auto-detected) python scripts/transcribe.py path/to/audio.mp3 # Specify a model explicitly python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3 # Save output to file python scripts/transcribe.py audio.mp3 --output result.txt ``` ### Adding more data later 1. Drop new `audio.mp3` + `audio.txt` pairs into `data/raw/`. 2. Re-run `python scripts/prepare_data.py` — rebuilds everything from scratch. 3. Re-run `python scripts/train.py`. ### Configuration Edit `config/training_config.yaml` to change: - `model.base_model` — swap to `openai/whisper-medium` for faster training - `training.per_device_train_batch_size` — reduce if out of GPU memory - `training.fp16: false` — disable on CPU or older GPUs - `data.max_segment_duration` — segment length (max 30 s for Whisper) ### GPU requirements | Model | Min VRAM | Recommended | | --- | --- | --- | | whisper-large-v3 | 16 GB | 24 GB A10/A100 | | whisper-medium | 8 GB | 16 GB | | whisper-small | 4 GB | 8 GB | Use `gradient_checkpointing: true` and lower `per_device_train_batch_size` to fit in less VRAM at the cost of slower training.