Speach-To-Text / README.md
MIP-Tech's picture
Add README with Space config
e333dd9
metadata
title: Speech To Text API
emoji: ๐ŸŽ™๏ธ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis.


Table of Contents

  1. Project Overview
  2. Prerequisites
  3. Environment Setup
  4. Starting the Server
  5. API Reference
  6. Error Codes
  7. Interactive Docs (Swagger UI)
  8. Training Pipeline

Project Overview

This project fine-tunes openai/whisper-large-v3 on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service.

Stack:

  • Inference: Whisper (HuggingFace Transformers) + Silero VAD
  • Post-processing: Google Gemini (speaker diarisation, entity extraction, call analysis)
  • API: FastAPI + Uvicorn
  • Reverse proxy: Nginx
  • Container: Docker + Docker Compose

Prerequisites

For Docker deployment (recommended)

Requirement Version
Docker โ‰ฅ 24
Docker Compose โ‰ฅ 2.20 (bundled with Docker Desktop)
NVIDIA Container Toolkit Required for GPU; skip for CPU-only
NVIDIA GPU driver โ‰ฅ 525 (for CUDA 12)

For local development (no Docker)

Requirement Version
Python 3.10 or 3.11
ffmpeg Any recent version
libsndfile Any recent version (Linux/macOS)
CUDA toolkit 12.x (optional, for GPU)

Environment Setup

Step 1 โ€” Copy the example environment file:

cp .env.example .env

Step 2 โ€” Open .env and fill in your values:

# Path inside the container where the model will be mounted
MODEL_PATH=/models/merged_model

# Host machine path to your model directory (mounted into the container)
MODEL_DIR=/opt/stt/models

# Inference device: "cuda" or "cpu" (leave blank to auto-detect)
DEVICE=cuda

# Required for /autocorrect, /corrected, and /analyze endpoints
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.5-flash

Key variables explained:

Variable Required Default Description
MODEL_PATH Yes /models/merged_model Path inside the container to the Whisper model directory
MODEL_DIR Yes /opt/stt/models Path on the host machine that gets mounted into the container as /models
DEVICE No auto-detect cuda or cpu
GEMINI_API_KEY For AI endpoints โ€” Google Gemini API key
GEMINI_MODEL No gemini-2.5-flash Gemini model to use

Note: If GEMINI_API_KEY is not set, the /autocorrect, /corrected, and /analyze endpoints will return 503 Service Unavailable.


Starting the Server

Option A โ€” Docker (Recommended)

This runs FastAPI behind an Nginx reverse proxy, with GPU support.

Step 1 โ€” Make sure .env is configured (see Environment Setup above).

Step 2 โ€” Build and start all services:

docker compose up --build -d

This will:

  1. Build the inference Docker image (installs Python deps, copies src/inference/ and api/)
  2. Start the stt-api container (FastAPI on port 8000 internally)
  3. Start the stt-nginx container (Nginx on port 80 externally)
  4. Wait for the API health check before Nginx accepts traffic (Whisper can take 60โ€“120 s to load)

Step 3 โ€” Verify the server is healthy:

curl http://localhost/health

Expected response when ready:

{
  "status": "ok",
  "whisper_loaded": true,
  "gemini_available": true,
  "model_path": "/models/merged_model"
}

If whisper_loaded is false, the model failed to load โ€” check container logs:

docker compose logs api

Step 4 โ€” Send your first request:

curl -X POST http://localhost/api/v1/transcribe \
  -F "audio=@/path/to/your/audio.mp3"

Useful Docker commands:

# View live logs
docker compose logs -f api

# Stop all services
docker compose down

# Restart after a code change (rebuild image)
docker compose up --build -d

# Check container status
docker compose ps

CPU-only deployment:

If you do not have an NVIDIA GPU, remove the deploy block from docker-compose.yml:

# Delete these lines from the `api` service:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

Then set DEVICE=cpu in your .env file. Transcription will be significantly slower.


Option B โ€” Local Development (no Docker)

Step 1 โ€” Install system dependencies:

On Ubuntu/Debian:

sudo apt-get install -y ffmpeg libsndfile1

On macOS (Homebrew):

brew install ffmpeg libsndfile

On Windows: install ffmpeg and add it to PATH.

Step 2 โ€” Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate        # Linux/macOS
.venv\Scripts\activate           # Windows

Step 3 โ€” Install API dependencies:

pip install -r requirements-api.txt

Step 4 โ€” Create your .env file (see Environment Setup) and point MODEL_PATH to your local model directory:

MODEL_PATH=outputs/checkpoints/merged_model
GEMINI_API_KEY=your_gemini_api_key_here

Step 5 โ€” Start the server:

uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

The server will be available at http://localhost:8000.

Remove --reload in production โ€” it watches for file changes and is not suitable for production use.

Step 6 โ€” Verify:

curl http://localhost:8000/health

API Reference

All transcription endpoints accept a multipart/form-data POST request with a single field named audio.

Supported audio formats: .wav, .mp3, .m4a, .flac, .ogg, .webm

Maximum file size: 200 MB

Base URL:

  • Docker deployment: http://localhost (port 80, via Nginx)
  • Local development: http://localhost:8000

GET /health

Check the server status and which services are loaded.

Request:

curl http://localhost/health

Response 200 OK:

{
  "status": "ok",
  "whisper_loaded": true,
  "gemini_available": true,
  "model_path": "/models/merged_model"
}
Field Type Description
status string "ok" if Whisper is loaded, "degraded" otherwise
whisper_loaded boolean Whether the Whisper model loaded successfully
gemini_available boolean Whether the Gemini analyzer is ready (requires GEMINI_API_KEY)
model_path string The model path the server loaded from

POST /api/v1/transcribe

Transcribe an audio file using Whisper only. No post-processing is applied โ€” returns raw Arabic text directly from the model.

When to use: You need a fast transcript and do not need speaker labels or error correction.

Request:

curl -X POST http://localhost/api/v1/transcribe \
  -F "audio=@recording.mp3"

Response 200 OK:

{
  "audio_filename": "recording.mp3",
  "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู…ุŒ ุฃู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุฅูŠุทุงู„ูŠุง ุนุดุงู†..."
}
Field Type Description
audio_filename string Name of the uploaded file
transcript string Raw Arabic text from Whisper

POST /api/v1/transcribe/autocorrect

Transcribe with Whisper, then send the raw transcript to Gemini for phonetic and orthographic correction only. No speaker labels are added โ€” returns a single continuous Arabic text.

When to use: You need clean, corrected Arabic text but do not care who said what.

Requires: GEMINI_API_KEY

Request:

curl -X POST http://localhost/api/v1/transcribe/autocorrect \
  -F "audio=@recording.mp3"

Response 200 OK:

{
  "audio_filename": "recording.mp3",
  "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
  "corrected_transcript": "ุฃุฒูŠูƒ ูŠุง ูู†ุฏู…ุŒ ุฃู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุฅูŠุทุงู„ูŠุง..."
}
Field Type Description
audio_filename string Name of the uploaded file
transcript string Raw Whisper output (unmodified)
corrected_transcript string Phonetically and orthographically corrected Arabic text

POST /api/v1/transcribe/corrected

Transcribe with Whisper, then send the transcript to Gemini, which returns a speaker-separated, phonetically corrected version. Speakers are labelled as SPEAKER_01 (Agent) and SPEAKER_00 (Customer).

When to use: You need a clean, readable transcript that shows who said what.

Requires: GEMINI_API_KEY

Request:

curl -X POST http://localhost/api/v1/transcribe/corrected \
  -F "audio=@recording.mp3"

Response 200 OK:

{
  "audio_filename": "recording.mp3",
  "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
  "corrected_transcript": "SPEAKER_01: ุฃู‡ู„ุงู‹ุŒ ู…ุนุงูƒ ุฃุญู…ุฏ ู…ู† ู…ุตุฑ ุฅูŠุทุงู„ูŠุงุŒ ูƒูŠู ุฃู‚ุฏุฑ ุฃุณุงุนุฏูƒุŸ\nSPEAKER_00: ุฃู‡ู„ุงู‹ุŒ ุฃู†ุง ุนุงูŠุฒ ุฃุนุฑู ุชูุงุตูŠู„ ุงู„ูˆุญุฏุฉ..."
}
Field Type Description
audio_filename string Name of the uploaded file
transcript string Raw Whisper output (unmodified)
corrected_transcript string Speaker-labelled, corrected Arabic transcript (SPEAKER_01 = Agent, SPEAKER_00 = Customer)

POST /api/v1/transcribe/analyze

The most powerful endpoint. Transcribes the audio, then runs a full Gemini call analysis that extracts structured information from the conversation.

When to use: You want a complete picture of the call โ€” who spoke, what happened, what needs follow-up.

Requires: GEMINI_API_KEY

Request:

curl -X POST http://localhost/api/v1/transcribe/analyze \
  -F "audio=@recording.mp3"

Response 200 OK:

{
  "audio_filename": "recording.mp3",
  "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
  "cleaned_transcript": "SPEAKER_01: ุฃู‡ู„ุงู‹ุŒ ู…ุนุงูƒ ุฃุญู…ุฏ ู…ู† ู…ุตุฑ ุฅูŠุทุงู„ูŠุง...\nSPEAKER_00: ...",
  "agent_name": "ุฃุญู…ุฏ",
  "customer_name": "ู…ุญู…ุฏ ุงู„ุณูŠุฏ",
  "unit_number": ["B2-401"],
  "project_name": "IL BOSCO",
  "department_mentioned": "Sales",
  "call_type": "Inbound",
  "customer_satisfaction": 3,
  "is_urgent": false,
  "pain_points": ["ุชุฃุฎูŠุฑ ู…ูˆุนุฏ ุงู„ุชุณู„ูŠู…", "ุนุฏู… ูˆุถูˆุญ ู…ุนุงุฏ ุงู„ุตูŠุงู†ุฉ"],
  "action_items_promised": ["ุฅุฑุณุงู„ ุจุฑูŠุฏ ุฅู„ูƒุชุฑูˆู†ูŠ ุจู…ูˆุงุนูŠุฏ ุงู„ุชุณู„ูŠู…"],
  "next_steps": ["ู…ุชุงุจุนุฉ ุงู„ุนู…ูŠู„ ุฎู„ุงู„ 48 ุณุงุนุฉ"]
}

Response fields:

Field Type Description
audio_filename string Name of the uploaded file
transcript string Raw Whisper output (unmodified)
cleaned_transcript string Speaker-labelled, corrected Arabic transcript
agent_name string | null Name of the agent extracted from the conversation
customer_name string | null Name of the customer extracted from the conversation
unit_number string[] Unit identifiers mentioned (e.g. ["B2-401"])
project_name string | null Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.)
department_mentioned string | null Department referenced (Sales, Maintenance, Housekeeping)
call_type string "Inbound" or "Outbound"
customer_satisfaction integer Satisfaction score 1โ€“5 inferred from tone (1 = very unhappy, 5 = very happy)
is_urgent boolean true if satisfaction โ‰ค 2 or the customer expressed critical frustration
pain_points string[] List of issues or complaints mentioned
action_items_promised string[] Commitments made by the agent during the call
next_steps string[] Follow-up actions identified

Error Codes

Code Meaning How to fix
200 Success โ€”
413 File exceeds 200 MB limit Compress or trim the audio
422 Unsupported audio format Use .wav, .mp3, .m4a, .flac, .ogg, or .webm
500 Whisper transcription failed Check server logs: docker compose logs api
502 Gemini call failed Check GEMINI_API_KEY and network access to Google APIs
503 Model not loaded Whisper or Gemini did not initialise โ€” check logs

Interactive Docs (Swagger UI)

FastAPI automatically generates interactive API documentation.

URL Description
http://localhost/docs Swagger UI โ€” try endpoints directly in the browser
http://localhost/redoc ReDoc โ€” clean, readable reference
http://localhost/openapi.json Raw OpenAPI 3.0 schema

For local development (no Docker), replace localhost with localhost:8000.


Training Pipeline

Project structure

.
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ training_config.yaml    # All hyperparameters in one place
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/
โ”‚   โ”‚   โ”œโ”€โ”€ audio/              โ† put your audio files here (.mp3, .wav, โ€ฆ)
โ”‚   โ”‚   โ””โ”€โ”€ transcripts/        โ† matching .txt transcript files (same filename stem)
โ”‚   โ””โ”€โ”€ processed/              โ† auto-generated (segments + HF dataset)
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ data_preparation/
โ”‚   โ”‚   โ”œโ”€โ”€ parse_transcripts.py
โ”‚   โ”‚   โ”œโ”€โ”€ segment_audio.py
โ”‚   โ”‚   โ””โ”€โ”€ build_dataset.py
โ”‚   โ”œโ”€โ”€ training/
โ”‚   โ”‚   โ””โ”€โ”€ trainer.py
โ”‚   โ””โ”€โ”€ inference/
โ”‚       โ”œโ”€โ”€ transcribe.py
โ”‚       โ””โ”€โ”€ analyze_call.py
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ import_existing_data.py โ† run once to import files from project root
โ”‚   โ”œโ”€โ”€ prepare_data.py         โ† step 1: build dataset
โ”‚   โ”œโ”€โ”€ train.py                โ† step 2: fine-tune
โ”‚   โ””โ”€โ”€ transcribe.py           โ† step 3: run inference CLI
โ”œโ”€โ”€ api/                        โ† FastAPI server
โ”œโ”€โ”€ nginx/                      โ† Nginx config
โ”œโ”€โ”€ Dockerfile
โ””โ”€โ”€ docker-compose.yml

Transcript format

Each .txt file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line):

0.0: ุณูŠุงุฏุฉ ุงู„ูƒูˆู„ูˆู†ูŠู„ุŒ ุตุจุฑูƒ ููŠ ู…ุญู„ู‡ุŒ
3.076: ู…ุจุฑูˆูƒ ุนู„ูŠู†ุงุŒ
4.238: ุนู…ู„ู†ุง ุฃูุฌุฑ ุทูŠุงุฑุฉ ููŠ ุชุงุฑูŠุฎ "ุฃู…ุฑูŠูƒุง".

Step 1 โ€” Install dependencies

pip install -r requirements.txt

Step 2 โ€” Add your data

Option A โ€” files already in the project root:

python scripts/import_existing_data.py

Option B โ€” place files directly:

  • Copy audio โ†’ data/raw/audio/my_file.mp3
  • Copy transcript โ†’ data/raw/transcripts/my_file.txt (same stem)

Step 3 โ€” Prepare the dataset

python scripts/prepare_data.py

Splits audio into โ‰ค25-second WAV segments aligned to the transcript, then builds a HuggingFace DatasetDict saved to data/processed/.

Step 4 โ€” Fine-tune

python scripts/train.py

# Resume from a checkpoint
python scripts/train.py --resume outputs/checkpoints/checkpoint-500

Step 5 โ€” Transcribe via CLI

# Use the fine-tuned model (auto-detected)
python scripts/transcribe.py path/to/audio.mp3

# Specify a model explicitly
python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3

# Save output to file
python scripts/transcribe.py audio.mp3 --output result.txt

Adding more data later

  1. Drop new audio.mp3 + audio.txt pairs into data/raw/.
  2. Re-run python scripts/prepare_data.py โ€” rebuilds everything from scratch.
  3. Re-run python scripts/train.py.

Configuration

Edit config/training_config.yaml to change:

  • model.base_model โ€” swap to openai/whisper-medium for faster training
  • training.per_device_train_batch_size โ€” reduce if out of GPU memory
  • training.fp16: false โ€” disable on CPU or older GPUs
  • data.max_segment_duration โ€” segment length (max 30 s for Whisper)

GPU requirements

Model Min VRAM Recommended
whisper-large-v3 16 GB 24 GB A10/A100
whisper-medium 8 GB 16 GB
whisper-small 4 GB 8 GB

Use gradient_checkpointing: true and lower per_device_train_batch_size to fit in less VRAM at the cost of slower training.