Spaces:
Sleeping
title: Speech To Text API
emoji: ๐๏ธ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis.
Table of Contents
- Project Overview
- Prerequisites
- Environment Setup
- Starting the Server
- API Reference
- Error Codes
- Interactive Docs (Swagger UI)
- Training Pipeline
Project Overview
This project fine-tunes openai/whisper-large-v3 on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service.
Stack:
- Inference: Whisper (HuggingFace Transformers) + Silero VAD
- Post-processing: Google Gemini (speaker diarisation, entity extraction, call analysis)
- API: FastAPI + Uvicorn
- Reverse proxy: Nginx
- Container: Docker + Docker Compose
Prerequisites
For Docker deployment (recommended)
| Requirement | Version |
|---|---|
| Docker | โฅ 24 |
| Docker Compose | โฅ 2.20 (bundled with Docker Desktop) |
| NVIDIA Container Toolkit | Required for GPU; skip for CPU-only |
| NVIDIA GPU driver | โฅ 525 (for CUDA 12) |
For local development (no Docker)
| Requirement | Version |
|---|---|
| Python | 3.10 or 3.11 |
| ffmpeg | Any recent version |
| libsndfile | Any recent version (Linux/macOS) |
| CUDA toolkit | 12.x (optional, for GPU) |
Environment Setup
Step 1 โ Copy the example environment file:
cp .env.example .env
Step 2 โ Open .env and fill in your values:
# Path inside the container where the model will be mounted
MODEL_PATH=/models/merged_model
# Host machine path to your model directory (mounted into the container)
MODEL_DIR=/opt/stt/models
# Inference device: "cuda" or "cpu" (leave blank to auto-detect)
DEVICE=cuda
# Required for /autocorrect, /corrected, and /analyze endpoints
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.5-flash
Key variables explained:
| Variable | Required | Default | Description |
|---|---|---|---|
MODEL_PATH |
Yes | /models/merged_model |
Path inside the container to the Whisper model directory |
MODEL_DIR |
Yes | /opt/stt/models |
Path on the host machine that gets mounted into the container as /models |
DEVICE |
No | auto-detect | cuda or cpu |
GEMINI_API_KEY |
For AI endpoints | โ | Google Gemini API key |
GEMINI_MODEL |
No | gemini-2.5-flash |
Gemini model to use |
Note: If
GEMINI_API_KEYis not set, the/autocorrect,/corrected, and/analyzeendpoints will return503 Service Unavailable.
Starting the Server
Option A โ Docker (Recommended)
This runs FastAPI behind an Nginx reverse proxy, with GPU support.
Step 1 โ Make sure .env is configured (see Environment Setup above).
Step 2 โ Build and start all services:
docker compose up --build -d
This will:
- Build the inference Docker image (installs Python deps, copies
src/inference/andapi/) - Start the
stt-apicontainer (FastAPI on port 8000 internally) - Start the
stt-nginxcontainer (Nginx on port 80 externally) - Wait for the API health check before Nginx accepts traffic (Whisper can take 60โ120 s to load)
Step 3 โ Verify the server is healthy:
curl http://localhost/health
Expected response when ready:
{
"status": "ok",
"whisper_loaded": true,
"gemini_available": true,
"model_path": "/models/merged_model"
}
If whisper_loaded is false, the model failed to load โ check container logs:
docker compose logs api
Step 4 โ Send your first request:
curl -X POST http://localhost/api/v1/transcribe \
-F "audio=@/path/to/your/audio.mp3"
Useful Docker commands:
# View live logs
docker compose logs -f api
# Stop all services
docker compose down
# Restart after a code change (rebuild image)
docker compose up --build -d
# Check container status
docker compose ps
CPU-only deployment:
If you do not have an NVIDIA GPU, remove the deploy block from docker-compose.yml:
# Delete these lines from the `api` service:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Then set DEVICE=cpu in your .env file. Transcription will be significantly slower.
Option B โ Local Development (no Docker)
Step 1 โ Install system dependencies:
On Ubuntu/Debian:
sudo apt-get install -y ffmpeg libsndfile1
On macOS (Homebrew):
brew install ffmpeg libsndfile
On Windows: install ffmpeg and add it to PATH.
Step 2 โ Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # Linux/macOS
.venv\Scripts\activate # Windows
Step 3 โ Install API dependencies:
pip install -r requirements-api.txt
Step 4 โ Create your .env file (see Environment Setup) and point MODEL_PATH to your local model directory:
MODEL_PATH=outputs/checkpoints/merged_model
GEMINI_API_KEY=your_gemini_api_key_here
Step 5 โ Start the server:
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
The server will be available at http://localhost:8000.
Remove
--reloadin production โ it watches for file changes and is not suitable for production use.
Step 6 โ Verify:
curl http://localhost:8000/health
API Reference
All transcription endpoints accept a multipart/form-data POST request with a single field named audio.
Supported audio formats: .wav, .mp3, .m4a, .flac, .ogg, .webm
Maximum file size: 200 MB
Base URL:
- Docker deployment:
http://localhost(port 80, via Nginx) - Local development:
http://localhost:8000
GET /health
Check the server status and which services are loaded.
Request:
curl http://localhost/health
Response 200 OK:
{
"status": "ok",
"whisper_loaded": true,
"gemini_available": true,
"model_path": "/models/merged_model"
}
| Field | Type | Description |
|---|---|---|
status |
string |
"ok" if Whisper is loaded, "degraded" otherwise |
whisper_loaded |
boolean |
Whether the Whisper model loaded successfully |
gemini_available |
boolean |
Whether the Gemini analyzer is ready (requires GEMINI_API_KEY) |
model_path |
string |
The model path the server loaded from |
POST /api/v1/transcribe
Transcribe an audio file using Whisper only. No post-processing is applied โ returns raw Arabic text directly from the model.
When to use: You need a fast transcript and do not need speaker labels or error correction.
Request:
curl -X POST http://localhost/api/v1/transcribe \
-F "audio=@recording.mp3"
Response 200 OK:
{
"audio_filename": "recording.mp3",
"transcript": "ุงุฒูู ูุง ููุฏู
ุ ุฃูุง ุจุชุตู ู
ู ุดุฑูุฉ ู
ุตุฑ ุฅูุทุงููุง ุนุดุงู..."
}
| Field | Type | Description |
|---|---|---|
audio_filename |
string |
Name of the uploaded file |
transcript |
string |
Raw Arabic text from Whisper |
POST /api/v1/transcribe/autocorrect
Transcribe with Whisper, then send the raw transcript to Gemini for phonetic and orthographic correction only. No speaker labels are added โ returns a single continuous Arabic text.
When to use: You need clean, corrected Arabic text but do not care who said what.
Requires: GEMINI_API_KEY
Request:
curl -X POST http://localhost/api/v1/transcribe/autocorrect \
-F "audio=@recording.mp3"
Response 200 OK:
{
"audio_filename": "recording.mp3",
"transcript": "ุงุฒูู ูุง ููุฏู
ุงูุง ุจุชุตู ู
ู ุดุฑูุฉ ู
ุตุฑ ุงูุทุงููุง...",
"corrected_transcript": "ุฃุฒูู ูุง ููุฏู
ุ ุฃูุง ุจุชุตู ู
ู ุดุฑูุฉ ู
ุตุฑ ุฅูุทุงููุง..."
}
| Field | Type | Description |
|---|---|---|
audio_filename |
string |
Name of the uploaded file |
transcript |
string |
Raw Whisper output (unmodified) |
corrected_transcript |
string |
Phonetically and orthographically corrected Arabic text |
POST /api/v1/transcribe/corrected
Transcribe with Whisper, then send the transcript to Gemini, which returns a speaker-separated, phonetically corrected version. Speakers are labelled as SPEAKER_01 (Agent) and SPEAKER_00 (Customer).
When to use: You need a clean, readable transcript that shows who said what.
Requires: GEMINI_API_KEY
Request:
curl -X POST http://localhost/api/v1/transcribe/corrected \
-F "audio=@recording.mp3"
Response 200 OK:
{
"audio_filename": "recording.mp3",
"transcript": "ุงุฒูู ูุง ููุฏู
ุงูุง ุจุชุตู ู
ู ู
ุตุฑ ุงูุทุงููุง...",
"corrected_transcript": "SPEAKER_01: ุฃููุงูุ ู
ุนุงู ุฃุญู
ุฏ ู
ู ู
ุตุฑ ุฅูุทุงููุงุ ููู ุฃูุฏุฑ ุฃุณุงุนุฏูุ\nSPEAKER_00: ุฃููุงูุ ุฃูุง ุนุงูุฒ ุฃุนุฑู ุชูุงุตูู ุงููุญุฏุฉ..."
}
| Field | Type | Description |
|---|---|---|
audio_filename |
string |
Name of the uploaded file |
transcript |
string |
Raw Whisper output (unmodified) |
corrected_transcript |
string |
Speaker-labelled, corrected Arabic transcript (SPEAKER_01 = Agent, SPEAKER_00 = Customer) |
POST /api/v1/transcribe/analyze
The most powerful endpoint. Transcribes the audio, then runs a full Gemini call analysis that extracts structured information from the conversation.
When to use: You want a complete picture of the call โ who spoke, what happened, what needs follow-up.
Requires: GEMINI_API_KEY
Request:
curl -X POST http://localhost/api/v1/transcribe/analyze \
-F "audio=@recording.mp3"
Response 200 OK:
{
"audio_filename": "recording.mp3",
"transcript": "ุงุฒูู ูุง ููุฏู
ุงูุง ุจุชุตู ู
ู ู
ุตุฑ ุงูุทุงููุง...",
"cleaned_transcript": "SPEAKER_01: ุฃููุงูุ ู
ุนุงู ุฃุญู
ุฏ ู
ู ู
ุตุฑ ุฅูุทุงููุง...\nSPEAKER_00: ...",
"agent_name": "ุฃุญู
ุฏ",
"customer_name": "ู
ุญู
ุฏ ุงูุณูุฏ",
"unit_number": ["B2-401"],
"project_name": "IL BOSCO",
"department_mentioned": "Sales",
"call_type": "Inbound",
"customer_satisfaction": 3,
"is_urgent": false,
"pain_points": ["ุชุฃุฎูุฑ ู
ูุนุฏ ุงูุชุณููู
", "ุนุฏู
ูุถูุญ ู
ุนุงุฏ ุงูุตูุงูุฉ"],
"action_items_promised": ["ุฅุฑุณุงู ุจุฑูุฏ ุฅููุชุฑููู ุจู
ูุงุนูุฏ ุงูุชุณููู
"],
"next_steps": ["ู
ุชุงุจุนุฉ ุงูุนู
ูู ุฎูุงู 48 ุณุงุนุฉ"]
}
Response fields:
| Field | Type | Description |
|---|---|---|
audio_filename |
string |
Name of the uploaded file |
transcript |
string |
Raw Whisper output (unmodified) |
cleaned_transcript |
string |
Speaker-labelled, corrected Arabic transcript |
agent_name |
string | null |
Name of the agent extracted from the conversation |
customer_name |
string | null |
Name of the customer extracted from the conversation |
unit_number |
string[] |
Unit identifiers mentioned (e.g. ["B2-401"]) |
project_name |
string | null |
Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.) |
department_mentioned |
string | null |
Department referenced (Sales, Maintenance, Housekeeping) |
call_type |
string |
"Inbound" or "Outbound" |
customer_satisfaction |
integer |
Satisfaction score 1โ5 inferred from tone (1 = very unhappy, 5 = very happy) |
is_urgent |
boolean |
true if satisfaction โค 2 or the customer expressed critical frustration |
pain_points |
string[] |
List of issues or complaints mentioned |
action_items_promised |
string[] |
Commitments made by the agent during the call |
next_steps |
string[] |
Follow-up actions identified |
Error Codes
| Code | Meaning | How to fix |
|---|---|---|
200 |
Success | โ |
413 |
File exceeds 200 MB limit | Compress or trim the audio |
422 |
Unsupported audio format | Use .wav, .mp3, .m4a, .flac, .ogg, or .webm |
500 |
Whisper transcription failed | Check server logs: docker compose logs api |
502 |
Gemini call failed | Check GEMINI_API_KEY and network access to Google APIs |
503 |
Model not loaded | Whisper or Gemini did not initialise โ check logs |
Interactive Docs (Swagger UI)
FastAPI automatically generates interactive API documentation.
| URL | Description |
|---|---|
http://localhost/docs |
Swagger UI โ try endpoints directly in the browser |
http://localhost/redoc |
ReDoc โ clean, readable reference |
http://localhost/openapi.json |
Raw OpenAPI 3.0 schema |
For local development (no Docker), replace
localhostwithlocalhost:8000.
Training Pipeline
Project structure
.
โโโ config/
โ โโโ training_config.yaml # All hyperparameters in one place
โโโ data/
โ โโโ raw/
โ โ โโโ audio/ โ put your audio files here (.mp3, .wav, โฆ)
โ โ โโโ transcripts/ โ matching .txt transcript files (same filename stem)
โ โโโ processed/ โ auto-generated (segments + HF dataset)
โโโ src/
โ โโโ data_preparation/
โ โ โโโ parse_transcripts.py
โ โ โโโ segment_audio.py
โ โ โโโ build_dataset.py
โ โโโ training/
โ โ โโโ trainer.py
โ โโโ inference/
โ โโโ transcribe.py
โ โโโ analyze_call.py
โโโ scripts/
โ โโโ import_existing_data.py โ run once to import files from project root
โ โโโ prepare_data.py โ step 1: build dataset
โ โโโ train.py โ step 2: fine-tune
โ โโโ transcribe.py โ step 3: run inference CLI
โโโ api/ โ FastAPI server
โโโ nginx/ โ Nginx config
โโโ Dockerfile
โโโ docker-compose.yml
Transcript format
Each .txt file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line):
0.0: ุณูุงุฏุฉ ุงููููููููุ ุตุจุฑู ูู ู
ุญููุ
3.076: ู
ุจุฑูู ุนูููุงุ
4.238: ุนู
ููุง ุฃูุฌุฑ ุทูุงุฑุฉ ูู ุชุงุฑูุฎ "ุฃู
ุฑููุง".
Step 1 โ Install dependencies
pip install -r requirements.txt
Step 2 โ Add your data
Option A โ files already in the project root:
python scripts/import_existing_data.py
Option B โ place files directly:
- Copy audio โ
data/raw/audio/my_file.mp3 - Copy transcript โ
data/raw/transcripts/my_file.txt(same stem)
Step 3 โ Prepare the dataset
python scripts/prepare_data.py
Splits audio into โค25-second WAV segments aligned to the transcript, then builds a HuggingFace DatasetDict saved to data/processed/.
Step 4 โ Fine-tune
python scripts/train.py
# Resume from a checkpoint
python scripts/train.py --resume outputs/checkpoints/checkpoint-500
Step 5 โ Transcribe via CLI
# Use the fine-tuned model (auto-detected)
python scripts/transcribe.py path/to/audio.mp3
# Specify a model explicitly
python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3
# Save output to file
python scripts/transcribe.py audio.mp3 --output result.txt
Adding more data later
- Drop new
audio.mp3+audio.txtpairs intodata/raw/. - Re-run
python scripts/prepare_data.pyโ rebuilds everything from scratch. - Re-run
python scripts/train.py.
Configuration
Edit config/training_config.yaml to change:
model.base_modelโ swap toopenai/whisper-mediumfor faster trainingtraining.per_device_train_batch_sizeโ reduce if out of GPU memorytraining.fp16: falseโ disable on CPU or older GPUsdata.max_segment_durationโ segment length (max 30 s for Whisper)
GPU requirements
| Model | Min VRAM | Recommended |
|---|---|---|
| whisper-large-v3 | 16 GB | 24 GB A10/A100 |
| whisper-medium | 8 GB | 16 GB |
| whisper-small | 4 GB | 8 GB |
Use gradient_checkpointing: true and lower per_device_train_batch_size to fit in less VRAM at the cost of slower training.