Spaces:

MIP-Tech
/

Speach-To-Text

Sleeping

App Files Files Community

Speach-To-Text / README.md

MIP-Tech

Add README with Space config

e333dd9 23 days ago

preview code

raw

history blame contribute delete

17.3 kB

metadata

title: Speech To Text API
emoji: 🎙️
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis.

Project Overview
Prerequisites
Environment Setup
Starting the Server
- Option A — Docker (Recommended)
- Option B — Local Development (no Docker)
API Reference
Error Codes
Interactive Docs (Swagger UI)
Training Pipeline

Project Overview

This project fine-tunes openai/whisper-large-v3 on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service.

Stack:

Inference: Whisper (HuggingFace Transformers) + Silero VAD
Post-processing: Google Gemini (speaker diarisation, entity extraction, call analysis)
API: FastAPI + Uvicorn
Reverse proxy: Nginx
Container: Docker + Docker Compose

Prerequisites

For Docker deployment (recommended)

Requirement	Version
Docker	≥ 24
Docker Compose	≥ 2.20 (bundled with Docker Desktop)
NVIDIA Container Toolkit	Required for GPU; skip for CPU-only
NVIDIA GPU driver	≥ 525 (for CUDA 12)

For local development (no Docker)

Requirement	Version
Python	3.10 or 3.11
ffmpeg	Any recent version
libsndfile	Any recent version (Linux/macOS)
CUDA toolkit	12.x (optional, for GPU)

Environment Setup

Step 1 — Copy the example environment file:

cp .env.example .env

Step 2 — Open .env and fill in your values:

# Path inside the container where the model will be mounted
MODEL_PATH=/models/merged_model

# Host machine path to your model directory (mounted into the container)
MODEL_DIR=/opt/stt/models

# Inference device: "cuda" or "cpu" (leave blank to auto-detect)
DEVICE=cuda

# Required for /autocorrect, /corrected, and /analyze endpoints
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.5-flash

Key variables explained:

Variable	Required	Default	Description
`MODEL_PATH`	Yes	`/models/merged_model`	Path inside the container to the Whisper model directory
`MODEL_DIR`	Yes	`/opt/stt/models`	Path on the host machine that gets mounted into the container as `/models`
`DEVICE`	No	auto-detect	`cuda` or `cpu`
`GEMINI_API_KEY`	For AI endpoints	—	Google Gemini API key
`GEMINI_MODEL`	No	`gemini-2.5-flash`	Gemini model to use

Note: If GEMINI_API_KEY is not set, the /autocorrect, /corrected, and /analyze endpoints will return 503 Service Unavailable.

Starting the Server

Option A — Docker (Recommended)

This runs FastAPI behind an Nginx reverse proxy, with GPU support.

Step 1 — Make sure .env is configured (see Environment Setup above).

Step 2 — Build and start all services:

docker compose up --build -d

This will:

Build the inference Docker image (installs Python deps, copies src/inference/ and api/)
Start the stt-api container (FastAPI on port 8000 internally)
Start the stt-nginx container (Nginx on port 80 externally)
Wait for the API health check before Nginx accepts traffic (Whisper can take 60–120 s to load)

Step 3 — Verify the server is healthy:

curl http://localhost/health

Expected response when ready:

{
  "status": "ok",
  "whisper_loaded": true,
  "gemini_available": true,
  "model_path": "/models/merged_model"
}

If whisper_loaded is false, the model failed to load — check container logs:

docker compose logs api

Step 4 — Send your first request:

curl -X POST http://localhost/api/v1/transcribe \
  -F "audio=@/path/to/your/audio.mp3"

Useful Docker commands:

# View live logs
docker compose logs -f api

# Stop all services
docker compose down

# Restart after a code change (rebuild image)
docker compose up --build -d

# Check container status
docker compose ps

CPU-only deployment:

If you do not have an NVIDIA GPU, remove the deploy block from docker-compose.yml:

# Delete these lines from the `api` service:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

Then set DEVICE=cpu in your .env file. Transcription will be significantly slower.

Option B — Local Development (no Docker)

Step 1 — Install system dependencies:

On Ubuntu/Debian:

sudo apt-get install -y ffmpeg libsndfile1

On macOS (Homebrew):

brew install ffmpeg libsndfile

On Windows: install ffmpeg and add it to PATH.

Step 2 — Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate        # Linux/macOS
.venv\Scripts\activate           # Windows

Step 3 — Install API dependencies:

pip install -r requirements-api.txt

Step 4 — Create your .env file (see Environment Setup) and point MODEL_PATH to your local model directory:

MODEL_PATH=outputs/checkpoints/merged_model
GEMINI_API_KEY=your_gemini_api_key_here

Step 5 — Start the server:

uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

The server will be available at http://localhost:8000.

Remove --reload in production — it watches for file changes and is not suitable for production use.

Step 6 — Verify:

curl http://localhost:8000/health

API Reference

All transcription endpoints accept a multipart/form-data POST request with a single field named audio.

Supported audio formats: .wav, .mp3, .m4a, .flac, .ogg, .webm

Maximum file size: 200 MB

Base URL:

Docker deployment: http://localhost (port 80, via Nginx)
Local development: http://localhost:8000

GET /health

Check the server status and which services are loaded.

Request:

curl http://localhost/health

Response 200 OK:

{
  "status": "ok",
  "whisper_loaded": true,
  "gemini_available": true,
  "model_path": "/models/merged_model"
}

Field	Type	Description
`status`	`string`	`"ok"` if Whisper is loaded, `"degraded"` otherwise
`whisper_loaded`	`boolean`	Whether the Whisper model loaded successfully
`gemini_available`	`boolean`	Whether the Gemini analyzer is ready (requires `GEMINI_API_KEY`)
`model_path`	`string`	The model path the server loaded from

POST /api/v1/transcribe

Transcribe an audio file using Whisper only. No post-processing is applied — returns raw Arabic text directly from the model.

When to use: You need a fast transcript and do not need speaker labels or error correction.

Request:

curl -X POST http://localhost/api/v1/transcribe \
  -F "audio=@recording.mp3"

Response 200 OK:

{
  "audio_filename": "recording.mp3",
  "transcript": "ازيك يا فندم، أنا بتصل من شركة مصر إيطاليا عشان..."
}

Field	Type	Description
`audio_filename`	`string`	Name of the uploaded file
`transcript`	`string`	Raw Arabic text from Whisper

POST /api/v1/transcribe/autocorrect

Transcribe with Whisper, then send the raw transcript to Gemini for phonetic and orthographic correction only. No speaker labels are added — returns a single continuous Arabic text.

When to use: You need clean, corrected Arabic text but do not care who said what.

Requires: GEMINI_API_KEY

Request:

curl -X POST http://localhost/api/v1/transcribe/autocorrect \
  -F "audio=@recording.mp3"

Response 200 OK:

{
  "audio_filename": "recording.mp3",
  "transcript": "ازيك يا فندم انا بتصل من شركة مصر ايطاليا...",
  "corrected_transcript": "أزيك يا فندم، أنا بتصل من شركة مصر إيطاليا..."
}

Field	Type	Description
`audio_filename`	`string`	Name of the uploaded file
`transcript`	`string`	Raw Whisper output (unmodified)
`corrected_transcript`	`string`	Phonetically and orthographically corrected Arabic text

POST /api/v1/transcribe/corrected

Transcribe with Whisper, then send the transcript to Gemini, which returns a speaker-separated, phonetically corrected version. Speakers are labelled as SPEAKER_01 (Agent) and SPEAKER_00 (Customer).

When to use: You need a clean, readable transcript that shows who said what.

Requires: GEMINI_API_KEY

Request:

curl -X POST http://localhost/api/v1/transcribe/corrected \
  -F "audio=@recording.mp3"

Response 200 OK:

{
  "audio_filename": "recording.mp3",
  "transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...",
  "corrected_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا، كيف أقدر أساعدك؟\nSPEAKER_00: أهلاً، أنا عايز أعرف تفاصيل الوحدة..."
}

Field	Type	Description
`audio_filename`	`string`	Name of the uploaded file
`transcript`	`string`	Raw Whisper output (unmodified)
`corrected_transcript`	`string`	Speaker-labelled, corrected Arabic transcript (`SPEAKER_01` = Agent, `SPEAKER_00` = Customer)

POST /api/v1/transcribe/analyze

The most powerful endpoint. Transcribes the audio, then runs a full Gemini call analysis that extracts structured information from the conversation.

When to use: You want a complete picture of the call — who spoke, what happened, what needs follow-up.

Requires: GEMINI_API_KEY

Request:

curl -X POST http://localhost/api/v1/transcribe/analyze \
  -F "audio=@recording.mp3"

Response 200 OK:

{
  "audio_filename": "recording.mp3",
  "transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...",
  "cleaned_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا...\nSPEAKER_00: ...",
  "agent_name": "أحمد",
  "customer_name": "محمد السيد",
  "unit_number": ["B2-401"],
  "project_name": "IL BOSCO",
  "department_mentioned": "Sales",
  "call_type": "Inbound",
  "customer_satisfaction": 3,
  "is_urgent": false,
  "pain_points": ["تأخير موعد التسليم", "عدم وضوح معاد الصيانة"],
  "action_items_promised": ["إرسال بريد إلكتروني بمواعيد التسليم"],
  "next_steps": ["متابعة العميل خلال 48 ساعة"]
}

Response fields:

Field	Type	Description
`audio_filename`	`string`	Name of the uploaded file
`transcript`	`string`	Raw Whisper output (unmodified)
`cleaned_transcript`	`string`	Speaker-labelled, corrected Arabic transcript
`agent_name`	`string \| null`	Name of the agent extracted from the conversation
`customer_name`	`string \| null`	Name of the customer extracted from the conversation
`unit_number`	`string[]`	Unit identifiers mentioned (e.g. `["B2-401"]`)
`project_name`	`string \| null`	Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.)
`department_mentioned`	`string \| null`	Department referenced (Sales, Maintenance, Housekeeping)
`call_type`	`string`	`"Inbound"` or `"Outbound"`
`customer_satisfaction`	`integer`	Satisfaction score 1–5 inferred from tone (1 = very unhappy, 5 = very happy)
`is_urgent`	`boolean`	`true` if satisfaction ≤ 2 or the customer expressed critical frustration
`pain_points`	`string[]`	List of issues or complaints mentioned
`action_items_promised`	`string[]`	Commitments made by the agent during the call
`next_steps`	`string[]`	Follow-up actions identified

Error Codes

Code	Meaning	How to fix
`200`	Success	—
`413`	File exceeds 200 MB limit	Compress or trim the audio
`422`	Unsupported audio format	Use `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, or `.webm`
`500`	Whisper transcription failed	Check server logs: `docker compose logs api`
`502`	Gemini call failed	Check `GEMINI_API_KEY` and network access to Google APIs
`503`	Model not loaded	Whisper or Gemini did not initialise — check logs

Interactive Docs (Swagger UI)

FastAPI automatically generates interactive API documentation.

URL	Description
`http://localhost/docs`	Swagger UI — try endpoints directly in the browser
`http://localhost/redoc`	ReDoc — clean, readable reference
`http://localhost/openapi.json`	Raw OpenAPI 3.0 schema

For local development (no Docker), replace localhost with localhost:8000.

Training Pipeline

Project structure

.
├── config/
│   └── training_config.yaml    # All hyperparameters in one place
├── data/
│   ├── raw/
│   │   ├── audio/              ← put your audio files here (.mp3, .wav, …)
│   │   └── transcripts/        ← matching .txt transcript files (same filename stem)
│   └── processed/              ← auto-generated (segments + HF dataset)
├── src/
│   ├── data_preparation/
│   │   ├── parse_transcripts.py
│   │   ├── segment_audio.py
│   │   └── build_dataset.py
│   ├── training/
│   │   └── trainer.py
│   └── inference/
│       ├── transcribe.py
│       └── analyze_call.py
├── scripts/
│   ├── import_existing_data.py ← run once to import files from project root
│   ├── prepare_data.py         ← step 1: build dataset
│   ├── train.py                ← step 2: fine-tune
│   └── transcribe.py           ← step 3: run inference CLI
├── api/                        ← FastAPI server
├── nginx/                      ← Nginx config
├── Dockerfile
└── docker-compose.yml

Transcript format

Each .txt file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line):

0.0: سيادة الكولونيل، صبرك في محله،
3.076: مبروك علينا،
4.238: عملنا أفجر طيارة في تاريخ "أمريكا".

Step 1 — Install dependencies

pip install -r requirements.txt

Step 2 — Add your data

Option A — files already in the project root:

python scripts/import_existing_data.py

Option B — place files directly:

Copy audio → data/raw/audio/my_file.mp3
Copy transcript → data/raw/transcripts/my_file.txt (same stem)

Step 3 — Prepare the dataset

python scripts/prepare_data.py

Splits audio into ≤25-second WAV segments aligned to the transcript, then builds a HuggingFace DatasetDict saved to data/processed/.

Step 4 — Fine-tune

python scripts/train.py

# Resume from a checkpoint
python scripts/train.py --resume outputs/checkpoints/checkpoint-500

Step 5 — Transcribe via CLI

# Use the fine-tuned model (auto-detected)
python scripts/transcribe.py path/to/audio.mp3

# Specify a model explicitly
python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3

# Save output to file
python scripts/transcribe.py audio.mp3 --output result.txt

Adding more data later

Drop new audio.mp3 + audio.txt pairs into data/raw/.
Re-run python scripts/prepare_data.py — rebuilds everything from scratch.
Re-run python scripts/train.py.

Configuration

Edit config/training_config.yaml to change:

model.base_model — swap to openai/whisper-medium for faster training
training.per_device_train_batch_size — reduce if out of GPU memory
training.fp16: false — disable on CPU or older GPUs
data.max_segment_duration — segment length (max 30 s for Whisper)

GPU requirements

Model	Min VRAM	Recommended
whisper-large-v3	16 GB	24 GB A10/A100
whisper-medium	8 GB	16 GB
whisper-small	4 GB	8 GB

Use gradient_checkpointing: true and lower per_device_train_batch_size to fit in less VRAM at the cost of slower training.

Table of Contents

Project Overview

Prerequisites

For Docker deployment (recommended)

For local development (no Docker)

Environment Setup

Starting the Server

Option A — Docker (Recommended)

Option B — Local Development (no Docker)

API Reference

GET /health

POST /api/v1/transcribe

POST /api/v1/transcribe/autocorrect

POST /api/v1/transcribe/corrected

POST /api/v1/transcribe/analyze

Error Codes

Interactive Docs (Swagger UI)

Training Pipeline

Project structure

Transcript format

Step 1 — Install dependencies

Step 2 — Add your data

Step 3 — Prepare the dataset

Step 4 — Fine-tune

Step 5 — Transcribe via CLI

Adding more data later

Configuration

GPU requirements