Spaces:

MIP-Tech
/

Speach-To-Text

Sleeping

App Files Files Community

Speach-To-Text / README.md

MIP-Tech

Add README with Space config

e333dd9 25 days ago

preview code

raw

history blame contribute delete

17.3 kB

	---
	title: Speech To Text API
	emoji: 🎙️
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: false
	---

	Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis.

	---

	## Table of Contents

	1. [Project Overview](#project-overview)
	2. [Prerequisites](#prerequisites)
	3. [Environment Setup](#environment-setup)
	4. [Starting the Server](#starting-the-server)
	- [Option A — Docker (Recommended)](#option-a--docker-recommended)
	- [Option B — Local Development (no Docker)](#option-b--local-development-no-docker)
	5. [API Reference](#api-reference)
	- [GET /health](#get-health)
	- [POST /api/v1/transcribe](#post-apiv1transcribe)
	- [POST /api/v1/transcribe/autocorrect](#post-apiv1transcribeautocorrect)
	- [POST /api/v1/transcribe/corrected](#post-apiv1transcribecorrected)
	- [POST /api/v1/transcribe/analyze](#post-apiv1transcribeanalyze)
	6. [Error Codes](#error-codes)
	7. [Interactive Docs (Swagger UI)](#interactive-docs-swagger-ui)
	8. [Training Pipeline](#training-pipeline)

	---

	## Project Overview

	This project fine-tunes `openai/whisper-large-v3` on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service.

	Stack:

	- Inference: Whisper (HuggingFace Transformers) + Silero VAD
	- Post-processing: Google Gemini (speaker diarisation, entity extraction, call analysis)
	- API: FastAPI + Uvicorn
	- Reverse proxy: Nginx
	- Container: Docker + Docker Compose

	---

	## Prerequisites

	### For Docker deployment (recommended)

	\| Requirement \| Version \|
	\| --- \| --- \|
	\| Docker \| ≥ 24 \|
	\| Docker Compose \| ≥ 2.20 (bundled with Docker Desktop) \|
	\| NVIDIA Container Toolkit \| Required for GPU; skip for CPU-only \|
	\| NVIDIA GPU driver \| ≥ 525 (for CUDA 12) \|

	### For local development (no Docker)

	\| Requirement \| Version \|
	\| --- \| --- \|
	\| Python \| 3.10 or 3.11 \|
	\| ffmpeg \| Any recent version \|
	\| libsndfile \| Any recent version (Linux/macOS) \|
	\| CUDA toolkit \| 12.x (optional, for GPU) \|

	---

	## Environment Setup

	Step 1 — Copy the example environment file:

	```bash
	cp .env.example .env
	```

	Step 2 — Open `.env` and fill in your values:

	```env
	# Path inside the container where the model will be mounted
	MODEL_PATH=/models/merged_model

	# Host machine path to your model directory (mounted into the container)
	MODEL_DIR=/opt/stt/models

	# Inference device: "cuda" or "cpu" (leave blank to auto-detect)
	DEVICE=cuda

	# Required for /autocorrect, /corrected, and /analyze endpoints
	GEMINI_API_KEY=your_gemini_api_key_here
	GEMINI_MODEL=gemini-2.5-flash
	```

	Key variables explained:

	\| Variable \| Required \| Default \| Description \|
	\| --- \| --- \| --- \| --- \|
	\| `MODEL_PATH` \| Yes \| `/models/merged_model` \| Path inside the container to the Whisper model directory \|
	\| `MODEL_DIR` \| Yes \| `/opt/stt/models` \| Path on the host machine that gets mounted into the container as `/models` \|
	\| `DEVICE` \| No \| auto-detect \| `cuda` or `cpu` \|
	\| `GEMINI_API_KEY` \| For AI endpoints \| — \| Google Gemini API key \|
	\| `GEMINI_MODEL` \| No \| `gemini-2.5-flash` \| Gemini model to use \|

	> Note: If `GEMINI_API_KEY` is not set, the `/autocorrect`, `/corrected`, and `/analyze` endpoints will return `503 Service Unavailable`.

	---

	## Starting the Server

	### Option A — Docker (Recommended)

	This runs FastAPI behind an Nginx reverse proxy, with GPU support.

	Step 1 — Make sure `.env` is configured (see [Environment Setup](#environment-setup) above).

	Step 2 — Build and start all services:

	```bash
	docker compose up --build -d
	```

	This will:
	1. Build the inference Docker image (installs Python deps, copies `src/inference/` and `api/`)
	2. Start the `stt-api` container (FastAPI on port 8000 internally)
	3. Start the `stt-nginx` container (Nginx on port 80 externally)
	4. Wait for the API health check before Nginx accepts traffic (Whisper can take 60–120 s to load)

	Step 3 — Verify the server is healthy:

	```bash
	curl http://localhost/health
	```

	Expected response when ready:
	```json
	{
	"status": "ok",
	"whisper_loaded": true,
	"gemini_available": true,
	"model_path": "/models/merged_model"
	}
	```

	If `whisper_loaded` is `false`, the model failed to load — check container logs:

	```bash
	docker compose logs api
	```

	Step 4 — Send your first request:

	```bash
	curl -X POST http://localhost/api/v1/transcribe \
	-F "audio=@/path/to/your/audio.mp3"
	```

	---

	Useful Docker commands:

	```bash
	# View live logs
	docker compose logs -f api

	# Stop all services
	docker compose down

	# Restart after a code change (rebuild image)
	docker compose up --build -d

	# Check container status
	docker compose ps
	```

	---

	CPU-only deployment:

	If you do not have an NVIDIA GPU, remove the `deploy` block from `docker-compose.yml`:

	```yaml
	# Delete these lines from the `api` service:
	deploy:
	resources:
	reservations:
	devices:
	- driver: nvidia
	count: 1
	capabilities: [gpu]
	```

	Then set `DEVICE=cpu` in your `.env` file. Transcription will be significantly slower.

	---

	### Option B — Local Development (no Docker)

	Step 1 — Install system dependencies:

	On Ubuntu/Debian:
	```bash
	sudo apt-get install -y ffmpeg libsndfile1
	```

	On macOS (Homebrew):
	```bash
	brew install ffmpeg libsndfile
	```

	On Windows: install [ffmpeg](https://ffmpeg.org/download.html) and add it to `PATH`.

	Step 2 — Create and activate a virtual environment:

	```bash
	python -m venv .venv
	source .venv/bin/activate # Linux/macOS
	.venv\Scripts\activate # Windows
	```

	Step 3 — Install API dependencies:

	```bash
	pip install -r requirements-api.txt
	```

	Step 4 — Create your `.env` file (see [Environment Setup](#environment-setup)) and point `MODEL_PATH` to your local model directory:

	```env
	MODEL_PATH=outputs/checkpoints/merged_model
	GEMINI_API_KEY=your_gemini_api_key_here
	```

	Step 5 — Start the server:

	```bash
	uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
	```

	The server will be available at `http://localhost:8000`.

	> Remove `--reload` in production — it watches for file changes and is not suitable for production use.

	Step 6 — Verify:

	```bash
	curl http://localhost:8000/health
	```

	---

	## API Reference

	All transcription endpoints accept a `multipart/form-data` POST request with a single field named `audio`.

	Supported audio formats: `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, `.webm`

	Maximum file size: 200 MB

	Base URL:
	- Docker deployment: `http://localhost` (port 80, via Nginx)
	- Local development: `http://localhost:8000`

	---

	### GET /health

	Check the server status and which services are loaded.

	Request:
	```bash
	curl http://localhost/health
	```

	Response `200 OK`:
	```json
	{
	"status": "ok",
	"whisper_loaded": true,
	"gemini_available": true,
	"model_path": "/models/merged_model"
	}
	```

	\| Field \| Type \| Description \|
	\| --- \| --- \| --- \|
	\| `status` \| `string` \| `"ok"` if Whisper is loaded, `"degraded"` otherwise \|
	\| `whisper_loaded` \| `boolean` \| Whether the Whisper model loaded successfully \|
	\| `gemini_available` \| `boolean` \| Whether the Gemini analyzer is ready (requires `GEMINI_API_KEY`) \|
	\| `model_path` \| `string` \| The model path the server loaded from \|

	---

	### POST /api/v1/transcribe

	Transcribe an audio file using Whisper only. No post-processing is applied — returns raw Arabic text directly from the model.

	When to use: You need a fast transcript and do not need speaker labels or error correction.

	Request:
	```bash
	curl -X POST http://localhost/api/v1/transcribe \
	-F "audio=@recording.mp3"
	```

	Response `200 OK`:
	```json
	{
	"audio_filename": "recording.mp3",
	"transcript": "ازيك يا فندم، أنا بتصل من شركة مصر إيطاليا عشان..."
	}
	```

	\| Field \| Type \| Description \|
	\| --- \| --- \| --- \|
	\| `audio_filename` \| `string` \| Name of the uploaded file \|
	\| `transcript` \| `string` \| Raw Arabic text from Whisper \|

	---

	### POST /api/v1/transcribe/autocorrect

	Transcribe with Whisper, then send the raw transcript to Gemini for phonetic and orthographic correction only. No speaker labels are added — returns a single continuous Arabic text.

	When to use: You need clean, corrected Arabic text but do not care who said what.

	Requires: `GEMINI_API_KEY`

	Request:
	```bash
	curl -X POST http://localhost/api/v1/transcribe/autocorrect \
	-F "audio=@recording.mp3"
	```

	Response `200 OK`:
	```json
	{
	"audio_filename": "recording.mp3",
	"transcript": "ازيك يا فندم انا بتصل من شركة مصر ايطاليا...",
	"corrected_transcript": "أزيك يا فندم، أنا بتصل من شركة مصر إيطاليا..."
	}
	```

	\| Field \| Type \| Description \|
	\| --- \| --- \| --- \|
	\| `audio_filename` \| `string` \| Name of the uploaded file \|
	\| `transcript` \| `string` \| Raw Whisper output (unmodified) \|
	\| `corrected_transcript` \| `string` \| Phonetically and orthographically corrected Arabic text \|

	---

	### POST /api/v1/transcribe/corrected

	Transcribe with Whisper, then send the transcript to Gemini, which returns a speaker-separated, phonetically corrected version. Speakers are labelled as `SPEAKER_01` (Agent) and `SPEAKER_00` (Customer).

	When to use: You need a clean, readable transcript that shows who said what.

	Requires: `GEMINI_API_KEY`

	Request:
	```bash
	curl -X POST http://localhost/api/v1/transcribe/corrected \
	-F "audio=@recording.mp3"
	```

	Response `200 OK`:
	```json
	{
	"audio_filename": "recording.mp3",
	"transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...",
	"corrected_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا، كيف أقدر أساعدك؟\nSPEAKER_00: أهلاً، أنا عايز أعرف تفاصيل الوحدة..."
	}
	```

	\| Field \| Type \| Description \|
	\| --- \| --- \| --- \|
	\| `audio_filename` \| `string` \| Name of the uploaded file \|
	\| `transcript` \| `string` \| Raw Whisper output (unmodified) \|
	\| `corrected_transcript` \| `string` \| Speaker-labelled, corrected Arabic transcript (`SPEAKER_01` = Agent, `SPEAKER_00` = Customer) \|

	---

	### POST /api/v1/transcribe/analyze

	The most powerful endpoint. Transcribes the audio, then runs a full Gemini call analysis that extracts structured information from the conversation.

	When to use: You want a complete picture of the call — who spoke, what happened, what needs follow-up.

	Requires: `GEMINI_API_KEY`

	Request:
	```bash
	curl -X POST http://localhost/api/v1/transcribe/analyze \
	-F "audio=@recording.mp3"
	```

	Response `200 OK`:
	```json
	{
	"audio_filename": "recording.mp3",
	"transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...",
	"cleaned_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا...\nSPEAKER_00: ...",
	"agent_name": "أحمد",
	"customer_name": "محمد السيد",
	"unit_number": ["B2-401"],
	"project_name": "IL BOSCO",
	"department_mentioned": "Sales",
	"call_type": "Inbound",
	"customer_satisfaction": 3,
	"is_urgent": false,
	"pain_points": ["تأخير موعد التسليم", "عدم وضوح معاد الصيانة"],
	"action_items_promised": ["إرسال بريد إلكتروني بمواعيد التسليم"],
	"next_steps": ["متابعة العميل خلال 48 ساعة"]
	}
	```

	Response fields:

	\| Field \| Type \| Description \|
	\| --- \| --- \| --- \|
	\| `audio_filename` \| `string` \| Name of the uploaded file \|
	\| `transcript` \| `string` \| Raw Whisper output (unmodified) \|
	\| `cleaned_transcript` \| `string` \| Speaker-labelled, corrected Arabic transcript \|
	\| `agent_name` \| `string \\| null` \| Name of the agent extracted from the conversation \|
	\| `customer_name` \| `string \\| null` \| Name of the customer extracted from the conversation \|
	\| `unit_number` \| `string[]` \| Unit identifiers mentioned (e.g. `["B2-401"]`) \|
	\| `project_name` \| `string \\| null` \| Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.) \|
	\| `department_mentioned` \| `string \\| null` \| Department referenced (Sales, Maintenance, Housekeeping) \|
	\| `call_type` \| `string` \| `"Inbound"` or `"Outbound"` \|
	\| `customer_satisfaction` \| `integer` \| Satisfaction score 1–5 inferred from tone (1 = very unhappy, 5 = very happy) \|
	\| `is_urgent` \| `boolean` \| `true` if satisfaction ≤ 2 or the customer expressed critical frustration \|
	\| `pain_points` \| `string[]` \| List of issues or complaints mentioned \|
	\| `action_items_promised` \| `string[]` \| Commitments made by the agent during the call \|
	\| `next_steps` \| `string[]` \| Follow-up actions identified \|

	---

	## Error Codes

	\| Code \| Meaning \| How to fix \|
	\| --- \| --- \| --- \|
	\| `200` \| Success \| — \|
	\| `413` \| File exceeds 200 MB limit \| Compress or trim the audio \|
	\| `422` \| Unsupported audio format \| Use `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, or `.webm` \|
	\| `500` \| Whisper transcription failed \| Check server logs: `docker compose logs api` \|
	\| `502` \| Gemini call failed \| Check `GEMINI_API_KEY` and network access to Google APIs \|
	\| `503` \| Model not loaded \| Whisper or Gemini did not initialise — check logs \|

	---

	## Interactive Docs (Swagger UI)

	FastAPI automatically generates interactive API documentation.

	\| URL \| Description \|
	\| --- \| --- \|
	\| `http://localhost/docs` \| Swagger UI — try endpoints directly in the browser \|
	\| `http://localhost/redoc` \| ReDoc — clean, readable reference \|
	\| `http://localhost/openapi.json` \| Raw OpenAPI 3.0 schema \|

	> For local development (no Docker), replace `localhost` with `localhost:8000`.

	---

	## Training Pipeline

	### Project structure

	```
	.
	├── config/
	│ └── training_config.yaml # All hyperparameters in one place
	├── data/
	│ ├── raw/
	│ │ ├── audio/ ← put your audio files here (.mp3, .wav, …)
	│ │ └── transcripts/ ← matching .txt transcript files (same filename stem)
	│ └── processed/ ← auto-generated (segments + HF dataset)
	├── src/
	│ ├── data_preparation/
	│ │ ├── parse_transcripts.py
	│ │ ├── segment_audio.py
	│ │ └── build_dataset.py
	│ ├── training/
	│ │ └── trainer.py
	│ └── inference/
	│ ├── transcribe.py
	│ └── analyze_call.py
	├── scripts/
	│ ├── import_existing_data.py ← run once to import files from project root
	│ ├── prepare_data.py ← step 1: build dataset
	│ ├── train.py ← step 2: fine-tune
	│ └── transcribe.py ← step 3: run inference CLI
	├── api/ ← FastAPI server
	├── nginx/ ← Nginx config
	├── Dockerfile
	└── docker-compose.yml
	```

	### Transcript format

	Each `.txt` file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line):

	```
	0.0: سيادة الكولونيل، صبرك في محله،
	3.076: مبروك علينا،
	4.238: عملنا أفجر طيارة في تاريخ "أمريكا".
	```

	### Step 1 — Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### Step 2 — Add your data

	Option A — files already in the project root:
	```bash
	python scripts/import_existing_data.py
	```

	Option B — place files directly:
	- Copy audio → `data/raw/audio/my_file.mp3`
	- Copy transcript → `data/raw/transcripts/my_file.txt` (same stem)

	### Step 3 — Prepare the dataset

	```bash
	python scripts/prepare_data.py
	```

	Splits audio into ≤25-second WAV segments aligned to the transcript, then builds a HuggingFace `DatasetDict` saved to `data/processed/`.

	### Step 4 — Fine-tune

	```bash
	python scripts/train.py

	# Resume from a checkpoint
	python scripts/train.py --resume outputs/checkpoints/checkpoint-500
	```

	### Step 5 — Transcribe via CLI

	```bash
	# Use the fine-tuned model (auto-detected)
	python scripts/transcribe.py path/to/audio.mp3

	# Specify a model explicitly
	python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3

	# Save output to file
	python scripts/transcribe.py audio.mp3 --output result.txt
	```

	### Adding more data later

	1. Drop new `audio.mp3` + `audio.txt` pairs into `data/raw/`.
	2. Re-run `python scripts/prepare_data.py` — rebuilds everything from scratch.
	3. Re-run `python scripts/train.py`.

	### Configuration

	Edit `config/training_config.yaml` to change:
	- `model.base_model` — swap to `openai/whisper-medium` for faster training
	- `training.per_device_train_batch_size` — reduce if out of GPU memory
	- `training.fp16: false` — disable on CPU or older GPUs
	- `data.max_segment_duration` — segment length (max 30 s for Whisper)

	### GPU requirements

	\| Model \| Min VRAM \| Recommended \|
	\| --- \| --- \| --- \|
	\| whisper-large-v3 \| 16 GB \| 24 GB A10/A100 \|
	\| whisper-medium \| 8 GB \| 16 GB \|
	\| whisper-small \| 4 GB \| 8 GB \|

	Use `gradient_checkpointing: true` and lower `per_device_train_batch_size` to fit in less VRAM at the cost of slower training.