Spaces:

MIP-Tech
/

Speach-To-Text

Sleeping

App Files Files Community

MIP-Tech commited on 22 days ago

Commit

e333dd9

1 Parent(s): 0db822c

Add README with Space config

Browse files

Files changed (1) hide show

README.md +579 -0

README.md ADDED Viewed

	@@ -0,0 +1,579 @@

+---
+title: Speech To Text API
+emoji: 🎙️
+colorFrom: blue
+colorTo: purple
+sdk: docker
+app_port: 7860
+pinned: false
+---
+Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis.
+---
+## Table of Contents
+1. [Project Overview](#project-overview)
+2. [Prerequisites](#prerequisites)
+3. [Environment Setup](#environment-setup)
+4. [Starting the Server](#starting-the-server)
+   - [Option A — Docker (Recommended)](#option-a--docker-recommended)
+   - [Option B — Local Development (no Docker)](#option-b--local-development-no-docker)
+5. [API Reference](#api-reference)
+   - [GET /health](#get-health)
+   - [POST /api/v1/transcribe](#post-apiv1transcribe)
+   - [POST /api/v1/transcribe/autocorrect](#post-apiv1transcribeautocorrect)
+   - [POST /api/v1/transcribe/corrected](#post-apiv1transcribecorrected)
+   - [POST /api/v1/transcribe/analyze](#post-apiv1transcribeanalyze)
+6. [Error Codes](#error-codes)
+7. [Interactive Docs (Swagger UI)](#interactive-docs-swagger-ui)
+8. [Training Pipeline](#training-pipeline)
+---
+## Project Overview
+This project fine-tunes `openai/whisper-large-v3` on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service.
+**Stack:**
+- **Inference:** Whisper (HuggingFace Transformers) + Silero VAD
+- **Post-processing:** Google Gemini (speaker diarisation, entity extraction, call analysis)
+- **API:** FastAPI + Uvicorn
+- **Reverse proxy:** Nginx
+- **Container:** Docker + Docker Compose
+---
+## Prerequisites
+### For Docker deployment (recommended)
+| Requirement | Version |
+| --- | --- |
+| Docker | ≥ 24 |
+| Docker Compose | ≥ 2.20 (bundled with Docker Desktop) |
+| NVIDIA Container Toolkit | Required for GPU; skip for CPU-only |
+| NVIDIA GPU driver | ≥ 525 (for CUDA 12) |
+### For local development (no Docker)
+| Requirement | Version |
+| --- | --- |
+| Python | 3.10 or 3.11 |
+| ffmpeg | Any recent version |
+| libsndfile | Any recent version (Linux/macOS) |
+| CUDA toolkit | 12.x (optional, for GPU) |
+---
+## Environment Setup
+**Step 1 — Copy the example environment file:**
+```bash
+cp .env.example .env
+```
+**Step 2 — Open `.env` and fill in your values:**
+```env
+# Path inside the container where the model will be mounted
+MODEL_PATH=/models/merged_model
+# Host machine path to your model directory (mounted into the container)
+MODEL_DIR=/opt/stt/models
+# Inference device: "cuda" or "cpu" (leave blank to auto-detect)
+DEVICE=cuda
+# Required for /autocorrect, /corrected, and /analyze endpoints
+GEMINI_API_KEY=your_gemini_api_key_here
+GEMINI_MODEL=gemini-2.5-flash
+```
+**Key variables explained:**
+| Variable | Required | Default | Description |
+| --- | --- | --- | --- |
+| `MODEL_PATH` | Yes | `/models/merged_model` | Path **inside the container** to the Whisper model directory |
+| `MODEL_DIR` | Yes | `/opt/stt/models` | Path on the **host machine** that gets mounted into the container as `/models` |
+| `DEVICE` | No | auto-detect | `cuda` or `cpu` |
+| `GEMINI_API_KEY` | For AI endpoints | — | Google Gemini API key |
+| `GEMINI_MODEL` | No | `gemini-2.5-flash` | Gemini model to use |
+> **Note:** If `GEMINI_API_KEY` is not set, the `/autocorrect`, `/corrected`, and `/analyze` endpoints will return `503 Service Unavailable`.
+---
+## Starting the Server
+### Option A — Docker (Recommended)
+This runs FastAPI behind an Nginx reverse proxy, with GPU support.
+**Step 1 — Make sure `.env` is configured** (see [Environment Setup](#environment-setup) above).
+**Step 2 — Build and start all services:**
+```bash
+docker compose up --build -d
+```
+This will:
+1. Build the inference Docker image (installs Python deps, copies `src/inference/` and `api/`)
+2. Start the `stt-api` container (FastAPI on port 8000 internally)
+3. Start the `stt-nginx` container (Nginx on port **80** externally)
+4. Wait for the API health check before Nginx accepts traffic (Whisper can take 60–120 s to load)
+**Step 3 — Verify the server is healthy:**
+```bash
+curl http://localhost/health
+```
+Expected response when ready:
+```json
+{
+  "status": "ok",
+  "whisper_loaded": true,
+  "gemini_available": true,
+  "model_path": "/models/merged_model"
+}
+```
+If `whisper_loaded` is `false`, the model failed to load — check container logs:
+```bash
+docker compose logs api
+```
+**Step 4 — Send your first request:**
+```bash
+curl -X POST http://localhost/api/v1/transcribe \
+  -F "audio=@/path/to/your/audio.mp3"
+```
+---
+**Useful Docker commands:**
+```bash
+# View live logs
+docker compose logs -f api
+# Stop all services
+docker compose down
+# Restart after a code change (rebuild image)
+docker compose up --build -d
+# Check container status
+docker compose ps
+```
+---
+**CPU-only deployment:**
+If you do not have an NVIDIA GPU, remove the `deploy` block from `docker-compose.yml`:
+```yaml
+# Delete these lines from the `api` service:
+deploy:
+  resources:
+    reservations:
+      devices:
+        - driver: nvidia
+          count: 1
+          capabilities: [gpu]
+```
+Then set `DEVICE=cpu` in your `.env` file. Transcription will be significantly slower.
+---
+### Option B — Local Development (no Docker)
+**Step 1 — Install system dependencies:**
+On Ubuntu/Debian:
+```bash
+sudo apt-get install -y ffmpeg libsndfile1
+```
+On macOS (Homebrew):
+```bash
+brew install ffmpeg libsndfile
+```
+On Windows: install [ffmpeg](https://ffmpeg.org/download.html) and add it to `PATH`.
+**Step 2 — Create and activate a virtual environment:**
+```bash
+python -m venv .venv
+source .venv/bin/activate        # Linux/macOS
+.venv\Scripts\activate           # Windows
+```
+**Step 3 — Install API dependencies:**
+```bash
+pip install -r requirements-api.txt
+```
+**Step 4 — Create your `.env` file** (see [Environment Setup](#environment-setup)) and point `MODEL_PATH` to your local model directory:
+```env
+MODEL_PATH=outputs/checkpoints/merged_model
+GEMINI_API_KEY=your_gemini_api_key_here
+```
+**Step 5 — Start the server:**
+```bash
+uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
+```
+The server will be available at `http://localhost:8000`.
+> Remove `--reload` in production — it watches for file changes and is not suitable for production use.
+**Step 6 — Verify:**
+```bash
+curl http://localhost:8000/health
+```
+---
+## API Reference
+All transcription endpoints accept a `multipart/form-data` POST request with a single field named `audio`.
+**Supported audio formats:** `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, `.webm`
+**Maximum file size:** 200 MB
+**Base URL:**
+- Docker deployment: `http://localhost` (port 80, via Nginx)
+- Local development: `http://localhost:8000`
+---
+### GET /health
+Check the server status and which services are loaded.
+**Request:**
+```bash
+curl http://localhost/health
+```
+**Response `200 OK`:**
+```json
+{
+  "status": "ok",
+  "whisper_loaded": true,
+  "gemini_available": true,
+  "model_path": "/models/merged_model"
+}
+```
+| Field | Type | Description |
+| --- | --- | --- |
+| `status` | `string` | `"ok"` if Whisper is loaded, `"degraded"` otherwise |
+| `whisper_loaded` | `boolean` | Whether the Whisper model loaded successfully |
+| `gemini_available` | `boolean` | Whether the Gemini analyzer is ready (requires `GEMINI_API_KEY`) |
+| `model_path` | `string` | The model path the server loaded from |
+---
+### POST /api/v1/transcribe
+Transcribe an audio file using Whisper only. No post-processing is applied — returns raw Arabic text directly from the model.
+**When to use:** You need a fast transcript and do not need speaker labels or error correction.
+**Request:**
+```bash
+curl -X POST http://localhost/api/v1/transcribe \
+  -F "audio=@recording.mp3"
+```
+**Response `200 OK`:**
+```json
+{
+  "audio_filename": "recording.mp3",
+  "transcript": "ازيك يا فندم، أنا بتصل من شركة مصر إيطاليا عشان..."
+}
+```
+| Field | Type | Description |
+| --- | --- | --- |
+| `audio_filename` | `string` | Name of the uploaded file |
+| `transcript` | `string` | Raw Arabic text from Whisper |
+---
+### POST /api/v1/transcribe/autocorrect
+Transcribe with Whisper, then send the raw transcript to Gemini for **phonetic and orthographic correction only**. No speaker labels are added — returns a single continuous Arabic text.
+**When to use:** You need clean, corrected Arabic text but do not care who said what.
+**Requires:** `GEMINI_API_KEY`
+**Request:**
+```bash
+curl -X POST http://localhost/api/v1/transcribe/autocorrect \
+  -F "audio=@recording.mp3"
+```
+**Response `200 OK`:**
+```json
+{
+  "audio_filename": "recording.mp3",
+  "transcript": "ازيك يا فندم انا بتصل من شركة مصر ايطاليا...",
+  "corrected_transcript": "أزيك يا فندم، أنا بتصل من شركة مصر إيطاليا..."
+}
+```
+| Field | Type | Description |
+| --- | --- | --- |
+| `audio_filename` | `string` | Name of the uploaded file |
+| `transcript` | `string` | Raw Whisper output (unmodified) |
+| `corrected_transcript` | `string` | Phonetically and orthographically corrected Arabic text |
+---
+### POST /api/v1/transcribe/corrected
+Transcribe with Whisper, then send the transcript to Gemini, which returns a **speaker-separated, phonetically corrected** version. Speakers are labelled as `SPEAKER_01` (Agent) and `SPEAKER_00` (Customer).
+**When to use:** You need a clean, readable transcript that shows who said what.
+**Requires:** `GEMINI_API_KEY`
+**Request:**
+```bash
+curl -X POST http://localhost/api/v1/transcribe/corrected \
+  -F "audio=@recording.mp3"
+```
+**Response `200 OK`:**
+```json
+{
+  "audio_filename": "recording.mp3",
+  "transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...",
+  "corrected_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا، كيف أقدر أساعدك؟\nSPEAKER_00: أهلاً، أنا عايز أعرف تفاصيل الوحدة..."
+}
+```
+| Field | Type | Description |
+| --- | --- | --- |
+| `audio_filename` | `string` | Name of the uploaded file |
+| `transcript` | `string` | Raw Whisper output (unmodified) |
+| `corrected_transcript` | `string` | Speaker-labelled, corrected Arabic transcript (`SPEAKER_01` = Agent, `SPEAKER_00` = Customer) |
+---
+### POST /api/v1/transcribe/analyze
+The most powerful endpoint. Transcribes the audio, then runs a full **Gemini call analysis** that extracts structured information from the conversation.
+**When to use:** You want a complete picture of the call — who spoke, what happened, what needs follow-up.
+**Requires:** `GEMINI_API_KEY`
+**Request:**
+```bash
+curl -X POST http://localhost/api/v1/transcribe/analyze \
+  -F "audio=@recording.mp3"
+```
+**Response `200 OK`:**
+```json
+{
+  "audio_filename": "recording.mp3",
+  "transcript": "ازيك يا فندم انا بتصل من مصر ايطاليا...",
+  "cleaned_transcript": "SPEAKER_01: أهلاً، معاك أحمد من مصر إيطاليا...\nSPEAKER_00: ...",
+  "agent_name": "أحمد",
+  "customer_name": "محمد السيد",
+  "unit_number": ["B2-401"],
+  "project_name": "IL BOSCO",
+  "department_mentioned": "Sales",
+  "call_type": "Inbound",
+  "customer_satisfaction": 3,
+  "is_urgent": false,
+  "pain_points": ["تأخير موعد التسليم", "عدم وضوح معاد الصيانة"],
+  "action_items_promised": ["إرسال بريد إلكتروني بمواعيد التسليم"],
+  "next_steps": ["متابعة العميل خلال 48 ساعة"]
+}
+```
+**Response fields:**
+| Field | Type | Description |
+| --- | --- | --- |
+| `audio_filename` | `string` | Name of the uploaded file |
+| `transcript` | `string` | Raw Whisper output (unmodified) |
+| `cleaned_transcript` | `string` | Speaker-labelled, corrected Arabic transcript |
+| `agent_name` | `string \| null` | Name of the agent extracted from the conversation |
+| `customer_name` | `string \| null` | Name of the customer extracted from the conversation |
+| `unit_number` | `string[]` | Unit identifiers mentioned (e.g. `["B2-401"]`) |
+| `project_name` | `string \| null` | Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.) |
+| `department_mentioned` | `string \| null` | Department referenced (Sales, Maintenance, Housekeeping) |
+| `call_type` | `string` | `"Inbound"` or `"Outbound"` |
+| `customer_satisfaction` | `integer` | Satisfaction score **1–5** inferred from tone (1 = very unhappy, 5 = very happy) |
+| `is_urgent` | `boolean` | `true` if satisfaction ≤ 2 or the customer expressed critical frustration |
+| `pain_points` | `string[]` | List of issues or complaints mentioned |
+| `action_items_promised` | `string[]` | Commitments made by the agent during the call |
+| `next_steps` | `string[]` | Follow-up actions identified |
+---
+## Error Codes
+| Code | Meaning | How to fix |
+| --- | --- | --- |
+| `200` | Success | — |
+| `413` | File exceeds 200 MB limit | Compress or trim the audio |
+| `422` | Unsupported audio format | Use `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, or `.webm` |
+| `500` | Whisper transcription failed | Check server logs: `docker compose logs api` |
+| `502` | Gemini call failed | Check `GEMINI_API_KEY` and network access to Google APIs |
+| `503` | Model not loaded | Whisper or Gemini did not initialise — check logs |
+---
+## Interactive Docs (Swagger UI)
+FastAPI automatically generates interactive API documentation.
+| URL | Description |
+| --- | --- |
+| `http://localhost/docs` | Swagger UI — try endpoints directly in the browser |
+| `http://localhost/redoc` | ReDoc — clean, readable reference |
+| `http://localhost/openapi.json` | Raw OpenAPI 3.0 schema |
+> For local development (no Docker), replace `localhost` with `localhost:8000`.
+---
+## Training Pipeline
+### Project structure
+```
+.
+├── config/
+│   └── training_config.yaml    # All hyperparameters in one place
+├── data/
+│   ├── raw/
+│   │   ├── audio/              ← put your audio files here (.mp3, .wav, …)
+│   │   └── transcripts/        ← matching .txt transcript files (same filename stem)
+│   └── processed/              ← auto-generated (segments + HF dataset)
+├── src/
+│   ├── data_preparation/
+│   │   ├── parse_transcripts.py
+│   │   ├── segment_audio.py
+│   │   └── build_dataset.py
+│   ├── training/
+│   │   └── trainer.py
+│   └── inference/
+│       ├── transcribe.py
+│       └── analyze_call.py
+├── scripts/
+│   ├── import_existing_data.py ← run once to import files from project root
+│   ├── prepare_data.py         ← step 1: build dataset
+│   ├── train.py                ← step 2: fine-tune
+│   └── transcribe.py           ← step 3: run inference CLI
+├── api/                        ← FastAPI server
+├── nginx/                      ← Nginx config
+├── Dockerfile
+└── docker-compose.yml
+```
+### Transcript format
+Each `.txt` file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line):
+```
+0.0: سيادة الكولونيل، صبرك في محله،
+3.076: مبروك علينا،
+4.238: عملنا أفجر طيارة في تاريخ "أمريكا".
+```
+### Step 1 — Install dependencies
+```bash
+pip install -r requirements.txt
+```
+### Step 2 — Add your data
+Option A — files already in the project root:
+```bash
+python scripts/import_existing_data.py
+```
+Option B — place files directly:
+- Copy audio → `data/raw/audio/my_file.mp3`
+- Copy transcript → `data/raw/transcripts/my_file.txt` *(same stem)*
+### Step 3 — Prepare the dataset
+```bash
+python scripts/prepare_data.py
+```
+Splits audio into ≤25-second WAV segments aligned to the transcript, then builds a HuggingFace `DatasetDict` saved to `data/processed/`.
+### Step 4 — Fine-tune
+```bash
+python scripts/train.py
+# Resume from a checkpoint
+python scripts/train.py --resume outputs/checkpoints/checkpoint-500
+```
+### Step 5 — Transcribe via CLI
+```bash
+# Use the fine-tuned model (auto-detected)
+python scripts/transcribe.py path/to/audio.mp3
+# Specify a model explicitly
+python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3
+# Save output to file
+python scripts/transcribe.py audio.mp3 --output result.txt
+```
+### Adding more data later
+1. Drop new `audio.mp3` + `audio.txt` pairs into `data/raw/`.
+2. Re-run `python scripts/prepare_data.py` — rebuilds everything from scratch.
+3. Re-run `python scripts/train.py`.
+### Configuration
+Edit `config/training_config.yaml` to change:
+- `model.base_model` — swap to `openai/whisper-medium` for faster training
+- `training.per_device_train_batch_size` — reduce if out of GPU memory
+- `training.fp16: false` — disable on CPU or older GPUs
+- `data.max_segment_duration` — segment length (max 30 s for Whisper)
+### GPU requirements
+| Model | Min VRAM | Recommended |
+| --- | --- | --- |
+| whisper-large-v3 | 16 GB | 24 GB A10/A100 |
+| whisper-medium | 8 GB | 16 GB |
+| whisper-small | 4 GB | 8 GB |
+Use `gradient_checkpointing: true` and lower `per_device_train_batch_size` to fit in less VRAM at the cost of slower training.