File size: 17,314 Bytes
e333dd9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
---
title: Speech To Text API
emoji: ๐ŸŽ™๏ธ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---

Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis.

---

## Table of Contents

1. [Project Overview](#project-overview)
2. [Prerequisites](#prerequisites)
3. [Environment Setup](#environment-setup)
4. [Starting the Server](#starting-the-server)
   - [Option A โ€” Docker (Recommended)](#option-a--docker-recommended)
   - [Option B โ€” Local Development (no Docker)](#option-b--local-development-no-docker)
5. [API Reference](#api-reference)
   - [GET /health](#get-health)
   - [POST /api/v1/transcribe](#post-apiv1transcribe)
   - [POST /api/v1/transcribe/autocorrect](#post-apiv1transcribeautocorrect)
   - [POST /api/v1/transcribe/corrected](#post-apiv1transcribecorrected)
   - [POST /api/v1/transcribe/analyze](#post-apiv1transcribeanalyze)
6. [Error Codes](#error-codes)
7. [Interactive Docs (Swagger UI)](#interactive-docs-swagger-ui)
8. [Training Pipeline](#training-pipeline)

---

## Project Overview

This project fine-tunes `openai/whisper-large-v3` on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service.

**Stack:**

- **Inference:** Whisper (HuggingFace Transformers) + Silero VAD
- **Post-processing:** Google Gemini (speaker diarisation, entity extraction, call analysis)
- **API:** FastAPI + Uvicorn
- **Reverse proxy:** Nginx
- **Container:** Docker + Docker Compose

---

## Prerequisites

### For Docker deployment (recommended)

| Requirement | Version |
| --- | --- |
| Docker | โ‰ฅ 24 |
| Docker Compose | โ‰ฅ 2.20 (bundled with Docker Desktop) |
| NVIDIA Container Toolkit | Required for GPU; skip for CPU-only |
| NVIDIA GPU driver | โ‰ฅ 525 (for CUDA 12) |

### For local development (no Docker)

| Requirement | Version |
| --- | --- |
| Python | 3.10 or 3.11 |
| ffmpeg | Any recent version |
| libsndfile | Any recent version (Linux/macOS) |
| CUDA toolkit | 12.x (optional, for GPU) |

---

## Environment Setup

**Step 1 โ€” Copy the example environment file:**

```bash
cp .env.example .env
```

**Step 2 โ€” Open `.env` and fill in your values:**

```env
# Path inside the container where the model will be mounted
MODEL_PATH=/models/merged_model

# Host machine path to your model directory (mounted into the container)
MODEL_DIR=/opt/stt/models

# Inference device: "cuda" or "cpu" (leave blank to auto-detect)
DEVICE=cuda

# Required for /autocorrect, /corrected, and /analyze endpoints
GEMINI_API_KEY=your_gemini_api_key_here
GEMINI_MODEL=gemini-2.5-flash
```

**Key variables explained:**

| Variable | Required | Default | Description |
| --- | --- | --- | --- |
| `MODEL_PATH` | Yes | `/models/merged_model` | Path **inside the container** to the Whisper model directory |
| `MODEL_DIR` | Yes | `/opt/stt/models` | Path on the **host machine** that gets mounted into the container as `/models` |
| `DEVICE` | No | auto-detect | `cuda` or `cpu` |
| `GEMINI_API_KEY` | For AI endpoints | โ€” | Google Gemini API key |
| `GEMINI_MODEL` | No | `gemini-2.5-flash` | Gemini model to use |

> **Note:** If `GEMINI_API_KEY` is not set, the `/autocorrect`, `/corrected`, and `/analyze` endpoints will return `503 Service Unavailable`.

---

## Starting the Server

### Option A โ€” Docker (Recommended)

This runs FastAPI behind an Nginx reverse proxy, with GPU support.

**Step 1 โ€” Make sure `.env` is configured** (see [Environment Setup](#environment-setup) above).

**Step 2 โ€” Build and start all services:**

```bash
docker compose up --build -d
```

This will:
1. Build the inference Docker image (installs Python deps, copies `src/inference/` and `api/`)
2. Start the `stt-api` container (FastAPI on port 8000 internally)
3. Start the `stt-nginx` container (Nginx on port **80** externally)
4. Wait for the API health check before Nginx accepts traffic (Whisper can take 60โ€“120 s to load)

**Step 3 โ€” Verify the server is healthy:**

```bash
curl http://localhost/health
```

Expected response when ready:
```json
{
  "status": "ok",
  "whisper_loaded": true,
  "gemini_available": true,
  "model_path": "/models/merged_model"
}
```

If `whisper_loaded` is `false`, the model failed to load โ€” check container logs:

```bash
docker compose logs api
```

**Step 4 โ€” Send your first request:**

```bash
curl -X POST http://localhost/api/v1/transcribe \
  -F "audio=@/path/to/your/audio.mp3"
```

---

**Useful Docker commands:**

```bash
# View live logs
docker compose logs -f api

# Stop all services
docker compose down

# Restart after a code change (rebuild image)
docker compose up --build -d

# Check container status
docker compose ps
```

---

**CPU-only deployment:**

If you do not have an NVIDIA GPU, remove the `deploy` block from `docker-compose.yml`:

```yaml
# Delete these lines from the `api` service:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]
```

Then set `DEVICE=cpu` in your `.env` file. Transcription will be significantly slower.

---

### Option B โ€” Local Development (no Docker)

**Step 1 โ€” Install system dependencies:**

On Ubuntu/Debian:
```bash
sudo apt-get install -y ffmpeg libsndfile1
```

On macOS (Homebrew):
```bash
brew install ffmpeg libsndfile
```

On Windows: install [ffmpeg](https://ffmpeg.org/download.html) and add it to `PATH`.

**Step 2 โ€” Create and activate a virtual environment:**

```bash
python -m venv .venv
source .venv/bin/activate        # Linux/macOS
.venv\Scripts\activate           # Windows
```

**Step 3 โ€” Install API dependencies:**

```bash
pip install -r requirements-api.txt
```

**Step 4 โ€” Create your `.env` file** (see [Environment Setup](#environment-setup)) and point `MODEL_PATH` to your local model directory:

```env
MODEL_PATH=outputs/checkpoints/merged_model
GEMINI_API_KEY=your_gemini_api_key_here
```

**Step 5 โ€” Start the server:**

```bash
uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
```

The server will be available at `http://localhost:8000`.

> Remove `--reload` in production โ€” it watches for file changes and is not suitable for production use.

**Step 6 โ€” Verify:**

```bash
curl http://localhost:8000/health
```

---

## API Reference

All transcription endpoints accept a `multipart/form-data` POST request with a single field named `audio`.

**Supported audio formats:** `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, `.webm`

**Maximum file size:** 200 MB

**Base URL:**
- Docker deployment: `http://localhost` (port 80, via Nginx)
- Local development: `http://localhost:8000`

---

### GET /health

Check the server status and which services are loaded.

**Request:**
```bash
curl http://localhost/health
```

**Response `200 OK`:**
```json
{
  "status": "ok",
  "whisper_loaded": true,
  "gemini_available": true,
  "model_path": "/models/merged_model"
}
```

| Field | Type | Description |
| --- | --- | --- |
| `status` | `string` | `"ok"` if Whisper is loaded, `"degraded"` otherwise |
| `whisper_loaded` | `boolean` | Whether the Whisper model loaded successfully |
| `gemini_available` | `boolean` | Whether the Gemini analyzer is ready (requires `GEMINI_API_KEY`) |
| `model_path` | `string` | The model path the server loaded from |

---

### POST /api/v1/transcribe

Transcribe an audio file using Whisper only. No post-processing is applied โ€” returns raw Arabic text directly from the model.

**When to use:** You need a fast transcript and do not need speaker labels or error correction.

**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe \
  -F "audio=@recording.mp3"
```

**Response `200 OK`:**
```json
{
  "audio_filename": "recording.mp3",
  "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู…ุŒ ุฃู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุฅูŠุทุงู„ูŠุง ุนุดุงู†..."
}
```

| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Arabic text from Whisper |

---

### POST /api/v1/transcribe/autocorrect

Transcribe with Whisper, then send the raw transcript to Gemini for **phonetic and orthographic correction only**. No speaker labels are added โ€” returns a single continuous Arabic text.

**When to use:** You need clean, corrected Arabic text but do not care who said what.

**Requires:** `GEMINI_API_KEY`

**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe/autocorrect \
  -F "audio=@recording.mp3"
```

**Response `200 OK`:**
```json
{
  "audio_filename": "recording.mp3",
  "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
  "corrected_transcript": "ุฃุฒูŠูƒ ูŠุง ูู†ุฏู…ุŒ ุฃู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุฅูŠุทุงู„ูŠุง..."
}
```

| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Whisper output (unmodified) |
| `corrected_transcript` | `string` | Phonetically and orthographically corrected Arabic text |

---

### POST /api/v1/transcribe/corrected

Transcribe with Whisper, then send the transcript to Gemini, which returns a **speaker-separated, phonetically corrected** version. Speakers are labelled as `SPEAKER_01` (Agent) and `SPEAKER_00` (Customer).

**When to use:** You need a clean, readable transcript that shows who said what.

**Requires:** `GEMINI_API_KEY`

**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe/corrected \
  -F "audio=@recording.mp3"
```

**Response `200 OK`:**
```json
{
  "audio_filename": "recording.mp3",
  "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
  "corrected_transcript": "SPEAKER_01: ุฃู‡ู„ุงู‹ุŒ ู…ุนุงูƒ ุฃุญู…ุฏ ู…ู† ู…ุตุฑ ุฅูŠุทุงู„ูŠุงุŒ ูƒูŠู ุฃู‚ุฏุฑ ุฃุณุงุนุฏูƒุŸ\nSPEAKER_00: ุฃู‡ู„ุงู‹ุŒ ุฃู†ุง ุนุงูŠุฒ ุฃุนุฑู ุชูุงุตูŠู„ ุงู„ูˆุญุฏุฉ..."
}
```

| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Whisper output (unmodified) |
| `corrected_transcript` | `string` | Speaker-labelled, corrected Arabic transcript (`SPEAKER_01` = Agent, `SPEAKER_00` = Customer) |

---

### POST /api/v1/transcribe/analyze

The most powerful endpoint. Transcribes the audio, then runs a full **Gemini call analysis** that extracts structured information from the conversation.

**When to use:** You want a complete picture of the call โ€” who spoke, what happened, what needs follow-up.

**Requires:** `GEMINI_API_KEY`

**Request:**
```bash
curl -X POST http://localhost/api/v1/transcribe/analyze \
  -F "audio=@recording.mp3"
```

**Response `200 OK`:**
```json
{
  "audio_filename": "recording.mp3",
  "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
  "cleaned_transcript": "SPEAKER_01: ุฃู‡ู„ุงู‹ุŒ ู…ุนุงูƒ ุฃุญู…ุฏ ู…ู† ู…ุตุฑ ุฅูŠุทุงู„ูŠุง...\nSPEAKER_00: ...",
  "agent_name": "ุฃุญู…ุฏ",
  "customer_name": "ู…ุญู…ุฏ ุงู„ุณูŠุฏ",
  "unit_number": ["B2-401"],
  "project_name": "IL BOSCO",
  "department_mentioned": "Sales",
  "call_type": "Inbound",
  "customer_satisfaction": 3,
  "is_urgent": false,
  "pain_points": ["ุชุฃุฎูŠุฑ ู…ูˆุนุฏ ุงู„ุชุณู„ูŠู…", "ุนุฏู… ูˆุถูˆุญ ู…ุนุงุฏ ุงู„ุตูŠุงู†ุฉ"],
  "action_items_promised": ["ุฅุฑุณุงู„ ุจุฑูŠุฏ ุฅู„ูƒุชุฑูˆู†ูŠ ุจู…ูˆุงุนูŠุฏ ุงู„ุชุณู„ูŠู…"],
  "next_steps": ["ู…ุชุงุจุนุฉ ุงู„ุนู…ูŠู„ ุฎู„ุงู„ 48 ุณุงุนุฉ"]
}
```

**Response fields:**

| Field | Type | Description |
| --- | --- | --- |
| `audio_filename` | `string` | Name of the uploaded file |
| `transcript` | `string` | Raw Whisper output (unmodified) |
| `cleaned_transcript` | `string` | Speaker-labelled, corrected Arabic transcript |
| `agent_name` | `string \| null` | Name of the agent extracted from the conversation |
| `customer_name` | `string \| null` | Name of the customer extracted from the conversation |
| `unit_number` | `string[]` | Unit identifiers mentioned (e.g. `["B2-401"]`) |
| `project_name` | `string \| null` | Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.) |
| `department_mentioned` | `string \| null` | Department referenced (Sales, Maintenance, Housekeeping) |
| `call_type` | `string` | `"Inbound"` or `"Outbound"` |
| `customer_satisfaction` | `integer` | Satisfaction score **1โ€“5** inferred from tone (1 = very unhappy, 5 = very happy) |
| `is_urgent` | `boolean` | `true` if satisfaction โ‰ค 2 or the customer expressed critical frustration |
| `pain_points` | `string[]` | List of issues or complaints mentioned |
| `action_items_promised` | `string[]` | Commitments made by the agent during the call |
| `next_steps` | `string[]` | Follow-up actions identified |

---

## Error Codes

| Code | Meaning | How to fix |
| --- | --- | --- |
| `200` | Success | โ€” |
| `413` | File exceeds 200 MB limit | Compress or trim the audio |
| `422` | Unsupported audio format | Use `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, or `.webm` |
| `500` | Whisper transcription failed | Check server logs: `docker compose logs api` |
| `502` | Gemini call failed | Check `GEMINI_API_KEY` and network access to Google APIs |
| `503` | Model not loaded | Whisper or Gemini did not initialise โ€” check logs |

---

## Interactive Docs (Swagger UI)

FastAPI automatically generates interactive API documentation.

| URL | Description |
| --- | --- |
| `http://localhost/docs` | Swagger UI โ€” try endpoints directly in the browser |
| `http://localhost/redoc` | ReDoc โ€” clean, readable reference |
| `http://localhost/openapi.json` | Raw OpenAPI 3.0 schema |

> For local development (no Docker), replace `localhost` with `localhost:8000`.

---

## Training Pipeline

### Project structure

```
.
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ training_config.yaml    # All hyperparameters in one place
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/
โ”‚   โ”‚   โ”œโ”€โ”€ audio/              โ† put your audio files here (.mp3, .wav, โ€ฆ)
โ”‚   โ”‚   โ””โ”€โ”€ transcripts/        โ† matching .txt transcript files (same filename stem)
โ”‚   โ””โ”€โ”€ processed/              โ† auto-generated (segments + HF dataset)
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ data_preparation/
โ”‚   โ”‚   โ”œโ”€โ”€ parse_transcripts.py
โ”‚   โ”‚   โ”œโ”€โ”€ segment_audio.py
โ”‚   โ”‚   โ””โ”€โ”€ build_dataset.py
โ”‚   โ”œโ”€โ”€ training/
โ”‚   โ”‚   โ””โ”€โ”€ trainer.py
โ”‚   โ””โ”€โ”€ inference/
โ”‚       โ”œโ”€โ”€ transcribe.py
โ”‚       โ””โ”€โ”€ analyze_call.py
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ import_existing_data.py โ† run once to import files from project root
โ”‚   โ”œโ”€โ”€ prepare_data.py         โ† step 1: build dataset
โ”‚   โ”œโ”€โ”€ train.py                โ† step 2: fine-tune
โ”‚   โ””โ”€โ”€ transcribe.py           โ† step 3: run inference CLI
โ”œโ”€โ”€ api/                        โ† FastAPI server
โ”œโ”€โ”€ nginx/                      โ† Nginx config
โ”œโ”€โ”€ Dockerfile
โ””โ”€โ”€ docker-compose.yml
```

### Transcript format

Each `.txt` file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line):

```
0.0: ุณูŠุงุฏุฉ ุงู„ูƒูˆู„ูˆู†ูŠู„ุŒ ุตุจุฑูƒ ููŠ ู…ุญู„ู‡ุŒ
3.076: ู…ุจุฑูˆูƒ ุนู„ูŠู†ุงุŒ
4.238: ุนู…ู„ู†ุง ุฃูุฌุฑ ุทูŠุงุฑุฉ ููŠ ุชุงุฑูŠุฎ "ุฃู…ุฑูŠูƒุง".
```

### Step 1 โ€” Install dependencies

```bash
pip install -r requirements.txt
```

### Step 2 โ€” Add your data

Option A โ€” files already in the project root:
```bash
python scripts/import_existing_data.py
```

Option B โ€” place files directly:
- Copy audio โ†’ `data/raw/audio/my_file.mp3`
- Copy transcript โ†’ `data/raw/transcripts/my_file.txt` *(same stem)*

### Step 3 โ€” Prepare the dataset

```bash
python scripts/prepare_data.py
```

Splits audio into โ‰ค25-second WAV segments aligned to the transcript, then builds a HuggingFace `DatasetDict` saved to `data/processed/`.

### Step 4 โ€” Fine-tune

```bash
python scripts/train.py

# Resume from a checkpoint
python scripts/train.py --resume outputs/checkpoints/checkpoint-500
```

### Step 5 โ€” Transcribe via CLI

```bash
# Use the fine-tuned model (auto-detected)
python scripts/transcribe.py path/to/audio.mp3

# Specify a model explicitly
python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3

# Save output to file
python scripts/transcribe.py audio.mp3 --output result.txt
```

### Adding more data later

1. Drop new `audio.mp3` + `audio.txt` pairs into `data/raw/`.
2. Re-run `python scripts/prepare_data.py` โ€” rebuilds everything from scratch.
3. Re-run `python scripts/train.py`.

### Configuration

Edit `config/training_config.yaml` to change:
- `model.base_model` โ€” swap to `openai/whisper-medium` for faster training
- `training.per_device_train_batch_size` โ€” reduce if out of GPU memory
- `training.fp16: false` โ€” disable on CPU or older GPUs
- `data.max_segment_duration` โ€” segment length (max 30 s for Whisper)

### GPU requirements

| Model | Min VRAM | Recommended |
| --- | --- | --- |
| whisper-large-v3 | 16 GB | 24 GB A10/A100 |
| whisper-medium | 8 GB | 16 GB |
| whisper-small | 4 GB | 8 GB |

Use `gradient_checkpointing: true` and lower `per_device_train_batch_size` to fit in less VRAM at the cost of slower training.