MIP-Tech commited on
Commit
e333dd9
ยท
1 Parent(s): 0db822c

Add README with Space config

Browse files
Files changed (1) hide show
  1. README.md +579 -0
README.md ADDED
@@ -0,0 +1,579 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Speech To Text API
3
+ emoji: ๐ŸŽ™๏ธ
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ ---
10
+
11
+ Arabic speech transcription powered by a fine-tuned Whisper model, with optional Gemini post-processing for speaker diarisation, phonetic correction, and real estate call analysis.
12
+
13
+ ---
14
+
15
+ ## Table of Contents
16
+
17
+ 1. [Project Overview](#project-overview)
18
+ 2. [Prerequisites](#prerequisites)
19
+ 3. [Environment Setup](#environment-setup)
20
+ 4. [Starting the Server](#starting-the-server)
21
+ - [Option A โ€” Docker (Recommended)](#option-a--docker-recommended)
22
+ - [Option B โ€” Local Development (no Docker)](#option-b--local-development-no-docker)
23
+ 5. [API Reference](#api-reference)
24
+ - [GET /health](#get-health)
25
+ - [POST /api/v1/transcribe](#post-apiv1transcribe)
26
+ - [POST /api/v1/transcribe/autocorrect](#post-apiv1transcribeautocorrect)
27
+ - [POST /api/v1/transcribe/corrected](#post-apiv1transcribecorrected)
28
+ - [POST /api/v1/transcribe/analyze](#post-apiv1transcribeanalyze)
29
+ 6. [Error Codes](#error-codes)
30
+ 7. [Interactive Docs (Swagger UI)](#interactive-docs-swagger-ui)
31
+ 8. [Training Pipeline](#training-pipeline)
32
+
33
+ ---
34
+
35
+ ## Project Overview
36
+
37
+ This project fine-tunes `openai/whisper-large-v3` on Egyptian Arabic speech data (real estate sales calls from Misr Italia Properties) and exposes the model through a production-ready FastAPI service.
38
+
39
+ **Stack:**
40
+
41
+ - **Inference:** Whisper (HuggingFace Transformers) + Silero VAD
42
+ - **Post-processing:** Google Gemini (speaker diarisation, entity extraction, call analysis)
43
+ - **API:** FastAPI + Uvicorn
44
+ - **Reverse proxy:** Nginx
45
+ - **Container:** Docker + Docker Compose
46
+
47
+ ---
48
+
49
+ ## Prerequisites
50
+
51
+ ### For Docker deployment (recommended)
52
+
53
+ | Requirement | Version |
54
+ | --- | --- |
55
+ | Docker | โ‰ฅ 24 |
56
+ | Docker Compose | โ‰ฅ 2.20 (bundled with Docker Desktop) |
57
+ | NVIDIA Container Toolkit | Required for GPU; skip for CPU-only |
58
+ | NVIDIA GPU driver | โ‰ฅ 525 (for CUDA 12) |
59
+
60
+ ### For local development (no Docker)
61
+
62
+ | Requirement | Version |
63
+ | --- | --- |
64
+ | Python | 3.10 or 3.11 |
65
+ | ffmpeg | Any recent version |
66
+ | libsndfile | Any recent version (Linux/macOS) |
67
+ | CUDA toolkit | 12.x (optional, for GPU) |
68
+
69
+ ---
70
+
71
+ ## Environment Setup
72
+
73
+ **Step 1 โ€” Copy the example environment file:**
74
+
75
+ ```bash
76
+ cp .env.example .env
77
+ ```
78
+
79
+ **Step 2 โ€” Open `.env` and fill in your values:**
80
+
81
+ ```env
82
+ # Path inside the container where the model will be mounted
83
+ MODEL_PATH=/models/merged_model
84
+
85
+ # Host machine path to your model directory (mounted into the container)
86
+ MODEL_DIR=/opt/stt/models
87
+
88
+ # Inference device: "cuda" or "cpu" (leave blank to auto-detect)
89
+ DEVICE=cuda
90
+
91
+ # Required for /autocorrect, /corrected, and /analyze endpoints
92
+ GEMINI_API_KEY=your_gemini_api_key_here
93
+ GEMINI_MODEL=gemini-2.5-flash
94
+ ```
95
+
96
+ **Key variables explained:**
97
+
98
+ | Variable | Required | Default | Description |
99
+ | --- | --- | --- | --- |
100
+ | `MODEL_PATH` | Yes | `/models/merged_model` | Path **inside the container** to the Whisper model directory |
101
+ | `MODEL_DIR` | Yes | `/opt/stt/models` | Path on the **host machine** that gets mounted into the container as `/models` |
102
+ | `DEVICE` | No | auto-detect | `cuda` or `cpu` |
103
+ | `GEMINI_API_KEY` | For AI endpoints | โ€” | Google Gemini API key |
104
+ | `GEMINI_MODEL` | No | `gemini-2.5-flash` | Gemini model to use |
105
+
106
+ > **Note:** If `GEMINI_API_KEY` is not set, the `/autocorrect`, `/corrected`, and `/analyze` endpoints will return `503 Service Unavailable`.
107
+
108
+ ---
109
+
110
+ ## Starting the Server
111
+
112
+ ### Option A โ€” Docker (Recommended)
113
+
114
+ This runs FastAPI behind an Nginx reverse proxy, with GPU support.
115
+
116
+ **Step 1 โ€” Make sure `.env` is configured** (see [Environment Setup](#environment-setup) above).
117
+
118
+ **Step 2 โ€” Build and start all services:**
119
+
120
+ ```bash
121
+ docker compose up --build -d
122
+ ```
123
+
124
+ This will:
125
+ 1. Build the inference Docker image (installs Python deps, copies `src/inference/` and `api/`)
126
+ 2. Start the `stt-api` container (FastAPI on port 8000 internally)
127
+ 3. Start the `stt-nginx` container (Nginx on port **80** externally)
128
+ 4. Wait for the API health check before Nginx accepts traffic (Whisper can take 60โ€“120 s to load)
129
+
130
+ **Step 3 โ€” Verify the server is healthy:**
131
+
132
+ ```bash
133
+ curl http://localhost/health
134
+ ```
135
+
136
+ Expected response when ready:
137
+ ```json
138
+ {
139
+ "status": "ok",
140
+ "whisper_loaded": true,
141
+ "gemini_available": true,
142
+ "model_path": "/models/merged_model"
143
+ }
144
+ ```
145
+
146
+ If `whisper_loaded` is `false`, the model failed to load โ€” check container logs:
147
+
148
+ ```bash
149
+ docker compose logs api
150
+ ```
151
+
152
+ **Step 4 โ€” Send your first request:**
153
+
154
+ ```bash
155
+ curl -X POST http://localhost/api/v1/transcribe \
156
+ -F "audio=@/path/to/your/audio.mp3"
157
+ ```
158
+
159
+ ---
160
+
161
+ **Useful Docker commands:**
162
+
163
+ ```bash
164
+ # View live logs
165
+ docker compose logs -f api
166
+
167
+ # Stop all services
168
+ docker compose down
169
+
170
+ # Restart after a code change (rebuild image)
171
+ docker compose up --build -d
172
+
173
+ # Check container status
174
+ docker compose ps
175
+ ```
176
+
177
+ ---
178
+
179
+ **CPU-only deployment:**
180
+
181
+ If you do not have an NVIDIA GPU, remove the `deploy` block from `docker-compose.yml`:
182
+
183
+ ```yaml
184
+ # Delete these lines from the `api` service:
185
+ deploy:
186
+ resources:
187
+ reservations:
188
+ devices:
189
+ - driver: nvidia
190
+ count: 1
191
+ capabilities: [gpu]
192
+ ```
193
+
194
+ Then set `DEVICE=cpu` in your `.env` file. Transcription will be significantly slower.
195
+
196
+ ---
197
+
198
+ ### Option B โ€” Local Development (no Docker)
199
+
200
+ **Step 1 โ€” Install system dependencies:**
201
+
202
+ On Ubuntu/Debian:
203
+ ```bash
204
+ sudo apt-get install -y ffmpeg libsndfile1
205
+ ```
206
+
207
+ On macOS (Homebrew):
208
+ ```bash
209
+ brew install ffmpeg libsndfile
210
+ ```
211
+
212
+ On Windows: install [ffmpeg](https://ffmpeg.org/download.html) and add it to `PATH`.
213
+
214
+ **Step 2 โ€” Create and activate a virtual environment:**
215
+
216
+ ```bash
217
+ python -m venv .venv
218
+ source .venv/bin/activate # Linux/macOS
219
+ .venv\Scripts\activate # Windows
220
+ ```
221
+
222
+ **Step 3 โ€” Install API dependencies:**
223
+
224
+ ```bash
225
+ pip install -r requirements-api.txt
226
+ ```
227
+
228
+ **Step 4 โ€” Create your `.env` file** (see [Environment Setup](#environment-setup)) and point `MODEL_PATH` to your local model directory:
229
+
230
+ ```env
231
+ MODEL_PATH=outputs/checkpoints/merged_model
232
+ GEMINI_API_KEY=your_gemini_api_key_here
233
+ ```
234
+
235
+ **Step 5 โ€” Start the server:**
236
+
237
+ ```bash
238
+ uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
239
+ ```
240
+
241
+ The server will be available at `http://localhost:8000`.
242
+
243
+ > Remove `--reload` in production โ€” it watches for file changes and is not suitable for production use.
244
+
245
+ **Step 6 โ€” Verify:**
246
+
247
+ ```bash
248
+ curl http://localhost:8000/health
249
+ ```
250
+
251
+ ---
252
+
253
+ ## API Reference
254
+
255
+ All transcription endpoints accept a `multipart/form-data` POST request with a single field named `audio`.
256
+
257
+ **Supported audio formats:** `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, `.webm`
258
+
259
+ **Maximum file size:** 200 MB
260
+
261
+ **Base URL:**
262
+ - Docker deployment: `http://localhost` (port 80, via Nginx)
263
+ - Local development: `http://localhost:8000`
264
+
265
+ ---
266
+
267
+ ### GET /health
268
+
269
+ Check the server status and which services are loaded.
270
+
271
+ **Request:**
272
+ ```bash
273
+ curl http://localhost/health
274
+ ```
275
+
276
+ **Response `200 OK`:**
277
+ ```json
278
+ {
279
+ "status": "ok",
280
+ "whisper_loaded": true,
281
+ "gemini_available": true,
282
+ "model_path": "/models/merged_model"
283
+ }
284
+ ```
285
+
286
+ | Field | Type | Description |
287
+ | --- | --- | --- |
288
+ | `status` | `string` | `"ok"` if Whisper is loaded, `"degraded"` otherwise |
289
+ | `whisper_loaded` | `boolean` | Whether the Whisper model loaded successfully |
290
+ | `gemini_available` | `boolean` | Whether the Gemini analyzer is ready (requires `GEMINI_API_KEY`) |
291
+ | `model_path` | `string` | The model path the server loaded from |
292
+
293
+ ---
294
+
295
+ ### POST /api/v1/transcribe
296
+
297
+ Transcribe an audio file using Whisper only. No post-processing is applied โ€” returns raw Arabic text directly from the model.
298
+
299
+ **When to use:** You need a fast transcript and do not need speaker labels or error correction.
300
+
301
+ **Request:**
302
+ ```bash
303
+ curl -X POST http://localhost/api/v1/transcribe \
304
+ -F "audio=@recording.mp3"
305
+ ```
306
+
307
+ **Response `200 OK`:**
308
+ ```json
309
+ {
310
+ "audio_filename": "recording.mp3",
311
+ "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู…ุŒ ุฃู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุฅูŠุทุงู„ูŠุง ุนุดุงู†..."
312
+ }
313
+ ```
314
+
315
+ | Field | Type | Description |
316
+ | --- | --- | --- |
317
+ | `audio_filename` | `string` | Name of the uploaded file |
318
+ | `transcript` | `string` | Raw Arabic text from Whisper |
319
+
320
+ ---
321
+
322
+ ### POST /api/v1/transcribe/autocorrect
323
+
324
+ Transcribe with Whisper, then send the raw transcript to Gemini for **phonetic and orthographic correction only**. No speaker labels are added โ€” returns a single continuous Arabic text.
325
+
326
+ **When to use:** You need clean, corrected Arabic text but do not care who said what.
327
+
328
+ **Requires:** `GEMINI_API_KEY`
329
+
330
+ **Request:**
331
+ ```bash
332
+ curl -X POST http://localhost/api/v1/transcribe/autocorrect \
333
+ -F "audio=@recording.mp3"
334
+ ```
335
+
336
+ **Response `200 OK`:**
337
+ ```json
338
+ {
339
+ "audio_filename": "recording.mp3",
340
+ "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
341
+ "corrected_transcript": "ุฃุฒูŠูƒ ูŠุง ูู†ุฏู…ุŒ ุฃู†ุง ุจุชุตู„ ู…ู† ุดุฑูƒุฉ ู…ุตุฑ ุฅูŠุทุงู„ูŠุง..."
342
+ }
343
+ ```
344
+
345
+ | Field | Type | Description |
346
+ | --- | --- | --- |
347
+ | `audio_filename` | `string` | Name of the uploaded file |
348
+ | `transcript` | `string` | Raw Whisper output (unmodified) |
349
+ | `corrected_transcript` | `string` | Phonetically and orthographically corrected Arabic text |
350
+
351
+ ---
352
+
353
+ ### POST /api/v1/transcribe/corrected
354
+
355
+ Transcribe with Whisper, then send the transcript to Gemini, which returns a **speaker-separated, phonetically corrected** version. Speakers are labelled as `SPEAKER_01` (Agent) and `SPEAKER_00` (Customer).
356
+
357
+ **When to use:** You need a clean, readable transcript that shows who said what.
358
+
359
+ **Requires:** `GEMINI_API_KEY`
360
+
361
+ **Request:**
362
+ ```bash
363
+ curl -X POST http://localhost/api/v1/transcribe/corrected \
364
+ -F "audio=@recording.mp3"
365
+ ```
366
+
367
+ **Response `200 OK`:**
368
+ ```json
369
+ {
370
+ "audio_filename": "recording.mp3",
371
+ "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
372
+ "corrected_transcript": "SPEAKER_01: ุฃู‡ู„ุงู‹ุŒ ู…ุนุงูƒ ุฃุญู…ุฏ ู…ู† ู…ุตุฑ ุฅูŠุทุงู„ูŠุงุŒ ูƒูŠู ุฃู‚ุฏุฑ ุฃุณุงุนุฏูƒุŸ\nSPEAKER_00: ุฃู‡ู„ุงู‹ุŒ ุฃู†ุง ุนุงูŠุฒ ุฃุนุฑู ุชูุงุตูŠู„ ุงู„ูˆุญุฏุฉ..."
373
+ }
374
+ ```
375
+
376
+ | Field | Type | Description |
377
+ | --- | --- | --- |
378
+ | `audio_filename` | `string` | Name of the uploaded file |
379
+ | `transcript` | `string` | Raw Whisper output (unmodified) |
380
+ | `corrected_transcript` | `string` | Speaker-labelled, corrected Arabic transcript (`SPEAKER_01` = Agent, `SPEAKER_00` = Customer) |
381
+
382
+ ---
383
+
384
+ ### POST /api/v1/transcribe/analyze
385
+
386
+ The most powerful endpoint. Transcribes the audio, then runs a full **Gemini call analysis** that extracts structured information from the conversation.
387
+
388
+ **When to use:** You want a complete picture of the call โ€” who spoke, what happened, what needs follow-up.
389
+
390
+ **Requires:** `GEMINI_API_KEY`
391
+
392
+ **Request:**
393
+ ```bash
394
+ curl -X POST http://localhost/api/v1/transcribe/analyze \
395
+ -F "audio=@recording.mp3"
396
+ ```
397
+
398
+ **Response `200 OK`:**
399
+ ```json
400
+ {
401
+ "audio_filename": "recording.mp3",
402
+ "transcript": "ุงุฒูŠูƒ ูŠุง ูู†ุฏู… ุงู†ุง ุจุชุตู„ ู…ู† ู…ุตุฑ ุงูŠุทุงู„ูŠุง...",
403
+ "cleaned_transcript": "SPEAKER_01: ุฃู‡ู„ุงู‹ุŒ ู…ุนุงูƒ ุฃุญู…ุฏ ู…ู† ู…ุตุฑ ุฅูŠุทุงู„ูŠุง...\nSPEAKER_00: ...",
404
+ "agent_name": "ุฃุญู…ุฏ",
405
+ "customer_name": "ู…ุญู…ุฏ ุงู„ุณูŠุฏ",
406
+ "unit_number": ["B2-401"],
407
+ "project_name": "IL BOSCO",
408
+ "department_mentioned": "Sales",
409
+ "call_type": "Inbound",
410
+ "customer_satisfaction": 3,
411
+ "is_urgent": false,
412
+ "pain_points": ["ุชุฃุฎูŠุฑ ู…ูˆุนุฏ ุงู„ุชุณู„ูŠู…", "ุนุฏู… ูˆุถูˆุญ ู…ุนุงุฏ ุงู„ุตูŠุงู†ุฉ"],
413
+ "action_items_promised": ["ุฅุฑุณุงู„ ุจุฑูŠุฏ ุฅู„ูƒุชุฑูˆู†ูŠ ุจู…ูˆุงุนูŠุฏ ุงู„ุชุณู„ูŠู…"],
414
+ "next_steps": ["ู…ุชุงุจุนุฉ ุงู„ุนู…ูŠู„ ุฎู„ุงู„ 48 ุณุงุนุฉ"]
415
+ }
416
+ ```
417
+
418
+ **Response fields:**
419
+
420
+ | Field | Type | Description |
421
+ | --- | --- | --- |
422
+ | `audio_filename` | `string` | Name of the uploaded file |
423
+ | `transcript` | `string` | Raw Whisper output (unmodified) |
424
+ | `cleaned_transcript` | `string` | Speaker-labelled, corrected Arabic transcript |
425
+ | `agent_name` | `string \| null` | Name of the agent extracted from the conversation |
426
+ | `customer_name` | `string \| null` | Name of the customer extracted from the conversation |
427
+ | `unit_number` | `string[]` | Unit identifiers mentioned (e.g. `["B2-401"]`) |
428
+ | `project_name` | `string \| null` | Project name (IL BOSCO, La Nuova Vista, KAI Sokhna, etc.) |
429
+ | `department_mentioned` | `string \| null` | Department referenced (Sales, Maintenance, Housekeeping) |
430
+ | `call_type` | `string` | `"Inbound"` or `"Outbound"` |
431
+ | `customer_satisfaction` | `integer` | Satisfaction score **1โ€“5** inferred from tone (1 = very unhappy, 5 = very happy) |
432
+ | `is_urgent` | `boolean` | `true` if satisfaction โ‰ค 2 or the customer expressed critical frustration |
433
+ | `pain_points` | `string[]` | List of issues or complaints mentioned |
434
+ | `action_items_promised` | `string[]` | Commitments made by the agent during the call |
435
+ | `next_steps` | `string[]` | Follow-up actions identified |
436
+
437
+ ---
438
+
439
+ ## Error Codes
440
+
441
+ | Code | Meaning | How to fix |
442
+ | --- | --- | --- |
443
+ | `200` | Success | โ€” |
444
+ | `413` | File exceeds 200 MB limit | Compress or trim the audio |
445
+ | `422` | Unsupported audio format | Use `.wav`, `.mp3`, `.m4a`, `.flac`, `.ogg`, or `.webm` |
446
+ | `500` | Whisper transcription failed | Check server logs: `docker compose logs api` |
447
+ | `502` | Gemini call failed | Check `GEMINI_API_KEY` and network access to Google APIs |
448
+ | `503` | Model not loaded | Whisper or Gemini did not initialise โ€” check logs |
449
+
450
+ ---
451
+
452
+ ## Interactive Docs (Swagger UI)
453
+
454
+ FastAPI automatically generates interactive API documentation.
455
+
456
+ | URL | Description |
457
+ | --- | --- |
458
+ | `http://localhost/docs` | Swagger UI โ€” try endpoints directly in the browser |
459
+ | `http://localhost/redoc` | ReDoc โ€” clean, readable reference |
460
+ | `http://localhost/openapi.json` | Raw OpenAPI 3.0 schema |
461
+
462
+ > For local development (no Docker), replace `localhost` with `localhost:8000`.
463
+
464
+ ---
465
+
466
+ ## Training Pipeline
467
+
468
+ ### Project structure
469
+
470
+ ```
471
+ .
472
+ โ”œโ”€โ”€ config/
473
+ โ”‚ โ””โ”€โ”€ training_config.yaml # All hyperparameters in one place
474
+ โ”œโ”€โ”€ data/
475
+ โ”‚ โ”œโ”€โ”€ raw/
476
+ โ”‚ โ”‚ โ”œโ”€โ”€ audio/ โ† put your audio files here (.mp3, .wav, โ€ฆ)
477
+ โ”‚ โ”‚ โ””โ”€โ”€ transcripts/ โ† matching .txt transcript files (same filename stem)
478
+ โ”‚ โ””โ”€โ”€ processed/ โ† auto-generated (segments + HF dataset)
479
+ โ”œโ”€โ”€ src/
480
+ โ”‚ โ”œโ”€โ”€ data_preparation/
481
+ โ”‚ โ”‚ โ”œโ”€โ”€ parse_transcripts.py
482
+ โ”‚ โ”‚ โ”œโ”€โ”€ segment_audio.py
483
+ โ”‚ โ”‚ โ””โ”€โ”€ build_dataset.py
484
+ โ”‚ โ”œโ”€โ”€ training/
485
+ โ”‚ โ”‚ โ””โ”€โ”€ trainer.py
486
+ โ”‚ โ””โ”€โ”€ inference/
487
+ โ”‚ โ”œโ”€โ”€ transcribe.py
488
+ โ”‚ โ””โ”€โ”€ analyze_call.py
489
+ โ”œโ”€โ”€ scripts/
490
+ โ”‚ โ”œโ”€โ”€ import_existing_data.py โ† run once to import files from project root
491
+ โ”‚ โ”œโ”€โ”€ prepare_data.py โ† step 1: build dataset
492
+ โ”‚ โ”œโ”€โ”€ train.py โ† step 2: fine-tune
493
+ โ”‚ โ””โ”€โ”€ transcribe.py โ† step 3: run inference CLI
494
+ โ”œโ”€โ”€ api/ โ† FastAPI server
495
+ โ”œโ”€โ”€ nginx/ โ† Nginx config
496
+ โ”œโ”€โ”€ Dockerfile
497
+ โ””โ”€โ”€ docker-compose.yml
498
+ ```
499
+
500
+ ### Transcript format
501
+
502
+ Each `.txt` file must match its audio file's name (same stem) and use this timestamped format (seconds as float, one entry per line):
503
+
504
+ ```
505
+ 0.0: ุณูŠุงุฏุฉ ุงู„ูƒูˆู„ูˆู†ูŠู„ุŒ ุตุจุฑูƒ ููŠ ู…ุญู„ู‡ุŒ
506
+ 3.076: ู…ุจุฑูˆูƒ ุนู„ูŠู†ุงุŒ
507
+ 4.238: ุนู…ู„ู†ุง ุฃูุฌุฑ ุทูŠุงุฑุฉ ููŠ ุชุงุฑูŠุฎ "ุฃู…ุฑูŠูƒุง".
508
+ ```
509
+
510
+ ### Step 1 โ€” Install dependencies
511
+
512
+ ```bash
513
+ pip install -r requirements.txt
514
+ ```
515
+
516
+ ### Step 2 โ€” Add your data
517
+
518
+ Option A โ€” files already in the project root:
519
+ ```bash
520
+ python scripts/import_existing_data.py
521
+ ```
522
+
523
+ Option B โ€” place files directly:
524
+ - Copy audio โ†’ `data/raw/audio/my_file.mp3`
525
+ - Copy transcript โ†’ `data/raw/transcripts/my_file.txt` *(same stem)*
526
+
527
+ ### Step 3 โ€” Prepare the dataset
528
+
529
+ ```bash
530
+ python scripts/prepare_data.py
531
+ ```
532
+
533
+ Splits audio into โ‰ค25-second WAV segments aligned to the transcript, then builds a HuggingFace `DatasetDict` saved to `data/processed/`.
534
+
535
+ ### Step 4 โ€” Fine-tune
536
+
537
+ ```bash
538
+ python scripts/train.py
539
+
540
+ # Resume from a checkpoint
541
+ python scripts/train.py --resume outputs/checkpoints/checkpoint-500
542
+ ```
543
+
544
+ ### Step 5 โ€” Transcribe via CLI
545
+
546
+ ```bash
547
+ # Use the fine-tuned model (auto-detected)
548
+ python scripts/transcribe.py path/to/audio.mp3
549
+
550
+ # Specify a model explicitly
551
+ python scripts/transcribe.py --model openai/whisper-large-v3 audio.mp3
552
+
553
+ # Save output to file
554
+ python scripts/transcribe.py audio.mp3 --output result.txt
555
+ ```
556
+
557
+ ### Adding more data later
558
+
559
+ 1. Drop new `audio.mp3` + `audio.txt` pairs into `data/raw/`.
560
+ 2. Re-run `python scripts/prepare_data.py` โ€” rebuilds everything from scratch.
561
+ 3. Re-run `python scripts/train.py`.
562
+
563
+ ### Configuration
564
+
565
+ Edit `config/training_config.yaml` to change:
566
+ - `model.base_model` โ€” swap to `openai/whisper-medium` for faster training
567
+ - `training.per_device_train_batch_size` โ€” reduce if out of GPU memory
568
+ - `training.fp16: false` โ€” disable on CPU or older GPUs
569
+ - `data.max_segment_duration` โ€” segment length (max 30 s for Whisper)
570
+
571
+ ### GPU requirements
572
+
573
+ | Model | Min VRAM | Recommended |
574
+ | --- | --- | --- |
575
+ | whisper-large-v3 | 16 GB | 24 GB A10/A100 |
576
+ | whisper-medium | 8 GB | 16 GB |
577
+ | whisper-small | 4 GB | 8 GB |
578
+
579
+ Use `gradient_checkpointing: true` and lower `per_device_train_batch_size` to fit in less VRAM at the cost of slower training.