Chirapath commited on
Commit
841f71f
Β·
verified Β·
1 Parent(s): a5568a5

Delete Developer.md

Browse files
Files changed (1) hide show
  1. Developer.md +0 -1904
Developer.md DELETED
@@ -1,1904 +0,0 @@
1
- # πŸ› οΈ Azure Speech Transcription - Developer Guide
2
-
3
- ## πŸ“‹ Table of Contents
4
-
5
- - [System Architecture](#-system-architecture)
6
- - [Development Environment](#-development-environment)
7
- - [Deployment Guide](#-deployment-guide)
8
- - [API Documentation](#-api-documentation)
9
- - [Database Schema](#-database-schema)
10
- - [Security Implementation](#-security-implementation)
11
- - [Monitoring & Maintenance](#-monitoring--maintenance)
12
- - [Contributing Guidelines](#-contributing-guidelines)
13
- - [Advanced Configuration](#-advanced-configuration)
14
- - [Troubleshooting](#-troubleshooting)
15
-
16
- ---
17
-
18
- ## πŸ—οΈ System Architecture
19
-
20
- ### Overview
21
-
22
- The Azure Speech Transcription service is built with a modern, secure architecture focusing on user privacy, PDPA compliance, and scalability.
23
-
24
- ```
25
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
26
- β”‚ Frontend UI β”‚ β”‚ Backend API β”‚ β”‚ Azure Services β”‚
27
- β”‚ (Gradio) │◄──►│ (Python) │◄──►│ Speech & Blob β”‚
28
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
29
- β”‚ β”‚ β”‚
30
- β”‚ β”‚ β”‚
31
- β–Ό β–Ό β–Ό
32
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
33
- β”‚ User Session β”‚ β”‚ SQLite Database β”‚ β”‚ User Storage β”‚
34
- β”‚ Management β”‚ β”‚ (Metadata) β”‚ β”‚ (Isolated) β”‚
35
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
36
- ```
37
-
38
- ### Core Components
39
-
40
- #### 1. Frontend Layer (`gradio_app.py`)
41
- - **Technology**: Gradio with custom CSS
42
- - **Purpose**: User interface and session management
43
- - **Features**: Authentication, file upload, real-time status, history management
44
-
45
- #### 2. Backend Layer (`app_core.py`)
46
- - **Technology**: Python with threading and async processing
47
- - **Purpose**: Business logic, authentication, and Azure integration
48
- - **Features**: User management, transcription processing, PDPA compliance
49
-
50
- #### 3. Data Layer
51
- - **Database**: SQLite with Azure Blob backup
52
- - **Storage**: Azure Blob Storage with user separation
53
- - **Security**: User-isolated folders and encrypted connections
54
-
55
- #### 4. External Services
56
- - **Azure Speech Services**: Transcription processing
57
- - **Azure Blob Storage**: File and database storage
58
- - **FFmpeg**: Audio/video conversion
59
-
60
- ### Data Flow
61
-
62
- ```
63
- 1. User uploads file β†’ 2. Authentication check β†’ 3. File validation
64
- ↓ ↓ ↓
65
- 8. Download results ← 7. Store transcript ← 6. Process with Azure
66
- ↑ ↑ ↑
67
- 9. Update UI status ← 4. Save to user folder ← 5. Background processing
68
- ```
69
-
70
- ---
71
-
72
- ## πŸ’» Development Environment
73
-
74
- ### Prerequisites
75
-
76
- - **Python**: 3.8 or higher
77
- - **Azure Account**: With Speech Services and Blob Storage
78
- - **FFmpeg**: For audio/video processing
79
- - **Git**: For version control
80
-
81
- ### Environment Setup
82
-
83
- #### 1. Clone Repository
84
- ```bash
85
- git clone <repository-url>
86
- cd azure-speech-transcription
87
- ```
88
-
89
- #### 2. Virtual Environment
90
- ```bash
91
- # Create virtual environment
92
- python -m venv venv
93
-
94
- # Activate (Windows)
95
- venv\Scripts\activate
96
-
97
- # Activate (macOS/Linux)
98
- source venv/bin/activate
99
- ```
100
-
101
- #### 3. Install Dependencies
102
- ```bash
103
- pip install -r requirements.txt
104
- ```
105
-
106
- #### 4. Environment Configuration
107
- ```bash
108
- # Copy environment template
109
- cp .env.example .env
110
-
111
- # Edit with your Azure credentials
112
- nano .env
113
- ```
114
-
115
- #### 5. Install FFmpeg
116
-
117
- **Windows (Chocolatey):**
118
- ```bash
119
- choco install ffmpeg
120
- ```
121
-
122
- **macOS (Homebrew):**
123
- ```bash
124
- brew install ffmpeg
125
- ```
126
-
127
- **Ubuntu/Debian:**
128
- ```bash
129
- sudo apt update
130
- sudo apt install ffmpeg
131
- ```
132
-
133
- #### 6. Verify Installation
134
- ```python
135
- python -c "
136
- import gradio as gr
137
- from azure.storage.blob import BlobServiceClient
138
- import subprocess
139
- print('Gradio:', gr.__version__)
140
- print('FFmpeg:', subprocess.run(['ffmpeg', '-version'], capture_output=True).returncode == 0)
141
- print('Azure Blob:', 'OK')
142
- "
143
- ```
144
-
145
- ### Development Server
146
-
147
- ```bash
148
- # Start development server
149
- python gradio_app.py
150
-
151
- # Server will be available at:
152
- # http://localhost:7860
153
- ```
154
-
155
- ### Development Tools
156
-
157
- #### Recommended IDE Setup
158
- - **VS Code**: With Python, Azure, and Git extensions
159
- - **PyCharm**: Professional edition with Azure toolkit
160
- - **Vim/Emacs**: With appropriate Python plugins
161
-
162
- #### Useful Extensions
163
- ```json
164
- {
165
- "recommendations": [
166
- "ms-python.python",
167
- "ms-vscode.azure-cli",
168
- "ms-azuretools.azure-cli-tools",
169
- "ms-python.black-formatter",
170
- "ms-python.flake8"
171
- ]
172
- }
173
- ```
174
-
175
- #### Code Quality Tools
176
- ```bash
177
- # Install development tools
178
- pip install black flake8 pytest mypy
179
-
180
- # Format code
181
- black .
182
-
183
- # Lint code
184
- flake8 .
185
-
186
- # Type checking
187
- mypy app_core.py gradio_app.py
188
- ```
189
-
190
- ---
191
-
192
- ## πŸš€ Deployment Guide
193
-
194
- ### Production Deployment Options
195
-
196
- #### Option 1: Traditional Server Deployment
197
-
198
- **1. Server Preparation**
199
- ```bash
200
- # Update system
201
- sudo apt update && sudo apt upgrade -y
202
-
203
- # Install Python and dependencies
204
- sudo apt install python3 python3-pip python3-venv nginx ffmpeg -y
205
-
206
- # Create application user
207
- sudo useradd -m -s /bin/bash transcription
208
- sudo su - transcription
209
- ```
210
-
211
- **2. Application Setup**
212
- ```bash
213
- # Clone repository
214
- git clone <repository-url> /home/transcription/app
215
- cd /home/transcription/app
216
-
217
- # Setup virtual environment
218
- python3 -m venv venv
219
- source venv/bin/activate
220
- pip install -r requirements.txt
221
-
222
- # Configure environment
223
- cp .env.example .env
224
- # Edit .env with production values
225
- ```
226
-
227
- **3. Systemd Service**
228
- ```ini
229
- # /etc/systemd/system/transcription.service
230
- [Unit]
231
- Description=Azure Speech Transcription Service
232
- After=network.target
233
-
234
- [Service]
235
- Type=simple
236
- User=transcription
237
- Group=transcription
238
- WorkingDirectory=/home/transcription/app
239
- Environment=PATH=/home/transcription/app/venv/bin
240
- ExecStart=/home/transcription/app/venv/bin/python gradio_app.py
241
- Restart=always
242
- RestartSec=10
243
-
244
- [Install]
245
- WantedBy=multi-user.target
246
- ```
247
-
248
- **4. Nginx Configuration**
249
- ```nginx
250
- # /etc/nginx/sites-available/transcription
251
- server {
252
- listen 80;
253
- server_name your-domain.com;
254
- client_max_body_size 500M;
255
-
256
- location / {
257
- proxy_pass http://127.0.0.1:7860;
258
- proxy_set_header Host $host;
259
- proxy_set_header X-Real-IP $remote_addr;
260
- proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
261
- proxy_set_header X-Forwarded-Proto $scheme;
262
- proxy_read_timeout 300s;
263
- proxy_connect_timeout 75s;
264
- }
265
- }
266
- ```
267
-
268
- **5. SSL Certificate**
269
- ```bash
270
- # Install Certbot
271
- sudo apt install certbot python3-certbot-nginx -y
272
-
273
- # Get SSL certificate
274
- sudo certbot --nginx -d your-domain.com
275
-
276
- # Verify auto-renewal
277
- sudo certbot renew --dry-run
278
- ```
279
-
280
- **6. Start Services**
281
- ```bash
282
- # Enable and start application
283
- sudo systemctl enable transcription
284
- sudo systemctl start transcription
285
-
286
- # Enable and restart nginx
287
- sudo systemctl enable nginx
288
- sudo systemctl restart nginx
289
-
290
- # Check status
291
- sudo systemctl status transcription
292
- sudo systemctl status nginx
293
- ```
294
-
295
- #### Option 2: Docker Deployment
296
-
297
- **1. Dockerfile**
298
- ```dockerfile
299
- FROM python:3.9-slim
300
-
301
- # Install system dependencies
302
- RUN apt-get update && apt-get install -y \
303
- ffmpeg \
304
- && rm -rf /var/lib/apt/lists/*
305
-
306
- # Set working directory
307
- WORKDIR /app
308
-
309
- # Copy requirements and install Python dependencies
310
- COPY requirements.txt .
311
- RUN pip install --no-cache-dir -r requirements.txt
312
-
313
- # Copy application code
314
- COPY . .
315
-
316
- # Create necessary directories
317
- RUN mkdir -p uploads database temp
318
-
319
- # Expose port
320
- EXPOSE 7860
321
-
322
- # Run application
323
- CMD ["python", "gradio_app.py"]
324
- ```
325
-
326
- **2. Docker Compose**
327
- ```yaml
328
- # docker-compose.yml
329
- version: '3.8'
330
-
331
- services:
332
- transcription:
333
- build: .
334
- ports:
335
- - "7860:7860"
336
- environment:
337
- - AZURE_SPEECH_KEY=${AZURE_SPEECH_KEY}
338
- - AZURE_SPEECH_KEY_ENDPOINT=${AZURE_SPEECH_KEY_ENDPOINT}
339
- - AZURE_REGION=${AZURE_REGION}
340
- - AZURE_BLOB_CONNECTION=${AZURE_BLOB_CONNECTION}
341
- - AZURE_CONTAINER=${AZURE_CONTAINER}
342
- - AZURE_BLOB_SAS_TOKEN=${AZURE_BLOB_SAS_TOKEN}
343
- - ALLOWED_LANGS=${ALLOWED_LANGS}
344
- volumes:
345
- - ./uploads:/app/uploads
346
- - ./database:/app/database
347
- - ./temp:/app/temp
348
- restart: unless-stopped
349
-
350
- nginx:
351
- image: nginx:alpine
352
- ports:
353
- - "80:80"
354
- - "443:443"
355
- volumes:
356
- - ./nginx.conf:/etc/nginx/nginx.conf
357
- - ./ssl:/etc/ssl/certs
358
- depends_on:
359
- - transcription
360
- restart: unless-stopped
361
- ```
362
-
363
- **3. Deploy with Docker**
364
- ```bash
365
- # Build and start
366
- docker-compose up -d
367
-
368
- # View logs
369
- docker-compose logs -f transcription
370
-
371
- # Update application
372
- git pull
373
- docker-compose build transcription
374
- docker-compose up -d transcription
375
- ```
376
-
377
- #### Option 3: Cloud Deployment (Azure Container Instances)
378
-
379
- **1. Create Container Registry**
380
- ```bash
381
- # Create ACR
382
- az acr create --resource-group myResourceGroup \
383
- --name myregistry --sku Basic
384
-
385
- # Login to ACR
386
- az acr login --name myregistry
387
-
388
- # Build and push image
389
- docker build -t myregistry.azurecr.io/transcription:latest .
390
- docker push myregistry.azurecr.io/transcription:latest
391
- ```
392
-
393
- **2. Deploy Container Instance**
394
- ```bash
395
- # Create container instance
396
- az container create \
397
- --resource-group myResourceGroup \
398
- --name transcription-app \
399
- --image myregistry.azurecr.io/transcription:latest \
400
- --cpu 2 --memory 4 \
401
- --port 7860 \
402
- --environment-variables \
403
- AZURE_SPEECH_KEY=$AZURE_SPEECH_KEY \
404
- AZURE_SPEECH_KEY_ENDPOINT=$AZURE_SPEECH_KEY_ENDPOINT \
405
- AZURE_REGION=$AZURE_REGION \
406
- AZURE_BLOB_CONNECTION="$AZURE_BLOB_CONNECTION" \
407
- AZURE_CONTAINER=$AZURE_CONTAINER \
408
- AZURE_BLOB_SAS_TOKEN="$AZURE_BLOB_SAS_TOKEN"
409
- ```
410
-
411
- ---
412
-
413
- ## πŸ“‘ API Documentation
414
-
415
- ### Core Classes and Methods
416
-
417
- #### TranscriptionManager Class
418
-
419
- **Purpose**: Main service class handling all transcription operations
420
-
421
- ```python
422
- class TranscriptionManager:
423
- def __init__(self)
424
-
425
- # User Authentication
426
- def register_user(email: str, username: str, password: str,
427
- gdpr_consent: bool, data_retention_agreed: bool,
428
- marketing_consent: bool) -> Tuple[bool, str, Optional[str]]
429
-
430
- def login_user(login: str, password: str) -> Tuple[bool, str, Optional[User]]
431
-
432
- # Transcription Operations
433
- def submit_transcription(file_bytes: bytes, original_filename: str,
434
- user_id: str, language: str,
435
- settings: Dict) -> str
436
-
437
- def get_job_status(job_id: str) -> Optional[TranscriptionJob]
438
-
439
- # Data Management
440
- def get_user_history(user_id: str, limit: int) -> List[TranscriptionJob]
441
- def get_user_stats(user_id: str) -> Dict
442
- def export_user_data(user_id: str) -> Dict
443
- def delete_user_account(user_id: str) -> bool
444
- ```
445
-
446
- #### DatabaseManager Class
447
-
448
- **Purpose**: Handle database operations and Azure blob synchronization
449
-
450
- ```python
451
- class DatabaseManager:
452
- def __init__(db_path: str = None)
453
-
454
- # User Operations
455
- def create_user(...) -> Tuple[bool, str, Optional[str]]
456
- def authenticate_user(login: str, password: str) -> Tuple[bool, str, Optional[User]]
457
- def get_user_by_id(user_id: str) -> Optional[User]
458
-
459
- # Job Operations
460
- def save_job(job: TranscriptionJob)
461
- def get_job(job_id: str) -> Optional[TranscriptionJob]
462
- def get_user_jobs(user_id: str, limit: int) -> List[TranscriptionJob]
463
- def get_pending_jobs() -> List[TranscriptionJob]
464
- ```
465
-
466
- #### AuthManager Class
467
-
468
- **Purpose**: Authentication utilities and validation
469
-
470
- ```python
471
- class AuthManager:
472
- @staticmethod
473
- def hash_password(password: str) -> str
474
- def verify_password(password: str, password_hash: str) -> bool
475
- def validate_email(email: str) -> bool
476
- def validate_username(username: str) -> bool
477
- def validate_password(password: str) -> Tuple[bool, str]
478
- ```
479
-
480
- ### Data Models
481
-
482
- #### User Model
483
- ```python
484
- @dataclass
485
- class User:
486
- user_id: str
487
- email: str
488
- username: str
489
- password_hash: str
490
- created_at: str
491
- last_login: Optional[str] = None
492
- is_active: bool = True
493
- gdpr_consent: bool = False
494
- data_retention_agreed: bool = False
495
- marketing_consent: bool = False
496
- ```
497
-
498
- #### TranscriptionJob Model
499
- ```python
500
- @dataclass
501
- class TranscriptionJob:
502
- job_id: str
503
- user_id: str
504
- original_filename: str
505
- audio_url: str
506
- language: str
507
- status: str # pending, processing, completed, failed
508
- created_at: str
509
- completed_at: Optional[str] = None
510
- transcript_text: Optional[str] = None
511
- transcript_url: Optional[str] = None
512
- error_message: Optional[str] = None
513
- azure_trans_id: Optional[str] = None
514
- settings: Optional[Dict] = None
515
- ```
516
-
517
- ### Configuration Parameters
518
-
519
- #### Environment Variables
520
- ```python
521
- # Required
522
- AZURE_SPEECH_KEY: str
523
- AZURE_SPEECH_KEY_ENDPOINT: str
524
- AZURE_REGION: str
525
- AZURE_BLOB_CONNECTION: str
526
- AZURE_CONTAINER: str
527
- AZURE_BLOB_SAS_TOKEN: str
528
-
529
- # Optional
530
- ALLOWED_LANGS: str # JSON string
531
- API_VERSION: str = "v3.2"
532
- PASSWORD_SALT: str = "default_salt"
533
- MAX_FILE_SIZE_MB: int = 500
534
- ```
535
-
536
- #### Transcription Settings
537
- ```python
538
- settings = {
539
- 'audio_format': str, # wav, mp3, etc.
540
- 'diarization_enabled': bool, # Speaker identification
541
- 'speakers': int, # Max speakers (1-10)
542
- 'profanity': str, # masked, removed, raw
543
- 'punctuation': str, # automatic, dictated, none
544
- 'timestamps': bool, # Include timestamps
545
- 'lexical': bool, # Include lexical forms
546
- 'language_id_enabled': bool, # Auto language detection
547
- 'candidate_locales': List[str] # Language candidates
548
- }
549
- ```
550
-
551
- ---
552
-
553
- ## πŸ—„οΈ Database Schema
554
-
555
- ### SQLite Database Structure
556
-
557
- #### Users Table
558
- ```sql
559
- CREATE TABLE users (
560
- user_id TEXT PRIMARY KEY,
561
- email TEXT UNIQUE NOT NULL,
562
- username TEXT UNIQUE NOT NULL,
563
- password_hash TEXT NOT NULL,
564
- created_at TEXT NOT NULL,
565
- last_login TEXT,
566
- is_active BOOLEAN DEFAULT 1,
567
- gdpr_consent BOOLEAN DEFAULT 0,
568
- data_retention_agreed BOOLEAN DEFAULT 0,
569
- marketing_consent BOOLEAN DEFAULT 0
570
- );
571
-
572
- -- Indexes
573
- CREATE INDEX idx_users_email ON users(email);
574
- CREATE INDEX idx_users_username ON users(username);
575
- ```
576
-
577
- #### Transcriptions Table
578
- ```sql
579
- CREATE TABLE transcriptions (
580
- job_id TEXT PRIMARY KEY,
581
- user_id TEXT NOT NULL,
582
- original_filename TEXT NOT NULL,
583
- audio_url TEXT,
584
- language TEXT NOT NULL,
585
- status TEXT NOT NULL,
586
- created_at TEXT NOT NULL,
587
- completed_at TEXT,
588
- transcript_text TEXT,
589
- transcript_url TEXT,
590
- error_message TEXT,
591
- azure_trans_id TEXT,
592
- settings TEXT,
593
- FOREIGN KEY (user_id) REFERENCES users (user_id)
594
- );
595
-
596
- -- Indexes
597
- CREATE INDEX idx_transcriptions_user_id ON transcriptions(user_id);
598
- CREATE INDEX idx_transcriptions_status ON transcriptions(status);
599
- CREATE INDEX idx_transcriptions_created_at ON transcriptions(created_at DESC);
600
- CREATE INDEX idx_transcriptions_user_created ON transcriptions(user_id, created_at DESC);
601
- ```
602
-
603
- ### Azure Blob Storage Structure
604
-
605
- ```
606
- Container: {AZURE_CONTAINER}/
607
- β”œβ”€β”€ shared/
608
- β”‚ └── database/
609
- β”‚ └── transcriptions.db # Shared database backup
610
- β”œβ”€β”€ users/
611
- β”‚ β”œβ”€β”€ {user-id-1}/
612
- β”‚ β”‚ β”œβ”€β”€ audio/ # Processed audio files
613
- β”‚ β”‚ β”‚ β”œβ”€β”€ {job-id-1}.wav
614
- β”‚ β”‚ β”‚ └── {job-id-2}.wav
615
- β”‚ β”‚ β”œβ”€β”€ transcripts/ # Transcript files
616
- β”‚ β”‚ β”‚ β”œβ”€β”€ {job-id-1}.txt
617
- β”‚ β”‚ β”‚ └── {job-id-2}.txt
618
- β”‚ β”‚ └── originals/ # Original uploaded files
619
- β”‚ β”‚ β”œβ”€β”€ {job-id-1}_{filename}.mp4
620
- β”‚ β”‚ └── {job-id-2}_{filename}.wav
621
- β”‚ └── {user-id-2}/
622
- β”‚ β”œβ”€β”€ audio/
623
- β”‚ β”œβ”€β”€ transcripts/
624
- β”‚ └── originals/
625
- ```
626
-
627
- ### Database Operations
628
-
629
- #### User Management Queries
630
- ```sql
631
- -- Create user
632
- INSERT INTO users (user_id, email, username, password_hash, created_at,
633
- gdpr_consent, data_retention_agreed, marketing_consent)
634
- VALUES (?, ?, ?, ?, ?, ?, ?, ?);
635
-
636
- -- Authenticate user
637
- SELECT * FROM users
638
- WHERE (email = ? OR username = ?) AND is_active = 1;
639
-
640
- -- Update last login
641
- UPDATE users SET last_login = ? WHERE user_id = ?;
642
-
643
- -- Get user stats
644
- SELECT status, COUNT(*) FROM transcriptions
645
- WHERE user_id = ? GROUP BY status;
646
- ```
647
-
648
- #### Job Management Queries
649
- ```sql
650
- -- Create job
651
- INSERT INTO transcriptions (job_id, user_id, original_filename, language,
652
- status, created_at, settings)
653
- VALUES (?, ?, ?, ?, 'pending', ?, ?);
654
-
655
- -- Update job status
656
- UPDATE transcriptions
657
- SET status = ?, completed_at = ?, transcript_text = ?, transcript_url = ?
658
- WHERE job_id = ?;
659
-
660
- -- Get user jobs
661
- SELECT * FROM transcriptions
662
- WHERE user_id = ?
663
- ORDER BY created_at DESC LIMIT ?;
664
-
665
- -- Get pending jobs for background processor
666
- SELECT * FROM transcriptions
667
- WHERE status IN ('pending', 'processing');
668
- ```
669
-
670
- ---
671
-
672
- ## πŸ”’ Security Implementation
673
-
674
- ### Authentication Security
675
-
676
- #### Password Security
677
- ```python
678
- # Password hashing with salt
679
- def hash_password(password: str) -> str:
680
- salt = os.environ.get("PASSWORD_SALT", "default_salt")
681
- return hashlib.sha256((password + salt).encode()).hexdigest()
682
-
683
- # Password validation
684
- def validate_password(password: str) -> Tuple[bool, str]:
685
- if len(password) < 8:
686
- return False, "Password must be at least 8 characters"
687
- if not re.search(r'[A-Z]', password):
688
- return False, "Password must contain uppercase letter"
689
- if not re.search(r'[a-z]', password):
690
- return False, "Password must contain lowercase letter"
691
- if not re.search(r'\d', password):
692
- return False, "Password must contain number"
693
- return True, "Valid"
694
- ```
695
-
696
- #### Session Management
697
- ```python
698
- # User session state
699
- session_state = {
700
- 'user_id': str,
701
- 'username': str,
702
- 'logged_in_at': datetime,
703
- 'last_activity': datetime
704
- }
705
-
706
- # Session validation
707
- def validate_session(session_state: dict) -> bool:
708
- if not session_state or 'user_id' not in session_state:
709
- return False
710
-
711
- # Check session timeout (if implemented)
712
- last_activity = session_state.get('last_activity')
713
- if last_activity:
714
- timeout = timedelta(hours=24) # 24-hour sessions
715
- if datetime.now() - last_activity > timeout:
716
- return False
717
-
718
- return True
719
- ```
720
-
721
- ### Data Security
722
-
723
- #### Access Control
724
- ```python
725
- # User data access verification
726
- def verify_user_access(job_id: str, user_id: str) -> bool:
727
- job = get_job(job_id)
728
- return job and job.user_id == user_id
729
-
730
- # File path security
731
- def get_user_blob_path(user_id: str, blob_type: str, filename: str) -> str:
732
- # Ensure user can only access their own folder
733
- safe_filename = os.path.basename(filename) # Prevent path traversal
734
- return f"users/{user_id}/{blob_type}/{safe_filename}"
735
- ```
736
-
737
- #### Data Encryption
738
- ```python
739
- # Azure Blob Storage encryption (configured at Azure level)
740
- # - Encryption at rest: Enabled by default
741
- # - Encryption in transit: HTTPS enforced
742
- # - Customer-managed keys: Optional enhancement
743
-
744
- # Database encryption (for sensitive fields)
745
- from cryptography.fernet import Fernet
746
-
747
- def encrypt_sensitive_data(data: str, key: bytes) -> str:
748
- f = Fernet(key)
749
- return f.encrypt(data.encode()).decode()
750
-
751
- def decrypt_sensitive_data(encrypted_data: str, key: bytes) -> str:
752
- f = Fernet(key)
753
- return f.decrypt(encrypted_data.encode()).decode()
754
- ```
755
-
756
- ### Azure Security
757
-
758
- #### Blob Storage Security
759
- ```python
760
- # SAS token configuration for least privilege
761
- sas_permissions = BlobSasPermissions(
762
- read=True,
763
- write=True,
764
- delete=True,
765
- list=True
766
- )
767
-
768
- # IP restrictions (optional)
769
- sas_ip_range = "192.168.1.0/24" # Restrict to specific IP range
770
-
771
- # Time-limited tokens
772
- sas_expiry = datetime.utcnow() + timedelta(hours=1)
773
- ```
774
-
775
- #### Speech Service Security
776
- ```python
777
- # Secure API calls
778
- headers = {
779
- "Ocp-Apim-Subscription-Key": AZURE_SPEECH_KEY,
780
- "Content-Type": "application/json"
781
- }
782
-
783
- # Request timeout and retry logic
784
- response = requests.post(
785
- url,
786
- headers=headers,
787
- json=body,
788
- timeout=30,
789
- verify=True # Verify SSL certificates
790
- )
791
- ```
792
-
793
- ### Input Validation
794
-
795
- #### File Upload Security
796
- ```python
797
- def validate_uploaded_file(file_path: str, max_size: int = 500 * 1024 * 1024) -> Tuple[bool, str]:
798
- try:
799
- # Check file exists
800
- if not os.path.exists(file_path):
801
- return False, "File not found"
802
-
803
- # Check file size
804
- file_size = os.path.getsize(file_path)
805
- if file_size > max_size:
806
- return False, f"File too large: {file_size / 1024 / 1024:.1f}MB"
807
-
808
- # Check file type by content (not just extension)
809
- import magic
810
- mime_type = magic.from_file(file_path, mime=True)
811
- allowed_types = ['audio/', 'video/']
812
- if not any(mime_type.startswith(t) for t in allowed_types):
813
- return False, f"Invalid file type: {mime_type}"
814
-
815
- return True, "Valid"
816
-
817
- except Exception as e:
818
- return False, f"Validation error: {str(e)}"
819
- ```
820
-
821
- #### SQL Injection Prevention
822
- ```python
823
- # Use parameterized queries (already implemented)
824
- cursor.execute(
825
- "SELECT * FROM users WHERE email = ? AND password_hash = ?",
826
- (email, password_hash)
827
- )
828
-
829
- # Input sanitization
830
- def sanitize_input(user_input: str) -> str:
831
- # Remove dangerous characters
832
- import html
833
- sanitized = html.escape(user_input)
834
- # Limit length
835
- return sanitized[:1000]
836
- ```
837
-
838
- ---
839
-
840
- ## πŸ“Š Monitoring & Maintenance
841
-
842
- ### Application Monitoring
843
-
844
- #### Health Checks
845
- ```python
846
- def health_check() -> Dict[str, Any]:
847
- """System health check endpoint"""
848
- try:
849
- # Database check
850
- db_status = check_database_connection()
851
-
852
- # Azure services check
853
- blob_status = check_blob_storage()
854
- speech_status = check_speech_service()
855
-
856
- # FFmpeg check
857
- ffmpeg_status = check_ffmpeg_installation()
858
-
859
- # Disk space check
860
- disk_status = check_disk_space()
861
-
862
- return {
863
- 'status': 'healthy' if all([db_status, blob_status, speech_status, ffmpeg_status]) else 'unhealthy',
864
- 'timestamp': datetime.now().isoformat(),
865
- 'services': {
866
- 'database': db_status,
867
- 'blob_storage': blob_status,
868
- 'speech_service': speech_status,
869
- 'ffmpeg': ffmpeg_status,
870
- 'disk_space': disk_status
871
- }
872
- }
873
-
874
- except Exception as e:
875
- return {
876
- 'status': 'error',
877
- 'timestamp': datetime.now().isoformat(),
878
- 'error': str(e)
879
- }
880
-
881
- def check_database_connection() -> bool:
882
- try:
883
- with transcription_manager.db.get_connection() as conn:
884
- conn.execute("SELECT 1").fetchone()
885
- return True
886
- except:
887
- return False
888
-
889
- def check_blob_storage() -> bool:
890
- try:
891
- client = BlobServiceClient.from_connection_string(AZURE_BLOB_CONNECTION)
892
- client.list_containers(max_results=1)
893
- return True
894
- except:
895
- return False
896
- ```
897
-
898
- #### Logging Configuration
899
- ```python
900
- import logging
901
- from logging.handlers import RotatingFileHandler
902
-
903
- def setup_logging():
904
- """Configure application logging"""
905
-
906
- # Create formatter
907
- formatter = logging.Formatter(
908
- '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
909
- )
910
-
911
- # Console handler
912
- console_handler = logging.StreamHandler()
913
- console_handler.setFormatter(formatter)
914
- console_handler.setLevel(logging.INFO)
915
-
916
- # File handler with rotation
917
- file_handler = RotatingFileHandler(
918
- 'logs/transcription.log',
919
- maxBytes=10*1024*1024, # 10MB
920
- backupCount=5
921
- )
922
- file_handler.setFormatter(formatter)
923
- file_handler.setLevel(logging.DEBUG)
924
-
925
- # Configure root logger
926
- logger = logging.getLogger()
927
- logger.setLevel(logging.DEBUG)
928
- logger.addHandler(console_handler)
929
- logger.addHandler(file_handler)
930
-
931
- # Separate logger for sensitive operations
932
- auth_logger = logging.getLogger('auth')
933
- auth_handler = RotatingFileHandler(
934
- 'logs/auth.log',
935
- maxBytes=5*1024*1024, # 5MB
936
- backupCount=10
937
- )
938
- auth_handler.setFormatter(formatter)
939
- auth_logger.addHandler(auth_handler)
940
- auth_logger.setLevel(logging.INFO)
941
- ```
942
-
943
- #### Performance Monitoring
944
- ```python
945
- import time
946
- from functools import wraps
947
-
948
- def monitor_performance(func):
949
- """Decorator to monitor function performance"""
950
- @wraps(func)
951
- def wrapper(*args, **kwargs):
952
- start_time = time.time()
953
- try:
954
- result = func(*args, **kwargs)
955
- duration = time.time() - start_time
956
- logging.info(f"{func.__name__} completed in {duration:.2f}s")
957
- return result
958
- except Exception as e:
959
- duration = time.time() - start_time
960
- logging.error(f"{func.__name__} failed after {duration:.2f}s: {str(e)}")
961
- raise
962
- return wrapper
963
-
964
- # Usage
965
- @monitor_performance
966
- def submit_transcription(self, file_bytes, filename, user_id, language, settings):
967
- # Implementation here
968
- pass
969
- ```
970
-
971
- ### Database Maintenance
972
-
973
- #### Backup Strategy
974
- ```python
975
- def backup_database():
976
- """Backup database to Azure Blob Storage"""
977
- try:
978
- # Create timestamped backup
979
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
980
- backup_name = f"shared/backups/transcriptions_backup_{timestamp}.db"
981
-
982
- # Upload current database
983
- blob_client = blob_service.get_blob_client(
984
- container=AZURE_CONTAINER,
985
- blob=backup_name
986
- )
987
-
988
- with open(db_path, "rb") as data:
989
- blob_client.upload_blob(data)
990
-
991
- logging.info(f"Database backup created: {backup_name}")
992
-
993
- # Clean old backups (keep last 30 days)
994
- cleanup_old_backups()
995
-
996
- except Exception as e:
997
- logging.error(f"Database backup failed: {str(e)}")
998
-
999
- def cleanup_old_backups():
1000
- """Remove backups older than 30 days"""
1001
- try:
1002
- cutoff_date = datetime.now() - timedelta(days=30)
1003
- container_client = blob_service.get_container_client(AZURE_CONTAINER)
1004
-
1005
- for blob in container_client.list_blobs(name_starts_with="shared/backups/"):
1006
- if blob.last_modified < cutoff_date:
1007
- blob_service.delete_blob(AZURE_CONTAINER, blob.name)
1008
- logging.info(f"Deleted old backup: {blob.name}")
1009
-
1010
- except Exception as e:
1011
- logging.error(f"Backup cleanup failed: {str(e)}")
1012
- ```
1013
-
1014
- #### Database Optimization
1015
- ```python
1016
- def optimize_database():
1017
- """Optimize database performance"""
1018
- try:
1019
- with transcription_manager.db.get_connection() as conn:
1020
- # Analyze tables
1021
- conn.execute("ANALYZE")
1022
-
1023
- # Vacuum database (compact)
1024
- conn.execute("VACUUM")
1025
-
1026
- # Update statistics
1027
- conn.execute("PRAGMA optimize")
1028
-
1029
- logging.info("Database optimization completed")
1030
-
1031
- except Exception as e:
1032
- logging.error(f"Database optimization failed: {str(e)}")
1033
-
1034
- # Schedule optimization (run weekly)
1035
- import schedule
1036
-
1037
- schedule.every().week.do(optimize_database)
1038
- schedule.every().day.at("02:00").do(backup_database)
1039
- ```
1040
-
1041
- ### Resource Management
1042
-
1043
- #### Cleanup Tasks
1044
- ```python
1045
- def cleanup_temporary_files():
1046
- """Clean up temporary files older than 24 hours"""
1047
- try:
1048
- cutoff_time = time.time() - (24 * 60 * 60) # 24 hours ago
1049
- temp_dirs = ['uploads', 'temp']
1050
-
1051
- for temp_dir in temp_dirs:
1052
- if os.path.exists(temp_dir):
1053
- for filename in os.listdir(temp_dir):
1054
- filepath = os.path.join(temp_dir, filename)
1055
- if os.path.isfile(filepath) and os.path.getmtime(filepath) < cutoff_time:
1056
- os.remove(filepath)
1057
- logging.info(f"Cleaned up temporary file: {filepath}")
1058
-
1059
- except Exception as e:
1060
- logging.error(f"Temporary file cleanup failed: {str(e)}")
1061
-
1062
- def monitor_disk_space():
1063
- """Monitor and alert on disk space"""
1064
- try:
1065
- import shutil
1066
- total, used, free = shutil.disk_usage("/")
1067
-
1068
- # Convert to GB
1069
- free_gb = free // (1024**3)
1070
- total_gb = total // (1024**3)
1071
- usage_percent = (used / total) * 100
1072
-
1073
- if usage_percent > 85:
1074
- logging.warning(f"High disk usage: {usage_percent:.1f}% ({free_gb}GB free)")
1075
-
1076
- if free_gb < 5:
1077
- logging.critical(f"Low disk space: {free_gb}GB remaining")
1078
-
1079
- except Exception as e:
1080
- logging.error(f"Disk space monitoring failed: {str(e)}")
1081
- ```
1082
-
1083
- ### Monitoring Alerts
1084
-
1085
- #### Email Alerts (Optional)
1086
- ```python
1087
- import smtplib
1088
- from email.mime.text import MIMEText
1089
-
1090
- def send_alert(subject: str, message: str):
1091
- """Send email alert for critical issues"""
1092
- try:
1093
- smtp_server = os.environ.get("SMTP_SERVER")
1094
- smtp_port = int(os.environ.get("SMTP_PORT", "587"))
1095
- smtp_user = os.environ.get("SMTP_USER")
1096
- smtp_pass = os.environ.get("SMTP_PASS")
1097
- alert_email = os.environ.get("ALERT_EMAIL")
1098
-
1099
- if not all([smtp_server, smtp_user, smtp_pass, alert_email]):
1100
- return # Email not configured
1101
-
1102
- msg = MIMEText(message)
1103
- msg['Subject'] = f"[Transcription Service] {subject}"
1104
- msg['From'] = smtp_user
1105
- msg['To'] = alert_email
1106
-
1107
- with smtplib.SMTP(smtp_server, smtp_port) as server:
1108
- server.starttls()
1109
- server.login(smtp_user, smtp_pass)
1110
- server.send_message(msg)
1111
-
1112
- except Exception as e:
1113
- logging.error(f"Failed to send alert: {str(e)}")
1114
- ```
1115
-
1116
- ---
1117
-
1118
- ## 🀝 Contributing Guidelines
1119
-
1120
- ### Development Workflow
1121
-
1122
- #### 1. Setup Development Environment
1123
- ```bash
1124
- # Fork repository
1125
- git clone https://github.com/your-username/azure-speech-transcription.git
1126
- cd azure-speech-transcription
1127
-
1128
- # Create feature branch
1129
- git checkout -b feature/your-feature-name
1130
-
1131
- # Setup environment
1132
- python -m venv venv
1133
- source venv/bin/activate # or venv\Scripts\activate on Windows
1134
- pip install -r requirements.txt
1135
- pip install -r requirements-dev.txt # Development dependencies
1136
- ```
1137
-
1138
- #### 2. Code Quality Standards
1139
-
1140
- **Python Style Guide**
1141
- - Follow PEP 8 style guidelines
1142
- - Use type hints for function parameters and return values
1143
- - Maximum line length: 88 characters (Black formatter)
1144
- - Use meaningful variable and function names
1145
-
1146
- **Code Formatting**
1147
- ```bash
1148
- # Install development tools
1149
- pip install black flake8 mypy pytest
1150
-
1151
- # Format code
1152
- black .
1153
-
1154
- # Check style
1155
- flake8 .
1156
-
1157
- # Type checking
1158
- mypy app_core.py gradio_app.py
1159
-
1160
- # Run tests
1161
- pytest tests/
1162
- ```
1163
-
1164
- **Documentation Standards**
1165
- - All functions must have docstrings
1166
- - Include type hints
1167
- - Document complex logic with inline comments
1168
- - Update README.md for new features
1169
-
1170
- ```python
1171
- def submit_transcription(
1172
- self,
1173
- file_bytes: bytes,
1174
- original_filename: str,
1175
- user_id: str,
1176
- language: str,
1177
- settings: Dict[str, Any]
1178
- ) -> str:
1179
- """
1180
- Submit a new transcription job for processing.
1181
-
1182
- Args:
1183
- file_bytes: Raw bytes of the audio/video file
1184
- original_filename: Original name of the uploaded file
1185
- user_id: ID of the authenticated user
1186
- language: Language code for transcription (e.g., 'en-US')
1187
- settings: Transcription configuration options
1188
-
1189
- Returns:
1190
- str: Unique job ID for tracking transcription progress
1191
-
1192
- Raises:
1193
- ValueError: If user_id is invalid or file is too large
1194
- ConnectionError: If Azure services are unavailable
1195
- """
1196
- ```
1197
-
1198
- #### 3. Testing Requirements
1199
-
1200
- **Unit Tests**
1201
- ```python
1202
- import pytest
1203
- from unittest.mock import Mock, patch
1204
- from app_core import TranscriptionManager, AuthManager
1205
-
1206
- class TestAuthManager:
1207
- def test_password_hashing(self):
1208
- password = "TestPassword123"
1209
- hashed = AuthManager.hash_password(password)
1210
-
1211
- assert hashed != password
1212
- assert AuthManager.verify_password(password, hashed)
1213
- assert not AuthManager.verify_password("wrong", hashed)
1214
-
1215
- def test_email_validation(self):
1216
- assert AuthManager.validate_email("test@example.com")
1217
- assert not AuthManager.validate_email("invalid-email")
1218
- assert not AuthManager.validate_email("")
1219
-
1220
- class TestTranscriptionManager:
1221
- @patch('app_core.BlobServiceClient')
1222
- def test_submit_transcription(self, mock_blob):
1223
- manager = TranscriptionManager()
1224
-
1225
- job_id = manager.submit_transcription(
1226
- b"fake audio data",
1227
- "test.wav",
1228
- "user123",
1229
- "en-US",
1230
- {"audio_format": "wav"}
1231
- )
1232
-
1233
- assert isinstance(job_id, str)
1234
- assert len(job_id) == 36 # UUID length
1235
- ```
1236
-
1237
- **Integration Tests**
1238
- ```python
1239
- class TestIntegration:
1240
- def test_full_transcription_workflow(self):
1241
- # Test complete workflow from upload to download
1242
- pass
1243
-
1244
- def test_user_registration_and_login(self):
1245
- # Test complete auth workflow
1246
- pass
1247
- ```
1248
-
1249
- #### 4. Commit Guidelines
1250
-
1251
- **Commit Message Format**
1252
- ```
1253
- type(scope): brief description
1254
-
1255
- Detailed explanation of changes if needed
1256
-
1257
- - List specific changes
1258
- - Include any breaking changes
1259
- - Reference issue numbers
1260
-
1261
- Closes #123
1262
- ```
1263
-
1264
- **Commit Types**
1265
- - `feat`: New feature
1266
- - `fix`: Bug fix
1267
- - `docs`: Documentation changes
1268
- - `style`: Code style changes (formatting, etc.)
1269
- - `refactor`: Code refactoring
1270
- - `test`: Adding or updating tests
1271
- - `chore`: Maintenance tasks
1272
-
1273
- **Example Commits**
1274
- ```bash
1275
- git commit -m "feat(auth): add password strength validation
1276
-
1277
- - Implement password complexity requirements
1278
- - Add client-side validation feedback
1279
- - Update registration form UI
1280
-
1281
- Closes #45"
1282
-
1283
- git commit -m "fix(transcription): handle Azure service timeouts
1284
-
1285
- - Add retry logic for failed API calls
1286
- - Improve error messages for users
1287
- - Log detailed error information
1288
-
1289
- Fixes #67"
1290
- ```
1291
-
1292
- #### 5. Pull Request Process
1293
-
1294
- **PR Checklist**
1295
- - [ ] Code follows style guidelines
1296
- - [ ] All tests pass
1297
- - [ ] Documentation updated
1298
- - [ ] Security considerations reviewed
1299
- - [ ] Performance impact assessed
1300
- - [ ] Breaking changes documented
1301
-
1302
- **PR Template**
1303
- ```markdown
1304
- ## Description
1305
- Brief description of changes
1306
-
1307
- ## Type of Change
1308
- - [ ] Bug fix
1309
- - [ ] New feature
1310
- - [ ] Breaking change
1311
- - [ ] Documentation update
1312
-
1313
- ## Testing
1314
- - [ ] Unit tests added/updated
1315
- - [ ] Integration tests pass
1316
- - [ ] Manual testing completed
1317
-
1318
- ## Security
1319
- - [ ] No sensitive data exposed
1320
- - [ ] Input validation implemented
1321
- - [ ] Access controls maintained
1322
-
1323
- ## Performance
1324
- - [ ] No performance degradation
1325
- - [ ] Database queries optimized
1326
- - [ ] Resource usage considered
1327
- ```
1328
-
1329
- ### Feature Development
1330
-
1331
- #### Adding New Languages
1332
- ```python
1333
- # 1. Update environment configuration
1334
- ALLOWED_LANGS = {
1335
- "en-US": "English (United States)",
1336
- "es-ES": "Spanish (Spain)",
1337
- "new-LANG": "New Language Name"
1338
- }
1339
-
1340
- # 2. Test language support
1341
- def test_new_language():
1342
- # Verify Azure Speech Services supports the language
1343
- # Test transcription accuracy
1344
- # Update documentation
1345
- ```
1346
-
1347
- #### Adding New Audio Formats
1348
- ```python
1349
- # 1. Update supported formats list
1350
- AUDIO_FORMATS = [
1351
- "wav", "mp3", "ogg", "opus", "flac",
1352
- "new_format" # Add new format
1353
- ]
1354
-
1355
- # 2. Update FFmpeg conversion logic
1356
- def _convert_to_audio(self, input_path, output_path, audio_format="wav"):
1357
- if audio_format == "new_format":
1358
- # Add specific conversion parameters
1359
- cmd = ["ffmpeg", "-i", input_path, "-codec", "new_codec", output_path]
1360
- ```
1361
-
1362
- #### Adding New Features
1363
- ```python
1364
- # 1. Database schema updates
1365
- def upgrade_database_schema():
1366
- with self.get_connection() as conn:
1367
- conn.execute("""
1368
- ALTER TABLE transcriptions
1369
- ADD COLUMN new_feature_data TEXT
1370
- """)
1371
-
1372
- # 2. API endpoint updates
1373
- def new_feature_endpoint(user_id: str, feature_data: Dict) -> Dict:
1374
- # Implement new feature logic
1375
- pass
1376
-
1377
- # 3. UI updates
1378
- def add_new_feature_ui():
1379
- new_feature_input = gr.Textbox(label="New Feature")
1380
- new_feature_button = gr.Button("Use New Feature")
1381
- ```
1382
-
1383
- ---
1384
-
1385
- ## βš™οΈ Advanced Configuration
1386
-
1387
- ### Performance Optimization
1388
-
1389
- #### Concurrent Processing
1390
- ```python
1391
- # Adjust worker thread pool size based on server capacity
1392
- class TranscriptionManager:
1393
- def __init__(self, max_workers: int = None):
1394
- if max_workers is None:
1395
- # Auto-detect based on CPU cores
1396
- import multiprocessing
1397
- max_workers = min(multiprocessing.cpu_count(), 10)
1398
-
1399
- self.executor = ThreadPoolExecutor(max_workers=max_workers)
1400
-
1401
- # Configure based on server specs
1402
- # Small server: max_workers=2-4
1403
- # Medium server: max_workers=5-8
1404
- # Large server: max_workers=10+
1405
- ```
1406
-
1407
- #### Database Optimization
1408
- ```python
1409
- # SQLite performance tuning
1410
- def configure_database_performance(db_path: str):
1411
- with sqlite3.connect(db_path) as conn:
1412
- # Enable WAL mode for better concurrency
1413
- conn.execute("PRAGMA journal_mode=WAL")
1414
-
1415
- # Increase cache size (in KB)
1416
- conn.execute("PRAGMA cache_size=10000")
1417
-
1418
- # Optimize synchronization
1419
- conn.execute("PRAGMA synchronous=NORMAL")
1420
-
1421
- # Enable foreign keys
1422
- conn.execute("PRAGMA foreign_keys=ON")
1423
- ```
1424
-
1425
- #### Memory Management
1426
- ```python
1427
- # Large file handling
1428
- def process_large_file(file_path: str):
1429
- """Process large files in chunks to manage memory"""
1430
- chunk_size = 64 * 1024 * 1024 # 64MB chunks
1431
-
1432
- with open(file_path, 'rb') as f:
1433
- while chunk := f.read(chunk_size):
1434
- # Process chunk
1435
- yield chunk
1436
-
1437
- # Garbage collection for long-running processes
1438
- import gc
1439
-
1440
- def cleanup_memory():
1441
- """Force garbage collection"""
1442
- gc.collect()
1443
-
1444
- # Schedule periodic cleanup
1445
- schedule.every(30).minutes.do(cleanup_memory)
1446
- ```
1447
-
1448
- ### Security Hardening
1449
-
1450
- #### Rate Limiting
1451
- ```python
1452
- from collections import defaultdict
1453
- from time import time
1454
-
1455
- class RateLimiter:
1456
- def __init__(self, max_requests: int = 100, window: int = 3600):
1457
- self.max_requests = max_requests
1458
- self.window = window
1459
- self.requests = defaultdict(list)
1460
-
1461
- def is_allowed(self, user_id: str) -> bool:
1462
- now = time()
1463
- user_requests = self.requests[user_id]
1464
-
1465
- # Clean old requests
1466
- user_requests[:] = [req_time for req_time in user_requests
1467
- if now - req_time < self.window]
1468
-
1469
- # Check limit
1470
- if len(user_requests) >= self.max_requests:
1471
- return False
1472
-
1473
- user_requests.append(now)
1474
- return True
1475
-
1476
- # Usage in endpoints
1477
- rate_limiter = RateLimiter(max_requests=50, window=3600) # 50 per hour
1478
-
1479
- def submit_transcription(self, user_id: str, ...):
1480
- if not rate_limiter.is_allowed(user_id):
1481
- raise Exception("Rate limit exceeded")
1482
- ```
1483
-
1484
- #### Input Sanitization
1485
- ```python
1486
- import bleach
1487
- import re
1488
-
1489
- def sanitize_filename(filename: str) -> str:
1490
- """Sanitize uploaded filename"""
1491
- # Remove path traversal attempts
1492
- filename = os.path.basename(filename)
1493
-
1494
- # Remove dangerous characters
1495
- filename = re.sub(r'[<>:"/\\|?*]', '_', filename)
1496
-
1497
- # Limit length
1498
- if len(filename) > 255:
1499
- name, ext = os.path.splitext(filename)
1500
- filename = name[:250] + ext
1501
-
1502
- return filename
1503
-
1504
- def sanitize_user_input(text: str) -> str:
1505
- """Sanitize user text input"""
1506
- # Remove HTML tags
1507
- text = bleach.clean(text, tags=[], strip=True)
1508
-
1509
- # Limit length
1510
- text = text[:1000]
1511
-
1512
- return text.strip()
1513
- ```
1514
-
1515
- #### Audit Logging
1516
- ```python
1517
- class AuditLogger:
1518
- def __init__(self):
1519
- self.logger = logging.getLogger('audit')
1520
-
1521
- def log_user_action(self, user_id: str, action: str, details: Dict = None):
1522
- """Log user actions for security auditing"""
1523
- audit_entry = {
1524
- 'timestamp': datetime.now().isoformat(),
1525
- 'user_id': user_id,
1526
- 'action': action,
1527
- 'details': details or {},
1528
- 'ip_address': self._get_client_ip(),
1529
- 'user_agent': self._get_user_agent()
1530
- }
1531
-
1532
- self.logger.info(json.dumps(audit_entry))
1533
-
1534
- def _get_client_ip(self) -> str:
1535
- # Implementation depends on deployment setup
1536
- return "unknown"
1537
-
1538
- def _get_user_agent(self) -> str:
1539
- # Implementation depends on deployment setup
1540
- return "unknown"
1541
-
1542
- # Usage
1543
- audit = AuditLogger()
1544
- audit.log_user_action(user_id, "login", {"success": True})
1545
- audit.log_user_action(user_id, "transcription_submit", {"filename": filename})
1546
- ```
1547
-
1548
- ### Custom Extensions
1549
-
1550
- #### Plugin Architecture
1551
- ```python
1552
- class TranscriptionPlugin:
1553
- """Base class for transcription plugins"""
1554
-
1555
- def pre_process(self, file_bytes: bytes, settings: Dict) -> bytes:
1556
- """Pre-process audio before transcription"""
1557
- return file_bytes
1558
-
1559
- def post_process(self, transcript: str, settings: Dict) -> str:
1560
- """Post-process transcript text"""
1561
- return transcript
1562
-
1563
- def get_name(self) -> str:
1564
- """Return plugin name"""
1565
- raise NotImplementedError
1566
-
1567
- class NoiseReductionPlugin(TranscriptionPlugin):
1568
- def get_name(self) -> str:
1569
- return "noise_reduction"
1570
-
1571
- def pre_process(self, file_bytes: bytes, settings: Dict) -> bytes:
1572
- # Implement noise reduction using audio processing library
1573
- # This is a placeholder - actual implementation would use
1574
- # libraries like librosa, scipy, or pydub
1575
- return file_bytes
1576
-
1577
- class LanguageDetectionPlugin(TranscriptionPlugin):
1578
- def get_name(self) -> str:
1579
- return "language_detection"
1580
-
1581
- def pre_process(self, file_bytes: bytes, settings: Dict) -> bytes:
1582
- # Detect language and update settings
1583
- detected_language = self._detect_language(file_bytes)
1584
- settings['detected_language'] = detected_language
1585
- return file_bytes
1586
-
1587
- # Plugin manager
1588
- class PluginManager:
1589
- def __init__(self):
1590
- self.plugins: List[TranscriptionPlugin] = []
1591
-
1592
- def register_plugin(self, plugin: TranscriptionPlugin):
1593
- self.plugins.append(plugin)
1594
-
1595
- def apply_pre_processing(self, file_bytes: bytes, settings: Dict) -> bytes:
1596
- for plugin in self.plugins:
1597
- file_bytes = plugin.pre_process(file_bytes, settings)
1598
- return file_bytes
1599
-
1600
- def apply_post_processing(self, transcript: str, settings: Dict) -> str:
1601
- for plugin in self.plugins:
1602
- transcript = plugin.post_process(transcript, settings)
1603
- return transcript
1604
- ```
1605
-
1606
- ---
1607
-
1608
- ## πŸ”§ Troubleshooting
1609
-
1610
- ### Common Development Issues
1611
-
1612
- #### Environment Setup Problems
1613
-
1614
- **Issue**: Azure connection fails
1615
- ```bash
1616
- # Check environment variables
1617
- python -c "
1618
- import os
1619
- print('AZURE_SPEECH_KEY:', bool(os.getenv('AZURE_SPEECH_KEY')))
1620
- print('AZURE_BLOB_CONNECTION:', bool(os.getenv('AZURE_BLOB_CONNECTION')))
1621
- "
1622
-
1623
- # Test Azure connection
1624
- python -c "
1625
- from azure.storage.blob import BlobServiceClient
1626
- client = BlobServiceClient.from_connection_string('$AZURE_BLOB_CONNECTION')
1627
- print('Containers:', list(client.list_containers()))
1628
- "
1629
- ```
1630
-
1631
- **Issue**: FFmpeg not found
1632
- ```bash
1633
- # Check FFmpeg installation
1634
- ffmpeg -version
1635
-
1636
- # Install FFmpeg (Ubuntu/Debian)
1637
- sudo apt update && sudo apt install ffmpeg
1638
-
1639
- # Install FFmpeg (Windows with Chocolatey)
1640
- choco install ffmpeg
1641
-
1642
- # Install FFmpeg (macOS with Homebrew)
1643
- brew install ffmpeg
1644
- ```
1645
-
1646
- **Issue**: Database initialization fails
1647
- ```python
1648
- # Check database permissions
1649
- import os
1650
- db_dir = "database"
1651
- if not os.path.exists(db_dir):
1652
- os.makedirs(db_dir)
1653
- print(f"Created directory: {db_dir}")
1654
-
1655
- # Test database creation
1656
- import sqlite3
1657
- conn = sqlite3.connect("database/test.db")
1658
- conn.execute("CREATE TABLE test (id INTEGER)")
1659
- conn.close()
1660
- print("Database test successful")
1661
- ```
1662
-
1663
- #### Runtime Issues
1664
-
1665
- **Issue**: Memory errors with large files
1666
- ```python
1667
- # Monitor memory usage
1668
- import psutil
1669
-
1670
- def check_memory():
1671
- memory = psutil.virtual_memory()
1672
- print(f"Memory usage: {memory.percent}%")
1673
- print(f"Available: {memory.available / 1024**3:.1f}GB")
1674
-
1675
- # Implement file chunking for large uploads
1676
- def process_large_file_in_chunks(file_path: str, chunk_size: int = 64*1024*1024):
1677
- with open(file_path, 'rb') as f:
1678
- while chunk := f.read(chunk_size):
1679
- yield chunk
1680
- ```
1681
-
1682
- **Issue**: Transcription jobs stuck
1683
- ```python
1684
- # Check pending jobs
1685
- def diagnose_stuck_jobs():
1686
- pending_jobs = transcription_manager.db.get_pending_jobs()
1687
- print(f"Pending jobs: {len(pending_jobs)}")
1688
-
1689
- for job in pending_jobs:
1690
- duration = datetime.now() - datetime.fromisoformat(job.created_at)
1691
- print(f"Job {job.job_id}: {job.status} for {duration}")
1692
-
1693
- if duration.total_seconds() > 3600: # 1 hour
1694
- print(f"⚠️ Job {job.job_id} may be stuck")
1695
-
1696
- # Reset stuck jobs
1697
- def reset_stuck_jobs():
1698
- with transcription_manager.db.get_connection() as conn:
1699
- conn.execute("""
1700
- UPDATE transcriptions
1701
- SET status = 'pending', azure_trans_id = NULL
1702
- WHERE status = 'processing'
1703
- AND created_at < datetime('now', '-1 hour')
1704
- """)
1705
- ```
1706
-
1707
- **Issue**: Azure API errors
1708
- ```python
1709
- # Test Azure Speech Service
1710
- def test_azure_speech():
1711
- try:
1712
- url = f"{AZURE_SPEECH_KEY_ENDPOINT}/speechtotext/v3.2/transcriptions"
1713
- headers = {"Ocp-Apim-Subscription-Key": AZURE_SPEECH_KEY}
1714
-
1715
- response = requests.get(url, headers=headers)
1716
- print(f"Status: {response.status_code}")
1717
- print(f"Response: {response.text[:200]}")
1718
-
1719
- except Exception as e:
1720
- print(f"Azure Speech test failed: {e}")
1721
-
1722
- # Check Azure service status
1723
- def check_azure_status():
1724
- # Check Azure status page
1725
- status_url = "https://status.azure.com/en-us/status"
1726
- print(f"Check Azure status: {status_url}")
1727
- ```
1728
-
1729
- ### Debugging Tools
1730
-
1731
- #### Debug Mode Configuration
1732
- ```python
1733
- # Enable debug mode
1734
- DEBUG = os.environ.get("DEBUG", "false").lower() == "true"
1735
-
1736
- if DEBUG:
1737
- logging.basicConfig(level=logging.DEBUG)
1738
-
1739
- # Enable Gradio debug mode
1740
- demo.launch(debug=True, show_error=True)
1741
- ```
1742
-
1743
- #### Performance Profiling
1744
- ```python
1745
- import cProfile
1746
- import pstats
1747
-
1748
- def profile_function(func):
1749
- """Profile function performance"""
1750
- profiler = cProfile.Profile()
1751
-
1752
- def wrapper(*args, **kwargs):
1753
- profiler.enable()
1754
- result = func(*args, **kwargs)
1755
- profiler.disable()
1756
-
1757
- # Print stats
1758
- stats = pstats.Stats(profiler)
1759
- stats.sort_stats('cumulative')
1760
- stats.print_stats(10) # Top 10 functions
1761
-
1762
- return result
1763
-
1764
- return wrapper
1765
-
1766
- # Usage
1767
- @profile_function
1768
- def submit_transcription(self, ...):
1769
- # Function implementation
1770
- pass
1771
- ```
1772
-
1773
- #### Log Analysis
1774
- ```python
1775
- def analyze_logs(log_file: str = "logs/transcription.log"):
1776
- """Analyze application logs for issues"""
1777
-
1778
- errors = []
1779
- warnings = []
1780
- performance_issues = []
1781
-
1782
- with open(log_file, 'r') as f:
1783
- for line in f:
1784
- if 'ERROR' in line:
1785
- errors.append(line.strip())
1786
- elif 'WARNING' in line:
1787
- warnings.append(line.strip())
1788
- elif 'completed in' in line:
1789
- # Extract timing information
1790
- import re
1791
- match = re.search(r'completed in (\d+\.\d+)s', line)
1792
- if match and float(match.group(1)) > 30: # > 30 seconds
1793
- performance_issues.append(line.strip())
1794
-
1795
- print(f"Errors: {len(errors)}")
1796
- print(f"Warnings: {len(warnings)}")
1797
- print(f"Performance issues: {len(performance_issues)}")
1798
-
1799
- return {
1800
- 'errors': errors[-10:], # Last 10 errors
1801
- 'warnings': warnings[-10:], # Last 10 warnings
1802
- 'performance_issues': performance_issues[-10:]
1803
- }
1804
- ```
1805
-
1806
- ### Production Troubleshooting
1807
-
1808
- #### Service Health Check
1809
- ```bash
1810
- #!/bin/bash
1811
- # health_check.sh
1812
-
1813
- echo "=== System Health Check ==="
1814
-
1815
- # Check service status
1816
- systemctl is-active transcription
1817
- systemctl is-active nginx
1818
-
1819
- # Check disk space
1820
- df -h
1821
-
1822
- # Check memory usage
1823
- free -h
1824
-
1825
- # Check CPU usage
1826
- top -b -n1 | grep "Cpu(s)"
1827
-
1828
- # Check logs for errors
1829
- tail -n 50 /home/transcription/app/logs/transcription.log | grep ERROR
1830
-
1831
- # Check Azure connectivity
1832
- curl -s -o /dev/null -w "%{http_code}" https://azure.microsoft.com/
1833
-
1834
- echo "=== Health Check Complete ==="
1835
- ```
1836
-
1837
- #### Database Recovery
1838
- ```python
1839
- def recover_database():
1840
- """Recover database from Azure backup"""
1841
- try:
1842
- # List available backups
1843
- container_client = blob_service.get_container_client(AZURE_CONTAINER)
1844
- backups = []
1845
-
1846
- for blob in container_client.list_blobs(name_starts_with="shared/backups/"):
1847
- backups.append({
1848
- 'name': blob.name,
1849
- 'modified': blob.last_modified
1850
- })
1851
-
1852
- # Sort by date (newest first)
1853
- backups.sort(key=lambda x: x['modified'], reverse=True)
1854
-
1855
- if not backups:
1856
- print("No backups found")
1857
- return
1858
-
1859
- # Download latest backup
1860
- latest_backup = backups[0]['name']
1861
- print(f"Restoring from: {latest_backup}")
1862
-
1863
- blob_client = blob_service.get_blob_client(
1864
- container=AZURE_CONTAINER,
1865
- blob=latest_backup
1866
- )
1867
-
1868
- # Download backup
1869
- with open("database/transcriptions_restored.db", "wb") as f:
1870
- f.write(blob_client.download_blob().readall())
1871
-
1872
- print("Database restored successfully")
1873
- print("Restart the application to use restored database")
1874
-
1875
- except Exception as e:
1876
- print(f"Database recovery failed: {str(e)}")
1877
- ```
1878
-
1879
- ---
1880
-
1881
- ## πŸ“š Additional Resources
1882
-
1883
- ### Documentation Links
1884
- - [Azure Speech Services Documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/)
1885
- - [Azure Blob Storage Documentation](https://docs.microsoft.com/en-us/azure/storage/blobs/)
1886
- - [Gradio Documentation](https://gradio.app/docs/)
1887
- - [SQLite Documentation](https://www.sqlite.org/docs.html)
1888
- - [FFmpeg Documentation](https://ffmpeg.org/documentation.html)
1889
-
1890
- ### Useful Tools
1891
- - **Azure Storage Explorer**: GUI for managing blob storage
1892
- - **DB Browser for SQLite**: Visual database management
1893
- - **Postman**: API testing and development
1894
- - **Azure CLI**: Command-line Azure management
1895
- - **Visual Studio Code**: Recommended IDE with Azure extensions
1896
-
1897
- ### Community Resources
1898
- - [Azure Speech Services Community](https://docs.microsoft.com/en-us/answers/topics/azure-speech-services.html)
1899
- - [Gradio Community](https://github.com/gradio-app/gradio/discussions)
1900
- - [Python Audio Processing Libraries](https://github.com/topics/audio-processing)
1901
-
1902
- ---
1903
-
1904
- **This developer guide provides comprehensive information for setting up, developing, deploying, and maintaining the Azure Speech Transcription service. For additional help, refer to the linked documentation and community resources.** πŸš€