Spaces:

Betimes-Solution
/

Azure_Transcipt_Service

Sleeping

File size: 53,194 Bytes

d891c35

# 🛠️ Azure Speech Transcription - Developer Guide

## 📋 Table of Contents

- [System Architecture](#-system-architecture)
- [Development Environment](#-development-environment)
- [Deployment Guide](#-deployment-guide)
- [API Documentation](#-api-documentation)
- [Database Schema](#-database-schema)
- [Security Implementation](#-security-implementation)
- [Monitoring & Maintenance](#-monitoring--maintenance)
- [Contributing Guidelines](#-contributing-guidelines)
- [Advanced Configuration](#-advanced-configuration)
- [Troubleshooting](#-troubleshooting)

---

## 🏗️ System Architecture

### Overview

The Azure Speech Transcription service is built with a modern, secure architecture focusing on user privacy, PDPA compliance, and scalability.

```

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐

│   Frontend UI   │    │   Backend API   │    │ Azure Services  │

│   (Gradio)      │◄──►│   (Python)      │◄──►│ Speech & Blob   │

└─────────────────┘    └─────────────────┘    └─────────────────┘

         │                       │                       │

         │                       │                       │

         ▼                       ▼                       ▼

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐

│   User Session  │    │ SQLite Database │    │  User Storage   │

│   Management    │    │   (Metadata)    │    │   (Isolated)    │

└─────────────────┘    └─────────────────┘    └─────────────────┘

```

### Core Components

#### 1. Frontend Layer (`gradio_app.py`)

- **Technology**: Gradio with custom CSS

- **Purpose**: User interface and session management

- **Features**: Authentication, file upload, real-time status, history management



#### 2. Backend Layer (`app_core.py`)
- **Technology**: Python with threading and async processing
- **Purpose**: Business logic, authentication, and Azure integration
- **Features**: User management, transcription processing, PDPA compliance

#### 3. Data Layer
- **Database**: SQLite with Azure Blob backup
- **Storage**: Azure Blob Storage with user separation
- **Security**: User-isolated folders and encrypted connections

#### 4. External Services
- **Azure Speech Services**: Transcription processing
- **Azure Blob Storage**: File and database storage
- **FFmpeg**: Audio/video conversion

### Data Flow

```

1. User uploads file → 2. Authentication check → 3. File validation

        ↓                       ↓                       ↓

8. Download results ← 7. Store transcript ← 6. Process with Azure

        ↑                       ↑                       ↑

9. Update UI status ← 4. Save to user folder ← 5. Background processing

```

---

## 💻 Development Environment

### Prerequisites

- **Python**: 3.8 or higher
- **Azure Account**: With Speech Services and Blob Storage
- **FFmpeg**: For audio/video processing
- **Git**: For version control

### Environment Setup

#### 1. Clone Repository
```bash

git clone <repository-url>

cd azure-speech-transcription

```

#### 2. Virtual Environment
```bash

# Create virtual environment

python -m venv venv



# Activate (Windows)

venv\Scripts\activate



# Activate (macOS/Linux)

source venv/bin/activate

```

#### 3. Install Dependencies
```bash

pip install -r requirements.txt

```

#### 4. Environment Configuration
```bash

# Copy environment template

cp .env.example .env



# Edit with your Azure credentials

nano .env

```

#### 5. Install FFmpeg

**Windows (Chocolatey):**
```bash

choco install ffmpeg

```

**macOS (Homebrew):**
```bash

brew install ffmpeg

```

**Ubuntu/Debian:**
```bash

sudo apt update

sudo apt install ffmpeg

```

#### 6. Verify Installation
```python

python -c "

import gradio as gr

from azure.storage.blob import BlobServiceClient

import subprocess

print('Gradio:', gr.__version__)

print('FFmpeg:', subprocess.run(['ffmpeg', '-version'], capture_output=True).returncode == 0)

print('Azure Blob:', 'OK')

"

```

### Development Server

```bash

# Start development server

python gradio_app.py



# Server will be available at:

# http://localhost:7860

```

### Development Tools

#### Recommended IDE Setup
- **VS Code**: With Python, Azure, and Git extensions
- **PyCharm**: Professional edition with Azure toolkit
- **Vim/Emacs**: With appropriate Python plugins

#### Useful Extensions
```json

{

  "recommendations": [

    "ms-python.python",

    "ms-vscode.azure-cli",

    "ms-azuretools.azure-cli-tools",

    "ms-python.black-formatter",

    "ms-python.flake8"

  ]

}

```

#### Code Quality Tools
```bash

# Install development tools

pip install black flake8 pytest mypy



# Format code

black .



# Lint code

flake8 .



# Type checking

mypy app_core.py gradio_app.py

```

---

## 🚀 Deployment Guide

### Production Deployment Options

#### Option 1: Traditional Server Deployment

**1. Server Preparation**
```bash

# Update system

sudo apt update && sudo apt upgrade -y



# Install Python and dependencies

sudo apt install python3 python3-pip python3-venv nginx ffmpeg -y



# Create application user

sudo useradd -m -s /bin/bash transcription

sudo su - transcription

```

**2. Application Setup**
```bash

# Clone repository

git clone <repository-url> /home/transcription/app

cd /home/transcription/app



# Setup virtual environment

python3 -m venv venv

source venv/bin/activate

pip install -r requirements.txt



# Configure environment

cp .env.example .env

# Edit .env with production values

```

**3. Systemd Service**
```ini

# /etc/systemd/system/transcription.service

[Unit]

Description=Azure Speech Transcription Service

After=network.target



[Service]

Type=simple

User=transcription

Group=transcription

WorkingDirectory=/home/transcription/app

Environment=PATH=/home/transcription/app/venv/bin

ExecStart=/home/transcription/app/venv/bin/python gradio_app.py

Restart=always

RestartSec=10



[Install]

WantedBy=multi-user.target

```

**4. Nginx Configuration**
```nginx

# /etc/nginx/sites-available/transcription

server {

    listen 80;

    server_name your-domain.com;

    client_max_body_size 500M;



    location / {

        proxy_pass http://127.0.0.1:7860;

        proxy_set_header Host $host;

        proxy_set_header X-Real-IP $remote_addr;

        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        proxy_set_header X-Forwarded-Proto $scheme;

        proxy_read_timeout 300s;

        proxy_connect_timeout 75s;

    }

}

```

**5. SSL Certificate**
```bash

# Install Certbot

sudo apt install certbot python3-certbot-nginx -y



# Get SSL certificate

sudo certbot --nginx -d your-domain.com



# Verify auto-renewal

sudo certbot renew --dry-run

```

**6. Start Services**
```bash

# Enable and start application

sudo systemctl enable transcription

sudo systemctl start transcription



# Enable and restart nginx

sudo systemctl enable nginx

sudo systemctl restart nginx



# Check status

sudo systemctl status transcription

sudo systemctl status nginx

```

#### Option 2: Docker Deployment

**1. Dockerfile**
```dockerfile

FROM python:3.9-slim



# Install system dependencies

RUN apt-get update && apt-get install -y \

    ffmpeg \

    && rm -rf /var/lib/apt/lists/*



# Set working directory

WORKDIR /app



# Copy requirements and install Python dependencies

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt



# Copy application code

COPY . .



# Create necessary directories

RUN mkdir -p uploads database temp



# Expose port

EXPOSE 7860



# Run application

CMD ["python", "gradio_app.py"]

```

**2. Docker Compose**
```yaml

# docker-compose.yml

version: '3.8'



services:

  transcription:

    build: .

    ports:

      - "7860:7860"

    environment:

      - AZURE_SPEECH_KEY=${AZURE_SPEECH_KEY}

      - AZURE_SPEECH_KEY_ENDPOINT=${AZURE_SPEECH_KEY_ENDPOINT}

      - AZURE_REGION=${AZURE_REGION}

      - AZURE_BLOB_CONNECTION=${AZURE_BLOB_CONNECTION}

      - AZURE_CONTAINER=${AZURE_CONTAINER}

      - AZURE_BLOB_SAS_TOKEN=${AZURE_BLOB_SAS_TOKEN}

      - ALLOWED_LANGS=${ALLOWED_LANGS}

    volumes:

      - ./uploads:/app/uploads

      - ./database:/app/database

      - ./temp:/app/temp

    restart: unless-stopped



  nginx:

    image: nginx:alpine

    ports:

      - "80:80"

      - "443:443"

    volumes:

      - ./nginx.conf:/etc/nginx/nginx.conf

      - ./ssl:/etc/ssl/certs

    depends_on:

      - transcription

    restart: unless-stopped

```

**3. Deploy with Docker**
```bash

# Build and start

docker-compose up -d



# View logs

docker-compose logs -f transcription



# Update application

git pull

docker-compose build transcription

docker-compose up -d transcription

```

#### Option 3: Cloud Deployment (Azure Container Instances)

**1. Create Container Registry**
```bash

# Create ACR

az acr create --resource-group myResourceGroup \

  --name myregistry --sku Basic



# Login to ACR

az acr login --name myregistry



# Build and push image

docker build -t myregistry.azurecr.io/transcription:latest .

docker push myregistry.azurecr.io/transcription:latest

```

**2. Deploy Container Instance**
```bash

# Create container instance

az container create \

  --resource-group myResourceGroup \

  --name transcription-app \

  --image myregistry.azurecr.io/transcription:latest \

  --cpu 2 --memory 4 \

  --port 7860 \

  --environment-variables \

    AZURE_SPEECH_KEY=$AZURE_SPEECH_KEY \

    AZURE_SPEECH_KEY_ENDPOINT=$AZURE_SPEECH_KEY_ENDPOINT \

    AZURE_REGION=$AZURE_REGION \

    AZURE_BLOB_CONNECTION="$AZURE_BLOB_CONNECTION" \

    AZURE_CONTAINER=$AZURE_CONTAINER \

    AZURE_BLOB_SAS_TOKEN="$AZURE_BLOB_SAS_TOKEN"

```

---

## 📡 API Documentation

### Core Classes and Methods

#### TranscriptionManager Class

**Purpose**: Main service class handling all transcription operations

```python

class TranscriptionManager:

    def __init__(self)

    

    # User Authentication

    def register_user(email: str, username: str, password: str, 

                     gdpr_consent: bool, data_retention_agreed: bool, 

                     marketing_consent: bool) -> Tuple[bool, str, Optional[str]]

    

    def login_user(login: str, password: str) -> Tuple[bool, str, Optional[User]]

    

    # Transcription Operations  

    def submit_transcription(file_bytes: bytes, original_filename: str,

                           user_id: str, language: str, 

                           settings: Dict) -> str

    

    def get_job_status(job_id: str) -> Optional[TranscriptionJob]

    

    # Data Management

    def get_user_history(user_id: str, limit: int) -> List[TranscriptionJob]

    def get_user_stats(user_id: str) -> Dict

    def export_user_data(user_id: str) -> Dict

    def delete_user_account(user_id: str) -> bool

```

#### DatabaseManager Class

**Purpose**: Handle database operations and Azure blob synchronization

```python

class DatabaseManager:

    def __init__(db_path: str = None)

    

    # User Operations

    def create_user(...) -> Tuple[bool, str, Optional[str]]

    def authenticate_user(login: str, password: str) -> Tuple[bool, str, Optional[User]]

    def get_user_by_id(user_id: str) -> Optional[User]

    

    # Job Operations

    def save_job(job: TranscriptionJob)

    def get_job(job_id: str) -> Optional[TranscriptionJob]

    def get_user_jobs(user_id: str, limit: int) -> List[TranscriptionJob]

    def get_pending_jobs() -> List[TranscriptionJob]

```

#### AuthManager Class

**Purpose**: Authentication utilities and validation

```python

class AuthManager:

    @staticmethod

    def hash_password(password: str) -> str

    def verify_password(password: str, password_hash: str) -> bool

    def validate_email(email: str) -> bool

    def validate_username(username: str) -> bool

    def validate_password(password: str) -> Tuple[bool, str]

```

### Data Models

#### User Model
```python

@dataclass

class User:

    user_id: str

    email: str

    username: str

    password_hash: str

    created_at: str

    last_login: Optional[str] = None

    is_active: bool = True

    gdpr_consent: bool = False

    data_retention_agreed: bool = False

    marketing_consent: bool = False

```

#### TranscriptionJob Model
```python

@dataclass

class TranscriptionJob:

    job_id: str

    user_id: str

    original_filename: str

    audio_url: str

    language: str

    status: str  # pending, processing, completed, failed

    created_at: str

    completed_at: Optional[str] = None

    transcript_text: Optional[str] = None

    transcript_url: Optional[str] = None

    error_message: Optional[str] = None

    azure_trans_id: Optional[str] = None

    settings: Optional[Dict] = None

```

### Configuration Parameters

#### Environment Variables
```python

# Required

AZURE_SPEECH_KEY: str

AZURE_SPEECH_KEY_ENDPOINT: str  

AZURE_REGION: str

AZURE_BLOB_CONNECTION: str

AZURE_CONTAINER: str

AZURE_BLOB_SAS_TOKEN: str



# Optional

ALLOWED_LANGS: str  # JSON string

API_VERSION: str = "v3.2"

PASSWORD_SALT: str = "default_salt"

MAX_FILE_SIZE_MB: int = 500

```

#### Transcription Settings
```python

settings = {

    'audio_format': str,           # wav, mp3, etc.

    'diarization_enabled': bool,   # Speaker identification

    'speakers': int,               # Max speakers (1-10)

    'profanity': str,              # masked, removed, raw

    'punctuation': str,            # automatic, dictated, none

    'timestamps': bool,            # Include timestamps

    'lexical': bool,               # Include lexical forms

    'language_id_enabled': bool,   # Auto language detection

    'candidate_locales': List[str] # Language candidates

}

```

---

## 🗄️ Database Schema

### SQLite Database Structure

#### Users Table
```sql

CREATE TABLE users (

    user_id TEXT PRIMARY KEY,

    email TEXT UNIQUE NOT NULL,

    username TEXT UNIQUE NOT NULL,

    password_hash TEXT NOT NULL,

    created_at TEXT NOT NULL,

    last_login TEXT,

    is_active BOOLEAN DEFAULT 1,

    gdpr_consent BOOLEAN DEFAULT 0,

    data_retention_agreed BOOLEAN DEFAULT 0,

    marketing_consent BOOLEAN DEFAULT 0

);



-- Indexes

CREATE INDEX idx_users_email ON users(email);

CREATE INDEX idx_users_username ON users(username);

```

#### Transcriptions Table
```sql

CREATE TABLE transcriptions (

    job_id TEXT PRIMARY KEY,

    user_id TEXT NOT NULL,

    original_filename TEXT NOT NULL,

    audio_url TEXT,

    language TEXT NOT NULL,

    status TEXT NOT NULL,

    created_at TEXT NOT NULL,

    completed_at TEXT,

    transcript_text TEXT,

    transcript_url TEXT,

    error_message TEXT,

    azure_trans_id TEXT,

    settings TEXT,

    FOREIGN KEY (user_id) REFERENCES users (user_id)

);



-- Indexes

CREATE INDEX idx_transcriptions_user_id ON transcriptions(user_id);

CREATE INDEX idx_transcriptions_status ON transcriptions(status);

CREATE INDEX idx_transcriptions_created_at ON transcriptions(created_at DESC);

CREATE INDEX idx_transcriptions_user_created ON transcriptions(user_id, created_at DESC);

```

### Azure Blob Storage Structure

```

Container: {AZURE_CONTAINER}/

├── shared/

│   └── database/

│       └── transcriptions.db           # Shared database backup

├── users/

│   ├── {user-id-1}/

│   │   ├── audio/                      # Processed audio files

│   │   │   ├── {job-id-1}.wav

│   │   │   └── {job-id-2}.wav

│   │   ├── transcripts/                # Transcript files

│   │   │   ├── {job-id-1}.txt

│   │   │   └── {job-id-2}.txt

│   │   └── originals/                  # Original uploaded files

│   │       ├── {job-id-1}_{filename}.mp4

│   │       └── {job-id-2}_{filename}.wav

│   └── {user-id-2}/

│       ├── audio/

│       ├── transcripts/

│       └── originals/

```

### Database Operations

#### User Management Queries
```sql

-- Create user

INSERT INTO users (user_id, email, username, password_hash, created_at, 

                   gdpr_consent, data_retention_agreed, marketing_consent)

VALUES (?, ?, ?, ?, ?, ?, ?, ?);



-- Authenticate user

SELECT * FROM users 

WHERE (email = ? OR username = ?) AND is_active = 1;



-- Update last login

UPDATE users SET last_login = ? WHERE user_id = ?;



-- Get user stats

SELECT status, COUNT(*) FROM transcriptions 

WHERE user_id = ? GROUP BY status;

```

#### Job Management Queries
```sql

-- Create job

INSERT INTO transcriptions (job_id, user_id, original_filename, language, 

                           status, created_at, settings)

VALUES (?, ?, ?, ?, 'pending', ?, ?);



-- Update job status

UPDATE transcriptions 

SET status = ?, completed_at = ?, transcript_text = ?, transcript_url = ?

WHERE job_id = ?;



-- Get user jobs

SELECT * FROM transcriptions 

WHERE user_id = ? 

ORDER BY created_at DESC LIMIT ?;



-- Get pending jobs for background processor

SELECT * FROM transcriptions 

WHERE status IN ('pending', 'processing');

```

---

## 🔒 Security Implementation

### Authentication Security

#### Password Security
```python

# Password hashing with salt

def hash_password(password: str) -> str:

    salt = os.environ.get("PASSWORD_SALT", "default_salt")

    return hashlib.sha256((password + salt).encode()).hexdigest()



# Password validation

def validate_password(password: str) -> Tuple[bool, str]:

    if len(password) < 8:

        return False, "Password must be at least 8 characters"

    if not re.search(r'[A-Z]', password):

        return False, "Password must contain uppercase letter"

    if not re.search(r'[a-z]', password):

        return False, "Password must contain lowercase letter"

    if not re.search(r'\d', password):

        return False, "Password must contain number"

    return True, "Valid"

```

#### Session Management
```python

# User session state

session_state = {

    'user_id': str,

    'username': str,

    'logged_in_at': datetime,

    'last_activity': datetime

}



# Session validation

def validate_session(session_state: dict) -> bool:

    if not session_state or 'user_id' not in session_state:

        return False

    

    # Check session timeout (if implemented)

    last_activity = session_state.get('last_activity')

    if last_activity:

        timeout = timedelta(hours=24)  # 24-hour sessions

        if datetime.now() - last_activity > timeout:

            return False

    

    return True

```

### Data Security

#### Access Control
```python

# User data access verification

def verify_user_access(job_id: str, user_id: str) -> bool:

    job = get_job(job_id)

    return job and job.user_id == user_id



# File path security

def get_user_blob_path(user_id: str, blob_type: str, filename: str) -> str:

    # Ensure user can only access their own folder

    safe_filename = os.path.basename(filename)  # Prevent path traversal

    return f"users/{user_id}/{blob_type}/{safe_filename}"

```

#### Data Encryption
```python

# Azure Blob Storage encryption (configured at Azure level)

# - Encryption at rest: Enabled by default

# - Encryption in transit: HTTPS enforced

# - Customer-managed keys: Optional enhancement



# Database encryption (for sensitive fields)

from cryptography.fernet import Fernet



def encrypt_sensitive_data(data: str, key: bytes) -> str:

    f = Fernet(key)

    return f.encrypt(data.encode()).decode()



def decrypt_sensitive_data(encrypted_data: str, key: bytes) -> str:

    f = Fernet(key)

    return f.decrypt(encrypted_data.encode()).decode()

```

### Azure Security

#### Blob Storage Security
```python

# SAS token configuration for least privilege

sas_permissions = BlobSasPermissions(

    read=True,

    write=True,

    delete=True,

    list=True

)



# IP restrictions (optional)

sas_ip_range = "192.168.1.0/24"  # Restrict to specific IP range



# Time-limited tokens

sas_expiry = datetime.utcnow() + timedelta(hours=1)

```

#### Speech Service Security
```python

# Secure API calls

headers = {

    "Ocp-Apim-Subscription-Key": AZURE_SPEECH_KEY,

    "Content-Type": "application/json"

}



# Request timeout and retry logic

response = requests.post(

    url, 

    headers=headers, 

    json=body,

    timeout=30,

    verify=True  # Verify SSL certificates

)

```

### Input Validation

#### File Upload Security
```python

def validate_uploaded_file(file_path: str, max_size: int = 500 * 1024 * 1024) -> Tuple[bool, str]:

    try:

        # Check file exists

        if not os.path.exists(file_path):

            return False, "File not found"

        

        # Check file size

        file_size = os.path.getsize(file_path)

        if file_size > max_size:

            return False, f"File too large: {file_size / 1024 / 1024:.1f}MB"

        

        # Check file type by content (not just extension)

        import magic

        mime_type = magic.from_file(file_path, mime=True)

        allowed_types = ['audio/', 'video/']

        if not any(mime_type.startswith(t) for t in allowed_types):

            return False, f"Invalid file type: {mime_type}"

        

        return True, "Valid"

    

    except Exception as e:

        return False, f"Validation error: {str(e)}"

```

#### SQL Injection Prevention
```python

# Use parameterized queries (already implemented)

cursor.execute(

    "SELECT * FROM users WHERE email = ? AND password_hash = ?",

    (email, password_hash)

)



# Input sanitization

def sanitize_input(user_input: str) -> str:

    # Remove dangerous characters

    import html

    sanitized = html.escape(user_input)

    # Limit length

    return sanitized[:1000]

```

---

## 📊 Monitoring & Maintenance

### Application Monitoring

#### Health Checks
```python

def health_check() -> Dict[str, Any]:

    """System health check endpoint"""

    try:

        # Database check

        db_status = check_database_connection()

        

        # Azure services check

        blob_status = check_blob_storage()

        speech_status = check_speech_service()

        

        # FFmpeg check

        ffmpeg_status = check_ffmpeg_installation()

        

        # Disk space check

        disk_status = check_disk_space()

        

        return {

            'status': 'healthy' if all([db_status, blob_status, speech_status, ffmpeg_status]) else 'unhealthy',

            'timestamp': datetime.now().isoformat(),

            'services': {

                'database': db_status,

                'blob_storage': blob_status,

                'speech_service': speech_status,

                'ffmpeg': ffmpeg_status,

                'disk_space': disk_status

            }

        }

    

    except Exception as e:

        return {

            'status': 'error',

            'timestamp': datetime.now().isoformat(),

            'error': str(e)

        }



def check_database_connection() -> bool:

    try:

        with transcription_manager.db.get_connection() as conn:

            conn.execute("SELECT 1").fetchone()

        return True

    except:

        return False



def check_blob_storage() -> bool:

    try:

        client = BlobServiceClient.from_connection_string(AZURE_BLOB_CONNECTION)

        client.list_containers(max_results=1)

        return True

    except:

        return False

```

#### Logging Configuration
```python

import logging

from logging.handlers import RotatingFileHandler



def setup_logging():

    """Configure application logging"""

    

    # Create formatter

    formatter = logging.Formatter(

        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'

    )

    

    # Console handler

    console_handler = logging.StreamHandler()

    console_handler.setFormatter(formatter)

    console_handler.setLevel(logging.INFO)

    

    # File handler with rotation

    file_handler = RotatingFileHandler(

        'logs/transcription.log',

        maxBytes=10*1024*1024,  # 10MB

        backupCount=5

    )

    file_handler.setFormatter(formatter)

    file_handler.setLevel(logging.DEBUG)

    

    # Configure root logger

    logger = logging.getLogger()

    logger.setLevel(logging.DEBUG)

    logger.addHandler(console_handler)

    logger.addHandler(file_handler)

    

    # Separate logger for sensitive operations

    auth_logger = logging.getLogger('auth')

    auth_handler = RotatingFileHandler(

        'logs/auth.log',

        maxBytes=5*1024*1024,  # 5MB

        backupCount=10

    )

    auth_handler.setFormatter(formatter)

    auth_logger.addHandler(auth_handler)

    auth_logger.setLevel(logging.INFO)

```

#### Performance Monitoring
```python

import time

from functools import wraps



def monitor_performance(func):

    """Decorator to monitor function performance"""

    @wraps(func)

    def wrapper(*args, **kwargs):

        start_time = time.time()

        try:

            result = func(*args, **kwargs)

            duration = time.time() - start_time

            logging.info(f"{func.__name__} completed in {duration:.2f}s")

            return result

        except Exception as e:

            duration = time.time() - start_time

            logging.error(f"{func.__name__} failed after {duration:.2f}s: {str(e)}")

            raise

    return wrapper



# Usage

@monitor_performance

def submit_transcription(self, file_bytes, filename, user_id, language, settings):

    # Implementation here

    pass

```

### Database Maintenance

#### Backup Strategy
```python

def backup_database():

    """Backup database to Azure Blob Storage"""

    try:

        # Create timestamped backup

        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

        backup_name = f"shared/backups/transcriptions_backup_{timestamp}.db"

        

        # Upload current database

        blob_client = blob_service.get_blob_client(

            container=AZURE_CONTAINER, 

            blob=backup_name

        )

        

        with open(db_path, "rb") as data:

            blob_client.upload_blob(data)

        

        logging.info(f"Database backup created: {backup_name}")

        

        # Clean old backups (keep last 30 days)

        cleanup_old_backups()

        

    except Exception as e:

        logging.error(f"Database backup failed: {str(e)}")



def cleanup_old_backups():

    """Remove backups older than 30 days"""

    try:

        cutoff_date = datetime.now() - timedelta(days=30)

        container_client = blob_service.get_container_client(AZURE_CONTAINER)

        

        for blob in container_client.list_blobs(name_starts_with="shared/backups/"):

            if blob.last_modified < cutoff_date:

                blob_service.delete_blob(AZURE_CONTAINER, blob.name)

                logging.info(f"Deleted old backup: {blob.name}")

                

    except Exception as e:

        logging.error(f"Backup cleanup failed: {str(e)}")

```

#### Database Optimization
```python

def optimize_database():

    """Optimize database performance"""

    try:

        with transcription_manager.db.get_connection() as conn:

            # Analyze tables

            conn.execute("ANALYZE")

            

            # Vacuum database (compact)

            conn.execute("VACUUM")

            

            # Update statistics

            conn.execute("PRAGMA optimize")

            

        logging.info("Database optimization completed")

        

    except Exception as e:

        logging.error(f"Database optimization failed: {str(e)}")



# Schedule optimization (run weekly)

import schedule



schedule.every().week.do(optimize_database)

schedule.every().day.at("02:00").do(backup_database)

```

### Resource Management

#### Cleanup Tasks
```python

def cleanup_temporary_files():

    """Clean up temporary files older than 24 hours"""

    try:

        cutoff_time = time.time() - (24 * 60 * 60)  # 24 hours ago

        temp_dirs = ['uploads', 'temp']

        

        for temp_dir in temp_dirs:

            if os.path.exists(temp_dir):

                for filename in os.listdir(temp_dir):

                    filepath = os.path.join(temp_dir, filename)

                    if os.path.isfile(filepath) and os.path.getmtime(filepath) < cutoff_time:

                        os.remove(filepath)

                        logging.info(f"Cleaned up temporary file: {filepath}")

                        

    except Exception as e:

        logging.error(f"Temporary file cleanup failed: {str(e)}")



def monitor_disk_space():

    """Monitor and alert on disk space"""

    try:

        import shutil

        total, used, free = shutil.disk_usage("/")

        

        # Convert to GB

        free_gb = free // (1024**3)

        total_gb = total // (1024**3)

        usage_percent = (used / total) * 100

        

        if usage_percent > 85:

            logging.warning(f"High disk usage: {usage_percent:.1f}% ({free_gb}GB free)")

            

        if free_gb < 5:

            logging.critical(f"Low disk space: {free_gb}GB remaining")

            

    except Exception as e:

        logging.error(f"Disk space monitoring failed: {str(e)}")

```

### Monitoring Alerts

#### Email Alerts (Optional)
```python

import smtplib

from email.mime.text import MIMEText



def send_alert(subject: str, message: str):

    """Send email alert for critical issues"""

    try:

        smtp_server = os.environ.get("SMTP_SERVER")

        smtp_port = int(os.environ.get("SMTP_PORT", "587"))

        smtp_user = os.environ.get("SMTP_USER")

        smtp_pass = os.environ.get("SMTP_PASS")

        alert_email = os.environ.get("ALERT_EMAIL")

        

        if not all([smtp_server, smtp_user, smtp_pass, alert_email]):

            return  # Email not configured

        

        msg = MIMEText(message)

        msg['Subject'] = f"[Transcription Service] {subject}"

        msg['From'] = smtp_user

        msg['To'] = alert_email

        

        with smtplib.SMTP(smtp_server, smtp_port) as server:

            server.starttls()

            server.login(smtp_user, smtp_pass)

            server.send_message(msg)

            

    except Exception as e:

        logging.error(f"Failed to send alert: {str(e)}")

```

---

## 🤝 Contributing Guidelines

### Development Workflow

#### 1. Setup Development Environment
```bash

# Fork repository

git clone https://github.com/your-username/azure-speech-transcription.git

cd azure-speech-transcription



# Create feature branch

git checkout -b feature/your-feature-name



# Setup environment

python -m venv venv

source venv/bin/activate  # or venv\Scripts\activate on Windows

pip install -r requirements.txt

pip install -r requirements-dev.txt  # Development dependencies

```

#### 2. Code Quality Standards

**Python Style Guide**
- Follow PEP 8 style guidelines
- Use type hints for function parameters and return values
- Maximum line length: 88 characters (Black formatter)
- Use meaningful variable and function names

**Code Formatting**
```bash

# Install development tools

pip install black flake8 mypy pytest



# Format code

black .



# Check style

flake8 .



# Type checking

mypy app_core.py gradio_app.py



# Run tests

pytest tests/

```

**Documentation Standards**
- All functions must have docstrings
- Include type hints
- Document complex logic with inline comments
- Update README.md for new features

```python

def submit_transcription(

    self, 

    file_bytes: bytes, 

    original_filename: str,

    user_id: str,

    language: str,

    settings: Dict[str, Any]

) -> str:

    """

    Submit a new transcription job for processing.

    

    Args:

        file_bytes: Raw bytes of the audio/video file

        original_filename: Original name of the uploaded file

        user_id: ID of the authenticated user

        language: Language code for transcription (e.g., 'en-US')

        settings: Transcription configuration options

        

    Returns:

        str: Unique job ID for tracking transcription progress

        

    Raises:

        ValueError: If user_id is invalid or file is too large

        ConnectionError: If Azure services are unavailable

    """

```

#### 3. Testing Requirements

**Unit Tests**
```python

import pytest

from unittest.mock import Mock, patch

from app_core import TranscriptionManager, AuthManager



class TestAuthManager:

    def test_password_hashing(self):

        password = "TestPassword123"

        hashed = AuthManager.hash_password(password)

        

        assert hashed != password

        assert AuthManager.verify_password(password, hashed)

        assert not AuthManager.verify_password("wrong", hashed)

    

    def test_email_validation(self):

        assert AuthManager.validate_email("test@example.com")

        assert not AuthManager.validate_email("invalid-email")

        assert not AuthManager.validate_email("")



class TestTranscriptionManager:

    @patch('app_core.BlobServiceClient')

    def test_submit_transcription(self, mock_blob):

        manager = TranscriptionManager()

        

        job_id = manager.submit_transcription(

            b"fake audio data",

            "test.wav",

            "user123",

            "en-US",

            {"audio_format": "wav"}

        )

        

        assert isinstance(job_id, str)

        assert len(job_id) == 36  # UUID length

```

**Integration Tests**
```python

class TestIntegration:

    def test_full_transcription_workflow(self):

        # Test complete workflow from upload to download

        pass

    

    def test_user_registration_and_login(self):

        # Test complete auth workflow

        pass

```

#### 4. Commit Guidelines

**Commit Message Format**
```

type(scope): brief description



Detailed explanation of changes if needed



- List specific changes

- Include any breaking changes

- Reference issue numbers



Closes #123

```

**Commit Types**
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation changes
- `style`: Code style changes (formatting, etc.)
- `refactor`: Code refactoring
- `test`: Adding or updating tests
- `chore`: Maintenance tasks

**Example Commits**
```bash

git commit -m "feat(auth): add password strength validation



- Implement password complexity requirements

- Add client-side validation feedback

- Update registration form UI



Closes #45"



git commit -m "fix(transcription): handle Azure service timeouts



- Add retry logic for failed API calls

- Improve error messages for users

- Log detailed error information



Fixes #67"

```

#### 5. Pull Request Process

**PR Checklist**
- [ ] Code follows style guidelines
- [ ] All tests pass
- [ ] Documentation updated
- [ ] Security considerations reviewed
- [ ] Performance impact assessed
- [ ] Breaking changes documented

**PR Template**
```markdown

## Description

Brief description of changes



## Type of Change

- [ ] Bug fix

- [ ] New feature

- [ ] Breaking change

- [ ] Documentation update



## Testing

- [ ] Unit tests added/updated

- [ ] Integration tests pass

- [ ] Manual testing completed



## Security

- [ ] No sensitive data exposed

- [ ] Input validation implemented

- [ ] Access controls maintained



## Performance

- [ ] No performance degradation

- [ ] Database queries optimized

- [ ] Resource usage considered

```

### Feature Development

#### Adding New Languages
```python

# 1. Update environment configuration

ALLOWED_LANGS = {

    "en-US": "English (United States)",

    "es-ES": "Spanish (Spain)",

    "new-LANG": "New Language Name"

}



# 2. Test language support

def test_new_language():

    # Verify Azure Speech Services supports the language

    # Test transcription accuracy

    # Update documentation

```

#### Adding New Audio Formats
```python

# 1. Update supported formats list

AUDIO_FORMATS = [

    "wav", "mp3", "ogg", "opus", "flac", 

    "new_format"  # Add new format

]



# 2. Update FFmpeg conversion logic

def _convert_to_audio(self, input_path, output_path, audio_format="wav"):

    if audio_format == "new_format":

        # Add specific conversion parameters

        cmd = ["ffmpeg", "-i", input_path, "-codec", "new_codec", output_path]

```

#### Adding New Features
```python

# 1. Database schema updates

def upgrade_database_schema():

    with self.get_connection() as conn:

        conn.execute("""

            ALTER TABLE transcriptions 

            ADD COLUMN new_feature_data TEXT

        """)



# 2. API endpoint updates

def new_feature_endpoint(user_id: str, feature_data: Dict) -> Dict:

    # Implement new feature logic

    pass



# 3. UI updates

def add_new_feature_ui():

    new_feature_input = gr.Textbox(label="New Feature")

    new_feature_button = gr.Button("Use New Feature")

```

---

## ⚙️ Advanced Configuration

### Performance Optimization

#### Concurrent Processing
```python

# Adjust worker thread pool size based on server capacity

class TranscriptionManager:

    def __init__(self, max_workers: int = None):

        if max_workers is None:

            # Auto-detect based on CPU cores

            import multiprocessing

            max_workers = min(multiprocessing.cpu_count(), 10)

        

        self.executor = ThreadPoolExecutor(max_workers=max_workers)



# Configure based on server specs

# Small server: max_workers=2-4

# Medium server: max_workers=5-8  

# Large server: max_workers=10+

```

#### Database Optimization
```python

# SQLite performance tuning

def configure_database_performance(db_path: str):

    with sqlite3.connect(db_path) as conn:

        # Enable WAL mode for better concurrency

        conn.execute("PRAGMA journal_mode=WAL")

        

        # Increase cache size (in KB)

        conn.execute("PRAGMA cache_size=10000")

        

        # Optimize synchronization

        conn.execute("PRAGMA synchronous=NORMAL")

        

        # Enable foreign keys

        conn.execute("PRAGMA foreign_keys=ON")

```

#### Memory Management
```python

# Large file handling

def process_large_file(file_path: str):

    """Process large files in chunks to manage memory"""

    chunk_size = 64 * 1024 * 1024  # 64MB chunks

    

    with open(file_path, 'rb') as f:

        while chunk := f.read(chunk_size):

            # Process chunk

            yield chunk



# Garbage collection for long-running processes

import gc



def cleanup_memory():

    """Force garbage collection"""

    gc.collect()

    

# Schedule periodic cleanup

schedule.every(30).minutes.do(cleanup_memory)

```

### Security Hardening

#### Rate Limiting
```python

from collections import defaultdict

from time import time



class RateLimiter:

    def __init__(self, max_requests: int = 100, window: int = 3600):

        self.max_requests = max_requests

        self.window = window

        self.requests = defaultdict(list)

    

    def is_allowed(self, user_id: str) -> bool:

        now = time()

        user_requests = self.requests[user_id]

        

        # Clean old requests

        user_requests[:] = [req_time for req_time in user_requests 

                           if now - req_time < self.window]

        

        # Check limit

        if len(user_requests) >= self.max_requests:

            return False

        

        user_requests.append(now)

        return True



# Usage in endpoints

rate_limiter = RateLimiter(max_requests=50, window=3600)  # 50 per hour



def submit_transcription(self, user_id: str, ...):

    if not rate_limiter.is_allowed(user_id):

        raise Exception("Rate limit exceeded")

```

#### Input Sanitization
```python

import bleach

import re



def sanitize_filename(filename: str) -> str:

    """Sanitize uploaded filename"""

    # Remove path traversal attempts

    filename = os.path.basename(filename)

    

    # Remove dangerous characters

    filename = re.sub(r'[<>:"/\\|?*]', '_', filename)

    

    # Limit length

    if len(filename) > 255:

        name, ext = os.path.splitext(filename)

        filename = name[:250] + ext

    

    return filename



def sanitize_user_input(text: str) -> str:

    """Sanitize user text input"""

    # Remove HTML tags

    text = bleach.clean(text, tags=[], strip=True)

    

    # Limit length

    text = text[:1000]

    

    return text.strip()

```

#### Audit Logging
```python

class AuditLogger:

    def __init__(self):

        self.logger = logging.getLogger('audit')

    

    def log_user_action(self, user_id: str, action: str, details: Dict = None):

        """Log user actions for security auditing"""

        audit_entry = {

            'timestamp': datetime.now().isoformat(),

            'user_id': user_id,

            'action': action,

            'details': details or {},

            'ip_address': self._get_client_ip(),

            'user_agent': self._get_user_agent()

        }

        

        self.logger.info(json.dumps(audit_entry))

    

    def _get_client_ip(self) -> str:

        # Implementation depends on deployment setup

        return "unknown"

    

    def _get_user_agent(self) -> str:

        # Implementation depends on deployment setup

        return "unknown"



# Usage

audit = AuditLogger()

audit.log_user_action(user_id, "login", {"success": True})

audit.log_user_action(user_id, "transcription_submit", {"filename": filename})

```

### Custom Extensions

#### Plugin Architecture
```python

class TranscriptionPlugin:

    """Base class for transcription plugins"""

    

    def pre_process(self, file_bytes: bytes, settings: Dict) -> bytes:

        """Pre-process audio before transcription"""

        return file_bytes

    

    def post_process(self, transcript: str, settings: Dict) -> str:

        """Post-process transcript text"""

        return transcript

    

    def get_name(self) -> str:

        """Return plugin name"""

        raise NotImplementedError



class NoiseReductionPlugin(TranscriptionPlugin):

    def get_name(self) -> str:

        return "noise_reduction"

    

    def pre_process(self, file_bytes: bytes, settings: Dict) -> bytes:

        # Implement noise reduction using audio processing library

        # This is a placeholder - actual implementation would use

        # libraries like librosa, scipy, or pydub

        return file_bytes



class LanguageDetectionPlugin(TranscriptionPlugin):

    def get_name(self) -> str:

        return "language_detection"

    

    def pre_process(self, file_bytes: bytes, settings: Dict) -> bytes:

        # Detect language and update settings

        detected_language = self._detect_language(file_bytes)

        settings['detected_language'] = detected_language

        return file_bytes



# Plugin manager

class PluginManager:

    def __init__(self):

        self.plugins: List[TranscriptionPlugin] = []

    

    def register_plugin(self, plugin: TranscriptionPlugin):

        self.plugins.append(plugin)

    

    def apply_pre_processing(self, file_bytes: bytes, settings: Dict) -> bytes:

        for plugin in self.plugins:

            file_bytes = plugin.pre_process(file_bytes, settings)

        return file_bytes

    

    def apply_post_processing(self, transcript: str, settings: Dict) -> str:

        for plugin in self.plugins:

            transcript = plugin.post_process(transcript, settings)

        return transcript

```

---

## 🔧 Troubleshooting

### Common Development Issues

#### Environment Setup Problems

**Issue**: Azure connection fails
```bash

# Check environment variables

python -c "

import os

print('AZURE_SPEECH_KEY:', bool(os.getenv('AZURE_SPEECH_KEY')))

print('AZURE_BLOB_CONNECTION:', bool(os.getenv('AZURE_BLOB_CONNECTION')))

"



# Test Azure connection

python -c "

from azure.storage.blob import BlobServiceClient

client = BlobServiceClient.from_connection_string('$AZURE_BLOB_CONNECTION')

print('Containers:', list(client.list_containers()))

"

```

**Issue**: FFmpeg not found
```bash

# Check FFmpeg installation

ffmpeg -version



# Install FFmpeg (Ubuntu/Debian)

sudo apt update && sudo apt install ffmpeg



# Install FFmpeg (Windows with Chocolatey)

choco install ffmpeg



# Install FFmpeg (macOS with Homebrew)

brew install ffmpeg

```

**Issue**: Database initialization fails
```python

# Check database permissions

import os

db_dir = "database"

if not os.path.exists(db_dir):

    os.makedirs(db_dir)

    print(f"Created directory: {db_dir}")



# Test database creation

import sqlite3

conn = sqlite3.connect("database/test.db")

conn.execute("CREATE TABLE test (id INTEGER)")

conn.close()

print("Database test successful")

```

#### Runtime Issues

**Issue**: Memory errors with large files
```python

# Monitor memory usage

import psutil



def check_memory():

    memory = psutil.virtual_memory()

    print(f"Memory usage: {memory.percent}%")

    print(f"Available: {memory.available / 1024**3:.1f}GB")



# Implement file chunking for large uploads

def process_large_file_in_chunks(file_path: str, chunk_size: int = 64*1024*1024):

    with open(file_path, 'rb') as f:

        while chunk := f.read(chunk_size):

            yield chunk

```

**Issue**: Transcription jobs stuck
```python

# Check pending jobs

def diagnose_stuck_jobs():

    pending_jobs = transcription_manager.db.get_pending_jobs()

    print(f"Pending jobs: {len(pending_jobs)}")

    

    for job in pending_jobs:

        duration = datetime.now() - datetime.fromisoformat(job.created_at)

        print(f"Job {job.job_id}: {job.status} for {duration}")

        

        if duration.total_seconds() > 3600:  # 1 hour

            print(f"⚠️ Job {job.job_id} may be stuck")



# Reset stuck jobs

def reset_stuck_jobs():

    with transcription_manager.db.get_connection() as conn:

        conn.execute("""

            UPDATE transcriptions 

            SET status = 'pending', azure_trans_id = NULL

            WHERE status = 'processing' 

            AND created_at < datetime('now', '-1 hour')

        """)

```

**Issue**: Azure API errors
```python

# Test Azure Speech Service

def test_azure_speech():

    try:

        url = f"{AZURE_SPEECH_KEY_ENDPOINT}/speechtotext/v3.2/transcriptions"

        headers = {"Ocp-Apim-Subscription-Key": AZURE_SPEECH_KEY}

        

        response = requests.get(url, headers=headers)

        print(f"Status: {response.status_code}")

        print(f"Response: {response.text[:200]}")

        

    except Exception as e:

        print(f"Azure Speech test failed: {e}")



# Check Azure service status

def check_azure_status():

    # Check Azure status page

    status_url = "https://status.azure.com/en-us/status"

    print(f"Check Azure status: {status_url}")

```

### Debugging Tools

#### Debug Mode Configuration
```python

# Enable debug mode

DEBUG = os.environ.get("DEBUG", "false").lower() == "true"



if DEBUG:

    logging.basicConfig(level=logging.DEBUG)

    

    # Enable Gradio debug mode

    demo.launch(debug=True, show_error=True)

```

#### Performance Profiling
```python

import cProfile

import pstats



def profile_function(func):

    """Profile function performance"""

    profiler = cProfile.Profile()

    

    def wrapper(*args, **kwargs):

        profiler.enable()

        result = func(*args, **kwargs)

        profiler.disable()

        

        # Print stats

        stats = pstats.Stats(profiler)

        stats.sort_stats('cumulative')

        stats.print_stats(10)  # Top 10 functions

        

        return result

    

    return wrapper



# Usage

@profile_function

def submit_transcription(self, ...):

    # Function implementation

    pass

```

#### Log Analysis
```python

def analyze_logs(log_file: str = "logs/transcription.log"):

    """Analyze application logs for issues"""

    

    errors = []

    warnings = []

    performance_issues = []

    

    with open(log_file, 'r') as f:

        for line in f:

            if 'ERROR' in line:

                errors.append(line.strip())

            elif 'WARNING' in line:

                warnings.append(line.strip())

            elif 'completed in' in line:

                # Extract timing information

                import re

                match = re.search(r'completed in (\d+\.\d+)s', line)

                if match and float(match.group(1)) > 30:  # > 30 seconds

                    performance_issues.append(line.strip())

    

    print(f"Errors: {len(errors)}")

    print(f"Warnings: {len(warnings)}")

    print(f"Performance issues: {len(performance_issues)}")

    

    return {

        'errors': errors[-10:],  # Last 10 errors

        'warnings': warnings[-10:],  # Last 10 warnings

        'performance_issues': performance_issues[-10:]

    }

```

### Production Troubleshooting

#### Service Health Check
```bash

#!/bin/bash

# health_check.sh



echo "=== System Health Check ==="



# Check service status

systemctl is-active transcription

systemctl is-active nginx



# Check disk space

df -h



# Check memory usage

free -h



# Check CPU usage

top -b -n1 | grep "Cpu(s)"



# Check logs for errors

tail -n 50 /home/transcription/app/logs/transcription.log | grep ERROR



# Check Azure connectivity

curl -s -o /dev/null -w "%{http_code}" https://azure.microsoft.com/



echo "=== Health Check Complete ==="

```

#### Database Recovery
```python

def recover_database():

    """Recover database from Azure backup"""

    try:

        # List available backups

        container_client = blob_service.get_container_client(AZURE_CONTAINER)

        backups = []

        

        for blob in container_client.list_blobs(name_starts_with="shared/backups/"):

            backups.append({

                'name': blob.name,

                'modified': blob.last_modified

            })

        

        # Sort by date (newest first)

        backups.sort(key=lambda x: x['modified'], reverse=True)

        

        if not backups:

            print("No backups found")

            return

        

        # Download latest backup

        latest_backup = backups[0]['name']

        print(f"Restoring from: {latest_backup}")

        

        blob_client = blob_service.get_blob_client(

            container=AZURE_CONTAINER,

            blob=latest_backup

        )

        

        # Download backup

        with open("database/transcriptions_restored.db", "wb") as f:

            f.write(blob_client.download_blob().readall())

        

        print("Database restored successfully")

        print("Restart the application to use restored database")

        

    except Exception as e:

        print(f"Database recovery failed: {str(e)}")

```

---

## 📚 Additional Resources

### Documentation Links
- [Azure Speech Services Documentation](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/)
- [Azure Blob Storage Documentation](https://docs.microsoft.com/en-us/azure/storage/blobs/)
- [Gradio Documentation](https://gradio.app/docs/)
- [SQLite Documentation](https://www.sqlite.org/docs.html)
- [FFmpeg Documentation](https://ffmpeg.org/documentation.html)

### Useful Tools
- **Azure Storage Explorer**: GUI for managing blob storage
- **DB Browser for SQLite**: Visual database management
- **Postman**: API testing and development
- **Azure CLI**: Command-line Azure management
- **Visual Studio Code**: Recommended IDE with Azure extensions

### Community Resources
- [Azure Speech Services Community](https://docs.microsoft.com/en-us/answers/topics/azure-speech-services.html)
- [Gradio Community](https://github.com/gradio-app/gradio/discussions)
- [Python Audio Processing Libraries](https://github.com/topics/audio-processing)

---

**This developer guide provides comprehensive information for setting up, developing, deploying, and maintaining the Azure Speech Transcription service. For additional help, refer to the linked documentation and community resources.** 🚀