Spaces:

vyluong
/

POC_ASR_v6

Running

App Files Files Community

vyluong commited on Mar 17

Commit

35b29f2

verified ·

1 Parent(s): 4d1f2ff

PoC deployment

Browse files

Files changed (31) hide show

Dockerfile +68 -0
README.md +99 -7
app/__init__.py +1 -0
app/api/__init__.py +1 -0
app/api/routes.py +209 -0
app/core/__init__.py +1 -0
app/core/config.py +122 -0
app/main.py +131 -0
app/schemas/__init__.py +1 -0
app/schemas/models.py +158 -0
app/services/__init__.py +5 -0
app/services/alignment.py +294 -0
app/services/audio_processor.py +247 -0
app/services/denoiser.py +142 -0
app/services/diarization.py +223 -0
app/services/emo.py +169 -0
app/services/processor.py +623 -0
app/services/silero_vad_service.py +72 -0
app/services/transcription.py +283 -0
app/services/vocal_separator.py +118 -0
app/static/css/style.css +673 -0
app/static/js/app.js +338 -0
app/templates/index.html +162 -0
data/processed/.gitkeep +0 -0
data/uploads/.gitkeep +0 -0
docker-compose.yml +60 -0
docker/.gitkeep +0 -0
precision_voice_eval_ASR.ipynb +0 -0
precision_voice_simple.ipynb +672 -0
requirements.txt +48 -0
scripts/verify_model_config.py +18 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,68 @@

+# ================================
+# PrecisionVoice Dockerfile
+# Optimized for performance and size
+# ================================
+# Stage 1: Builder
+FROM python:3.10-slim-bullseye AS builder
+WORKDIR /app
+# Install build dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    git \
+    ffmpeg \
+    libsndfile1-dev \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements and install dependencies
+# Using --user to keep packages in /root/.local
+COPY requirements.txt .
+RUN pip install --no-cache-dir --user -r requirements.txt
+# ================================
+# Stage 2: Runtime
+# ================================
+FROM python:3.10-slim-bullseye
+WORKDIR /app
+# Install runtime dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ffmpeg \
+    libsndfile1 \
+    && rm -rf /var/lib/apt/lists/* \
+    && apt-get clean
+# Copy Python packages from builder
+COPY --from=builder /root/.local /root/.local
+# Ensure scripts in .local are available
+ENV PATH=/root/.local/bin:$PATH
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONDONTWRITEBYTECODE=1
+# Model cache directories
+ENV HF_HOME=/root/.cache/huggingface
+ENV TORCH_HOME=/root/.cache/torch
+ENV TRANSFORMERS_CACHE=/root/.cache/huggingface
+# Copy application code
+COPY app/ ./app/
+COPY data/ ./data/
+# Create necessary directories
+RUN mkdir -p /app/data/uploads /app/data/processed
+# Port configuration
+ARG PORT=7860
+ENV PORT=${PORT}
+EXPOSE ${PORT}
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:${PORT}/api/health')" || exit 1
+# Run the application
+CMD ["sh", "-c", "uvicorn app.main:app --host 0.0.0.0 --port ${PORT}"]

README.md CHANGED Viewed

@@ -1,12 +1,104 @@
 ---
-title: PoC ASR V6 Dev
-emoji: 🦀
-colorFrom: purple
-colorTo: green
-sdk: gradio
-sdk_version: 6.9.0
-app_file: app.py
 pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: PrecisionVoice
+emoji: 🎙️
+colorFrom: blue
+colorTo: purple
+sdk: docker
+app_file: app/main.py
 pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# PrecisionVoice - STT & Speaker Diarization
+A production-ready Speech-to-Text and Speaker Diarization web application using FastAPI, faster-whisper, and pyannote.audio.
+## Features
+- 🎙️ Speech-to-Text using `erax-ai/EraX-WoW-Turbo-V1.1-CT2` (8x faster, 8 Vietnamese dialects)
+- 👥 Speaker Diarization using `pyannote/speaker-diarization-3.1`
+- 🧼 Speech Enhancement using `SpeechBrain SepFormer DNS4` (noise + reverb removal)
+- 🔇 Voice Activity Detection using `Silero VAD v5` (prevents hallucination)
+- 🎤 Vocal Isolation using `MDX-Net` (UVR-MDX-NET-Voc_FT)
+- 🔄 Automatic speaker-transcript alignment
+- 📥 Download results in TXT or SRT format
+- 🐳 Docker-ready with persistent model caching and GPU support
+- 🐳 Docker-ready with persistent model caching and GPU support
+## Quick Start
+### Prerequisites
+1. Docker and Docker Compose
+2. (Optional) NVIDIA GPU with CUDA support
+3. HuggingFace account with access to pyannote models
+### Setup
+1. Clone and configure:
+   ```bash
+   cp .env.example .env
+   # Edit .env and add your HuggingFace token
+   ```
+2. Build and run:
+   ```bash
+   docker compose up --build
+   ```
+3. Open http://localhost:8000
+## Audio Processing Pipeline
+The system uses a state-of-the-art multi-stage pipeline to ensure maximum accuracy:
+1. **Speech Enhancement**: Background noise and reverb are removed using `SpeechBrain SepFormer` (DNS4 Challenge winner).
+2. **Vocal Isolation**: Vocals are separated from background music using `MDX-Net`.
+3. **VAD Filtering**: Silence is removed using `Silero VAD v5` to prevent ASR hallucination.
+4. **Refinement**: Highpass filtering and EBU R128 loudness normalization.
+5. **Transcription**: High-precision Vietnamese transcription using `PhoWhisper`.
+6. **Diarization**: Segmenting audio by speaker using `Pyannote 3.1`.
+7. **Alignment**: Merging transcripts with speaker segments + timestamp reconstruction.
+## Configuration
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `HF_TOKEN` | - | Required for Pyannote models |
+| `ENABLE_SPEECH_ENHANCEMENT` | `True` | Toggle SpeechBrain speech enhancement |
+| `ENHANCEMENT_MODEL` | `speechbrain/sepformer-dns4-16k-enhancement` | Model for speech enhancement |
+| `ENABLE_SILERO_VAD` | `True` | Toggle Silero VAD for hallucination prevention |
+| `ENABLE_VOCAL_SEPARATION` | `True` | Toggle MDX-Net vocal isolation |
+| `MDX_MODEL` | `UVR-MDX-NET-Voc_FT` | Model for vocal separation |
+| `DEVICE` | `auto` | `cuda`, `cpu`, or `auto` |
+## Development
+### Local Setup (without Docker)
+```bash
+python -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+uvicorn app.main:app --reload
+```
+### API Endpoints
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/` | GET | Web UI |
+| `/api/transcribe` | POST | Upload and transcribe audio |
+| `/api/download/{filename}` | GET | Download result files |
+## Supported Audio Formats
+- MP3
+- WAV
+- M4A
+- OGG
+## License
+MIT

app/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # App package

app/api/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # API package

app/api/routes.py ADDED Viewed

	@@ -0,0 +1,209 @@

+"""
+API routes for the transcription service.
+"""
+import logging
+import time
+from pathlib import Path
+import csv
+from fastapi import APIRouter, UploadFile, File, HTTPException, BackgroundTasks, Form
+from fastapi.responses import FileResponse
+from app.core.config import get_settings
+from app.schemas.models import TranscriptionResponse, HealthResponse
+from app.services.audio_processor import AudioProcessor, AudioProcessingError
+from app.services.transcription import TranscriptionService, AVAILABLE_MODELS
+from app.services.diarization import DiarizationService
+from app.services.processor import Processor
+logger = logging.getLogger(__name__)
+settings = get_settings()
+router = APIRouter()
+@router.get("/api/health", response_model=HealthResponse)
+async def health_check():
+    """Health check endpoint."""
+    return HealthResponse(
+        status="healthy",
+        models_loaded=TranscriptionService.is_loaded() and DiarizationService.is_loaded(),
+        device=settings.resolved_device
+    )
+@router.get("/api/models")
+async def get_models():
+    """Get available Whisper models."""
+    return {
+        "models": list(AVAILABLE_MODELS.keys()),
+        "default": settings.default_whisper_model
+    }
+@router.post("/api/transcribe", response_model=TranscriptionResponse)
+async def transcribe_audio(
+    background_tasks: BackgroundTasks,
+    file: UploadFile = File(..., description="Audio file to transcribe"),
+    model: str = Form(default="PhoWhisper Large", description="Whisper model to use"),
+    language: str = Form(default="vi", description="Language code")
+):
+    """
+    Upload and transcribe an audio file.
+    Uses diarize-first workflow:
+    1. Diarization to identify speakers
+    2. Transcribe each speaker segment
+    3. Return combined result
+    4. Predict emotion segments
+    """
+    upload_path = None
+    try:
+        # Read file content
+        file_content = await file.read()
+        # Validate
+        try:
+            AudioProcessor.validate_file(file.filename or "audio.wav", len(file_content))
+        except AudioProcessingError as e:
+            raise HTTPException(status_code=400, detail=str(e))
+        # Save upload
+        upload_path = await AudioProcessor.save_upload(file_content, file.filename or "audio.wav")
+        # Process with new workflow
+        logger.info(f"Processing audio with model={model}, language={language}")
+        result = await Processor.process_audio(
+            audio_path=upload_path,
+            language=language,
+        )
+        # Name output files
+        base_name = Path(file.filename or "audio").stem
+        txt_filename = f"{base_name}_output.txt"
+        csv_filename = f"{base_name}_output.csv"
+        txt_path = settings.processed_dir / txt_filename
+        csv_path = settings.processed_dir / csv_filename
+        # Write TXT
+        txt_path.write_text(result.txt_content, encoding="utf-8")
+        # Write CSV (UTF-8)
+        roles = result.roles or {}
+        with csv_path.open("w", newline="", encoding="utf-8-sig") as f:
+            writer = csv.DictWriter(
+                f,
+                fieldnames=["start", "end", "speaker", "text"],
+            )
+            writer.writeheader()
+            for seg in result.segments:
+                writer.writerow({
+                    "start": round(seg.start, 2),
+                    "end": round(seg.end, 2),
+                    "speaker": roles.get(seg.speaker, seg.speaker),
+                    "text": seg.text,
+                })
+        # Schedule cleanup
+        background_tasks.add_task(cleanup_files, upload_path)
+        # Build response
+        return TranscriptionResponse(
+            success=True,
+            segments=[
+                {
+                    "start": seg.start,
+                    "end": seg.end,
+                    "speaker": seg.speaker,
+                    "role": seg.role,
+                    "text": seg.text,
+                    "emotion": seg.emotion
+                }
+                for seg in result.segments
+            ],
+            speaker_count=result.speaker_count,
+            speakers=result.speakers,
+            duration=result.duration,
+            processing_time=result.processing_time,
+            roles=result.roles,
+            emotion_timeline=[
+                {
+                    "time": p.time,
+                    "emotion": p.emotion
+                }
+                for p in (result.emotion_timeline or [])
+            ],
+            emotion_changes=[
+                {
+                    "time": c.time,
+                    "from": c.emotion_from,
+                    "to": c.emotion_to,
+                    "icon_from": c.icon_from,
+                    "icon_to": c.icon_to
+                }
+                for c in (result.emotion_changes or [])
+            ],
+            download_txt=f"/api/download/{txt_filename}",
+            download_csv=f"/api/download/{csv_filename}",
+        )
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.exception("Processing failed")
+        if upload_path and upload_path.exists():
+            background_tasks.add_task(cleanup_files, upload_path)
+        raise HTTPException(status_code=500, detail=f"Processing failed: {str(e)}")
+@router.get("/api/download/{filename}")
+async def download_file(filename: str):
+    """
+    Download a generated transcript file.
+    Supports: .txt, .srt files
+    """
+    # Security: only allow specific extensions and no path traversal
+    if not filename.endswith(('.txt', '.csv')) or '/' in filename or '..' in filename:
+        raise HTTPException(status_code=400, detail="Invalid filename")
+    filepath = settings.processed_dir / filename
+    if not filepath.exists():
+        raise HTTPException(status_code=404, detail="File not found")
+    # Determine media type
+    if filename.endswith(".txt"):
+        media_type = "text/plain; charset=utf-8"
+    elif filename.endswith(".csv"):
+        media_type = "text/csv; charset=utf-8"
+    elif filename.endswith(".srt"):
+        media_type = "application/x-subrip"
+    else:
+        media_type = "application/octet-stream"
+    return FileResponse(
+        path=filepath,
+        filename=filename,
+        media_type=media_type,
+        headers={
+            "Content-Disposition": f'attachment; filename="{filename}"'},
+    )
+async def cleanup_files(*paths: Path):
+    """Background task to cleanup temporary files."""
+    import asyncio
+    # Wait a bit before cleanup
+    await asyncio.sleep(5)
+    await AudioProcessor.cleanup_files(*paths)

app/core/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Core package

app/core/config.py ADDED Viewed

	@@ -0,0 +1,122 @@

+"""
+Application configuration using Pydantic Settings.
+"""
+import os
+from pathlib import Path
+from functools import lru_cache
+from typing import Literal, Dict
+from pydantic_settings import BaseSettings, SettingsConfigDict
+class Settings(BaseSettings):
+    """Application settings loaded from environment variables."""
+    model_config = SettingsConfigDict(
+        env_file=".env",
+        env_file_encoding="utf-8",
+        extra="ignore"
+    )
+    # HuggingFace
+    hf_token: str = ""
+    enable_noise_reduction: bool = True
+    # Denoising (Speech Enhancement)
+    enable_denoiser: bool = True
+    denoiser_model: str = "dns64"
+    # MDX-Net Vocal Separation
+    enable_vocal_separation: bool = True
+    mdx_model: str = "Kim_Vocal_2.onnx"  # High quality vocal isolation
+    available_whisper_models: Dict[str, str] = {
+        "EraX-WoW-Turbo": "erax-ai/EraX-WoW-Turbo-V1.1-CT2",
+        "PhoWhisper Large": "kiendt/PhoWhisper-large-ct2",
+        "PhoWhisper Lora Finetuned": "vyluong/pho-whisper-vi-ct2"
+    }
+    # S2T model
+    default_whisper_model: str = "vyluong/pho-whisper-vi-ct2"
+    # voice emotion detection model
+    default_dual_emotion_model: str = "vyluong/emo_dual_classi"
+    # sentiment model based text
+    # default_bert_sentiment_model: str = ""
+    # Diarization model
+    diarization_model: str = "pyannote/speaker-diarization-community-1"
+    # Device settings
+    device: Literal["cuda", "cpu", "auto"] = "auto"
+    compute_type: str = "float16"  # float16 for GPU, int8 for CPU
+    # Upload settings
+    max_upload_size_mb: int = 100
+    allowed_extensions: list[str] = ["mp3", "wav", "m4a", "ogg", "flac", "webm"]
+    # Audio processing settings
+    sample_rate: int = 16000
+    channels: int = 1  # Mono
+    enable_loudnorm: bool = True
+    # VAD parameters
+    vad_threshold: float = 0.55
+    vad_min_speech_duration_ms: int = 200
+    vad_min_silence_duration_ms: int = 450
+    vad_speech_pad_ms: int = 250
+    # Post-processing
+    merge_threshold_s: float = 0.35  # Merge segments from same speaker if gap < this
+    min_segment_duration_s: float = 0.85 # Remove segments shorter than this
+    # Server settings
+    host: str = "0.0.0.0"
+    port: int = 7860
+    # Paths
+    base_dir: Path = Path(__file__).parent.parent.parent
+    data_dir: Path = base_dir / "data"
+    upload_dir: Path = data_dir / "uploads"
+    processed_dir: Path = data_dir / "processed"
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        # Ensure directories exist
+        self.upload_dir.mkdir(parents=True, exist_ok=True)
+        self.processed_dir.mkdir(parents=True, exist_ok=True)
+    @property
+    def max_upload_size_bytes(self) -> int:
+        return self.max_upload_size_mb * 1024 * 1024
+    @property
+    def resolved_device(self) -> str:
+        """Resolve 'auto' to actual device."""
+        if self.device == "auto":
+            try:
+                import torch
+                return "cuda" if torch.cuda.is_available() else "cpu"
+            except ImportError:
+                return "cpu"
+        return self.device
+    @property
+    def resolved_compute_type(self) -> str:
+        """Get appropriate compute type for device."""
+        if self.resolved_device == "cuda":
+            return "float16"
+        return "int8"
+@lru_cache
+def get_settings() -> Settings:
+    """Get cached settings instance."""
+    return Settings()

app/main.py ADDED Viewed

	@@ -0,0 +1,131 @@

+"""
+PrecisionVoice - Speech-to-Text & Speaker Diarization Application
+Main FastAPI application entry point.
+"""
+import logging
+from contextlib import asynccontextmanager
+from fastapi import FastAPI, Request
+from fastapi.staticfiles import StaticFiles
+from fastapi.templating import Jinja2Templates
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import HTMLResponse
+from app.core.config import get_settings
+from app.api.routes import router
+from app.services.transcription import TranscriptionService
+from app.services.diarization import DiarizationService
+from app.services.emo import EmotionService
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+logger = logging.getLogger(__name__)
+settings = get_settings()
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """
+    Application lifespan handler.
+    Preloads models on startup for faster first request.
+    """
+    logger.info("Starting PrecisionVoice application...")
+    logger.info(f"Device: {settings.resolved_device}")
+    logger.info(f"Default Whisper model: {settings.default_whisper_model}")
+    logger.info(f"Diarization model: {settings.diarization_model}")
+    logger.info(f"Emotion voice model: {settings.default_dual_emotion_model}")
+    # Preload default Whisper model
+    try:
+        logger.info("Preloading Whisper model...")
+        TranscriptionService.preload_model()
+    except Exception as e:
+        logger.error(f"Failed to preload Whisper model: {e}")
+    # Preload diarization pipeline
+    try:
+        if settings.hf_token:
+            logger.info("Preloading diarization pipeline...")
+            DiarizationService.preload_pipeline()
+        else:
+            logger.warning("HF_TOKEN not set, diarization will not be available")
+    except Exception as e:
+        logger.warning(f"Diarization preload failed: {e}")
+    logger.info("Application startup complete")
+    yield
+    logger.info("Shutting down PrecisionVoice application...")
+    # preload emo model
+    try:
+        logger.info("Preloading emotion model...")
+        EmotionService.preload_model()
+        logger.info("Emotion model loaded")
+    except Exception as e:
+        logger.warning(f"Emotion model preload failed: {e}")
+    logger.info("Application startup complete")
+    yield logger.info("Shutting down PrecisionVoice application...")
+# Create FastAPI app
+app = FastAPI(
+    title="PrecisionVoice",
+    description="QA Voice MultipleModel API",
+    version="2.0.0",
+    lifespan=lifespan
+)
+# CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Mount static files
+app.mount(
+    "/static",
+    StaticFiles(directory="app/static"),
+    name="static"
+)
+# Templates
+templates = Jinja2Templates(directory="app/templates")
+# Include API routes
+app.include_router(router)
+@app.get("/", response_class=HTMLResponse)
+async def index(request: Request):
+    """Serve the main web interface."""
+    return templates.TemplateResponse(
+        "index.html",
+        {
+            "request": request,
+            "max_upload_mb": settings.max_upload_size_mb,
+            "allowed_formats": ", ".join(settings.allowed_extensions)
+        }
+    )
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(
+        "app.main:app",
+        host=settings.host,
+        port=settings.port,
+        reload=True
+    )

app/schemas/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Schemas package

app/schemas/models.py ADDED Viewed

	@@ -0,0 +1,158 @@

+"""
+Pydantic models for API requests and responses.
+"""
+from pydantic import BaseModel, Field
+from typing import Optional
+from enum import Enum
+class ProcessingStatus(str, Enum):
+    """Status of the transcription process."""
+    PENDING = "pending"
+    PROCESSING = "processing"
+    COMPLETED = "completed"
+    FAILED = "failed"
+class TranscriptSegment(BaseModel):
+    start: float
+    end: float
+    speaker: Optional[str] = Field(
+        default=None,
+        description="Internal speaker id (debug only)"
+    )
+    role: str = Field(
+        ...,
+        description="Conversation role (NV = agent, KH = customer)"
+    )
+    text: str = Field(
+        ...,
+        description="Transcribed text"
+    )
+    emotion: Optional[str] = Field(
+        default=None,
+        description="Predicted emotion label"
+    )
+    emotion_scores: Optional[List[float]] = Field(
+        default=None,
+        description="Emotion probability scores"
+    )
+    @property
+    def start_formatted(self) -> str:
+        """Format start time as HH:MM:SS."""
+        return self._format_time(self.start)
+    @property
+    def end_formatted(self) -> str:
+        """Format end time as HH:MM:SS."""
+        return self._format_time(self.end)
+    @staticmethod
+    def _format_time(seconds: float) -> str:
+        """Convert seconds to HH:MM:SS format."""
+        hours = int(seconds // 3600)
+        minutes = int((seconds % 3600) // 60)
+        secs = int(seconds % 60)
+        return f"{hours:02d}:{minutes:02d}:{secs:02d}"
+class EmotionPoint(BaseModel):
+    time: float = Field(..., description="Time in seconds")
+    emotion: str = Field(..., description="Emotion label")
+    icon: Optional[str] = Field(
+        default=None,
+        description="Emotion icon (emoji)"
+    )
+class EmotionChange(BaseModel):
+    time: float = Field(..., description="Time of emotion change")
+    emotion_from: str = Field(..., description="Previous emotion")
+    emotion_to: str = Field(..., description="New emotion")
+    icon_from: Optional[str] = Field(
+        default=None,
+        description="Previous emotion icon"
+    )
+    icon_to: Optional[str] = Field(
+        default=None,
+        description="New emotion icon"
+    )
+class TranscriptionRequest(BaseModel):
+    """Request model for transcription settings."""
+    language: str = Field(
+        default="vi",
+        description="Language code for transcription"
+    )
+    num_speakers: Optional[int] = Field(
+        default=None,
+        description="Expected number of speakers (None for auto-detect)"
+    )
+    output_format: str = Field(
+        default="json",
+        description="Output format: json, txt, csv"
+    )
+class TranscriptionResponse(BaseModel):
+    """Response containing the transcription results."""
+    success: bool = Field(..., description="Whether transcription succeeded")
+    message: str = Field(default="", description="Status message")
+    segments: list[TranscriptSegment] = Field(
+        default_factory=list,
+        description="Transcript segments with speaker and role")
+    duration: float = Field(default=0.0, description="Audio duration in seconds")
+    speaker_count: int = Field(default=0, description="Number of detected speakers")
+    processing_time: float = Field(default=0.0, description="Processing time in seconds")
+    speakers: Optional[list[str]] = None
+    roles: Optional[dict[str, str]] = Field(
+        default=None,
+        description="Internal mapping speaker_id → role (debug / audit only)"
+    )
+    # Emotion Analysis
+    emotion_timeline: Optional[List[EmotionPoint]] = Field(
+        default=None,
+        description="Emotion timeline across conversation"
+    )
+    emotion_changes: Optional[List[EmotionChange]] = Field(
+        default=None,
+        description="Detected emotion change events"
+    )
+    customer_emotion_score: Optional[float] = Field(
+        default=None,
+        description="Overall customer emotion score"
+    )
+    download_txt: Optional[str] = Field(default=None, description="Download URL for TXT file")
+    download_csv: Optional[str] = Field(default=None, description="Download URL for CSV file")
+class ErrorResponse(BaseModel):
+    """Error response model."""
+    success: bool = False
+    error: str = Field(..., description="Error message")
+    detail: Optional[str] = Field(default=None, description="Detailed error information")
+class HealthResponse(BaseModel):
+    """Health check response."""
+    status: str = "healthy"
+    models_loaded: bool = False
+    device: str = "cpu"

app/services/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Services package."""
+from app.services.transcription import TranscriptionService
+from app.services.diarization import DiarizationService
+from app.services.processor import Processor
+from app.services.audio_processor import AudioProcessor

app/services/alignment.py ADDED Viewed

	@@ -0,0 +1,294 @@

+"""
+Precision alignment service - Word-center-based speaker assignment.
+Merges word-level transcription with speaker diarization using precise timestamps.
+"""
+import logging
+from pathlib import Path
+from typing import List, Tuple, Optional
+from dataclasses import dataclass
+from app.core.config import get_settings
+from app.services.transcription import WordTimestamp
+from app.services.diarization import SpeakerSegment
+from app.schemas.models import TranscriptSegment
+logger = logging.getLogger(__name__)
+settings = get_settings()
+@dataclass
+class WordWithSpeaker:
+    """A word with assigned speaker."""
+    word: str
+    start: float
+    end: float
+    speaker: str
+class AlignmentService:
+    """
+    Precision alignment service.
+    Uses word-center-based algorithm for accurate speaker-to-text mapping.
+    """
+    PAUSE_THRESHOLD = 0.45
+    CENTER_TOL = 0.15 # s (150 ms)
+    OVERLAP_TH = 0.12 # > x% segments
+    DIA_MERGE_GAP = 0.25
+    MAX_SEGMENT_DURATION = 7.5
+    @staticmethod
+    def get_word_center(word: WordTimestamp) -> float:
+        """Calculate the center time of a word."""
+        return (word.start + word.end) / 2
+    @staticmethod
+    def overlap_ratio(w_start, w_end, s_start, s_end):
+        overlap = max(0.0, min(w_end, s_end) - max(w_start, s_start))
+        dur = max(1e-6, w_end - w_start)
+        return overlap / dur
+    # Diarization merge
+    @classmethod
+    def merge_dia_segments(cls, segments: List[SpeakerSegment]) -> List[SpeakerSegment]:
+        if not segments:
+            return []
+        segments = sorted(segments, key=lambda s: s.start)
+        merged = [segments[0]]
+        for s in segments[1:]:
+            p = merged[-1]
+            if s.speaker == p.speaker and (s.start - p.end) <= cls.DIA_MERGE_GAP:
+                p.end = s.end
+            else:
+                merged.append(s)
+        return merged
+    @classmethod
+    def find_speaker_center(
+        cls,
+        time: float,
+        speaker_segments: List[SpeakerSegment],
+    ) -> Optional[str]:
+        for seg in speaker_segments:
+            if seg.start - cls.CENTER_TOL <= time <= seg.end + cls.CENTER_TOL:
+                return seg.speaker
+        return None
+    @staticmethod
+    def find_closest_speaker(time: float, speaker_segments: List[SpeakerSegment]) -> str:
+        if not speaker_segments:
+            return "Unknown"
+        min_dist = float("inf")
+        closest = "Unknown"
+        for seg in speaker_segments:
+            d = min(abs(time - seg.start), abs(time - seg.end))
+            if d < min_dist:
+                min_dist = d
+                closest = seg.speaker
+        return closest
+    @classmethod
+    def assign_speakers_to_words(
+        cls,
+        words: List[WordTimestamp],
+        speaker_segments: List[SpeakerSegment],
+    ) -> List[WordWithSpeaker]:
+        words = [w for w in words if w.word and w.word.strip()]
+        if not speaker_segments:
+            logger.warning("No diarization, fallback single speaker")
+            return [
+                WordWithSpeaker(w.word, w.start, w.end, "Speaker 1")
+                for w in words
+            ]
+        speaker_segments = cls.merge_dia_segments(speaker_segments)
+        results = []
+        for word in words:
+            center = cls.get_word_center(word)
+            # 1. CENTER
+            speaker = cls.find_speaker_center(center, speaker_segments)
+            if speaker is None:
+                # 2. OVERLAP
+                best_ratio = 0
+                best_spk = None
+                for seg in speaker_segments:
+                    r = cls.overlap_ratio(word.start, word.end, seg.start, seg.end)
+                    if r > best_ratio:
+                        best_ratio = r
+                        best_spk = seg.speaker
+                if best_ratio >= cls.OVERLAP_TH:
+                    speaker = best_spk
+                else:
+                    # 3. CLOSEST
+                    speaker = cls.find_closest_speaker(center, speaker_segments)
+            results.append(
+                WordWithSpeaker(word.word, word.start, word.end, speaker)
+            )
+        return results
+    @classmethod
+    def reconstruct_segments(
+        cls,
+        words_with_speakers: List[WordWithSpeaker]
+    ) -> List[TranscriptSegment]:
+        """
+        Step 3d: Reconstruct sentence segments from words.
+        Groups consecutive words of the same speaker into segments.
+        Creates new segment when:
+        - Speaker changes
+        - Pause > PAUSE_THRESHOLD between words
+        Args:
+            words_with_speakers: List of words with speaker assignments
+        Returns:
+            List of TranscriptSegment with complete sentences
+        """
+        if not words_with_speakers:
+            return []
+        segments = []
+        # Start first segment
+        current_speaker = words_with_speakers[0].speaker
+        current_start = words_with_speakers[0].start
+        current_end = words_with_speakers[0].end
+        current_words = [words_with_speakers[0].word]
+        for i in range(1, len(words_with_speakers)):
+            word = words_with_speakers[i]
+            prev_word = words_with_speakers[i - 1]
+            # Calculate pause between words
+            pause = word.start - prev_word.end
+            # Check if we need to start a new segment
+            speaker_changed = word.speaker != current_speaker
+            significant_pause = pause > cls.PAUSE_THRESHOLD
+            segment_duration = current_end - current_start
+            too_long = segment_duration > cls.MAX_SEGMENT_DURATION and pause > 0.15
+            if speaker_changed or significant_pause or too_long:
+                # Save current segment
+                segments.append(TranscriptSegment(
+                    start=current_start,
+                    end=current_end,
+                    speaker=current_speaker,
+                    role="UNKNOWN",
+                    text=" ".join(current_words)
+                ))
+                # Start new segment
+                current_speaker = word.speaker
+                current_start = word.start
+                current_end = word.end
+                current_words = [word.word]
+            else:
+                # Continue current segment
+                current_end = word.end
+                current_words.append(word.word)
+        if current_words:
+            segments.append(TranscriptSegment(
+                start=current_start,
+                end=current_end,
+                speaker=current_speaker,
+                role="UNKNOWN",
+                text=" ".join(current_words)
+            ))
+        logger.debug(f"Reconstructed {len(segments)} segments from {len(words_with_speakers)} words")
+        return segments
+    @classmethod
+    def resize_and_merge_segments(
+        cls,
+        segments: List[TranscriptSegment]
+    ) -> List[TranscriptSegment]:
+        """
+        Merge consecutive segments of the same speaker if the gap is small.
+        Also filters out extremely short segments.
+        """
+        if not segments:
+            return []
+        # Filter 1: Remove extremely short blips (noise)
+        segments = [s for s in segments if (s.end - s.start) >= settings.min_segment_duration_s]
+        if not segments:
+            return []
+        merged = []
+        curr = segments[0]
+        for i in range(1, len(segments)):
+            next_seg = segments[i]
+            # If same speaker and gap is small, merge
+            gap = next_seg.start - curr.end
+            if next_seg.speaker == curr.speaker and gap < settings.merge_threshold_s:
+                curr.end = next_seg.end
+                curr.text += " " + next_seg.text
+            else:
+                merged.append(curr)
+                curr = next_seg
+        merged.append(curr)
+        logger.debug(f"Merged segments: {len(segments)} -> {len(merged)}")
+        return merged
+    @classmethod
+    def align_precision(
+        cls,
+        words: List[WordTimestamp],
+        speaker_segments: List[SpeakerSegment]
+    ) -> List[TranscriptSegment]:
+        """
+        Full precision alignment pipeline.
+        Args:
+            words: Word-level timestamps from transcription
+            speaker_segments: Speaker segments from diarization
+        Returns:
+            List of TranscriptSegment with proper speaker assignments
+        """
+        # Step 3c: Assign speakers to words
+        words_with_speakers = cls.assign_speakers_to_words(words, speaker_segments)
+        # Step 3d: Reconstruct segments
+        segments = cls.reconstruct_segments(words_with_speakers)
+        # Step 3e: Clustering/Merging (Optimization)
+        segments = cls.resize_and_merge_segments(segments)
+        return segments

app/services/audio_processor.py ADDED Viewed

	@@ -0,0 +1,247 @@

+"""
+Audio processing utilities.
+Simple validation and file handling.
+"""
+import logging
+import uuid
+from pathlib import Path
+from typing import Optional, Tuple
+from app.core.config import get_settings
+import ffmpeg
+import asyncio
+from app.services.vocal_separator import VocalSeparator
+from app.services.denoiser import DenoiserService
+logger = logging.getLogger(__name__)
+settings = get_settings()
+class AudioProcessingError(Exception):
+    """Custom exception for audio processing errors."""
+    pass
+class AudioProcessor:
+    ALLOWED_EXTENSIONS = settings.allowed_extensions
+    TARGET_SAMPLE_RATE = settings.sample_rate
+    TARGET_CHANNELS = settings.channels
+    @classmethod
+    def validate_file(cls, filename: str, file_size: int) -> None:
+        """
+        Validate uploaded file.
+        Args:
+            filename: Original filename
+            file_size: File size in bytes
+        Raises:
+            AudioProcessingError: If validation fails
+        """
+        # Check extension
+        ext = filename.rsplit('.', 1)[-1].lower() if '.' in filename else ''
+        if ext not in settings.allowed_extensions:
+            raise AudioProcessingError(
+                f"File type '.{ext}' not supported. "
+                f"Allowed: {', '.join(settings.allowed_extensions)}"
+            )
+        # Check size
+        if file_size > settings.max_upload_size_bytes:
+            raise AudioProcessingError(
+                f"File too large ({file_size / 1024 / 1024:.1f}MB). "
+                f"Maximum size: {settings.max_upload_size_mb}MB"
+            )
+    @classmethod
+    async def save_upload(cls, file_content: bytes, original_filename: str) -> Path:
+        """
+        Save uploaded file to disk.
+        Args:
+            file_content: Raw file bytes
+            original_filename: Original filename for extension
+        Returns:
+            Path to saved file
+        """
+        import aiofiles
+        # Generate unique filename
+        ext = original_filename.rsplit('.', 1)[-1].lower() if '.' in original_filename else 'wav'
+        unique_filename = f"{uuid.uuid4()}.{ext}"
+        file_path = settings.upload_dir / unique_filename
+        # Save file
+        async with aiofiles.open(file_path, 'wb') as f:
+            await f.write(file_content)
+        logger.info(f"Saved upload: {file_path} ({len(file_content) / 1024:.1f}KB)")
+        return file_path
+    @classmethod
+    async def convert_to_wav(cls, input_path: Path) -> Path:
+        """
+        Convert audio to 16kHz mono WAV using FFmpeg.
+        Args:
+            input_path: Path to input audio file
+        Returns:
+            Path to converted WAV file
+        """
+        output_filename = f"{input_path.stem}_processed.wav"
+        output_path = settings.processed_dir / output_filename
+        try:
+            # Run ffmpeg conversion in executor to not block
+            loop = asyncio.get_event_loop()
+            await loop.run_in_executor(None, lambda: cls._run_ffmpeg_conversion(input_path, output_path))
+            logger.info(f"Converted to WAV: {output_path}")
+            return output_path
+        except ffmpeg.Error as e:
+            error_msg = e.stderr.decode() if e.stderr else str(e)
+            logger.error(f"FFmpeg error: {error_msg}")
+            raise AudioProcessingError(f"Audio conversion failed: {error_msg}")
+    @staticmethod
+    def _run_ffmpeg_conversion(input_path: Path, output_path: Path) -> None:
+        """Run the actual FFmpeg conversion (blocking)."""
+        stream = ffmpeg.input(str(input_path))
+        # Apply normalization if enabled (loudnorm is best for speech consistency)
+        if settings.enable_loudnorm:
+            logger.debug("Applying loudnorm normalization...")
+            stream = stream.filter('loudnorm', I=-20, TP=-2, LRA=7)
+        # Apply noise reduction if enabled (Note: basic filters are kept as minor cleanup)
+        if settings.enable_noise_reduction:
+            logger.debug("Applying subtle highpass filter...")
+            stream = (
+                stream
+                .filter('highpass', f=60)
+                .filter('lowpass', f=7500)
+                .filter(
+                    #  Silence trimming
+                    'silenceremove',
+                    stop_periods=-1,
+                    stop_duration=0.4,
+                    stop_threshold='-45dB'
+                )
+            )
+            (
+                stream.output(
+                    str(output_path),
+                    acodec='pcm_s16le',
+                    ar=16000,
+                    ac=1
+                )
+                .overwrite_output()
+                .run(quiet=True, capture_stderr=True)
+            )
+    @classmethod
+    async def get_audio_duration(cls, filepath: Path) -> float:
+        """
+        Get audio file duration in seconds.
+        Args:
+            filepath: Path to audio file
+        Returns:
+            Duration in seconds
+        """
+        try:
+            loop = asyncio.get_event_loop()
+            probe = await loop.run_in_executor(
+                None,
+                lambda: ffmpeg.probe(str(filepath))
+            )
+            duration = float(probe['format'].get('duration', 0))
+            return duration
+        except ffmpeg.Error as e:
+            logger.warning(f"Could not probe audio duration: {e}")
+            return 0.0
+    @classmethod
+    async def cleanup_files(cls, *paths: Path) -> None:
+        """Remove temporary files."""
+        import asyncio
+        for path in paths:
+            try:
+                if path and path.exists():
+                    path.unlink()
+                    logger.debug(f"Cleaned up: {path}")
+            except Exception as e:
+                logger.warning(f"Failed to cleanup {path}: {e}")
+    @classmethod
+    async def process_upload(cls, file_content: bytes, filename: str) -> Tuple[Path, float]:
+        """
+        Full upload processing pipeline: validate, save, convert.
+        Args:
+            file_content: Uploaded file bytes
+            filename: Original filename
+        Returns:
+            Tuple of (processed WAV path, duration in seconds)
+        """
+        # Validate
+        cls.validate_file(filename, len(file_content))
+        # Save original
+        original_path = await cls.save_upload(file_content, filename)
+        vocals_path = None
+        try:
+            # Step 1: Denoising (Speech Enhancement)
+            if settings.enable_denoiser:
+                denoised_path = await DenoiserService.enhance_audio(original_path)
+                source_for_separation = denoised_path
+            else:
+                source_for_separation = original_path
+                denoised_path = None
+            # Step 2: Vocal separation using MDX-Net
+            if settings.enable_vocal_separation:
+                vocals_path = await VocalSeparator.separate_vocals(source_for_separation)
+                source_for_conversion = vocals_path
+            else:
+                source_for_conversion = source_for_separation
+                vocals_path = None
+            # Step 3: Convert to 16kHz mono WAV (includes normalization)
+            wav_path = await cls.convert_to_wav(source_for_conversion)
+            # Get duration
+            duration = await cls.get_audio_duration(wav_path)
+            # Cleanup intermediate files
+            to_cleanup = [original_path]
+            if denoised_path and denoised_path != original_path:
+                to_cleanup.append(denoised_path)
+            if vocals_path and vocals_path not in [original_path, denoised_path]:
+                to_cleanup.append(vocals_path)
+            await cls.cleanup_files(*to_cleanup)
+            return wav_path, duration
+        except Exception as e:
+            # Cleanup on error
+            await cls.cleanup_files(original_path)
+            if 'denoised_path' in locals() and denoised_path and denoised_path != original_path:
+                await cls.cleanup_files(denoised_path)
+            if 'vocals_path' in locals() and vocals_path and vocals_path not in [original_path, denoised_path]:
+                await cls.cleanup_files(vocals_path)
+            raise

app/services/denoiser.py ADDED Viewed

	@@ -0,0 +1,142 @@

+"""
+Speech Enhancement Service using Facebook's Denoiser.
+Removes background noise and enhances speech quality.
+"""
+import os
+import asyncio
+import logging
+from pathlib import Path
+import torch
+import torchaudio
+from app.core.config import get_settings
+logger = logging.getLogger(__name__)
+settings = get_settings()
+class DenoiserError(Exception):
+    """Custom exception for denoiser errors."""
+    pass
+class DenoiserService:
+    """
+    Service for enhancing speech using Facebook's Denoiser models.
+    Supports dns48, dns64, master64, etc.
+    """
+    _model = None
+    _model_name: str = None
+    @classmethod
+    def _get_model(cls):
+        """Lazy load the Denoiser model."""
+        if cls._model is None or cls._model_name != settings.denoiser_model:
+            from denoiser.pretrained import dns48, dns64, master64
+            model_map = {
+                "dns48": dns48,
+                "dns64": dns64,
+                "master64": master64
+            }
+            model_func = model_map.get(settings.denoiser_model, dns64)
+            logger.debug(f"Loading Denoiser model: {settings.denoiser_model}")
+            model = model_func()
+            device = settings.resolved_device
+            model.to(device)
+            model.eval()
+            cls._model = model
+            cls._model_name = settings.denoiser_model
+            logger.debug(f"Denoiser model loaded on {device}")
+        return cls._model
+    @classmethod
+    async def enhance_audio(cls, input_path: Path) -> Path:
+        """
+        Enhance audio by removing noise.
+        Args:
+            input_path: Path to input audio file
+        Returns:
+            Path to enhanced WAV file
+        """
+        if not settings.enable_denoiser:
+            logger.debug("Denoiser disabled, skipping...")
+            return input_path
+        logger.debug(f"Starting speech enhancement for: {input_path.name}")
+        try:
+            # Run enhancement in executor to not block
+            loop = asyncio.get_event_loop()
+            enhanced_path = await loop.run_in_executor(
+                None,
+                lambda: cls._run_enhancement(input_path)
+            )
+            logger.info(f"Speech enhancement complete: {enhanced_path.name}")
+            return enhanced_path
+        except Exception as e:
+            logger.error(f"Speech enhancement failed: {e}")
+            # Fallback to original on failure rather than failing the whole pipeline
+            logger.warning("Falling back to original audio.")
+            return input_path
+    @classmethod
+    def _run_enhancement(cls, input_path: Path) -> Path:
+        """Run the actual denoiser enhancement (blocking)."""
+        from denoiser.enhance import enhance
+        model = cls._get_model()
+        device = settings.resolved_device
+        # Load audio
+        wav, sr = torchaudio.load(str(input_path))
+        wav = wav.to(device)
+        # Ensure correct sample rate for the model
+        if sr != model.sample_rate:
+            resampler = torchaudio.transforms.Resample(sr, model.sample_rate).to(device)
+            wav = resampler(wav)
+            sr = model.sample_rate
+        # Enhance
+        # wav shape: [channels, time]
+        from types import SimpleNamespace
+        args = SimpleNamespace(
+            streaming=False,
+            dry=0.0,
+            sample_rate=sr
+        )
+        with torch.no_grad():
+            # denoiser.enhance.enhance(args, model, wav)
+            if wav.dim() == 1:
+                wav = wav.unsqueeze(0).unsqueeze(0)
+            elif wav.dim() == 2:
+                wav = wav.unsqueeze(0)
+            enhanced = enhance(args, model, wav)
+            # remove batch dim
+            enhanced = enhanced.squeeze(0)
+        # Save enhanced audio
+        output_filename = f"{input_path.stem}_denoised.wav"
+        output_path = settings.processed_dir / output_filename
+        torchaudio.save(
+            str(output_path),
+            enhanced.cpu(),
+            sr
+        )
+        return output_path

app/services/diarization.py ADDED Viewed

	@@ -0,0 +1,223 @@

+# The `DiarizationService` class provides a production-grade speaker diarization service for call
+# centers, including role inference based on speaking duration and asynchronous diarization
+# capabilities.
+"""
+Speaker diarization service using pyannote.audio.
+QA / Production optimized diarization for call center.
+"""
+import logging
+from pathlib import Path
+from typing import List, Optional, Dict
+from dataclasses import dataclass
+import torch
+from app.core.config import get_settings
+logger = logging.getLogger(__name__)
+settings = get_settings()
+# =========================
+# Data model
+# =========================
+@dataclass
+class SpeakerSegment:
+    start: float
+    end: float
+    speaker: str
+    @property
+    def duration(self) -> float:
+        return self.end - self.start
+@dataclass
+class DiarizationResult:
+    segments: List[SpeakerSegment]
+    speaker_count: int
+    speakers: List[str]
+    roles: Dict[str, str]
+# =========================
+# Diarization Service
+# =========================
+class DiarizationService:
+    """
+    Production-grade speaker diarization service.
+    """
+    _instance: Optional["DiarizationService"] = None
+    _pipeline = None
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
+    # -------------------------
+    # Pipeline loading
+    # -------------------------
+    @classmethod
+    def get_pipeline(cls):
+        if cls._pipeline is None:
+            from pyannote.audio import Pipeline
+            if not settings.hf_token:
+                raise ValueError("HF_TOKEN is required for diarization")
+            logger.info(
+                f"Loading diarization model: {settings.diarization_model}"
+            )
+            pipeline = Pipeline.from_pretrained(
+                settings.diarization_model,
+                token=settings.hf_token
+            )
+            pipeline.instantiate({
+                "clustering": {
+                    "threshold": 0.65
+                },
+                "segmentation": {
+                    "min_duration_off": 0.4  # avoid fragment explosion
+                }
+            })
+            device = torch.device(settings.resolved_device)
+            if device.type == "cuda":
+                pipeline = pipeline.to(device)
+                logger.info("Diarization pipeline moved to GPU")
+            cls._pipeline = pipeline
+        return cls._pipeline
+    # -------------------------
+    # Role inference (CALL CENTER)
+    # -------------------------
+    @staticmethod
+    def infer_roles(segments: List[SpeakerSegment]) -> Dict[str, str]:
+        """
+        Infer Agent / Customer roles based on total speaking duration.
+        Agent usually speaks the most.
+        """
+        duration_map: Dict[str, float] = {}
+        for seg in segments:
+            duration_map[seg.speaker] = (
+                duration_map.get(seg.speaker, 0.0) + seg.duration
+            )
+        if not duration_map:
+            return {}
+        # Agent = speaker with max duration
+        agent = max(duration_map, key=duration_map.get)
+        roles = {}
+        for speaker in duration_map:
+            roles[speaker] = "NV" if speaker == agent else "KH"
+        return roles
+    # -------------------------
+    # Main diarization
+    # -------------------------
+    @classmethod
+    def diarize(
+        cls,
+        audio_path: Path,
+        num_speakers: Optional[int] = None,
+        min_speakers: int = 1,
+        max_speakers: int = 10
+    ) -> DiarizationResult:
+        pipeline = cls.get_pipeline()
+        logger.debug(f"Diarizing file: {audio_path}")
+        params = {}
+        if num_speakers is not None:
+            params["num_speakers"] = num_speakers
+        else:
+            params["min_speakers"] = min_speakers
+            params["max_speakers"] = max_speakers
+        diarization = pipeline(str(audio_path), **params)
+        annotation = (
+            diarization.speaker_diarization
+            if hasattr(diarization, "speaker_diarization")
+            else diarization
+        )
+        # step 1: diarize
+        raw_segments: List[SpeakerSegment] = []
+        speaker_map = {}
+        speaker_idx = 1
+        for turn, _, speaker in annotation.itertracks(yield_label=True):
+            if speaker not in speaker_map:
+                speaker_map[speaker] = f"Speaker {speaker_idx}"
+                speaker_idx += 1
+            raw_segments.append(
+                SpeakerSegment(
+                    start=float(turn.start),
+                    end=float(turn.end),
+                    speaker=speaker_map[speaker]
+                )
+            )
+        raw_segments.sort(key=lambda s: s.start)
+        unique_speakers = []
+        for seg in raw_segments:
+            if seg.speaker not in unique_speakers:
+                unique_speakers.append(seg.speaker)
+        roles = cls.infer_roles(raw_segments)
+        logger.info(
+            f"Diarization done | "
+            f"Segments: {len(raw_segments)} | "
+            f"Speakers: {len(unique_speakers)} | "
+            f"Roles: {roles}"
+        )
+        return DiarizationResult(
+            segments=raw_segments,
+            speaker_count=len(unique_speakers),
+            speakers=unique_speakers,
+            roles=roles
+        )
+    # -------------------------
+    # Async
+    # -------------------------
+    @classmethod
+    async def diarize_async(
+        cls,
+        audio_path: Path,
+        num_speakers: Optional[int] = None,
+        min_speakers: int = 1,
+        max_speakers: int = 10
+    ) -> DiarizationResult:
+        import asyncio
+        loop = asyncio.get_event_loop()
+        return await loop.run_in_executor(
+            None,
+            lambda: cls.diarize(
+                audio_path,
+                num_speakers,
+                min_speakers,
+                max_speakers
+            )
+        )
+    @classmethod
+    def preload_pipeline(cls) -> None:
+        try:
+            cls.get_pipeline()
+        except Exception as e:
+            logger.warning(
+                f"Failed to preload diarization pipeline: {e}"
+            )

app/services/emo.py ADDED Viewed

	@@ -0,0 +1,169 @@

+import os
+import sys
+import time
+import logging
+import librosa
+import numpy as np
+import torch
+from torch.nn import functional as F
+from huggingface_hub import hf_hub_download
+logger = logging.getLogger(__name__)
+# HuggingFace repo
+AVAILABLE_MODELS = {
+    "dual_emotion": "vyluong/emo_dual_classi"
+}
+emotion_labels = ['Angry', 'Anxiety', 'Happy', 'Sad', 'Neutral']
+EMOTION_META = {
+    "Angry": {"emoji": "😡", "color": "#ff4d4f"},
+    "Anxiety": {"emoji": "😰", "color": "#faad14"},
+    "Happy": {"emoji": "😊", "color": "#52c41a"},
+    "Sad": {"emoji": "😢", "color": "#1890ff"},
+    "Neutral": {"emoji": "😐", "color": "#d9d9d9"},
+}
+class EmotionService:
+    _models = {}
+    emotion_labels = emotion_labels
+    meta = EMOTION_META
+    @classmethod
+    def load_dual_model(cls, repo_id, device):
+        logger.info(f"Downloading model from HF: {repo_id}")
+        model_file = hf_hub_download(
+            repo_id=repo_id,
+            filename="pytorch_model.bin"
+        )
+        model_code = hf_hub_download(
+            repo_id=repo_id,
+            filename="model.py"
+        )
+        # add model folder to python path
+        sys.path.append(os.path.dirname(model_code))
+        from model import Dual
+        model = Dual()
+        state_dict = torch.load(model_file, map_location=device)
+        model.load_state_dict(state_dict)
+        model.to(device)
+        model.eval()
+        logger.info("Emotion model loaded successfully")
+        return model
+    @classmethod
+    def get_model(cls, model_name="dual_emotion"):
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        if model_name in cls._models:
+            return cls._models[model_name]
+        repo_id = AVAILABLE_MODELS[model_name]
+        model = cls.load_dual_model(repo_id, device)
+        cls._models[model_name] = model
+        return model
+    @classmethod
+    def preload_model(cls):
+        logger.info("Preloading emotion model...")
+        cls.get_model()
+        logger.info("Emotion model ready")
+    # extract mfcc from segments
+    @staticmethod
+    def extract_mfcc_segment(
+        audio: np.ndarray,
+        sr: int,
+        start: float,
+        end: float,
+        duration: float = 5.0,
+        n_mfcc: int = 128,
+        n_fft: int = 2048,
+        hop_length: int = 512
+    ):
+        start_sample = int(start * sr)
+        end_sample = int(end * sr)
+        segment = audio[start_sample:end_sample]
+        if len(segment) == 0:
+            return None
+        target_len = int(sr * duration)
+        if len(segment) < target_len:
+            segment = np.pad(segment,(0,target_len-len(segment)),mode="symmetric")
+        else:
+            segment = segment[:target_len]
+        mfcc = librosa.feature.mfcc(
+            y=segment,
+            sr=sr,
+            n_mfcc=n_mfcc,
+            n_fft=n_fft,
+            hop_length=hop_length
+        )
+        return mfcc
+    @classmethod
+    def predict_from_mfcc(cls, mfcc):
+        model = cls.get_model()
+        tensor = torch.from_numpy(mfcc).unsqueeze(0).unsqueeze(0).float()
+        device = next(model.parameters()).device
+        tensor = tensor.to(device)
+        with torch.no_grad():
+            output = model(tensor)
+        probs = F.softmax(output.squeeze(), dim=0).cpu().numpy()
+        label = cls.emotion_labels[np.argmax(probs)]
+        return label
+    # predict from segments
+    @classmethod
+    def predict_segment(cls, audio, sr, start, end):
+        mfcc = cls.extract_mfcc_segment(audio, sr, start, end)
+        if mfcc is None:
+            return "Neutral"
+        return cls.predict_from_mfcc(mfcc)

app/services/processor.py ADDED Viewed

	@@ -0,0 +1,623 @@

+import logging
+import subprocess
+import time
+from pathlib import Path
+from typing import List, Dict, Optional, Tuple
+from dataclasses import dataclass
+from collections import defaultdict, Counter
+import numpy as np
+import librosa
+import torch
+from app.core.config import get_settings
+from app.services.transcription import TranscriptionService
+from app.services.alignment import AlignmentService
+from app.services.transcription import WordTimestamp
+from app.services.emo import EmotionService
+from app.services.diarization import DiarizationService, SpeakerSegment, DiarizationResult
+logger = logging.getLogger(__name__)
+settings = get_settings()
+@dataclass
+class TranscriptSegment:
+    """A transcribed segment with speaker info."""
+    start: float
+    end: float
+    speaker: str
+    role: Optional[str]
+    text: str
+    emotion: Optional[str] = None
+@dataclass
+class EmotionPoint:
+    time: float
+    emotion: str
+@dataclass
+class EmotionChange:
+    time: float
+    emotion_from: str
+    emotion_to: str
+    icon_from: Optional[str] = None
+    icon_to: Optional[str] = None
+@dataclass
+class ProcessingResult:
+    """Result of audio processing."""
+    segments: List[TranscriptSegment]
+    speaker_count: int
+    duration: float
+    processing_time: float
+    speakers: List[str]
+    roles: Dict[str, str]
+    txt_content: str = ""
+    csv_content: str = ""
+    emotion_timeline: List[EmotionPoint] = None
+    emotion_changes: List[EmotionChange] = None
+def pad_and_refine_tensor(
+    waveform: torch.Tensor,
+    sr: int,
+    start_s: float,
+    end_s: float,
+    pad_ms: int = 250,
+) -> Tuple[float, float]:
+    total_len = waveform.shape[1]
+    s = max(int((start_s - pad_ms / 1000) * sr), 0)
+    e = min(int((end_s + pad_ms / 1000) * sr), total_len)
+    if e <= s:
+        return start_s, end_s
+    return s / sr, e / sr
+def normalize_asr_result(result: dict):
+    words = []
+    for w in result.get("words", []):
+        word = w.get("word", "").strip()
+        if not word:
+            continue
+        words.append(
+            {
+                "word": word,
+                "start": float(w["start"]),
+                "end": float(w["end"]),
+                "speaker": w.get("speaker"),
+            }
+        )
+    text = result.get("text", "").strip()
+    return text, words
+def guess_speaker_by_overlap(start, end, diar_segments):
+    best_spk = None
+    best_overlap = 0.0
+    for seg in diar_segments:
+        overlap = max(0.0, min(end, seg.end) - max(start, seg.start))
+        if overlap > best_overlap:
+            best_overlap = overlap
+            best_spk = seg.speaker
+    return best_spk or diar_segments[0].speaker
+def convert_audio_to_wav(audio_path: Path) -> Path:
+    """Convert any audio to WAV 16kHz Mono using ffmpeg."""
+    output_path = audio_path.parent / f"{audio_path.stem}_processed.wav"
+    if output_path.exists():
+        output_path.unlink()
+    command = ["ffmpeg", "-i", str(audio_path), "-ar", "16000", "-ac", "1", "-y", str(output_path)]
+    try:
+        subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
+        logger.info(f"Converted audio to WAV: {output_path}")
+        return output_path
+    except subprocess.CalledProcessError as e:
+        logger.error(f"FFmpeg conversion failed: {e}")
+        return audio_path
+def format_timestamp(seconds: float) -> str:
+    m = int(seconds // 60)
+    s = seconds % 60
+    return f"{m:02d}:{s:06.3f}"
+def extract_mfcc_segment(
+    audio: np.ndarray,
+    sr: int,
+    start: float,
+    end: float,
+    duration=5,
+):
+    start_sample = int(start * sr)
+    end_sample = int(end * sr)
+    segment = audio[start_sample:end_sample]
+    if len(segment) == 0:
+        return None
+    target_len = int(sr * duration)
+    if len(segment) < target_len:
+        segment = np.pad(segment,(0,target_len-len(segment)),mode="symmetric")
+    else:
+        segment = segment[:target_len]
+    mfcc = librosa.feature.mfcc(
+        y=segment,
+        sr=sr,
+        n_mfcc=128,
+        n_fft=2048,
+        hop_length=512
+    )
+    return mfcc
+def merge_consecutive_segments(
+    segments: List[SpeakerSegment],
+    max_gap: float = 0.8,
+    min_duration: float = 0.15,
+) -> List[SpeakerSegment]:
+    """Merge consecutive segments from same speaker."""
+    if not segments:
+        return []
+    merged = []
+    current = SpeakerSegment(
+        start=segments[0].start,
+        end=segments[0].end,
+        speaker=segments[0].speaker
+    )
+    for seg in segments[1:]:
+        seg_dur = seg.end - seg.start
+        if (seg.speaker == current.speaker and (seg.start - current.end) <= max_gap
+            or seg_dur < min_duration):
+            # Merge: extend current segment
+                current.end = seg.end
+        else:
+            # New speaker or gap too large
+            merged.append(current)
+            current = SpeakerSegment(
+                start=seg.start,
+                end=seg.end,
+                speaker=seg.speaker
+            )
+    merged.append(current)
+    return merged
+def overlap_prefix(a: str, b: str, n: int = 12) -> bool:
+    if not a or not b:
+        return False
+    a = a.strip().lower()
+    b = b.strip().lower()
+    return a[:n] in b or b[:n] in a
+class Processor:
+    @classmethod
+    async def process_audio(
+        cls,
+        audio_path: Path,
+        model_name: str = "PhoWhisper Lora Finetuned",
+        language="vi",
+        merge_segments: bool = True,
+    ) -> ProcessingResult:
+        import asyncio
+        t0= time.time()
+        EmotionService.preload_model()
+        # 1: Convert to WAV
+        logger.info("Step 1: Converting audio to WAV 16kHz...")
+        wav_path = await asyncio.get_event_loop().run_in_executor(None, convert_audio_to_wav, audio_path)
+        # 2: Load audio
+        y, sr = librosa.load(wav_path, sr=16000, mono=True)
+        waveform = torch.from_numpy(y).unsqueeze(0)
+        if y.size == 0:
+            raise ValueError("Empty audio")
+        duration = len(y) / sr
+        # 3: Diarization
+        logger.info("Step 3: Running diarization...")
+        diarization: DiarizationResult = await DiarizationService.diarize_async(wav_path)
+        diarization_segments = diarization.segments or []
+        speakers = diarization.speakers or []
+        roles = diarization.roles or {}
+        if not diarization_segments:
+            diarization_segments = [SpeakerSegment(0.0, duration, "SPEAKER_0")]
+            speakers = ["SPEAKER_0"]
+            roles = {"SPEAKER_0": "KH"}
+        diarization_segments.sort(key=lambda x: x.start)
+        diarization_segments = [
+            SpeakerSegment(
+                *pad_and_refine_tensor(waveform, sr, s.start, s.end),
+                speaker=s.speaker,
+            )
+            for s in diarization_segments
+        ]
+        diarization_segments.sort(key=lambda x: x.start)
+        if merge_segments and diarization_segments:
+            logger.info("Step 4: Merging consecutive segments...")
+            diarization_segments = merge_consecutive_segments(diarization_segments)
+        # 4. Normalize speakers
+        raw_speakers = sorted({seg.speaker for seg in diarization_segments})
+        speaker_map = {
+            spk: f"Speaker {i+1}"
+            for i, spk in enumerate(raw_speakers)
+        }
+        speakers = list(speaker_map.values())
+        # 5. NORMALIZE ROLES
+        speaker_duration = defaultdict(float)
+        for seg in diarization_segments:
+            speaker_duration[seg.speaker] += seg.end - seg.start
+        logger.info(f"speaker_duration(raw) = {speaker_duration}")
+        if speaker_duration:
+            agent_raw = max(speaker_duration, key=speaker_duration.get)
+            roles = {
+                speaker_map[spk]: ("NV" if spk == agent_raw else "KH")
+                for spk in speaker_duration
+            }
+        else:
+            roles = {}
+        # Default fallback
+        for label in speakers:
+            roles.setdefault(label, "KH")
+        logger.info(f"roles(mapped) = {roles}")
+        # 7: Transcribe segments after diarization
+        logger.info("Step 7: Running ASR with external VAD batch...")
+        asr_result = await TranscriptionService.transcribe_with_words_async(
+            audio_array=y,
+            model_name=model_name,
+            language=language,
+            vad_options=True
+        )
+        text, raw_words = normalize_asr_result(asr_result)
+        processed_segments: List[TranscriptSegment] = []
+        if not raw_words:
+            processed_segments = [
+                TranscriptSegment(
+                    start=0.0,
+                    end=duration,
+                    speaker=speakers[0],
+                    role=roles[speakers[0]],
+                    text="(No speech detected)"
+                )
+            ]
+        else:
+            # ===== CONVERT TO WordTimestamp =====
+            word_objs: List[WordTimestamp] = []
+            for w in raw_words:
+                spk = w.get("speaker")
+                if spk is None:
+                    spk = guess_speaker_by_overlap(
+                        w["start"], w["end"], diarization_segments
+                    )
+                word_objs.append(
+                    WordTimestamp(
+                        word=w["word"],
+                        start=w["start"],
+                        end=w["end"],
+                        speaker=spk,
+                    )
+                )
+            word_objs.sort(key=lambda x: x.start)
+            # ===== ALIGNMENT =====
+            aligned_segments = AlignmentService.align_precision(
+                word_objs,
+                diarization_segments
+            )
+            processed_segments = []
+            if not aligned_segments:
+                vote = [w.speaker for w in word_objs if w.speaker]
+                if vote:
+                    raw_spk = Counter(vote).most_common(1)[0][0]
+                else:
+                    raw_spk = diarization_segments[0].speaker
+                label = speaker_map.get(raw_spk, "Speaker 1")
+                processed_segments.append(
+                    TranscriptSegment(0, duration, label, roles[label], text)
+                )
+            else:
+                for seg in aligned_segments:
+                    raw_spk = seg.speaker
+                    label = speaker_map.get(raw_spk, "Speaker 1")
+                    role = roles.get(label, "KH")
+                    processed_segments.append(
+                        TranscriptSegment(
+                            start=seg.start,
+                            end=seg.end,
+                            speaker=label,
+                            role=role,
+                            text=seg.text,
+                        )
+                    )
+        processed_segments = cls._merge_adjacent_segments(
+            processed_segments
+        )
+        processed_segments.sort(key=lambda x: x.start)
+        # 8 : Predict emotion segments
+        logger.info("Step 8: Predicting emo per segment ")
+        processed_segments = cls._predict_emotion_segments(
+            processed_segments,
+            y,
+            sr
+        )
+        # build emotion timeline
+        emotion_timeline = cls.build_emotion_timeline(processed_segments)
+        # detect emotion change
+        emotion_changes = cls.detect_emotion_changes(emotion_timeline)
+        processing_time = time.time() - t0
+        txt_content = cls._generate_txt(
+            processed_segments,
+            len(speakers),
+            processing_time,
+            duration,
+            roles
+        )
+        csv_content = cls._generate_csv(processed_segments)
+        return ProcessingResult(
+            segments=processed_segments,
+            speaker_count=len(speakers),
+            duration=duration,
+            processing_time=processing_time,
+            speakers=speakers,
+            roles=roles,
+            txt_content=txt_content,
+            csv_content=csv_content,
+            emotion_timeline=emotion_timeline,
+            emotion_changes=emotion_changes
+        )
+    @staticmethod
+    def _merge_adjacent_segments(
+        segments: List[TranscriptSegment],
+        max_gap_s: float = 0.8,
+        max_segment_duration: float = 9.0
+    ) -> List[TranscriptSegment]:
+        """
+        Merge adjacent segments if:
+        - same speaker
+        - gap <= max_gap_s
+        """
+        if not segments:
+            return segments
+        segments = sorted(segments, key=lambda s: s.start)
+        merged = [segments[0]]
+        for seg in segments[1:]:
+            prev = merged[-1]
+            gap = seg.start - prev.end
+            combined_duration = seg.end - prev.start
+            if (
+                seg.speaker == prev.speaker  and seg.role == prev.role
+                and gap <= max_gap_s
+                and combined_duration <= max_segment_duration
+                and not overlap_prefix(seg.text, prev.text)
+            ):
+                # MERGE
+                prev.text = f"{prev.text} {seg.text}".strip()
+                prev.end = max(prev.end, seg.end)
+            else:
+                merged.append(seg)
+        return merged
+    @staticmethod
+    def _predict_emotion_segments(
+        segments: List[TranscriptSegment],
+        audio: np.ndarray,
+        sr: int
+    ):
+        for seg in segments:
+            # chỉ predict emotion cho KH
+            if seg.role != "KH":
+                seg.emotion = None
+                continue
+            seg.emotion = EmotionService.predict_segment(
+                audio,
+                sr,
+                seg.start,
+                seg.end
+            )
+        return segments
+    @staticmethod
+    def build_emotion_timeline(segments):
+        timeline = []
+        for seg in segments:
+            if seg.role != "KH":
+                continue
+            if not seg.emotion:
+                continue
+            icon = EmotionService.meta.get(seg.emotion, {}).get("emoji", "🙂")
+            timeline.append(
+                EmotionPoint(
+                    time=seg.start,
+                    emotion=seg.emotion,
+                    icon=icon
+                )
+            )
+        return timeline
+    @staticmethod
+    def detect_emotion_changes(timeline):
+        changes = []
+        prev = None
+        for point in timeline:
+            if prev is not None and prev.emotion != point.emotion:
+                icon_from = EmotionService.meta.get(prev.emotion, {}).get("emoji", "🙂")
+                icon_to = EmotionService.meta.get(point.emotion, {}).get("emoji", "🙂")
+                changes.append(
+                    EmotionChange(
+                        time=point.time,
+                        emotion_from=prev.emotion,
+                        emotion_to=point.emotion,
+                        icon_from=icon_from,
+                        icon_to=icon_to
+                    )
+                )
+            prev = point
+        return changes
+    @classmethod
+    def _generate_txt(
+            cls,
+            segments: List[TranscriptSegment],
+            speaker_count: int,
+            processing_time: float,
+            duration: float,
+            roles: Dict[str, str],
+        ) -> str:
+        segments = sorted(segments, key=lambda s: s.start)
+        speakers = []
+        for seg in segments:
+            if seg.speaker and seg.speaker not in speakers:
+                speakers.append(seg.speaker)
+        lines = [
+            "# Transcription Result",
+            f"# Duration: {format_timestamp(duration)}",
+            f"# Speakers: {speaker_count}",
+            f"# Roles: {roles}",
+            f"# Processing time: {processing_time:.1f}s",
+            "",
+        ]
+        icon_pool = ["🔵", "🟢", "🟡", "🟠", "🔴", "🟣"]
+        speaker_icons = {
+            spk: icon_pool[i % len(icon_pool)]
+            for i, spk in enumerate(speakers)
+        }
+        for seg in segments:
+            ts = f"[{format_timestamp(seg.start)} → {format_timestamp(seg.end)}]"
+            role = seg.role or "UNKNOWN"
+            speaker_icon = speaker_icons.get(seg.speaker, "⚪")
+            lines.append(
+                f"{ts} {speaker_icon} [{seg.speaker}|{role}] {seg.text}"
+            )
+        return "\n".join(lines)
+    @classmethod
+    def _generate_csv(cls, segments: List[TranscriptSegment]) -> str:
+        import csv
+        from io import StringIO
+        output = StringIO()
+        writer = csv.writer(output)
+        writer.writerow(["start", "end", "speaker", "text"])
+        for seg in segments:
+            writer.writerow([round(seg.start, 3), round(seg.end, 3), seg.speaker, seg.text])
+        return output.getvalue()

app/services/silero_vad_service.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import numpy as np
+import librosa
+from typing import List, Tuple
+from silero_vad import load_silero_vad, get_speech_timestamps
+class SileroVADService:
+    _model = None
+    @classmethod
+    def load_model(cls):
+        if cls._model is None:
+            cls._model = load_silero_vad()
+        return cls._model
+    @classmethod
+    def get_speech_timestamps(
+        cls,
+        audio: np.ndarray,
+        sr: int
+    ) -> List[Tuple[float, float]]:
+        model = cls.load_model()
+        audio = audio.astype(np.float32)
+        if np.max(np.abs(audio)) > 0:
+            audio = audio / np.max(np.abs(audio))
+        if sr != 16000:
+            audio = librosa.resample(audio, orig_sr=sr, target_sr=16000)
+            sr = 16000
+        speech = get_speech_timestamps(
+            audio,
+            model,
+            sampling_rate=sr
+        )
+        # convert
+        segments = [
+            (seg["start"] / sr, seg["end"] / sr)
+            for seg in speech
+            if seg["end"] > seg["start"]
+        ]
+        MIN_SPEECH_SEC = 0.25
+        segments = [
+            (s, e) for s, e in segments
+            if (e - s) >= MIN_SPEECH_SEC
+        ]
+        # merge close
+        MERGE_GAP = 0.15
+        merged = []
+        for s, e in segments:
+            if not merged:
+                merged.append([s, e])
+                continue
+            prev = merged[-1]
+            if s - prev[1] < MERGE_GAP:
+                prev[1] = e
+            else:
+                merged.append([s, e])
+        return [(s, e) for s, e in merged]

app/services/transcription.py ADDED Viewed

	@@ -0,0 +1,283 @@

+"""
+Transcription service using faster-whisper.
+Supports multiple Vietnamese Whisper models with caching.
+"""
+import logging
+from typing import Dict, Optional, List
+from dataclasses import dataclass
+from typing import Tuple
+import re
+import librosa
+import numpy as np
+from faster_whisper import WhisperModel
+from app.core.config import get_settings
+logger = logging.getLogger(__name__)
+settings = get_settings()
+# Available Whisper models for Vietnamese
+AVAILABLE_MODELS = {
+    "EraX-WoW-Turbo": "erax-ai/EraX-WoW-Turbo-V1.1-CT2",
+    "PhoWhisper Large": "kiendt/PhoWhisper-large-ct2",
+    "PhoWhisper Lora Finetuned": "vyluong/pho-whisper-vi-ct2"
+}
+@dataclass
+class WordTimestamp:
+    """A single word with precise timestamp."""
+    word: str
+    start: float
+    end: float
+    speaker: Optional[str] = None
+class TranscriptionService:
+    """
+    Service for speech-to-text transcription using faster-whisper.
+    Supports multiple models with caching.
+    """
+    _models: Dict[str, WhisperModel] = {}
+    @classmethod
+    def get_model(cls, model_name: str = None) -> WhisperModel:
+        """
+        Get or load a Whisper model (lazy loading with caching).
+        Args:
+            model_name: Name of the model from AVAILABLE_MODELS
+        Returns:
+            Loaded WhisperModel instance
+        """
+        if model_name is None:
+            model_name = settings.default_whisper_model
+        cache_key = f"{model_name}_{settings.resolved_compute_type}"
+        if cache_key in cls._models:
+            return cls._models[cache_key]
+        # Get model path
+        if model_name in AVAILABLE_MODELS:
+            model_path = AVAILABLE_MODELS[model_name]
+        else:
+            # Fallback to first available model
+            model_name = list(AVAILABLE_MODELS.keys())[0]
+            model_path = AVAILABLE_MODELS[model_name]
+        logger.info(f"Loading Whisper model: {model_name} ({model_path})")
+        logger.debug(f"Device: {settings.resolved_device}, Compute type: {settings.resolved_compute_type}")
+        model = WhisperModel(
+            model_path,
+            device=settings.resolved_device,
+            compute_type=settings.resolved_compute_type,
+        )
+        cls._models[cache_key] = model
+        logger.info(f"Whisper model loaded: {model_name}")
+        return model
+    @classmethod
+    def is_loaded(cls, model_name: str = None) -> bool:
+        """Check if a model is loaded."""
+        if model_name is None:
+            model_name = settings.default_whisper_model
+        cache_key = f"{model_name}_{settings.resolved_compute_type}"
+        return cache_key in cls._models
+    @classmethod
+    def preload_model(cls, model_name: str = None) -> None:
+        """Preload a model during startup."""
+        if model_name is None:
+            model_name = settings.default_whisper_model
+        try:
+            cls.get_model(model_name)
+        except Exception as e:
+            logger.error(f"Failed to preload Whisper model: {e}")
+            raise
+    @classmethod
+    def transcribe_with_words(
+        cls,
+        audio_array: np.ndarray,
+        model_name: str = None,
+        language: str = "vi",
+        vad_options: Optional[dict | bool] = None,
+        beam_size: int = 3,
+        temperature: float = 0.0,
+        best_of: int = 5,
+        patience: float = 1.0,
+        length_penalty: float = 1.0,
+        no_repeat_ngram_size: int = 3,
+        # Prompting
+        initial_prompt: str = "Hội thoại tổng đài. Chỉ ghi lại đúng lời nói trong audio.",
+        prefix_text: Optional[str] = None,
+        # Stability / filtering
+        condition_on_previous_text: bool = False,
+        no_speech_threshold: float = 0.70,
+        log_prob_threshold: float = -1.0,
+        compression_ratio_threshold: float = 2.4
+    ) -> Dict:
+        """
+        Transcribe audio and return word-level timestamps.
+        """
+        model = cls.get_model(model_name)
+        if vad_options is None or vad_options is False:
+            use_vad = False
+            vad_parameters = None
+        elif vad_options is True:
+            use_vad = True
+            vad_parameters = {
+                "threshold": settings.vad_threshold,
+                "min_speech_duration_ms": settings.vad_min_speech_duration_ms,
+                "min_silence_duration_ms": settings.vad_min_silence_duration_ms,
+            }
+        elif isinstance(vad_options, dict):
+            use_vad = True
+            vad_parameters = vad_options
+        else:
+            use_vad = False
+            vad_parameters = None
+        prompt = (
+            initial_prompt.strip()
+            if isinstance(initial_prompt, str) and initial_prompt.strip()
+            else None
+        )
+        prefix = (
+            prefix_text.strip()
+            if isinstance(prefix_text, str) and prefix_text.strip()
+            else None
+        )
+        segments_gen, info = model.transcribe(
+            audio_array,
+            language=language if language != "auto" else None,
+            # decoding
+            beam_size=beam_size,
+            temperature=temperature,
+            best_of=best_of,
+            patience=patience,
+            length_penalty=length_penalty,
+            no_repeat_ngram_size=no_repeat_ngram_size,
+            # prompting
+            prefix=prefix,
+            # QA / Stability
+            condition_on_previous_text=condition_on_previous_text,
+            no_speech_threshold=no_speech_threshold,
+            log_prob_threshold=log_prob_threshold,
+            compression_ratio_threshold=compression_ratio_threshold,
+            word_timestamps=True,
+            # VAD
+            vad_filter=use_vad,
+            vad_parameters=vad_parameters,
+            initial_prompt=prompt,
+        )
+        words = []
+        full_text = []
+        for seg in segments_gen:
+            if seg.text:
+                full_text.append(seg.text.strip())
+            if hasattr(seg, "words") and seg.words:
+                for w in seg.words:
+                    if not w.word.strip():
+                        continue
+                    words.append({
+                        "word": w.word.strip(),
+                        "start": float(w.start),
+                        "end": float(w.end),
+                    })
+        return {
+            "text": " ".join(full_text).strip(),
+            "words": words,
+            "info": info,
+        }
+    @classmethod
+    async def transcribe_with_words_async(
+        cls,
+        audio_array: np.ndarray,
+        model_name: str = None,
+        language: str = "vi",
+        vad_options: Optional[dict | bool] = None,
+        beam_size: int = 5,
+        temperature: float = 0.0,
+        best_of: int = 5,
+        patience: float = 1.0,
+        length_penalty: float = 1.0,
+        no_repeat_ngram_size: int = 3,
+        initial_prompt: Optional[str] = None,
+        prefix_text: Optional[str] = None,
+        condition_on_previous_text: bool = False,
+        no_speech_threshold: float = 0.70,
+        log_prob_threshold: float = -1.0,
+        # text repetitive / nonsense
+        compression_ratio_threshold: float = 2.4
+    ) -> Dict:
+        """
+        Async wrapper for transcription (runs in thread pool).
+        """
+        import asyncio
+        loop = asyncio.get_running_loop()
+        return await loop.run_in_executor(
+            None,
+            lambda: cls.transcribe_with_words(
+                audio_array=audio_array,
+                model_name=model_name,
+                language=language,
+                vad_options=vad_options,
+                beam_size=beam_size,
+                temperature=temperature,
+                best_of=best_of,
+                patience=patience,
+                length_penalty=length_penalty,
+                no_repeat_ngram_size=no_repeat_ngram_size,
+                initial_prompt=initial_prompt,
+                prefix_text=prefix_text,
+                condition_on_previous_text=condition_on_previous_text,
+                no_speech_threshold=no_speech_threshold,
+                log_prob_threshold=log_prob_threshold,
+                compression_ratio_threshold=compression_ratio_threshold
+            )
+        )
+    @classmethod
+    def get_available_models(cls) -> Dict[str, str]:
+        """Return list of available models."""
+        return AVAILABLE_MODELS.copy()

app/services/vocal_separator.py ADDED Viewed

	@@ -0,0 +1,118 @@

+"""
+Vocal Separation Service using MDX-Net (via audio-separator).
+Isolates vocals from audio files using state-of-the-art MDX-Net models.
+"""
+import os
+import asyncio
+import logging
+from pathlib import Path
+from typing import Optional
+from app.core.config import get_settings
+logger = logging.getLogger(__name__)
+settings = get_settings()
+class VocalSeparationError(Exception):
+    """Custom exception for vocal separation errors."""
+    pass
+class VocalSeparator:
+    """
+    Service for separating vocals from audio using MDX-Net.
+    Uses the audio-separator library which supports UVR models.
+    """
+    _separator = None
+    _model_name: str = None
+    @classmethod
+    def _get_separator(cls):
+        """Lazy load the Audio Separator."""
+        if cls._separator is None or cls._model_name != settings.mdx_model:
+            from audio_separator.separator import Separator
+            logger.debug(f"Initializing MDX-Net separator with model: {settings.mdx_model}")
+            # Initialize separator
+            # Note: audio-separator expects output_dir to exist
+            settings.processed_dir.mkdir(parents=True, exist_ok=True)
+            separator = Separator(
+                output_dir=str(settings.processed_dir),
+                output_format="WAV",
+                normalization_threshold=0.9
+            )
+            # Load model
+            separator.load_model(settings.mdx_model)
+            cls._separator = separator
+            cls._model_name = settings.mdx_model
+            logger.debug(f"MDX-Net model loaded on {settings.resolved_device}")
+        return cls._separator
+    @classmethod
+    async def separate_vocals(cls, input_path: Path) -> Path:
+        """
+        Separate vocals from audio file using MDX-Net.
+        Args:
+            input_path: Path to input audio file
+        Returns:
+            Path to separated vocals WAV file
+        """
+        if not settings.enable_vocal_separation:
+            logger.debug("Vocal separation disabled, skipping...")
+            return input_path
+        logger.debug(f"Starting vocal separation for: {input_path.name}")
+        try:
+            # Run separation in executor to not block
+            loop = asyncio.get_event_loop()
+            vocals_path = await loop.run_in_executor(
+                None,
+                lambda: cls._run_separation(input_path)
+            )
+            logger.info(f"Vocal separation complete: {vocals_path.name}")
+            return vocals_path
+        except Exception as e:
+            logger.error(f"Vocal separation failed: {e}")
+            # Fallback to original
+            logger.warning("Falling back to original audio.")
+            return input_path
+    @classmethod
+    def _run_separation(cls, input_path: Path) -> Path:
+        """Run the actual separation (blocking)."""
+        separator = cls._get_separator()
+        # separate() returns a list of output filenames
+        output_files = separator.separate(str(input_path))
+        # audio-separator usually produces multiple files (Vocals, Instrumental)
+        # We need to find the vocals one.
+        # It typically names them like {input_stem}_(Vocals)_{model}.wav
+        vocals_file = None
+        for file in output_files:
+            if "Vocals" in file:
+                vocals_file = settings.processed_dir / file
+                break
+        if not vocals_file:
+            # If we can't find the vocals file specifically, just take the first one or fail
+            logger.warning("Could not identify vocals stem in output files.")
+            if output_files:
+                vocals_file = settings.processed_dir / output_files[0]
+            else:
+                raise VocalSeparationError("No output files generated by separator.")
+        return vocals_file

app/static/css/style.css ADDED Viewed

	@@ -0,0 +1,673 @@

+/* ================================
+   PrecisionVoice - Modern Dark Theme
+   ================================ */
+:root {
+    /* Color Palette */
+    --bg-primary: #0a0a0f;
+    --bg-secondary: #12121a;
+    --bg-card: rgba(255, 255, 255, 0.03);
+    --bg-card-hover: rgba(255, 255, 255, 0.05);
+    --text-primary: #ffffff;
+    --text-secondary: #a0a0b0;
+    --text-muted: #606070;
+    --accent-primary: #6366f1;
+    --accent-secondary: #8b5cf6;
+    --accent-gradient: linear-gradient(135deg, #6366f1 0%, #8b5cf6 50%, #a855f7 100%);
+    --success: #10b981;
+    --error: #ef4444;
+    --warning: #f59e0b;
+    --border-color: rgba(255, 255, 255, 0.08);
+    --border-glow: rgba(99, 102, 241, 0.3);
+    /* Spacing */
+    --spacing-xs: 0.25rem;
+    --spacing-sm: 0.5rem;
+    --spacing-md: 1rem;
+    --spacing-lg: 1.5rem;
+    --spacing-xl: 2rem;
+    --spacing-2xl: 3rem;
+    /* Border Radius */
+    --radius-sm: 0.375rem;
+    --radius-md: 0.75rem;
+    --radius-lg: 1rem;
+    --radius-xl: 1.5rem;
+    /* Shadows */
+    --shadow-sm: 0 2px 8px rgba(0, 0, 0, 0.3);
+    --shadow-md: 0 4px 16px rgba(0, 0, 0, 0.4);
+    --shadow-lg: 0 8px 32px rgba(0, 0, 0, 0.5);
+    --shadow-glow: 0 0 40px rgba(99, 102, 241, 0.15);
+    /* Transitions */
+    --transition-fast: 0.15s ease;
+    --transition-normal: 0.3s ease;
+    --transition-slow: 0.5s ease;
+}
+/* ================================
+   Base Styles
+   ================================ */
+*,
+*::before,
+*::after {
+    box-sizing: border-box;
+    margin: 0;
+    padding: 0;
+}
+html {
+    font-size: 16px;
+    scroll-behavior: smooth;
+}
+body {
+    font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+    background: var(--bg-primary);
+    color: var(--text-primary);
+    line-height: 1.6;
+    min-height: 100vh;
+    -webkit-font-smoothing: antialiased;
+    -moz-osx-font-smoothing: grayscale;
+}
+/* Animated background gradient */
+body::before {
+    content: '';
+    position: fixed;
+    top: 0;
+    left: 0;
+    right: 0;
+    bottom: 0;
+    background:
+        radial-gradient(ellipse at 20% 20%, rgba(99, 102, 241, 0.08) 0%, transparent 50%),
+        radial-gradient(ellipse at 80% 80%, rgba(139, 92, 246, 0.06) 0%, transparent 50%),
+        radial-gradient(ellipse at 50% 50%, rgba(168, 85, 247, 0.04) 0%, transparent 70%);
+    pointer-events: none;
+    z-index: -1;
+}
+/* ================================
+   Layout
+   ================================ */
+.app-container {
+    max-width: 800px;
+    margin: 0 auto;
+    padding: var(--spacing-lg);
+    min-height: 100vh;
+    display: flex;
+    flex-direction: column;
+}
+/* ================================
+   Header
+   ================================ */
+.header {
+    text-align: center;
+    padding: var(--spacing-2xl) 0;
+}
+.logo {
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    gap: var(--spacing-md);
+    margin-bottom: var(--spacing-sm);
+}
+.logo-icon {
+    width: 48px;
+    height: 48px;
+    background: var(--accent-gradient);
+    border-radius: var(--radius-lg);
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    box-shadow: var(--shadow-glow);
+}
+.logo-icon svg {
+    width: 28px;
+    height: 28px;
+    color: white;
+}
+.logo h1 {
+    font-size: 2rem;
+    font-weight: 700;
+    background: var(--accent-gradient);
+    -webkit-background-clip: text;
+    -webkit-text-fill-color: transparent;
+    background-clip: text;
+}
+.tagline {
+    color: var(--text-secondary);
+    font-size: 1rem;
+    font-weight: 400;
+}
+/* ================================
+   Cards
+   ================================ */
+.card {
+    background: var(--bg-card);
+    backdrop-filter: blur(20px);
+    border: 1px solid var(--border-color);
+    border-radius: var(--radius-xl);
+    padding: var(--spacing-xl);
+    margin-bottom: var(--spacing-lg);
+    transition: var(--transition-normal);
+}
+.card:hover {
+    border-color: var(--border-glow);
+    box-shadow: var(--shadow-glow);
+}
+.card-header {
+    display: flex;
+    align-items: center;
+    justify-content: space-between;
+    margin-bottom: var(--spacing-lg);
+    flex-wrap: wrap;
+    gap: var(--spacing-sm);
+}
+.card-header h2 {
+    font-size: 1.25rem;
+    font-weight: 600;
+}
+/* ================================
+   Badge
+   ================================ */
+.badge {
+    display: inline-block;
+    padding: var(--spacing-xs) var(--spacing-sm);
+    background: rgba(99, 102, 241, 0.15);
+    color: var(--accent-primary);
+    border-radius: var(--radius-sm);
+    font-size: 0.75rem;
+    font-weight: 500;
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+}
+/* ================================
+   Upload Zone
+   ================================ */
+.upload-zone {
+    border: 2px dashed var(--border-color);
+    border-radius: var(--radius-lg);
+    padding: var(--spacing-2xl);
+    text-align: center;
+    cursor: pointer;
+    transition: var(--transition-normal);
+    margin-bottom: var(--spacing-lg);
+}
+.upload-zone:hover,
+.upload-zone.dragover {
+    border-color: var(--accent-primary);
+    background: rgba(99, 102, 241, 0.05);
+}
+.upload-zone.dragover {
+    transform: scale(1.02);
+}
+.upload-icon {
+    width: 64px;
+    height: 64px;
+    margin: 0 auto var(--spacing-md);
+    background: var(--accent-gradient);
+    border-radius: 50%;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    opacity: 0.8;
+}
+.upload-icon svg {
+    width: 32px;
+    height: 32px;
+    color: white;
+}
+.upload-text {
+    font-size: 1.125rem;
+    font-weight: 500;
+    color: var(--text-primary);
+    margin-bottom: var(--spacing-xs);
+}
+.upload-subtext {
+    color: var(--text-muted);
+    font-size: 0.875rem;
+}
+/* ================================
+   File Info
+   ================================ */
+.file-info {
+    display: flex;
+    align-items: center;
+    justify-content: space-between;
+    padding: var(--spacing-md);
+    background: rgba(99, 102, 241, 0.1);
+    border-radius: var(--radius-md);
+    margin-bottom: var(--spacing-lg);
+}
+.file-details {
+    display: flex;
+    flex-direction: column;
+    gap: var(--spacing-xs);
+}
+.file-name {
+    font-weight: 500;
+    color: var(--text-primary);
+}
+.file-size {
+    font-size: 0.875rem;
+    color: var(--text-secondary);
+}
+/* ================================
+   Buttons
+   ================================ */
+.btn {
+    display: inline-flex;
+    align-items: center;
+    justify-content: center;
+    gap: var(--spacing-sm);
+    padding: var(--spacing-md) var(--spacing-xl);
+    border: none;
+    border-radius: var(--radius-md);
+    font-family: inherit;
+    font-size: 1rem;
+    font-weight: 500;
+    cursor: pointer;
+    transition: var(--transition-fast);
+    text-decoration: none;
+}
+.btn:disabled {
+    opacity: 0.5;
+    cursor: not-allowed;
+}
+.btn svg {
+    width: 20px;
+    height: 20px;
+}
+.btn-primary {
+    width: 100%;
+    background: var(--accent-gradient);
+    color: white;
+    box-shadow: var(--shadow-md);
+}
+.btn-primary:hover:not(:disabled) {
+    transform: translateY(-2px);
+    box-shadow: var(--shadow-lg), var(--shadow-glow);
+}
+.btn-primary:active:not(:disabled) {
+    transform: translateY(0);
+}
+.btn-secondary {
+    background: var(--bg-card);
+    color: var(--text-primary);
+    border: 1px solid var(--border-color);
+}
+.btn-secondary:hover:not(:disabled) {
+    background: var(--bg-card-hover);
+    border-color: var(--accent-primary);
+}
+.btn-outline {
+    background: transparent;
+    color: var(--text-primary);
+    border: 1px solid var(--border-color);
+    padding: var(--spacing-sm) var(--spacing-md);
+}
+.btn-outline:hover {
+    background: var(--bg-card);
+    border-color: var(--accent-primary);
+}
+.btn-clear {
+    width: 36px;
+    height: 36px;
+    padding: 0;
+    background: transparent;
+    color: var(--text-muted);
+}
+.btn-clear:hover {
+    color: var(--error);
+}
+/* ================================
+   Processing Section
+   ================================ */
+.processing-content {
+    text-align: center;
+    padding: var(--spacing-xl) 0;
+}
+.spinner {
+    width: 56px;
+    height: 56px;
+    margin: 0 auto var(--spacing-lg);
+    border: 3px solid var(--border-color);
+    border-top-color: var(--accent-primary);
+    border-radius: 50%;
+    animation: spin 1s linear infinite;
+}
+@keyframes spin {
+    to {
+        transform: rotate(360deg);
+    }
+}
+.processing-content h3 {
+    font-size: 1.25rem;
+    margin-bottom: var(--spacing-sm);
+}
+.processing-content p {
+    color: var(--text-secondary);
+    margin-bottom: var(--spacing-lg);
+}
+.progress-bar {
+    height: 6px;
+    background: var(--bg-secondary);
+    border-radius: var(--radius-sm);
+    overflow: hidden;
+    margin-bottom: var(--spacing-md);
+}
+.progress-fill {
+    height: 100%;
+    width: 0%;
+    background: var(--accent-gradient);
+    border-radius: var(--radius-sm);
+    transition: width 0.3s ease;
+    animation: pulse 2s ease-in-out infinite;
+}
+@keyframes pulse {
+    0%,
+    100% {
+        opacity: 1;
+    }
+    50% {
+        opacity: 0.7;
+    }
+}
+.processing-hint {
+    font-size: 0.875rem;
+    color: var(--text-muted);
+}
+.timer-display {
+    font-size: 2rem;
+    font-weight: 700;
+    color: var(--accent-primary);
+    margin: var(--spacing-md) 0;
+    font-family: monospace;
+    text-shadow: 0 0 10px rgba(99, 102, 241, 0.3);
+}
+/* ================================
+   Results Section
+   ================================ */
+.result-meta {
+    display: flex;
+    gap: var(--spacing-sm);
+    flex-wrap: wrap;
+}
+.download-buttons {
+    display: flex;
+    gap: var(--spacing-md);
+    margin-bottom: var(--spacing-lg);
+    flex-wrap: wrap;
+}
+.transcript-container {
+    max-height: 400px;
+    overflow-y: auto;
+    padding-right: var(--spacing-sm);
+    margin-bottom: var(--spacing-lg);
+}
+.transcript-container::-webkit-scrollbar {
+    width: 6px;
+}
+.transcript-container::-webkit-scrollbar-track {
+    background: var(--bg-secondary);
+    border-radius: var(--radius-sm);
+}
+.transcript-container::-webkit-scrollbar-thumb {
+    background: var(--border-color);
+    border-radius: var(--radius-sm);
+}
+.transcript-container::-webkit-scrollbar-thumb:hover {
+    background: var(--text-muted);
+}
+/* Transcript Segment */
+.segment {
+    padding: var(--spacing-md);
+    border-radius: var(--radius-md);
+    margin-bottom: var(--spacing-sm);
+    background: var(--bg-secondary);
+    border-left: 3px solid var(--accent-primary);
+    transition: var(--transition-fast);
+}
+.segment:hover {
+    background: var(--bg-card-hover);
+}
+.segment-header {
+    display: flex;
+    align-items: center;
+    gap: var(--spacing-md);
+    margin-bottom: var(--spacing-xs);
+    flex-wrap: wrap;
+}
+.segment-speaker {
+    font-weight: 600;
+    color: var(--accent-primary);
+}
+.segment-time {
+    font-size: 0.75rem;
+    color: var(--text-muted);
+    font-family: monospace;
+}
+.segment-text {
+    color: var(--text-primary);
+    line-height: 1.7;
+}
+/* Speaker Colors */
+.speaker-1 {
+    border-left-color: #6366f1;
+}
+.speaker-1 .segment-speaker {
+    color: #6366f1;
+}
+.speaker-2 {
+    border-left-color: #10b981;
+}
+.speaker-2 .segment-speaker {
+    color: #10b981;
+}
+.speaker-3 {
+    border-left-color: #f59e0b;
+}
+.speaker-3 .segment-speaker {
+    color: #f59e0b;
+}
+.speaker-4 {
+    border-left-color: #ec4899;
+}
+.speaker-4 .segment-speaker {
+    color: #ec4899;
+}
+.speaker-5 {
+    border-left-color: #8b5cf6;
+}
+.speaker-5 .segment-speaker {
+    color: #8b5cf6;
+}
+/* ================================
+   Error Section
+   ================================ */
+.error-content {
+    text-align: center;
+    padding: var(--spacing-xl) 0;
+}
+.error-icon {
+    width: 64px;
+    height: 64px;
+    margin: 0 auto var(--spacing-lg);
+    background: rgba(239, 68, 68, 0.15);
+    border-radius: 50%;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+}
+.error-icon svg {
+    width: 32px;
+    height: 32px;
+    color: var(--error);
+}
+.error-content h3 {
+    color: var(--error);
+    margin-bottom: var(--spacing-sm);
+}
+.error-content p {
+    color: var(--text-secondary);
+    margin-bottom: var(--spacing-lg);
+}
+/* ================================
+   Footer
+   ================================ */
+.footer {
+    margin-top: auto;
+    padding: var(--spacing-xl) 0;
+    text-align: center;
+    color: var(--text-muted);
+    font-size: 0.875rem;
+}
+.footer strong {
+    color: var(--text-secondary);
+}
+.footer-note {
+    margin-top: var(--spacing-xs);
+    font-size: 0.75rem;
+}
+/* ================================
+   Utility Classes
+   ================================ */
+.hidden {
+    display: none !important;
+}
+/* ================================
+   Responsive
+   ================================ */
+@media (max-width: 640px) {
+    :root {
+        font-size: 14px;
+    }
+    .app-container {
+        padding: var(--spacing-md);
+    }
+    .card {
+        padding: var(--spacing-lg);
+    }
+    .upload-zone {
+        padding: var(--spacing-xl);
+    }
+    .card-header {
+        flex-direction: column;
+        align-items: flex-start;
+    }
+    .result-meta {
+        width: 100%;
+    }
+    .download-buttons {
+        flex-direction: column;
+    }
+    .download-buttons .btn {
+        width: 100%;
+    }
+}

app/static/js/app.js ADDED Viewed

	@@ -0,0 +1,338 @@

+/**
+ * PrecisionVoice - Frontend Application Logic
+ * Handles file upload, transcription requests, and result display.
+ */
+document.addEventListener('DOMContentLoaded', () => {
+    // DOM Elements
+    const elements = {
+        // Upload
+        dropZone: document.getElementById('drop-zone'),
+        fileInput: document.getElementById('file-input'),
+        fileInfo: document.getElementById('file-info'),
+        fileName: document.getElementById('file-name'),
+        fileSize: document.getElementById('file-size'),
+        clearBtn: document.getElementById('clear-btn'),
+        transcribeBtn: document.getElementById('transcribe-btn'),
+        // Sections
+        uploadSection: document.getElementById('upload-section'),
+        processingSection: document.getElementById('processing-section'),
+        resultsSection: document.getElementById('results-section'),
+        errorSection: document.getElementById('error-section'),
+        // Processing
+        processingStatus: document.getElementById('processing-status'),
+        progressFill: document.getElementById('progress-fill'),
+        processingTimer: document.getElementById('processing-timer'),
+        // Results
+        speakerCount: document.getElementById('speaker-count'),
+        durationInfo: document.getElementById('duration-info'),
+        processingTime: document.getElementById('processing-time'),
+        transcriptContainer: document.getElementById('transcript-container'),
+        downloadTxt: document.getElementById('download-txt'),
+        downloadCsv: document.getElementById("download-csv"),
+        newUploadBtn: document.getElementById('new-upload-btn'),
+        // Error
+        errorMessage: document.getElementById('error-message'),
+        retryBtn: document.getElementById('retry-btn')
+    };
+    let selectedFile = null;
+    // =====================
+    // Event Listeners
+    // =====================
+    // Click to upload
+    elements.dropZone.addEventListener('click', () => {
+        elements.fileInput.click();
+    });
+    // File input change
+    elements.fileInput.addEventListener('change', (e) => {
+        if (e.target.files.length > 0) {
+            handleFileSelection(e.target.files[0]);
+        }
+    });
+    // Drag and drop
+    elements.dropZone.addEventListener('dragover', (e) => {
+        e.preventDefault();
+        elements.dropZone.classList.add('dragover');
+    });
+    elements.dropZone.addEventListener('dragleave', () => {
+        elements.dropZone.classList.remove('dragover');
+    });
+    elements.dropZone.addEventListener('drop', (e) => {
+        e.preventDefault();
+        elements.dropZone.classList.remove('dragover');
+        if (e.dataTransfer.files.length > 0) {
+            handleFileSelection(e.dataTransfer.files[0]);
+        }
+    });
+    // Clear file
+    elements.clearBtn.addEventListener('click', (e) => {
+        e.stopPropagation();
+        clearFileSelection();
+    });
+    // Transcribe button
+    elements.transcribeBtn.addEventListener('click', () => {
+        if (selectedFile) {
+            startTranscription();
+        }
+    });
+    // New upload button
+    elements.newUploadBtn.addEventListener('click', resetToUpload);
+    // Retry button
+    elements.retryBtn.addEventListener('click', resetToUpload);
+    // =====================
+    // File Handling
+    // =====================
+    function handleFileSelection(file) {
+        const allowedTypes = ['audio/mpeg', 'audio/wav', 'audio/x-wav', 'audio/mp4', 'audio/x-m4a',
+            'audio/ogg', 'audio/flac', 'audio/webm', 'video/webm'];
+        const allowedExtensions = ['mp3', 'wav', 'm4a', 'ogg', 'flac', 'webm'];
+        // Check file extension
+        const ext = file.name.split('.').pop().toLowerCase();
+        if (!allowedExtensions.includes(ext)) {
+            showError(`Unsupported file type: .${ext}. Supported: ${allowedExtensions.join(', ')}`);
+            return;
+        }
+        // Check file size (100MB limit)
+        const maxSize = 100 * 1024 * 1024;
+        if (file.size > maxSize) {
+            showError(`File too large. Maximum size: 100MB`);
+            return;
+        }
+        selectedFile = file;
+        // Update UI
+        elements.fileName.textContent = file.name;
+        elements.fileSize.textContent = formatFileSize(file.size);
+        elements.fileInfo.classList.remove('hidden');
+        elements.transcribeBtn.disabled = false;
+        // Hide drop zone text
+        elements.dropZone.style.display = 'none';
+    }
+    function clearFileSelection() {
+        selectedFile = null;
+        elements.fileInput.value = '';
+        elements.fileInfo.classList.add('hidden');
+        elements.transcribeBtn.disabled = true;
+        elements.dropZone.style.display = 'block';
+    }
+    function formatFileSize(bytes) {
+        if (bytes === 0) return '0 Bytes';
+        const k = 1024;
+        const sizes = ['Bytes', 'KB', 'MB', 'GB'];
+        const i = Math.floor(Math.log(bytes) / Math.log(k));
+        return parseFloat((bytes / Math.pow(k, i)).toFixed(2)) + ' ' + sizes[i];
+    }
+    // =====================
+    // Transcription
+    // =====================
+    async function startTranscription() {
+        if (!selectedFile) return;
+        // Show processing UI
+        showSection('processing');
+        updateProgress(100, 'Processing audio... (Check server logs for details)');
+        // Reset and start timer
+        let seconds = 0;
+        elements.processingTimer.textContent = '00:00';
+        const timerInterval = setInterval(() => {
+            seconds++;
+            const m = Math.floor(seconds / 60);
+            const s = seconds % 60;
+            elements.processingTimer.textContent = `${m.toString().padStart(2, '0')}:${s.toString().padStart(2, '0')}`;
+        }, 1000);
+        try {
+            const formData = new FormData();
+            formData.append('file', selectedFile);
+            const response = await fetch('/api/transcribe', {
+                method: 'POST',
+                body: formData
+            });
+            clearInterval(timerInterval);
+            if (!response.ok) {
+                const errorData = await response.json();
+                throw new Error(errorData.detail || 'Processing failed');
+            }
+            const result = await response.json();
+            displayResults(result);
+        } catch (error) {
+            clearInterval(timerInterval);
+            console.error('Processing error:', error);
+            showError(error.message || 'An error occurred during processing');
+        }
+    }
+    function updateProgress(percent, status) {
+        elements.progressFill.style.width = `${percent}%`;
+        if (status) {
+            elements.processingStatus.textContent = status;
+        }
+    }
+    // =====================
+    // Results Display
+    // =====================
+    function displayResults(result) {
+        // ===== Metadata =====
+        // Nếu chỉ dùng ROLE thì không nên hiển thị speaker_count
+        if (result.roles) {
+            const roleCount = Object.keys(result.roles).length;
+            elements.speakerCount.textContent = `${roleCount} role${roleCount !== 1 ? 's' : ''}`;
+        } else {
+            elements.speakerCount.textContent = '';
+        }
+        elements.durationInfo.textContent = formatDuration(result.duration || 0);
+        elements.processingTime.textContent = `${result.processing_time || 0}s`;
+        // ===== Download links =====
+        if (result.download_txt) {
+            elements.downloadTxt.href = result.download_txt;
+            elements.downloadTxt.style.display = 'inline-block';
+        }
+        if (result.download_csv) {
+            elements.downloadCsv.href = result.download_csv;
+            elements.downloadCsv.style.display = 'inline-block';
+        }
+        // ===== Render transcript =====
+        renderTranscript(result.segments || []);
+        // ===== Show results =====
+        showSection('results');
+    }
+    function renderTranscript(segments) {
+        elements.transcriptContainer.innerHTML = '';
+        const roleColors = {};
+        let colorIndex = 0;
+        segments.forEach((segment) => {
+            const role = segment.role || 'UNKNOWN';
+            if (!(role in roleColors)) {
+                colorIndex++;
+                roleColors[role] = `speaker-${Math.min(colorIndex, 5)}`;
+            }
+            const segmentEl = document.createElement('div');
+            segmentEl.className = `segment ${roleColors[role]}`;
+            const start = typeof segment.start === 'number' ? segment.start : 0;
+            const end = typeof segment.end === 'number' ? segment.end : 0;
+            const text = segment.text ? escapeHtml(segment.text) : '';
+            segmentEl.innerHTML = `
+                <div class="segment-header">
+                    <span class="segment-speaker">
+                        ${escapeHtml(role)}
+                    </span>
+                    <span class="segment-time">
+                        ${formatTime(start)} - ${formatTime(end)}
+                    </span>
+                </div>
+                <p class="segment-text">${text}</p>
+            `;
+            elements.transcriptContainer.appendChild(segmentEl);
+        });
+    }
+    function formatTime(seconds) {
+        const h = Math.floor(seconds / 3600);
+        const m = Math.floor((seconds % 3600) / 60);
+        const s = Math.floor(seconds % 60);
+        if (h > 0) {
+            return `${h}:${m.toString().padStart(2, '0')}:${s.toString().padStart(2, '0')}`;
+        }
+        return `${m}:${s.toString().padStart(2, '0')}`;
+    }
+    function formatDuration(seconds) {
+        const m = Math.floor(seconds / 60);
+        const s = Math.floor(seconds % 60);
+        return `${m}:${s.toString().padStart(2, '0')}`;
+    }
+    function escapeHtml(text) {
+        const div = document.createElement('div');
+        div.textContent = text;
+        return div.innerHTML;
+    }
+    // =====================
+    // UI State Management
+    // =====================
+    function showSection(section) {
+        elements.uploadSection.classList.add('hidden');
+        elements.processingSection.classList.add('hidden');
+        elements.resultsSection.classList.add('hidden');
+        elements.errorSection.classList.add('hidden');
+        switch (section) {
+            case 'upload':
+                elements.uploadSection.classList.remove('hidden');
+                break;
+            case 'processing':
+                elements.processingSection.classList.remove('hidden');
+                break;
+            case 'results':
+                elements.resultsSection.classList.remove('hidden');
+                break;
+            case 'error':
+                elements.errorSection.classList.remove('hidden');
+                break;
+        }
+    }
+    function showError(message) {
+        elements.errorMessage.textContent = message;
+        showSection('error');
+    }
+    function resetToUpload() {
+        clearFileSelection();
+        showSection('upload');
+        updateProgress(0, 'Uploading file...');
+    }
+});

app/templates/index.html ADDED Viewed

	@@ -0,0 +1,162 @@

+<!DOCTYPE html>
+<html lang="vi">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <meta name="description" content="PrecisionVoice - Speech-to-Text and Speaker Diarization powered by AI">
+    <title>PrecisionVoice | AI Speech Transcription</title>
+    <link rel="preconnect" href="https://fonts.googleapis.com">
+    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
+    <link rel="stylesheet" href="/static/css/style.css">
+</head>
+<body>
+    <div class="app-container">
+        <!-- Header -->
+        <header class="header">
+            <div class="logo">
+                <div class="logo-icon">
+                    <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
+                        <path d="M12 1a3 3 0 0 0-3 3v8a3 3 0 0 0 6 0V4a3 3 0 0 0-3-3z" />
+                        <path d="M19 10v2a7 7 0 0 1-14 0v-2" />
+                        <line x1="12" y1="19" x2="12" y2="23" />
+                        <line x1="8" y1="23" x2="16" y2="23" />
+                    </svg>
+                </div>
+                <h1>PrecisionVoice</h1>
+            </div>
+            <p class="tagline">AI-Powered Speech Transcription with Speaker Detection</p>
+        </header>
+        <!-- Main Content -->
+        <main class="main-content">
+            <!-- Upload Section -->
+            <section id="upload-section" class="card upload-card">
+                <div class="card-header">
+                    <h2>Upload Audio</h2>
+                    <span class="badge">Supported: {{ allowed_formats }}</span>
+                </div>
+                <div class="upload-zone" id="drop-zone">
+                    <div class="upload-icon">
+                        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
+                            <path d="M21 15v4a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2v-4" />
+                            <polyline points="17 8 12 3 7 8" />
+                            <line x1="12" y1="3" x2="12" y2="15" />
+                        </svg>
+                    </div>
+                    <p class="upload-text">Drag & drop audio file here</p>
+                    <p class="upload-subtext">or click to browse</p>
+                    <input type="file" id="file-input" accept=".mp3,.wav,.m4a,.ogg,.flac,.webm" hidden>
+                </div>
+                <div id="file-info" class="file-info hidden">
+                    <div class="file-details">
+                        <span class="file-name" id="file-name">audio.mp3</span>
+                        <span class="file-size" id="file-size">0 MB</span>
+                    </div>
+                    <button class="btn btn-clear" id="clear-btn" title="Remove file">
+                        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
+                            <line x1="18" y1="6" x2="6" y2="18" />
+                            <line x1="6" y1="6" x2="18" y2="18" />
+                        </svg>
+                    </button>
+                </div>
+                <button class="btn btn-primary" id="transcribe-btn" disabled>
+                    <span class="btn-text">Transcribe</span>
+                    <span class="btn-icon">
+                        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
+                            <polygon points="5 3 19 12 5 21 5 3" />
+                        </svg>
+                    </span>
+                </button>
+            </section>
+            <!-- Processing Section -->
+            <section id="processing-section" class="card processing-card hidden">
+                <div class="processing-content">
+                    <div class="spinner"></div>
+                    <h3>Processing Audio</h3>
+                    <p id="processing-status">Uploading file...</p>
+                    <div class="progress-bar">
+                        <div class="progress-fill" id="progress-fill"></div>
+                    </div>
+                    <div class="timer-display" id="processing-timer">00:00</div>
+                    <p class="processing-hint">This may take a few minutes depending on audio length</p>
+                </div>
+            </section>
+            <!-- Results Section -->
+            <section id="results-section" class="card results-card hidden">
+                <div class="card-header">
+                    <h2>Transcription Results</h2>
+                    <div class="result-meta">
+                        <span id="speaker-count" class="badge">0 speakers</span>
+                        <span id="duration-info" class="badge">0:00</span>
+                        <span id="processing-time" class="badge">0.0s</span>
+                    </div>
+                </div>
+                <div class="download-buttons">
+                    <a href="#" id="download-txt" class="btn btn-outline" download>
+                        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
+                            <path d="M21 15v4a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2v-4" />
+                            <polyline points="7 10 12 15 17 10" />
+                            <line x1="12" y1="15" x2="12" y2="3" />
+                        </svg>
+                        Download TXT
+                    </a>
+                    <a href="#" id="download-csv" class="btn btn-outline" download>
+                        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
+                            <path d="M21 15v4a2 2 0 0 1-2 2H5a2 2 0 0 1-2-2v-4" />
+                            <polyline points="7 10 12 15 17 10" />
+                            <line x1="12" y1="15" x2="12" y2="3" />
+                        </svg>
+                        Download CSV
+                    </a>
+                </div>
+                <div class="transcript-container" id="transcript-container">
+                    <!-- Transcript segments will be rendered here -->
+                </div>
+                <button class="btn btn-secondary" id="new-upload-btn">
+                    <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
+                        <polyline points="1 4 1 10 7 10" />
+                        <path d="M3.51 15a9 9 0 1 0 2.13-9.36L1 10" />
+                    </svg>
+                    New Transcription
+                </button>
+            </section>
+            <!-- Error Section -->
+            <section id="error-section" class="card error-card hidden">
+                <div class="error-content">
+                    <div class="error-icon">
+                        <svg viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2">
+                            <circle cx="12" cy="12" r="10" />
+                            <line x1="15" y1="9" x2="9" y2="15" />
+                            <line x1="9" y1="9" x2="15" y2="15" />
+                        </svg>
+                    </div>
+                    <h3>Error</h3>
+                    <p id="error-message">An error occurred during processing.</p>
+                    <button class="btn btn-secondary" id="retry-btn">Try Again</button>
+                </div>
+            </section>
+        </main>
+        <!-- Footer -->
+        <footer class="footer">
+            <p>Powered by <strong>faster-whisper</strong> & <strong>pyannote.audio</strong></p>
+            <p class="footer-note">Max file size: {{ max_upload_mb }}MB</p>
+        </footer>
+    </div>
+    <script src="/static/js/app.js"></script>
+</body>
+</html>

data/processed/.gitkeep ADDED Viewed

File without changes

data/uploads/.gitkeep ADDED Viewed

File without changes

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,60 @@

+services:
+  app:
+    build:
+      context: .
+      dockerfile: Dockerfile
+      args:
+        - PORT=${PORT:-7860}
+    container_name: precisionvoice
+    ports:
+      - "${PORT:-7860}:${PORT:-7860}"
+    volumes:
+      # Persist uploaded/processed files
+      - ./data:/app/data
+      # Cache models to avoid re-downloading
+      - model_cache_hf:/root/.cache/huggingface
+      - model_cache_torch:/root/.cache/torch
+      - model_cache_mdx:/root/.audio-separator-models
+    environment:
+      # HuggingFace token (required for pyannote.audio)
+      - HF_TOKEN=${HF_TOKEN:-}
+      # Model settings
+      - WHISPER_MODEL=${WHISPER_MODEL:-erax-ai/EraX-WoW-Turbo-V1.1-CT2}
+      - DIARIZATION_MODEL=${DIARIZATION_MODEL:-pyannote/speaker-diarization-3.1}
+      # Device (auto, cuda, cpu)
+      - DEVICE=${DEVICE:-auto}
+      # Speech Enhancement (SpeechBrain SepFormer)
+      - ENABLE_SPEECH_ENHANCEMENT=${ENABLE_SPEECH_ENHANCEMENT:-True}
+      - ENHANCEMENT_MODEL=${ENHANCEMENT_MODEL:-speechbrain/sepformer-dns4-16k-enhancement}
+      # MDX-Net Vocal Separation
+      - ENABLE_VOCAL_SEPARATION=${ENABLE_VOCAL_SEPARATION:-True}
+      - MDX_MODEL=${MDX_MODEL:-UVR-MDX-NET-Voc_FT}
+      # Upload settings
+      - MAX_UPLOAD_SIZE_MB=${MAX_UPLOAD_SIZE_MB:-100}
+      # Optimization settings
+      - ENABLE_LOUDNORM=${ENABLE_LOUDNORM:-True}
+      - ENABLE_NOISE_REDUCTION=${ENABLE_NOISE_REDUCTION:-True}
+      # VAD settings
+      - VAD_THRESHOLD=${VAD_THRESHOLD:-0.5}
+      - VAD_MIN_SPEECH_DURATION_MS=${VAD_MIN_SPEECH_DURATION_MS:-250}
+      - VAD_MIN_SILENCE_DURATION_MS=${VAD_MIN_SILENCE_DURATION_MS:-500}
+      # Clustering settings
+      - MERGE_THRESHOLD_S=${MERGE_THRESHOLD_S:-0.5}
+      - MIN_SEGMENT_DURATION_S=${MIN_SEGMENT_DURATION_S:-0.3}
+    restart: unless-stopped
+    # GPU support (uncomment for NVIDIA GPU)
+    # deploy:
+    #   resources:
+    #     reservations:
+    #       devices:
+    #         - driver: nvidia
+    #           count: all
+    #           capabilities: [gpu]
+volumes:
+  model_cache_hf:
+    name: precisionvoice_hf_cache
+  model_cache_torch:
+    name: precisionvoice_torch_cache
+  model_cache_mdx:
+    name: precisionvoice_mdx_cache

docker/.gitkeep ADDED Viewed

File without changes

precision_voice_eval_ASR.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

precision_voice_simple.ipynb ADDED Viewed

	@@ -0,0 +1,672 @@

+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# 🎙️ PrecisionVoice - Vietnamese Speech-to-Text\n",
+        "\n",
+        "Notebook đơn giản để transcribe audio tiếng Việt sử dụng **faster-whisper** và **pyannote** (diarization).\n",
+        "\n",
+        "### Hướng dẫn\n",
+        "1. **Chọn GPU**: `Runtime` → `Change runtime type` → **T4 GPU**\n",
+        "2. **Cài đặt Secrets**: Thêm `HF_TOKEN` vào Colab Secrets (Key icon bên trái) để dùng Pyannote.\n",
+        "3. **Chạy từng cell** theo thứ tự từ trên xuống\n",
+        "4. **Sử dụng Gradio link** ở cell cuối để truy cập UI"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# @title 1. 🔍 Kiểm tra GPU\n",
+        "import torch\n",
+        "\n",
+        "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+        "if device == \"cuda\":\n",
+        "    gpu_name = torch.cuda.get_device_name(0)\n",
+        "    gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9\n",
+        "    print(f\"✅ GPU Detected: {gpu_name}\")\n",
+        "    print(f\"   VRAM: {gpu_mem:.1f} GB\")\n",
+        "else:\n",
+        "    print(\"⚠️ KHÔNG TÌM THẤY GPU!\")\n",
+        "    print(\"👉 Vào Runtime → Change runtime type → T4 GPU\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# @title 2. 📦 Cài đặt Dependencies\n",
+        "print(\"Installing dependencies...\")\n",
+        "!pip install --upgrade torch torchvision torchaudio \"pyannote.audio>=3.3.1\" faster-whisper gradio librosa nest_asyncio lightning torchmetrics\n",
+        "!apt-get install -y -qq ffmpeg > /dev/null 2>&1\n",
+        "print(\"✅ Dependencies installed successfully!\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# @title 3. 🤖 Load Models (Whisper & Pyannote)\n",
+        "import torch\n",
+        "import time\n",
+        "import os\n",
+        "import librosa\n",
+        "import numpy as np\n",
+        "from google.colab import userdata\n",
+        "from faster_whisper import WhisperModel\n",
+        "from pyannote.audio import Pipeline\n",
+        "\n",
+        "try:\n",
+        "    from pyannote.audio.core.task import Specifications, Problem, Resolution\n",
+        "    torch.serialization.add_safe_globals([Specifications, Problem, Resolution])\n",
+        "except Exception as e:\n",
+        "    print(f\"Could not add custom globals: {e}\")\n",
+        "\n",
+        "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+        "compute_type = \"float16\" if device == \"cuda\" else \"int8\"\n",
+        "\n",
+        "# Danh sách các model Whisper hỗ trợ\n",
+        "AVAILABLE_MODELS = {\n",
+        "    \"EraX-WoW-Turbo (Whisper Large V3 Turbo - Tiếng Việt)\": \"erax-ai/EraX-WoW-Turbo-V1.1-CT2\",\n",
+        "    \"PhoWhisper Large (Tiếng Việt)\": \"kiendt/PhoWhisper-large-ct2\"\n",
+        "}\n",
+        "\n",
+        "# Cache models\n",
+        "loaded_whisper_models = {}\n",
+        "diarization_pipeline = None\n",
+        "\n",
+        "# Lấy HF_TOKEN\n",
+        "try:\n",
+        "    hf_token = userdata.get('HF_TOKEN')\n",
+        "except:\n",
+        "    hf_token = os.environ.get('HF_TOKEN')\n",
+        "\n",
+        "# ==================== LOAD ALL WHISPER MODELS ====================\n",
+        "print(\"=\"*50)\n",
+        "print(\"🔄 Pre-downloading ALL Whisper Models...\")\n",
+        "print(\"=\"*50)\n",
+        "\n",
+        "total_start = time.time()\n",
+        "for model_name, model_path in AVAILABLE_MODELS.items():\n",
+        "    print(f\"\\n📥 Loading: {model_name}\")\n",
+        "    start = time.time()\n",
+        "    try:\n",
+        "        model = WhisperModel(\n",
+        "            model_path,\n",
+        "            device=device,\n",
+        "            compute_type=compute_type\n",
+        "        )\n",
+        "        loaded_whisper_models[f\"{model_name}_{compute_type}\"] = model\n",
+        "        print(f\"   ✅ Loaded in {time.time() - start:.1f}s\")\n",
+        "    except Exception as e:\n",
+        "        print(f\"   ❌ Failed to load: {e}\")\n",
+        "\n",
+        "print(f\"\\n✅ All models loaded in {time.time() - total_start:.1f}s\")\n",
+        "print(f\"   Total models: {len(loaded_whisper_models)}\")\n",
+        "print(f\"   Device: {device}, Compute: {compute_type}\")\n",
+        "\n",
+        "# ==================== LOAD PYANNOTE ====================\n",
+        "print(\"\\n\" + \"=\"*50)\n",
+        "print(\"🔄 Loading Pyannote Diarization...\")\n",
+        "print(\"=\"*50)\n",
+        "\n",
+        "if not hf_token:\n",
+        "    print(\"⚠️ WARNING: HF_TOKEN not found!\")\n",
+        "    print(\"   Diarization will be disabled.\")\n",
+        "    print(\"   Please set HF_TOKEN in Colab Secrets.\")\n",
+        "else:\n",
+        "    start = time.time()\n",
+        "    try:\n",
+        "        diarization_pipeline = Pipeline.from_pretrained(\n",
+        "            \"pyannote/speaker-diarization-community-1\",\n",
+        "            token=hf_token\n",
+        "        )\n",
+        "        diarization_pipeline.to(torch.device(device))\n",
+        "        print(f\"✅ Pyannote loaded in {time.time() - start:.1f}s\")\n",
+        "    except Exception as e:\n",
+        "        print(f\"❌ Failed to load Pyannote: {e}\")\n",
+        "\n",
+        "print(\"\\n\" + \"=\"*50)\n",
+        "print(\"🎉 All models loaded successfully!\")\n",
+        "print(\"=\"*50)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# @title 4. 🛠️ Utilities & Helpers\n",
+        "import gradio as gr\n",
+        "import time\n",
+        "import nest_asyncio\n",
+        "import subprocess\n",
+        "import os\n",
+        "\n",
+        "nest_asyncio.apply()\n",
+        "\n",
+        "def convert_audio_to_wav(audio_path):\n",
+        "    \"\"\"Chuẩn hóa audio về định dạng WAV 16kHz Mono.\"\"\"\n",
+        "    try:\n",
+        "        # Tạo file tạm\n",
+        "        output_path = \"temp_processed_audio.wav\"\n",
+        "        \n",
+        "        # Xóa file cũ nếu tồn tại\n",
+        "        if os.path.exists(output_path):\n",
+        "            os.remove(output_path)\n",
+        "            \n",
+        "        # Command line ffmpeg\n",
+        "        # -i input: file đầu vào\n",
+        "        # -ar 16000: Sample rate 16k\n",
+        "        # -ac 1: Mono channel (Pyannote tốt nhất với mono)\n",
+        "        # -y: Overwrite output\n",
+        "        command = [\n",
+        "            \"ffmpeg\", \n",
+        "            \"-i\", audio_path,\n",
+        "            \"-ar\", \"16000\",\n",
+        "            \"-ac\", \"1\",\n",
+        "            \"-y\",\n",
+        "            output_path\n",
+        "        ]\n",
+        "        \n",
+        "        subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)\n",
+        "        return output_path\n",
+        "    except Exception as e:\n",
+        "        print(f\"Error converting audio: {e}\")\n",
+        "        # Fallback: Trả về file gốc nếu convert lỗi (dù rủi ro)\n",
+        "        return audio_path\n",
+        "\n",
+        "def load_whisper_model(model_name, comp_type):\n",
+        "    \"\"\"Dynamic load Whisper model với cache\"\"\"\n",
+        "    global loaded_whisper_models\n",
+        "    cache_key = f\"{model_name}_{comp_type}\"\n",
+        "    \n",
+        "    if cache_key in loaded_whisper_models:\n",
+        "        return loaded_whisper_models[cache_key]\n",
+        "    \n",
+        "    model_path = AVAILABLE_MODELS[model_name]\n",
+        "    print(f\"Loading {model_name}...\")\n",
+        "    start = time.time()\n",
+        "    \n",
+        "    model = WhisperModel(\n",
+        "        model_path,\n",
+        "        device=device,\n",
+        "        compute_type=comp_type\n",
+        "    )\n",
+        "    \n",
+        "    loaded_whisper_models[cache_key] = model\n",
+        "    print(f\"✅ Loaded in {time.time() - start:.1f}s\")\n",
+        "    return model\n",
+        "\n",
+        "def format_timestamp(seconds):\n",
+        "    \"\"\"Format seconds to MM:SS.ms\"\"\"\n",
+        "    hours = int(seconds // 3600)\n",
+        "    minutes = int((seconds % 3600) // 60)\n",
+        "    secs = seconds % 60\n",
+        "    if hours > 0:\n",
+        "        return f\"{hours:02d}:{minutes:02d}:{secs:05.2f}\"\n",
+        "    return f\"{minutes:02d}:{secs:05.2f}\"\n",
+        "\n",
+        "def assign_speaker_to_segment(seg_start, seg_end, diarization_result):\n",
+        "    \"\"\"Gán speaker cho segment dựa trên tỷ lệ overlap >= 30%.\"\"\"\n",
+        "    if diarization_result is None:\n",
+        "        return \"SPEAKER_00\"\n",
+        "    \n",
+        "    seg_duration = seg_end - seg_start\n",
+        "    if seg_duration <= 0:\n",
+        "        return \"SPEAKER_00\"\n",
+        "    \n",
+        "    speaker_overlaps = {}\n",
+        "    \n",
+        "    for turn, _, speaker in diarization_result.speaker_diarization.itertracks(yield_label=True):\n",
+        "        overlap_start = max(seg_start, turn.start)\n",
+        "        overlap_end = min(seg_end, turn.end)\n",
+        "        overlap = max(0, overlap_end - overlap_start)\n",
+        "        \n",
+        "        if overlap > 0:\n",
+        "            if speaker not in speaker_overlaps:\n",
+        "                speaker_overlaps[speaker] = 0\n",
+        "            speaker_overlaps[speaker] += overlap\n",
+        "    \n",
+        "    if not speaker_overlaps:\n",
+        "        return \"SPEAKER_00\"\n",
+        "    \n",
+        "    best_speaker = max(speaker_overlaps, key=speaker_overlaps.get)\n",
+        "    best_overlap = speaker_overlaps[best_speaker]\n",
+        "    \n",
+        "    if best_overlap / seg_duration >= 0.3:\n",
+        "        return best_speaker\n",
+        "    \n",
+        "    return \"SPEAKER_00\"\n",
+        "\n",
+        "def merge_consecutive_segments(segments, max_gap=0.5):\n",
+        "    \"\"\"Gộp các segment liên tiếp của cùng một speaker.\"\"\"\n",
+        "    if not segments:\n",
+        "        return []\n",
+        "    \n",
+        "    merged = []\n",
+        "    current = segments[0].copy()\n",
+        "    \n",
+        "    for seg in segments[1:]:\n",
+        "        if seg['speaker'] == current['speaker'] and (seg['start'] - current['end']) <= max_gap:\n",
+        "            current['end'] = seg['end']\n",
+        "            current['text'] += ' ' + seg['text']\n",
+        "        else:\n",
+        "            merged.append(current)\n",
+        "            current = seg.copy()\n",
+        "    \n",
+        "    merged.append(current)\n",
+        "    return merged"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# @title 5. ⚙️ Processing Logic\n",
+        "def process_audio(audio_path, model_name, language, beam_size, vad_filter, vad_min_silence, vad_speech_pad, vad_min_speech, vad_threshold, temperature, best_of, patience, length_penalty, initial_prompt, prefix, condition_on_previous_text, no_speech_threshold, log_prob_threshold, compression_ratio_threshold, comp_type, merge_segs, p=gr.Progress()):\n",
+        "    \"\"\"\n",
+        "    Quy trình mới:\n",
+        "    0. Chuẩn hóa audio (convert mp3 -> wav 16k).\n",
+        "    1. Diarization để tách các đoạn của từng người nói.\n",
+        "    2. Cắt audio theo các đoạn này.\n",
+        "    3. Transcribe từng đoạn audio.\n",
+        "    4. Gộp kết quả.\n",
+        "    \"\"\"\n",
+        "    if audio_path is None:\n",
+        "        msg = \"⚠️ Vui lòng upload hoặc ghi âm audio!\"\n",
+        "        return msg, msg\n",
+        "    \n",
+        "    total_start_time = time.time()\n",
+        "    \n",
+        "    # Check Pyannote\n",
+        "    if diarization_pipeline is None:\n",
+        "        return \"❌ Lỗi: Chưa load được Pyannote (kiểm tra HF_TOKEN).\", \"❌ Lỗi: Chưa load được Pyannote.\"\n",
+        "\n",
+        "    # 0. Preprocessing Audio (Standardize)\n",
+        "    p(0.05, desc=\"Đang chuẩn hóa audio (16kHz WAV)...\")\n",
+        "    try:\n",
+        "        # Luôn convert về wav 16k mono để tránh lỗi sample rate mismatch của Pyannote\n",
+        "        clean_audio_path = convert_audio_to_wav(audio_path)\n",
+        "    except Exception as e:\n",
+        "        msg = f\"❌ Lỗi convert audio: {e}\"\n",
+        "        return msg, msg\n",
+        "        \n",
+        "    # 1. Load Standardized Audio for slicing later\n",
+        "    p(0.08, desc=\"Đang đọc file audio...\")\n",
+        "    try:\n",
+        "        y, sr = librosa.load(clean_audio_path, sr=16000)\n",
+        "        # sr should be 16000 now exactly\n",
+        "    except Exception as e:\n",
+        "        return f\"❌ Lỗi đọc audio: {e}\", f\"❌ Lỗi đọc audio: {e}\"\n",
+        "\n",
+        "    # 2. DIARIZATION\n",
+        "    p(0.1, desc=\"Đang phân tách người nói (Diarization)...\")\n",
+        "    \n",
+        "    try:\n",
+        "        # Sử dụng file đã chuẩn hóa\n",
+        "        diarization = diarization_pipeline(clean_audio_path)\n",
+        "    except Exception as e:\n",
+        "        return f\"❌ Lỗi Diarization: {e}\", f\"❌ Lỗi Diarization: {e}\"\n",
+        "        \n",
+        "    diarization_segments = []\n",
+        "    # Dùng cách user đã fix trước đó (nếu model trả về object khác)\n",
+        "    # Mặc định pipeline community trả về Annotation trực tiếp, nhưng user fix thành diarization.speaker_diarization\n",
+        "    # Mình sẽ try/except để support cả 2 structure cho an toàn\n",
+        "    try:\n",
+        "        # Trường hợp 1: Standard Annotation\n",
+        "        iterator = diarization.itertracks(yield_label=True)\n",
+        "        # Test thử xem có chạy ko, nếu không phải Annotation nó sẽ lỗi attribute\n",
+        "        _ = list(iterator)\n",
+        "        # Reset iterate\n",
+        "        iterator = diarization.itertracks(yield_label=True)\n",
+        "    except:\n",
+        "        # Trường hợp 2: User report structure (maybe wrapper)\n",
+        "        try:\n",
+        "             iterator = diarization.speaker_diarization.itertracks(yield_label=True)\n",
+        "        except:\n",
+        "             return \"❌ Lỗi format result Diarization\", \"❌ Lỗi format result Diarization\"\n",
+        "\n",
+        "    for turn, _, speaker in iterator:\n",
+        "        diarization_segments.append({\n",
+        "            \"start\": turn.start,\n",
+        "            \"end\": turn.end,\n",
+        "            \"speaker\": speaker\n",
+        "        })\n",
+        "    \n",
+        "    # Sort segments by start time\n",
+        "    diarization_segments.sort(key=lambda x: x['start'])\n",
+        "    \n",
+        "    # Merge consecutive segments if requested\n",
+        "    if merge_segs and diarization_segments:\n",
+        "        p(0.3, desc=\"Đang gộp segment liên tiếp...\")\n",
+        "        merged = []\n",
+        "        current = diarization_segments[0].copy()\n",
+        "        for seg in diarization_segments[1:]:\n",
+        "            if seg['speaker'] == current['speaker'] and (seg['start'] - current['end']) <= 0.5:\n",
+        "                current['end'] = seg['end']\n",
+        "            else:\n",
+        "                merged.append(current)\n",
+        "                current = seg.copy()\n",
+        "        merged.append(current)\n",
+        "        diarization_segments = merged\n",
+        "    \n",
+        "    # 3. TRANSCRIPTION LOOP\n",
+        "    p(0.4, desc=\"Đang tải model Whisper...\")\n",
+        "    model = load_whisper_model(model_name, comp_type)\n",
+        "    \n",
+        "    processed_segments = []\n",
+        "    \n",
+        "    total_segs = len(diarization_segments)\n",
+        "    \n",
+        "    # Prepare VAD options\n",
+        "    if vad_filter:\n",
+        "        vad_options = dict(\n",
+        "            min_silence_duration_ms=vad_min_silence,\n",
+        "            speech_pad_ms=vad_speech_pad,\n",
+        "            min_speech_duration_ms=vad_min_speech,\n",
+        "            threshold=vad_threshold\n",
+        "        )\n",
+        "    else:\n",
+        "        vad_options = False\n",
+        "        \n",
+        "    prompt = initial_prompt.strip() if (initial_prompt and initial_prompt.strip()) else None\n",
+        "    prefix_text = prefix.strip() if (prefix and prefix.strip()) else None\n",
+        "\n",
+        "    print(f\"Processing {total_segs} segments...\")\n",
+        "    \n",
+        "    for idx, seg in enumerate(diarization_segments):\n",
+        "        start_sec = seg['start']\n",
+        "        end_sec = seg['end']\n",
+        "        speaker = seg['speaker']\n",
+        "        \n",
+        "        # UI Progress\n",
+        "        progress_val = 0.4 + (0.5 * (idx / total_segs))\n",
+        "        p(progress_val, desc=f\"Transcribing {idx+1}/{total_segs} ({speaker})...\")\n",
+        "        \n",
+        "        # Audio slicing\n",
+        "        start_sample = int(start_sec * sr)\n",
+        "        end_sample = int(end_sec * sr)\n",
+        "        \n",
+        "        # Avoid empty slice\n",
+        "        if end_sample <= start_sample:\n",
+        "            continue\n",
+        "            \n",
+        "        y_seg = y[start_sample:end_sample]\n",
+        "        \n",
+        "        # Whisper Transcribe for this chunk\n",
+        "        try:\n",
+        "            # Note: We pass the numpy array 'y_seg' directly\n",
+        "            segments_gen, _ = model.transcribe(\n",
+        "                y_seg, \n",
+        "                language=language if language != \"auto\" else None,\n",
+        "                beam_size=beam_size, \n",
+        "                vad_filter=vad_options,\n",
+        "                temperature=temperature,\n",
+        "                best_of=best_of,\n",
+        "                patience=patience,\n",
+        "                length_penalty=length_penalty,\n",
+        "                initial_prompt=prompt,\n",
+        "                prefix=prefix_text,\n",
+        "                condition_on_previous_text=condition_on_previous_text,\n",
+        "                no_speech_threshold=no_speech_threshold,\n",
+        "                log_prob_threshold=log_prob_threshold,\n",
+        "                compression_ratio_threshold=compression_ratio_threshold,\n",
+        "                word_timestamps=False \n",
+        "            )\n",
+        "            \n",
+        "            # Collect text\n",
+        "            seg_text_parts = []\n",
+        "            for s in segments_gen:\n",
+        "                seg_text_parts.append(s.text.strip())\n",
+        "            \n",
+        "            final_text = \" \".join(seg_text_parts).strip()\n",
+        "            \n",
+        "            if final_text:\n",
+        "                # Store Result\n",
+        "                processed_segments.append({\n",
+        "                    \"start\": start_sec,\n",
+        "                    \"end\": end_sec,\n",
+        "                    \"speaker\": speaker,\n",
+        "                    \"text\": final_text\n",
+        "                })\n",
+        "                \n",
+        "        except Exception as e:\n",
+        "            print(f\"Error transcribing segment {idx}: {e}\")\n",
+        "            continue\n",
+        "\n",
+        "    total_elapsed = time.time() - total_start_time\n",
+        "    \n",
+        "    p(0.95, desc=\"Đang xuất kết quả...\")\n",
+        "    \n",
+        "    # ========== OUTPUT GENERATION ==========\n",
+        "    \n",
+        "    # Speaker colors\n",
+        "    speaker_colors = {\n",
+        "        'SPEAKER_00': '🔵',\n",
+        "        'SPEAKER_01': '🟢', \n",
+        "        'SPEAKER_02': '🟡',\n",
+        "        'SPEAKER_03': '🟠',\n",
+        "        'SPEAKER_04': '🔴',\n",
+        "        'SPEAKER_05': '🟣',\n",
+        "    }\n",
+        "    \n",
+        "    # 1. Plain Transcription Output\n",
+        "    transcribe_lines = []\n",
+        "    for item in processed_segments:\n",
+        "        ts = f\"[{format_timestamp(item['start'])} → {format_timestamp(item['end'])}]\"\n",
+        "        transcribe_lines.append(f\"{ts} {item['text']}\")\n",
+        "        \n",
+        "    transcribe_header = f\"\"\"## 📝 Kết quả Transcription\n",
+        "\n",
+        "| Thông tin | Giá trị |\n",
+        "|-----------|----------|\n",
+        "| ⏱️ Tổng thời gian xử lý | {total_elapsed:.1f}s |\n",
+        "| 📊 Tổng số Segment | {len(processed_segments)} |\n",
+        "\n",
+        "---\n",
+        "\n",
+        "\"\"\"\n",
+        "    transcribe_output = transcribe_header + \"\\n\".join(transcribe_lines)\n",
+        "    \n",
+        "    # 2. Diarization + Transcription Output\n",
+        "    diarize_lines = []\n",
+        "    unique_speakers = set()\n",
+        "    \n",
+        "    for item in processed_segments:\n",
+        "        unique_speakers.add(item['speaker'])\n",
+        "        ts = f\"[{format_timestamp(item['start'])} → {format_timestamp(item['end'])}]\"\n",
+        "        icon = speaker_colors.get(item['speaker'], '⚪')\n",
+        "        diarize_lines.append(f\"{ts} {icon} **{item['speaker']}**: {item['text']}\")\n",
+        "        \n",
+        "    diarize_header = f\"\"\"## 🎭 Kết quả Transcription + Diarization\n",
+        "\n",
+        "| Thông tin | Giá trị |\n",
+        "|-----------|----------|\n",
+        "| 👥 Số người nói | {len(unique_speakers)} |\n",
+        "| ⏱️ Tổng thời gian xử lý | {total_elapsed:.1f}s |\n",
+        "| 📊 Tổng số Segment | {len(processed_segments)} |\n",
+        "\n",
+        "---\n",
+        "\n",
+        "\"\"\"\n",
+        "    diarize_output = diarize_header + \"\\n\".join(diarize_lines)\n",
+        "    \n",
+        "    return transcribe_output, diarize_output"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# @title 6. 🚀 Gradio UI\n",
+        "css = \"\"\"\n",
+        ".gradio-container { max-width: 1200px !important; }\n",
+        ".output-markdown { font-family: 'JetBrains Mono', monospace !important; }\n",
+        "\"\"\"\n",
+        "\n",
+        "with gr.Blocks(title=\"PrecisionVoice\", theme=gr.themes.Soft(), css=css) as demo:\n",
+        "    gr.Markdown(\"\"\"# 🎙️ PrecisionVoice - Vietnamese Speech-to-Text\n",
+        "    \n",
+        "Sử dụng **Whisper** để nhận dạng văn bản và **Pyannote** để phân biệt người nói.\n",
+        "\"\"\")\n",
+        "    \n",
+        "    with gr.Row():\n",
+        "        with gr.Column(scale=1):\n",
+        "            audio_input = gr.Audio(\n",
+        "                sources=[\"upload\", \"microphone\"], \n",
+        "                type=\"filepath\", \n",
+        "                label=\"🔊 Audio Input\"\n",
+        "            )\n",
+        "            \n",
+        "            gr.Markdown(\"### ⚙️ Cài đặt Model\")\n",
+        "            model_select = gr.Dropdown(\n",
+        "                choices=list(AVAILABLE_MODELS.keys()),\n",
+        "                value=list(AVAILABLE_MODELS.keys())[0],\n",
+        "                label=\"🤖 Whisper Model\"\n",
+        "            )\n",
+        "            \n",
+        "            language = gr.Dropdown(\n",
+        "                choices=[\"auto\", \"vi\", \"en\", \"zh\", \"ja\", \"ko\"],\n",
+        "                value=\"vi\",\n",
+        "                label=\"🌐 Ngôn ngữ\"\n",
+        "            )\n",
+        "            \n",
+        "            comp_type_select = gr.Dropdown(\n",
+        "                choices=[\"float16\", \"float32\", \"int8\", \"int8_float16\"],\n",
+        "                value=compute_type,\n",
+        "                label=\"⚡ Compute Type\"\n",
+        "            )\n",
+        "            \n",
+        "            with gr.Accordion(\"🔧 Tùy chọn nâng cao\", open=False):\n",
+        "                beam_size = gr.Slider(\n",
+        "                    minimum=1, maximum=10, value=5, step=1,\n",
+        "                    label=\"Beam Size\",\n",
+        "                    info=\"Cao hơn = chính xác hơn nhưng chậm hơn\"\n",
+        "                )\n",
+        "                vad_filter = gr.Checkbox(\n",
+        "                    value=True, \n",
+        "                    label=\"VAD Filter\",\n",
+        "                    info=\"Lọc khoảng lặng tự động\"\n",
+        "                )\n",
+        "                with gr.Row():\n",
+        "                    vad_min_silence = gr.Number(value=1000, label=\"Min Silence (ms)\", info=\"min_silence_duration_ms\")\n",
+        "                    vad_speech_pad = gr.Number(value=400, label=\"Speech Pad (ms)\", info=\"speech_pad_ms\")\n",
+        "                with gr.Row():\n",
+        "                    vad_min_speech = gr.Number(value=250, label=\"Min Speech (ms)\", info=\"min_speech_duration_ms\")\n",
+        "                    vad_threshold = gr.Slider(minimum=0, maximum=1, value=0.5, step=0.05, label=\"VAD Threshold\")\n",
+        "            \n",
+        "            with gr.Accordion(\"🧠 Tham số Generation (Whisper)\", open=False):\n",
+        "                with gr.Row():\n",
+        "                    temperature = gr.Slider(0.0, 1.0, value=0.0, step=0.1, label=\"Temperature\")\n",
+        "                    best_of = gr.Number(value=5, label=\"Best Of\")\n",
+        "                with gr.Row():\n",
+        "                    patience = gr.Number(value=1.0, label=\"Patience\", step=0.1)\n",
+        "                    length_penalty = gr.Number(value=1.0, label=\"Length Penalty\", step=0.1)\n",
+        "                initial_prompt = gr.Textbox(label=\"Initial Prompt\", placeholder=\"Ngữ cảnh hoặc từ vựng...\")\n",
+        "                prefix = gr.Textbox(label=\"Prefix\", placeholder=\"Bắt đầu câu với...\")\n",
+        "                condition_on_previous_text = gr.Checkbox(value=True, label=\"Condition on previous text\")\n",
+        "                \n",
+        "                gr.Markdown(\"**Filter Thresholds**\")\n",
+        "                with gr.Row():\n",
+        "                    no_speech_threshold = gr.Slider(0.0, 1.0, value=0.6, step=0.05, label=\"No Speech Threshold\")\n",
+        "                    log_prob_threshold = gr.Slider(-5.0, 0.0, value=-1.0, step=0.1, label=\"Log Prob Threshold\")\n",
+        "                    compression_ratio_threshold = gr.Number(value=2.4, label=\"Compression Ratio Threshold\")\n",
+        "            \n",
+        "            merge_segments = gr.Checkbox(\n",
+        "                value=True,\n",
+        "                label=\"Gộp Segment cùng Speaker\",\n",
+        "                info=\"Gộp các câu liên tiếp của cùng người nói\"\n",
+        "            )\n",
+        "            \n",
+        "            btn_process = gr.Button(\"🚀 Xử lý Audio\", variant=\"primary\", size=\"lg\")\n",
+        "        \n",
+        "        with gr.Column(scale=2):\n",
+        "            with gr.Tabs():\n",
+        "                with gr.Tab(\"📝 Transcription\"):\n",
+        "                    output_transcribe = gr.Markdown(\n",
+        "                        value=\"*Kết quả transcription sẽ hiển thị ở đây...*\",\n",
+        "                        elem_classes=[\"output-markdown\"]\n",
+        "                    )\n",
+        "                with gr.Tab(\"🎭 Transcription + Diarization\"):\n",
+        "                    output_diarize = gr.Markdown(\n",
+        "                        value=\"*Kết quả transcription + diarization sẽ hiển thị ở đây...*\",\n",
+        "                        elem_classes=[\"output-markdown\"]\n",
+        "                    )\n",
+        "    \n",
+        "    btn_process.click(\n",
+        "        process_audio,\n",
+        "        inputs=[\n",
+        "            audio_input, model_select, language, beam_size, vad_filter, \n",
+        "            vad_min_silence, vad_speech_pad, vad_min_speech, vad_threshold,\n",
+        "            temperature, best_of, patience, length_penalty, \n",
+        "            initial_prompt, prefix, condition_on_previous_text,\n",
+        "            no_speech_threshold, log_prob_threshold, compression_ratio_threshold,\n",
+        "            comp_type_select, merge_segments\n",
+        "        ],\n",
+        "        outputs=[output_transcribe, output_diarize]\n",
+        "    )\n",
+        "    \n",
+        "    gr.Markdown(\"\"\"---\n",
+        "    \n",
+        "### 📖 Hướng dẫn sử dụng\n",
+        "\n",
+        "1. **Upload audio** hoặc ghi âm trực tiếp\n",
+        "2. **Chọn Model**:\n",
+        "   - `EraX-WoW-Turbo`: Whisper Large V3 Turbo, tối ưu cho tiếng Việt\n",
+        "   - `PhoWhisper Large`: Model được huấn luyện riêng cho tiếng Việt\n",
+        "3. **Setting nâng cao**:\n",
+        "   - Chỉnh `temperature` nếu muốn model sáng tạo hơn.\n",
+        "   - Thêm `Initial Prompt` để gợi ý từ vựng chuyên ngành.\n",
+        "4. **Nhấn \"🚀 Xử lý Audio\"** để nhận kết quả ở cả 2 tab\n",
+        "\"\"\")\n",
+        "\n",
+        "# Launch\n",
+        "import os\n",
+        "if \"COLAB_GPU\" in os.environ or \"google.colab\" in str(get_ipython()):\n",
+        "    demo.queue().launch(share=True, debug=True)\n",
+        "else:\n",
+        "    demo.launch(share=False)"
+      ]
+    }
+  ],
+  "metadata": {
+    "accelerator": "GPU",
+    "colab": {
+      "gpuType": "T4",
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.12.12"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 4
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,48 @@

+# Core framework
+fastapi>=0.109.0
+uvicorn[standard]>=0.27.0
+python-multipart>=0.0.6
+jinja2>=3.1.2
+aiofiles>=23.2.1
+# AI/ML - Speech-to-Text
+faster-whisper>=1.0.0
+ctranslate2>=4.0.0
+# AI/ML - Speaker Diarization (from notebook cell #2)
+pyannote.audio>=3.3.1
+torch>=2.1.0
+torchaudio>=2.1.0
+torchvision
+lightning
+torchmetrics
+# Transformers Whisper + LoRA
+transformers>=4.39.0,<5
+accelerate>=0.26.0
+peft>=0.8.0
+huggingface-hub>=0.20.0
+safetensors>=0.4.0
+# AI/ML - Vocal Separation
+audio-separator[cpu]>=0.17.0
+denoiser>=0.1.4
+# Audio processing
+librosa>=0.10.0
+ffmpeg-python>=0.2.0
+pydub>=0.25.1
+# Configuration
+pydantic-settings>=2.1.0
+python-dotenv>=1.0.0
+# Utilities
+numpy>=1.24.0

scripts/verify_model_config.py ADDED Viewed

	@@ -0,0 +1,18 @@

+import os
+from app.core.config import get_settings
+from app.services.transcription import TranscriptionService
+def verify_stt_model():
+    settings = get_settings()
+    print(f"Current Whisper Model: {settings.whisper_model}")
+    print(f"Device: {settings.resolved_device}")
+    print(f"Compute Type: {settings.resolved_compute_type}")
+    expected_model = "kiendt/PhoWhisper-large-ct2"
+    if settings.whisper_model == expected_model:
+        print("✅ SUCCESS: Model configuration updated correctly.")
+    else:
+        print(f"❌ FAILURE: Expected {expected_model}, got {settings.whisper_model}")
+if __name__ == "__main__":
+    verify_stt_model()