staraks's picture
Update README.md
4334753 verified
metadata
license: apache-2.0
title: all in one transcribe
sdk: docker
emoji: 🏆
colorFrom: gray
colorTo: red
short_description: multiple file transcription
# Quick-Start Medical Transcription (Whisper small) — Multi-file -> Merged DOCX

This repository provides a simple, user-friendly transcription service:
- Web UI for multiple audio upload
- Background model loading so server becomes responsive quickly
- Transcription using Whisper (default: small) with light medical postprocessing
- Returns a single merged Word document (.docx) with per-file transcripts

Prerequisites
- Docker + docker-compose OR Python 3.10+ and pip
- For CPU-only: this setup uses the CPU PyTorch wheel. If you have a GPU, change the PyTorch wheel in requirements and set WHISPER_MODEL accordingly.

Quick: Run with Docker Compose (recommended)
1) Build and start (no preloading, smaller image):
   docker-compose up --build

   The model will be downloaded on first container run; check readiness:
   curl -i http://localhost:5000/ready   # 503 until model is loaded

2) Preload model into image (optional; faster startup, larger image):
   # Change PRELOAD_MODEL to true when building to include model weights in the image:
   docker-compose build --build-arg PRELOAD_MODEL=true
   docker-compose up -d

3) Use the UI:
   Open http://localhost:5000 in your browser, upload multiple audio files and click "Upload and Merge".
   Or call the API:
   curl -F "files=@a.wav" -F "files=@b.mp3" http://localhost:5000/transcribe --output merged_transcripts.docx

Run locally without Docker
1) Create virtualenv and install:
   python -m venv venv
   source venv/bin/activate
   pip install -r requirements.txt

2) Run:
   python app.py

Notes and next steps
- The default model is `small` for fast startup; change WHISPER_MODEL env var to medium for slightly higher accuracy.
- For production / PHI handling:
  - Use TLS, authentication, and private networking.
  - Use a vetted PHI de-identification pipeline (medspaCy or transformer NER).
  - Consider preloading the model at build time to avoid long first-start delays.
- To improve accuracy for medical terms:
  - Add or expand `medical_vocab.txt`.
  - Later upgrade to wav2vec2 fine-tuned + KenLM if you collect labeled medical audio.

If you want, I can:
- Provide a variant using faster-whisper/ONNX for much faster CPU inference.
- Add medspaCy-based PHI redaction integrated into the pipeline.
- Add speaker diarization (pyannote) and label physician vs patient sections in the merged docx.