Speech2Invoice / README.md
ducdatit2002
Update
5f68701
metadata
library_name: pytorch
license: mit
pipeline_tag: automatic-speech-recognition
language:
  - vi
  - en
tags:
  - automatic-speech-recognition
  - invoice-extraction
  - speech

ASR + Invoice Extraction Server

Standalone packaging of Server_conformer.py to transcribe audio and extract invoice JSON from transcript text. This folder now includes a copy of the trained RNNT checkpoint for convenience.

What’s inside

  • Server_conformer.py, Speech2text.py, InformationExtractor.py
  • chunkformer/ code
  • chunkformer-model/
  • requirements.txt

Prerequisites

  • Python 3.9+ and a CUDA GPU (required for Qwen invoice extraction; CPU will be extremely slow)
  • Hugging Face token with access to the models you use (HF_TOKEN)
  • Chunkformer RNNT checkpoint available at chunkformer-model (copied into this folder). Update CHUNKFORMER_MODEL_PATH if you place it elsewhere.

Setup

cd Speech2Invoice
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configure environment

Create a .env (or export env vars) with at least:

PORT=8000
USE_NGROK=false
HF_TOKEN=your_hf_token_here
CHUNKFORMER_MODEL_PATH=chunkformer-model
LOG_LEVEL=DEBUG
DEBUG=true

# Optional ngrok
NGROK_AUTHTOKEN=
NGROK_REGION=ap

# Optional invoice LLM overrides (defaults are fast)
IE_LLM_MODEL_ID=Qwen/Qwen1.5-7B-Chat
IE_MAX_NEW_TOKENS=256
IE_DO_SAMPLE=false
IE_TEMPERATURE=0.0
IE_TOP_P=0.8

If you move the model elsewhere, set CHUNKFORMER_MODEL_PATH to that directory.

Run

python3 Server_conformer.py

Endpoints

  • POST /transcribe — multipart/form-data with audio file (wav, mp3, m4a, ogg, webm). Returns JSON with final_result and full_transcription.
  • POST /ticket — JSON body {"full_transcription": "<text>"}. Returns invoice JSON inferred by Qwen.

Notes

  • The invoice extractor requires GPU and HF download on first run. Use smaller models via IE_LLM_MODEL_ID for speed.
  • Model weights for the RNNT checkpoint are included in chunkformer-model/. For large files, consider git-lfs if you plan to push to a remote.

Contact

For questions or controlled access requests to Speech2Invoice: