|
|
--- |
|
|
library_name: pytorch |
|
|
license: mit |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
language: |
|
|
- vi |
|
|
- en |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- invoice-extraction |
|
|
- speech |
|
|
--- |
|
|
|
|
|
# ASR + Invoice Extraction Server |
|
|
|
|
|
Standalone packaging of `Server_conformer.py` to transcribe audio and extract invoice JSON from transcript text. This folder now includes a copy of the trained RNNT checkpoint for convenience. |
|
|
|
|
|
## What’s inside |
|
|
- `Server_conformer.py`, `Speech2text.py`, `InformationExtractor.py` |
|
|
- `chunkformer/` code |
|
|
- `chunkformer-model/` |
|
|
- `requirements.txt` |
|
|
|
|
|
## Prerequisites |
|
|
- Python 3.9+ and a CUDA GPU (required for Qwen invoice extraction; CPU will be extremely slow) |
|
|
- Hugging Face token with access to the models you use (`HF_TOKEN`) |
|
|
- Chunkformer RNNT checkpoint available at `chunkformer-model` (copied into this folder). Update `CHUNKFORMER_MODEL_PATH` if you place it elsewhere. |
|
|
|
|
|
## Setup |
|
|
```bash |
|
|
cd Speech2Invoice |
|
|
python3 -m venv .venv |
|
|
source .venv/bin/activate |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
## Configure environment |
|
|
Create a `.env` (or export env vars) with at least: |
|
|
``` |
|
|
PORT=8000 |
|
|
USE_NGROK=false |
|
|
HF_TOKEN=your_hf_token_here |
|
|
CHUNKFORMER_MODEL_PATH=chunkformer-model |
|
|
LOG_LEVEL=DEBUG |
|
|
DEBUG=true |
|
|
|
|
|
# Optional ngrok |
|
|
NGROK_AUTHTOKEN= |
|
|
NGROK_REGION=ap |
|
|
|
|
|
# Optional invoice LLM overrides (defaults are fast) |
|
|
IE_LLM_MODEL_ID=Qwen/Qwen1.5-7B-Chat |
|
|
IE_MAX_NEW_TOKENS=256 |
|
|
IE_DO_SAMPLE=false |
|
|
IE_TEMPERATURE=0.0 |
|
|
IE_TOP_P=0.8 |
|
|
``` |
|
|
|
|
|
If you move the model elsewhere, set `CHUNKFORMER_MODEL_PATH` to that directory. |
|
|
|
|
|
## Run |
|
|
```bash |
|
|
python3 Server_conformer.py |
|
|
``` |
|
|
|
|
|
## Endpoints |
|
|
- `POST /transcribe` — multipart/form-data with audio file (`wav`, `mp3`, `m4a`, `ogg`, `webm`). Returns JSON with `final_result` and `full_transcription`. |
|
|
- `POST /ticket` — JSON body `{"full_transcription": "<text>"}`. Returns invoice JSON inferred by Qwen. |
|
|
|
|
|
## Notes |
|
|
- The invoice extractor requires GPU and HF download on first run. Use smaller models via `IE_LLM_MODEL_ID` for speed. |
|
|
- Model weights for the RNNT checkpoint are included in `chunkformer-model/`. For large files, consider git-lfs if you plan to push to a remote. |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or controlled access requests to Speech2Invoice: |
|
|
|
|
|
* Duc Dat Pham |
|
|
* Email: [ducdatit2002@gmail.com](mailto:ducdatit2002@gmail.com) |
|
|
|
|
|
|