File size: 6,833 Bytes
ae9287e 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 ae9287e 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 2341cf2 5c38f63 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | ---
license: apache-2.0
language:
- ru
- en
tags:
- automatic-speech-recognition
- speaker-diarization
- named-entity-recognition
- text-summarization
- onnx
- russian
- english
- asr
- gigaam
- whisper
- 3d-speaker
- camplus
- eres2net
- slovnet
- natasha
- navec
- mobile
- offline
library_name: onnx
---
# ProtocolVoice models
Offline models for the [ProtocolVoice](https://github.com/conwerter1/protocolvoice) Android app β voice transcription, speaker diarization, and on-device interview summarization.
All models run **on the device**, no cloud calls.
## Contents
### Russian ASR
| File | Size | Purpose | Original source | License |
|---|---|---|---|---|
| `gigaam_v3_e2e_ctc_int8.onnx` | 305 MB | Russian ASR with built-in punctuation | [Sber/SaluteDevices GigaAM](https://github.com/salute-developers/GigaAM) (v3, e2e CTC, int8-quantized) | MIT |
### English ASR
| File | Size | Purpose | Original source | License |
|---|---|---|---|---|
| `en/whisper_base_en_encoder_int8.onnx` | 28 MB | Whisper base.en encoder | [openai/whisper](https://github.com/openai/whisper) via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) | MIT |
| `en/whisper_base_en_decoder_int8.onnx` | 125 MB | Whisper base.en decoder | OpenAI Whisper via sherpa-onnx | MIT |
| `en/whisper_base_en_tokens.txt` | 0.8 MB | Whisper tokens vocab | OpenAI Whisper | MIT |
### Speaker diarization (works for any language)
| File | Size | Purpose | Original source | License |
|---|---|---|---|---|
| `speaker_embedding_camplus.onnx` | 27 MB | Speaker embedding (CAM++) β recommended default | [modelscope/3D-Speaker](https://github.com/modelscope/3D-Speaker) | Apache-2.0 |
| `speaker_embedding.onnx` | 111 MB | Speaker embedding (ERes2Net V1) β best quality | [modelscope/3D-Speaker](https://github.com/modelscope/3D-Speaker) | Apache-2.0 |
| `speaker_embedding_v2.onnx` | 68 MB | Speaker embedding (ERes2NetV2) | [modelscope/3D-Speaker](https://github.com/modelscope/3D-Speaker) | Apache-2.0 |
### Russian summarization (Default tier β NER-based, no LLM)
| File | Size | Purpose | Original source | License |
|---|---|---|---|---|
| `summary/navec_news.tar` | 25 MB | Navec quantized word embeddings (250K Russian words, 300-dim, PQ-100) | [natasha/navec](https://github.com/natasha/navec) | MIT |
| `summary/slovnet_ner.tar` | 2.3 MB | Slovnet NER weights (WordCNN + CRF, PER/LOC/ORG) | [natasha/slovnet](https://github.com/natasha/slovnet) | MIT |
These two files together (28 MB total) enable offline Russian named entity recognition + LexRank-based extractive summarization. ProtocolVoice uses them to extract names, organizations, locations, and key quotes from interview transcripts. No LLM required β fully deterministic, factual extraction.
### Manifest
| File | Size | Purpose |
|---|---|---|
| `manifest.json` | < 2 KB | SHA-256 hashes and metadata for all models |
## Important β attribution
These are NOT new models β this repository **redistributes existing models** in formats convenient for mobile delivery. The original authors retain all credit and copyright. We did not train, fine-tune, or modify the model weights.
**Please cite the original projects, not this redistribution:**
- **GigaAM-v3** (Russian ASR): Sber AI, SaluteDevices β https://github.com/salute-developers/GigaAM
- **Whisper** (English ASR): OpenAI β https://github.com/openai/whisper
- **3D-Speaker** (CAM++, ERes2Net, ERes2NetV2): ModelScope, Alibaba β https://github.com/modelscope/3D-Speaker
- **Slovnet NER + Navec**: Natasha project, Alexander Kukushkin β https://github.com/natasha/slovnet, https://github.com/natasha/navec
- **sherpa-onnx** (ONNX runtime): Next-gen Kaldi (k2-fsa) β https://github.com/k2-fsa/sherpa-onnx
## Why this redistribution
The ProtocolVoice mobile app needs to download these models on first run from a mirror that:
- supports files larger than 100 MB without git-lfs limits,
- has fast CDN reachable from Russia,
- is the conventional hosting platform for ML models.
All redistributed files retain their original licenses. This README serves as the required attribution under those licenses.
## How the app uses these models
ASR + diarization (loaded via [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx)):
1. App downloads `.onnx` files from `https://huggingface.co/protocolvoice/asr-models/resolve/main/{filename}`
2. Verifies SHA-256 against `manifest.json`
3. Loads via sherpa-onnx for offline inference
Summarization (Default tier, custom Kotlin port):
1. App downloads `summary/navec_news.tar` and `summary/slovnet_ner.tar`
2. Extracts both `.tar` archives into the app's private files directory
3. Loads weights into a pure-Kotlin reimplementation of Slovnet NER (no PyTorch, no Python β just FloatArray math): WordEmbedding β ShapeEmbedding β 3-layer Conv1D β Linear β CRF Viterbi
4. Combines NER output with TF-IDF + LexRank to extract top quotes, named entities, risks, and numerical data
Inference performance on Xiaomi 12T: ~6 seconds for a 17,900-word transcript (default tier, NER + LexRank, no LLM).
You can also use these files directly with the upstream libraries (sherpa-onnx, slovnet, navec) in any project that respects the original licenses.
## Verifying integrity
```python
import hashlib
with open("gigaam_v3_e2e_ctc_int8.onnx", "rb") as f:
print(hashlib.sha256(f.read()).hexdigest())
# expected: 0aacb41f70f0f5aaac4b45dd430337b9e16b180f22c72af04db8516e7609c3c0
```
Hashes for all files are in `manifest.json`.
## Optional: Pro tier (QVikhr 1.5B)
ProtocolVoice has an optional **PRO tier** that produces a literary, narrative summary using [QVikhr-2.5-1.5B-Instruct-r](https://huggingface.co/Vikhrmodels/QVikhr-2.5-1.5B-Instruct-r) (1.0 GB GGUF, runs via llama.cpp on-device). The PRO tier is layered on top of the Default tier β Default extracts facts, PRO turns them into a coherent narrative.
The QVikhr GGUF is **not hosted in this repo** β users download it directly from the Vikhrmodels HF org or from a separate mirror, on demand. The QVikhr authors retain copyright; please cite them, not us.
## License
This repository's metadata, README, and packaging scripts are released under **Apache-2.0**. Each model file remains under its original license (see the tables above). By using a model, you accept its original license β not just this repository's.
## Removal request
If you are an author of one of the upstream projects and have any concerns about this redistribution (attribution, hosting, anything else), please open a discussion on this Hugging Face repo or email the maintainers β the files will be amended or removed as requested.
|