drdraq commited on
Commit
c12ea79
·
verified ·
1 Parent(s): e3ac728

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. .gitattributes +2 -0
  2. README.md +257 -3
  3. config.json +144 -0
  4. model.qora-stt +3 -0
  5. qora-stt.exe +3 -0
  6. tokenizer.json +0 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ model.qora-stt filter=lfs diff=lfs merge=lfs -text
37
+ qora-stt.exe filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,257 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - de
6
+ - es
7
+ - ru
8
+ - ko
9
+ - fr
10
+ - ja
11
+ - pt
12
+ - tr
13
+ - pl
14
+ - nl
15
+ - ar
16
+ - sv
17
+ - it
18
+ - id
19
+ - hi
20
+ - fi
21
+ - vi
22
+ - he
23
+ - uk
24
+ - el
25
+ - cs
26
+ - ro
27
+ - da
28
+ - hu
29
+ - ta
30
+ - "no"
31
+ - th
32
+ - ur
33
+ - hr
34
+ - bg
35
+ - lt
36
+ - la
37
+ - ml
38
+ - cy
39
+ - sk
40
+ - te
41
+ - fa
42
+ - lv
43
+ - bn
44
+ - sr
45
+ - az
46
+ - sl
47
+ - kn
48
+ - et
49
+ - mk
50
+ - br
51
+ - eu
52
+ - is
53
+ - hy
54
+ - ne
55
+ - mn
56
+ - bs
57
+ - kk
58
+ - sq
59
+ - sw
60
+ - gl
61
+ - mr
62
+ - pa
63
+ - si
64
+ - km
65
+ - sn
66
+ - yo
67
+ - so
68
+ - af
69
+ - oc
70
+ - ka
71
+ - be
72
+ - tg
73
+ - sd
74
+ - gu
75
+ - am
76
+ - yi
77
+ - lo
78
+ - uz
79
+ - fo
80
+ - ht
81
+ - ps
82
+ - tk
83
+ - nn
84
+ - mt
85
+ - sa
86
+ - lb
87
+ - my
88
+ - bo
89
+ - tl
90
+ - mg
91
+ - as
92
+ - tt
93
+ - haw
94
+ - ln
95
+ - ha
96
+ - ba
97
+ - jw
98
+ - su
99
+ license: mit
100
+ tags:
101
+ - speech-to-text
102
+ - stt
103
+ - whisper
104
+ - rust
105
+ - cpu-inference
106
+ - pure-rust
107
+ - no-python
108
+ - no-cuda
109
+ - automatic-speech-recognition
110
+ base_model: openai/whisper-tiny
111
+ pipeline_tag: automatic-speech-recognition
112
+ library_name: qora
113
+ ---
114
+
115
+ # QORA-STT - Pure Rust Speech-to-Text
116
+
117
+ Pure Rust inference engine for OpenAI's Whisper Tiny. No Python, no CUDA, no external dependencies. Single executable + binary weights = portable speech-to-text on any machine.
118
+
119
+ Based on **openai/whisper-tiny** (MIT License).
120
+
121
+ ## Quick Start
122
+
123
+ ```bash
124
+ # Transcribe an audio file (English)
125
+ qora-stt.exe --model-path . --load model.qora-stt --audio recording.wav
126
+
127
+ # Specify language
128
+ qora-stt.exe --model-path . --load model.qora-stt --audio recording.wav --language french
129
+
130
+ # Save transcription to file
131
+ qora-stt.exe --model-path . --load model.qora-stt --audio recording.wav --output transcript.txt
132
+ ```
133
+
134
+ ## Files
135
+
136
+ ```
137
+ model/
138
+ qora-stt.exe 2.5 MB Inference engine (single binary)
139
+ model.qora-stt 144 MB F32 weights (encoder + decoder)
140
+ config.json 2.0 KB Model configuration
141
+ tokenizer.json 2.4 MB Tokenizer (51,865 vocab)
142
+ README.md This file
143
+ ```
144
+
145
+ **No safetensors needed.** Everything loads from `model.qora-stt`.
146
+
147
+ ## Model Info
148
+
149
+ | Property | Value |
150
+ |----------|-------|
151
+ | **Base Model** | openai/whisper-tiny |
152
+ | **Parameters** | 39 Million |
153
+ | **Type** | Encoder-decoder transformer |
154
+ | **Weights** | F32 (no quantization needed at 39M params) |
155
+ | **Binary Size** | 144 MB |
156
+ | **Input** | WAV audio (any sample rate, auto-resampled to 16kHz) |
157
+ | **Output** | Transcribed text |
158
+ | **Max Duration** | 30 seconds per chunk |
159
+ | **Languages** | 99 languages supported |
160
+
161
+ ## Architecture
162
+
163
+ | Component | Details |
164
+ |-----------|---------|
165
+ | **Encoder** | Conv1D stem (80->384, stride 2) + 4 transformer layers |
166
+ | **Decoder** | 4 transformer layers with cross-attention to encoder |
167
+ | **Hidden Size** | 384 |
168
+ | **Attention Heads** | 6 (head_dim=64) |
169
+ | **FFN Dimension** | 1,536 |
170
+ | **Vocabulary** | 51,865 tokens (BPE) |
171
+ | **Activation** | GELU |
172
+ | **Normalization** | LayerNorm with bias |
173
+ | **Mel Spectrogram** | 80 bins, n_fft=400, hop=160, 16kHz |
174
+ | **Position Encoding** | Encoder: sinusoidal (stored), Decoder: learned |
175
+
176
+ ### Encoder
177
+ 1. **Conv1D stem**: Conv1(80->384, k=3, s=1) -> GELU -> Conv2(384->384, k=3, s=2) -> GELU
178
+ 2. Input: mel spectrogram `[80, 3000]` -> output `[1500, 384]`
179
+ 3. 4 transformer layers: LayerNorm -> self-attention (6 heads, full) -> residual -> LayerNorm -> FFN -> residual
180
+ 4. Final LayerNorm
181
+
182
+ ### Decoder (Autoregressive)
183
+ 1. Token + positional embedding
184
+ 2. 4 transformer layers, each with:
185
+ - Causal self-attention (with KV cache)
186
+ - Cross-attention to encoder output (cached once)
187
+ - FFN (384 -> 1536 -> 384)
188
+ 3. Output projection (tied with token embeddings)
189
+
190
+ ## CLI Arguments
191
+
192
+ | Flag | Default | Description |
193
+ |------|---------|-------------|
194
+ | `--model-path <dir>` | `.` | Directory with config.json + tokenizer.json |
195
+ | `--load <path>` | -- | Load binary model (.qora-stt) |
196
+ | `--audio <wav>` | -- | Input WAV file to transcribe |
197
+ | `--language <name>` | english | Language name or code (e.g., "french", "fr") |
198
+ | `--output <path>` | -- | Write transcription to text file |
199
+ | `--save <path>` | -- | Save binary model (for converting from safetensors) |
200
+ | `--help` | -- | Show help |
201
+
202
+ ## Supported Languages
203
+
204
+ 99 languages including: English, Chinese, German, Spanish, Russian, Korean, French, Japanese, Portuguese, Turkish, Polish, Dutch, Arabic, Swedish, Italian, Indonesian, Hindi, Finnish, Vietnamese, Hebrew, Ukrainian, Greek, Czech, Romanian, Danish, Hungarian, Tamil, Norwegian, Thai, Urdu, Croatian, Bulgarian, Lithuanian, Latin, Malayalam, Welsh, Slovak, Telugu, Persian, Latvian, Bengali, Serbian, Azerbaijani, Slovenian, Kannada, Estonian, Macedonian, Breton, Basque, Icelandic, Armenian, Nepali, Mongolian, Bosnian, Kazakh, Albanian, Swahili, Galician, Marathi, Punjabi, Sinhala, Khmer, Shona, Yoruba, Somali, Afrikaans, Occitan, Georgian, Belarusian, Tajik, Sindhi, Gujarati, Amharic, Yiddish, Lao, Uzbek, Faroese, Haitian, Pashto, Turkmen, Nynorsk, Maltese, Sanskrit, Luxembourgish, Myanmar, Tibetan, Tagalog, Malagasy, Assamese, Tatar, Hawaiian, Lingala, Hausa, Bashkir, Javanese, Sundanese.
205
+
206
+ ## Performance (i5-11500, 16GB RAM, CPU-only)
207
+
208
+ | Phase | Time |
209
+ |-------|------|
210
+ | Model Load (binary) | ~92ms |
211
+ | Mel Extraction | ~108ms |
212
+ | Encoder (4 layers) | ~2.6s |
213
+ | Cross-attention Cache | ~32ms |
214
+ | Decoding | ~26ms/token |
215
+ | **Total (6s audio, 21 tokens)** | **~3.5s** |
216
+ | Memory | ~144 MB |
217
+
218
+ ### Optimizations
219
+
220
+ - **Rayon parallelism**: GEMM rows parallelized across all CPU cores
221
+ - **Cache-friendly GEMM**: i-p-j loop order for sequential memory access
222
+ - **Parallel attention heads**: 6 heads computed concurrently
223
+ - **KV caching**: Cross-attention K/V computed once, reused every decoder step
224
+ - **Self-attention cache**: Grows incrementally, no recomputation
225
+
226
+ ## Converting from Safetensors
227
+
228
+ If you have the original `openai/whisper-tiny` safetensors:
229
+
230
+ ```bash
231
+ # Download model
232
+ huggingface-cli download openai/whisper-tiny --local-dir whisper-tiny
233
+
234
+ # Convert to binary (runs one dummy transcription to trigger save)
235
+ qora-stt.exe --model-path whisper-tiny --save model.qora-stt --audio some.wav
236
+ ```
237
+
238
+ After conversion, safetensors files are no longer needed.
239
+
240
+ ## QORA Model Family
241
+
242
+ | Engine | Model | Params | Size | Purpose |
243
+ |--------|-------|--------|------|---------|
244
+ | **QORA** | SmolLM3-3B | 3.07B | 1.68 GB (Q4) | Text generation, reasoning, chat |
245
+ | **QORA-TTS** | Qwen3-TTS-12Hz | 0.6B/1.7B | 971 MB (Q4) | Text-to-speech synthesis |
246
+ | **QORA-STT** | Whisper Tiny | 39M | 144 MB (F32) | Speech-to-text transcription |
247
+ | **QORA-Image** | SDXS-512 | 350M | 350 MB | Text-to-image generation |
248
+
249
+ All engines are pure Rust, CPU-only, single-binary executables with no Python dependencies.
250
+
251
+ ## License
252
+
253
+ The QORA-STT inference engine is custom-built. The Whisper Tiny model weights are released under the [MIT License](https://github.com/openai/whisper/blob/main/LICENSE) by OpenAI.
254
+
255
+ ---
256
+
257
+ *Built with QORA - Pure Rust AI Inference*
config.json ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "openai/whisper-tiny",
3
+ "activation_dropout": 0.0,
4
+ "activation_function": "gelu",
5
+ "architectures": [
6
+ "WhisperForConditionalGeneration"
7
+ ],
8
+ "attention_dropout": 0.0,
9
+ "begin_suppress_tokens": [
10
+ 220,
11
+ 50257
12
+ ],
13
+ "bos_token_id": 50257,
14
+ "d_model": 384,
15
+ "decoder_attention_heads": 6,
16
+ "decoder_ffn_dim": 1536,
17
+ "decoder_layerdrop": 0.0,
18
+ "decoder_layers": 4,
19
+ "decoder_start_token_id": 50258,
20
+ "dropout": 0.0,
21
+ "encoder_attention_heads": 6,
22
+ "encoder_ffn_dim": 1536,
23
+ "encoder_layerdrop": 0.0,
24
+ "encoder_layers": 4,
25
+ "eos_token_id": 50257,
26
+ "forced_decoder_ids": [
27
+ [
28
+ 1,
29
+ 50259
30
+ ],
31
+ [
32
+ 2,
33
+ 50359
34
+ ],
35
+ [
36
+ 3,
37
+ 50363
38
+ ]
39
+ ],
40
+ "init_std": 0.02,
41
+ "is_encoder_decoder": true,
42
+ "max_length": 448,
43
+ "max_source_positions": 1500,
44
+ "max_target_positions": 448,
45
+ "model_type": "whisper",
46
+ "num_hidden_layers": 4,
47
+ "num_mel_bins": 80,
48
+ "pad_token_id": 50257,
49
+ "scale_embedding": false,
50
+ "suppress_tokens": [
51
+ 1,
52
+ 2,
53
+ 7,
54
+ 8,
55
+ 9,
56
+ 10,
57
+ 14,
58
+ 25,
59
+ 26,
60
+ 27,
61
+ 28,
62
+ 29,
63
+ 31,
64
+ 58,
65
+ 59,
66
+ 60,
67
+ 61,
68
+ 62,
69
+ 63,
70
+ 90,
71
+ 91,
72
+ 92,
73
+ 93,
74
+ 359,
75
+ 503,
76
+ 522,
77
+ 542,
78
+ 873,
79
+ 893,
80
+ 902,
81
+ 918,
82
+ 922,
83
+ 931,
84
+ 1350,
85
+ 1853,
86
+ 1982,
87
+ 2460,
88
+ 2627,
89
+ 3246,
90
+ 3253,
91
+ 3268,
92
+ 3536,
93
+ 3846,
94
+ 3961,
95
+ 4183,
96
+ 4667,
97
+ 6585,
98
+ 6647,
99
+ 7273,
100
+ 9061,
101
+ 9383,
102
+ 10428,
103
+ 10929,
104
+ 11938,
105
+ 12033,
106
+ 12331,
107
+ 12562,
108
+ 13793,
109
+ 14157,
110
+ 14635,
111
+ 15265,
112
+ 15618,
113
+ 16553,
114
+ 16604,
115
+ 18362,
116
+ 18956,
117
+ 20075,
118
+ 21675,
119
+ 22520,
120
+ 26130,
121
+ 26161,
122
+ 26435,
123
+ 28279,
124
+ 29464,
125
+ 31650,
126
+ 32302,
127
+ 32470,
128
+ 36865,
129
+ 42863,
130
+ 47425,
131
+ 49870,
132
+ 50254,
133
+ 50258,
134
+ 50358,
135
+ 50359,
136
+ 50360,
137
+ 50361,
138
+ 50362
139
+ ],
140
+ "torch_dtype": "float32",
141
+ "transformers_version": "4.27.0.dev0",
142
+ "use_cache": true,
143
+ "vocab_size": 51865
144
+ }
model.qora-stt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0cd03ac530d679e40f92141756ddb10a8ba744676405979c289c225b908af253
3
+ size 151043268
qora-stt.exe ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95097c10edccf113defd53bee20fb8be3b04836186fbc5ac97ad8c319001ce7b
3
+ size 2615808
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff