feat: pin encoder-only artifact for whisper-tiny-encoder (#1149)
Browse files
README.md
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- automatic-speech-recognition
|
| 5 |
+
- audio
|
| 6 |
+
- whisper
|
| 7 |
+
- ferrotorch
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# `ferrotorch/whisper-tiny-encoder`
|
| 11 |
+
|
| 12 |
+
whisper-tiny encoder (openai/whisper-tiny). 4-layer 6-head Transformer audio encoder, d_model=384, encoder_ffn_dim=1536, num_mel_bins=80, max_source_positions=1500. MIT-licensed. Pinned encoder-only — decoder/proj_out weights are dropped from this mirror. Real-artifact baseline for audio encoder parity vs transformers (#1149).
|
| 13 |
+
|
| 14 |
+
## Provenance
|
| 15 |
+
|
| 16 |
+
* Upstream: `openai/whisper-tiny` (mit).
|
| 17 |
+
* Conversion script:
|
| 18 |
+
[`ferrotorch/scripts/pin_pretrained_whisper_weights.py`](https://github.com/dollspace-gay/ferrotorch/blob/main/scripts/pin_pretrained_whisper_weights.py).
|
| 19 |
+
* Ferrotorch issue: <https://github.com/dollspace-gay/ferrotorch/issues/1149>.
|
| 20 |
+
* SHA-256 of `model.safetensors` (this file is pinned in
|
| 21 |
+
`ferrotorch-hub/src/registry.rs`): `4ce29194b87ef05385203f8b09914f5c3b060200c2b503d6d420459ffb80a294`.
|
| 22 |
+
* Number of trainable parameters in the encoder slice:
|
| 23 |
+
**8,208,384**.
|
| 24 |
+
* Config snapshot: d_model=384,
|
| 25 |
+
encoder_layers=4,
|
| 26 |
+
encoder_attention_heads=6,
|
| 27 |
+
encoder_ffn_dim=1536,
|
| 28 |
+
num_mel_bins=80,
|
| 29 |
+
max_source_positions=1500,
|
| 30 |
+
activation_function='gelu'.
|
| 31 |
+
* Non-encoder keys dropped from the upstream checkpoint (this
|
| 32 |
+
mirror is encoder-only): 100 total, first few:
|
| 33 |
+
`['model.decoder.embed_positions.weight', 'model.decoder.embed_tokens.weight', 'model.decoder.layer_norm.bias']`.
|
| 34 |
+
|
| 35 |
+
## Value-parity probe
|
| 36 |
+
|
| 37 |
+
Three extra files are uploaded so the ferrotorch-side harness can
|
| 38 |
+
reproduce the parity verdict without re-running the upstream
|
| 39 |
+
Whisper model:
|
| 40 |
+
|
| 41 |
+
* `_value_parity_audio.bin` — deterministic synthetic 30-second
|
| 42 |
+
audio (sum of three sine waves with a slow envelope),
|
| 43 |
+
16 kHz mono float32, shape `[1, 480000]`.
|
| 44 |
+
* `_value_parity_mel.bin` — `WhisperFeatureExtractor(audio)`
|
| 45 |
+
output `[1, 80, 3000]` float32 from the upstream feature
|
| 46 |
+
extractor. The Rust-side `ferrotorch_whisper::audio` is
|
| 47 |
+
compared against this.
|
| 48 |
+
* `_value_parity_encoder_output.bin` — float32 encoder hidden
|
| 49 |
+
states `[1, 1500, 384]` from
|
| 50 |
+
`WhisperModel.encoder(input_features=mel).last_hidden_state`.
|
| 51 |
+
Format: `[u32 ndim][u32 × ndim shape][f32 × prod(shape)]`
|
| 52 |
+
little-endian (matches every other ferrotorch dump).
|
| 53 |
+
|
| 54 |
+
## How to load
|
| 55 |
+
|
| 56 |
+
```rust
|
| 57 |
+
use ferrotorch_whisper::{
|
| 58 |
+
HfWhisperConfig, WhisperConfig, load_whisper_encoder,
|
| 59 |
+
};
|
| 60 |
+
use ferrotorch_hub::{HubCache, hf_download_model};
|
| 61 |
+
|
| 62 |
+
let cache = HubCache::with_default_dir();
|
| 63 |
+
let repo_dir = hf_download_model("ferrotorch/whisper-tiny-encoder", "main", &cache)?;
|
| 64 |
+
let hf_cfg = HfWhisperConfig::from_file(repo_dir.join("config.json"))?;
|
| 65 |
+
let cfg = WhisperConfig::from_hf(&hf_cfg)?;
|
| 66 |
+
let (encoder, _drop_report) = load_whisper_encoder::<f32>(
|
| 67 |
+
&repo_dir.join("model.safetensors"),
|
| 68 |
+
cfg,
|
| 69 |
+
/* strict = */ false,
|
| 70 |
+
)?;
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Upstream license
|
| 74 |
+
|
| 75 |
+
```
|
| 76 |
+
MIT License
|
| 77 |
+
|
| 78 |
+
Copyright (c) 2022 OpenAI
|
| 79 |
+
|
| 80 |
+
Permission is hereby granted, free of charge, to any person obtaining a
|
| 81 |
+
copy of this software and associated documentation files (the
|
| 82 |
+
"Software"), to deal in the Software without restriction, including
|
| 83 |
+
without limitation the rights to use, copy, modify, merge, publish,
|
| 84 |
+
distribute, sublicense, and/or sell copies of the Software, and to
|
| 85 |
+
permit persons to whom the Software is furnished to do so, subject to
|
| 86 |
+
the following conditions:
|
| 87 |
+
|
| 88 |
+
The above copyright notice and this permission notice shall be included
|
| 89 |
+
in all copies or substantial portions of the Software.
|
| 90 |
+
|
| 91 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
|
| 92 |
+
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
| 93 |
+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
| 94 |
+
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
| 95 |
+
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
| 96 |
+
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
| 97 |
+
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
| 98 |
+
|
| 99 |
+
```
|