Upload omniASR-W2V-1B converted from fairseq2
Browse files- README.md +5 -3
- config.json +1 -1
README.md
CHANGED
|
@@ -21,14 +21,16 @@ This is the **pre-trained encoder backbone without a CTC head**, suitable for fe
|
|
| 21 |
| HF class | `Wav2Vec2Model` |
|
| 22 |
| Encoder layers | 48 |
|
| 23 |
| Hidden size | 1280 |
|
| 24 |
-
| Attention heads |
|
| 25 |
| FFN intermediate | 5120 |
|
| 26 |
| Source framework | fairseq2 |
|
| 27 |
| Source card | `omniASR_W2V_1B` |
|
| 28 |
-
| Parity verification |
|
| 29 |
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Usage
|
| 34 |
|
|
|
|
| 21 |
| HF class | `Wav2Vec2Model` |
|
| 22 |
| Encoder layers | 48 |
|
| 23 |
| Hidden size | 1280 |
|
| 24 |
+
| Attention heads | 16 |
|
| 25 |
| FFN intermediate | 5120 |
|
| 26 |
| Source framework | fairseq2 |
|
| 27 |
| Source card | `omniASR_W2V_1B` |
|
| 28 |
+
| Parity verification | ✅ Verified |
|
| 29 |
|
| 30 |
|
| 31 |
+
Numerical parity against the original fairseq2 checkpoint has been confirmed: outputs match to within `atol=1e-4` on a held-out audio sample.
|
| 32 |
+
|
| 33 |
+
Embedding statistics on the held-out audio clip: embedding shape (1, 175, 1280), max_abs_diff=0.00e+00, mean_diff=0.00e+00, std_diff=0.00e+00
|
| 34 |
|
| 35 |
## Usage
|
| 36 |
|
config.json
CHANGED
|
@@ -67,7 +67,7 @@
|
|
| 67 |
"mask_time_prob": 0.05,
|
| 68 |
"model_type": "wav2vec2",
|
| 69 |
"num_adapter_layers": 3,
|
| 70 |
-
"num_attention_heads":
|
| 71 |
"num_codevector_groups": 2,
|
| 72 |
"num_codevectors_per_group": 320,
|
| 73 |
"num_conv_pos_embedding_groups": 16,
|
|
|
|
| 67 |
"mask_time_prob": 0.05,
|
| 68 |
"model_type": "wav2vec2",
|
| 69 |
"num_adapter_layers": 3,
|
| 70 |
+
"num_attention_heads": 16,
|
| 71 |
"num_codevector_groups": 2,
|
| 72 |
"num_codevectors_per_group": 320,
|
| 73 |
"num_conv_pos_embedding_groups": 16,
|