techiaith
/

wav2vec2-btb-cv-ft-cv-cy

@@ -1,113 +1,84 @@
 ---
 license: apache-2.0
-base_model: DewiBrynJones/wav2vec2-xlsr-53-ft-btb-cv-cy
 tags:
 - automatic-speech-recognition
-- ./data-configs/cv.json
 - generated_from_trainer
 metrics:
 - wer
 model-index:
 - name: wav2vec2-btb-cv-ft-cv-cy
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # wav2vec2-btb-cv-ft-cv-cy
-This model is a fine-tuned version of [DewiBrynJones/wav2vec2-xlsr-53-ft-btb-cv-cy](https://huggingface.co/DewiBrynJones/wav2vec2-xlsr-53-ft-btb-cv-cy) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.2516
-- Wer: 0.2403
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0003
-- train_batch_size: 4
-- eval_batch_size: 64
-- seed: 42
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: linear
-- lr_scheduler_warmup_steps: 1000
-- training_steps: 10000
-- mixed_precision_training: Native AMP
-### Training results
-| Training Loss | Epoch  | Step  | Validation Loss | Wer    |
-|:-------------:|:------:|:-----:|:---------------:|:------:|
-| No log        | 0.1004 | 200   | 0.3807          | 0.2514 |
-| No log        | 0.2008 | 400   | 0.2540          | 0.2643 |
-| 2.4874        | 0.3012 | 600   | 0.2642          | 0.3038 |
-| 2.4874        | 0.4016 | 800   | 0.3125          | 0.3905 |
-| 0.3991        | 0.5020 | 1000  | 0.3531          | 0.3939 |
-| 0.3991        | 0.6024 | 1200  | 0.3572          | 0.4039 |
-| 0.3991        | 0.7028 | 1400  | 0.3679          | 0.4053 |
-| 0.4512        | 0.8032 | 1600  | 0.3590          | 0.3877 |
-| 0.4512        | 0.9036 | 1800  | 0.3733          | 0.4007 |
-| 0.4333        | 1.0040 | 2000  | 0.3771          | 0.4243 |
-| 0.4333        | 1.1044 | 2200  | 0.3604          | 0.3867 |
-| 0.4333        | 1.2048 | 2400  | 0.3431          | 0.3814 |
-| 0.3468        | 1.3052 | 2600  | 0.3290          | 0.3779 |
-| 0.3468        | 1.4056 | 2800  | 0.3341          | 0.3647 |
-| 0.3503        | 1.5060 | 3000  | 0.3248          | 0.3615 |
-| 0.3503        | 1.6064 | 3200  | 0.3312          | 0.3551 |
-| 0.3503        | 1.7068 | 3400  | 0.3411          | 0.3836 |
-| 0.3418        | 1.8072 | 3600  | 0.3117          | 0.3375 |
-| 0.3418        | 1.9076 | 3800  | 0.3197          | 0.3432 |
-| 0.3181        | 2.0080 | 4000  | 0.3068          | 0.3340 |
-| 0.3181        | 2.1084 | 4200  | 0.3138          | 0.3358 |
-| 0.3181        | 2.2088 | 4400  | 0.3139          | 0.3334 |
-| 0.2423        | 2.3092 | 4600  | 0.3192          | 0.3285 |
-| 0.2423        | 2.4096 | 4800  | 0.2929          | 0.3168 |
-| 0.2327        | 2.5100 | 5000  | 0.2921          | 0.3103 |
-| 0.2327        | 2.6104 | 5200  | 0.2802          | 0.3037 |
-| 0.2327        | 2.7108 | 5400  | 0.2812          | 0.2962 |
-| 0.2374        | 2.8112 | 5600  | 0.2887          | 0.3042 |
-| 0.2374        | 2.9116 | 5800  | 0.2740          | 0.2927 |
-| 0.2136        | 3.0120 | 6000  | 0.2662          | 0.2830 |
-| 0.2136        | 3.1124 | 6200  | 0.2829          | 0.2890 |
-| 0.2136        | 3.2129 | 6400  | 0.2729          | 0.2869 |
-| 0.167         | 3.3133 | 6600  | 0.2777          | 0.2889 |
-| 0.167         | 3.4137 | 6800  | 0.2712          | 0.2810 |
-| 0.1614        | 3.5141 | 7000  | 0.2688          | 0.2709 |
-| 0.1614        | 3.6145 | 7200  | 0.2589          | 0.2663 |
-| 0.1614        | 3.7149 | 7400  | 0.2651          | 0.2670 |
-| 0.1529        | 3.8153 | 7600  | 0.2507          | 0.2637 |
-| 0.1529        | 3.9157 | 7800  | 0.2494          | 0.2568 |
-| 0.1496        | 4.0161 | 8000  | 0.2582          | 0.2580 |
-| 0.1496        | 4.1165 | 8200  | 0.2650          | 0.2575 |
-| 0.1496        | 4.2169 | 8400  | 0.2656          | 0.2560 |
-| 0.1128        | 4.3173 | 8600  | 0.2543          | 0.2512 |
-| 0.1128        | 4.4177 | 8800  | 0.2587          | 0.2499 |
-| 0.1109        | 4.5181 | 9000  | 0.2540          | 0.2460 |
-| 0.1109        | 4.6185 | 9200  | 0.2546          | 0.2425 |
-| 0.1109        | 4.7189 | 9400  | 0.2580          | 0.2420 |
-| 0.1028        | 4.8193 | 9600  | 0.2514          | 0.2404 |
-| 0.1028        | 4.9197 | 9800  | 0.2510          | 0.2403 |
-| 0.1069        | 5.0201 | 10000 | 0.2516          | 0.2403 |
-### Framework versions
-- Transformers 4.44.0
-- Pytorch 2.4.0+cu121
-- Datasets 2.21.0
-- Tokenizers 0.19.1

 ---
 license: apache-2.0
+base_model:
+- techiaith/wav2vec2-xlsr-53-ft-btb-cv-cy
 tags:
 - automatic-speech-recognition
 - generated_from_trainer
 metrics:
 - wer
 model-index:
 - name: wav2vec2-btb-cv-ft-cv-cy
   results: []
+datasets:
+- techiaith/commonvoice_18_0_cy
+language:
+- cy
+pipeline_tag: automatic-speech-recognition
 ---
 # wav2vec2-btb-cv-ft-cv-cy
+This model is a version of [techiaith/wav2vec2-xlsr-53-ft-btb-cv-cy](https://huggingface.co/DewiBrynJones/wav2vec2-xlsr-53-ft-btb-cv-cy)
+fine-tuned with its encoder frozen and the training set [commonvoice_cy_18](https://huggingface.co/datasets/techiaith/commonvoice_18_0_cy)
+It achieves the following results on the Welsh Common Voice version 18 standard test set:
+- WER: 24.93
+- CER: 6.55
+However, when the accompanying KenLM language model is used, it achieves the following results on the same test set:
+- WER: 15.30
+- CER: 4.57
+## Usage
+### wav2vec2 acoustic model only...
+```python
+import torch
+import torchaudio
+import librosa
+from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+processor = Wav2Vec2Processor.from_pretrained("techiaith/wav2vec2-xlsr-ft-cy")
+model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-ft-cy")
+audio, rate = librosa.load(audio_file, sr=16000)
+inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)
+with torch.no_grad():
+  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
+# greedy decoding
+predicted_ids = torch.argmax(logits, dim=-1)
+print("Prediction:", processor.batch_decode(predicted_ids))
+```
+### with language model...
+```python
+import torch
+import torchaudio
+import librosa
+from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
+processor = Wav2Vec2ProcessorWithLM.from_pretrained("techiaith/wav2vec2-xlsr-ft-cy")
+model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-ft-cy")
+audio, rate = librosa.load(audio_file, sr=16000)
+inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)
+with torch.no_grad():
+  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
+# ctc decoding
+print("Prediction:", processor.batch_decode(logits.numpy()).text[0])
+```