aidankmcl commited on
Commit
578f1f8
·
verified ·
1 Parent(s): 4a00790

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -3
README.md CHANGED
@@ -1,3 +1,82 @@
1
- ---
2
- license: mpl-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mpl-2.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - KoelLabs/xlsr-timit-b0
7
+ tags:
8
+ - phoneme
9
+ ---
10
+ This model is just a conversion of [KoelLabs/xlsr-timit-b0](https://huggingface.co/KoelLabs/xlsr-timit-b0)
11
+
12
+ The original model card is below
13
+
14
+ ---
15
+
16
+ # XLSR-TIMIT-B0: Fine-tuned on TIMIT for Phonemic Transcription
17
+
18
+ This model leverages the pretrained checkpoint [ginic/data_seed_4_wav2vec2-large-xlsr-buckeye-ipa](https://huggingface.co/ginic/data_seed_4_wav2vec2-large-xlsr-buckeye-ipa) and is fine-tuned on the [TIMIT Darpa English Corpus](https://github.com/philipperemy/timit) to transcribe audio into phonemic representations for the English language.
19
+
20
+ All code is available on [Github](https://github.com/KoelLabs/ML/blob/main/notebooks/TIMIT_Finetune_Ginic.ipynb)
21
+
22
+ This model outperforms all current xlsr IPA transcription models for English
23
+
24
+ **Performance**
25
+ - Training Loss: 1.254
26
+ - Validation Loss: 0.267
27
+ - Test Results (TIMIT test set):
28
+ - Average Weighted Distance: 13.309375
29
+ - Standard Deviation (Weighted Distance): 9.87
30
+ - Average Character Error Rate (CER): 0.113
31
+ - Standard Deviation (CER): 0.06
32
+
33
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61dd07bafdc070745eed96fd/yXgLgGzlYvTMejmrhu2Y9.png)
34
+
35
+ **Model Information**
36
+ - Number of Epochs: 40
37
+ - Learning Rate: 8e-5
38
+ - Optimizer: Adam
39
+ - Datasets Used: TIMIT, Darpa English Corpus
40
+
41
+ **Example Outputs**
42
+ 1. **Prediction**: `lizteɪkðɪsdɹɾiteɪbklɔθiðiklinizfɹmi`
43
+ **Ground Truth**: `lizteɪkðɪsdɹɾiteɪbəklɔtiðiklinizfɹmi`
44
+ **Weighted Feature Edit Distance**: 7.875
45
+ **CER**: 0.0556
46
+
47
+ 2. **Prediction**: `ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiɾimpɛɾikoʊts`
48
+ **Ground Truth**: `ɹænmʌðɹʔaʊtɹuhɹʔʌpɹɪŋiŋinpɛɾikoʊts`
49
+ **Weighted Feature Edit Distance**: 2.375
50
+ **CER**: 0.0588
51
+
52
+ ## Limitations
53
+
54
+ This phonemic transcription model is fine-tuned on an English speech corpus that does not encompass all dialects and languages. We acknowledge that it may significantly underperform for any unseen languages. We aim to release models and datasets that better serve all populations and languages in the future.
55
+
56
+
57
+ ---
58
+
59
+ # Usage
60
+
61
+ To transcribe audio files, this model can be used as follows:
62
+
63
+ ```python
64
+ from transformers import AutoModelForCTC, AutoProcessor
65
+ import torch
66
+
67
+ # Load model and processor
68
+ model = AutoModelForCTC.from_pretrained("KoelLabs/xlsr-timit-b0")
69
+ processor = AutoProcessor.from_pretrained("KoelLabs/xlsr-timit-b0")
70
+
71
+ # Prepare input
72
+ audio_input = "path_to_your_audio_file.wav" # Replace with your file
73
+ input_values = processor(audio_input, return_tensors="pt", sampling_rate=16000).input_values
74
+
75
+ # Retrieve logits
76
+ with torch.no_grad():
77
+ logits = model(input_values).logits
78
+
79
+ # Decode predictions
80
+ predicted_ids = torch.argmax(logits, dim=-1)
81
+ transcription = processor.batch_decode(predicted_ids)
82
+ print(transcription)