kgnlp
/

allophant-shared

Automatic Speech Recognition

phoneme-recognition

Model card Files Files and versions

kgnlp commited on Oct 21, 2024

Commit

9f9a940

·

1 Parent(s): e1d2098

Added Usage section to README

Files changed (1) hide show

README.md +64 -0

README.md CHANGED Viewed

@@ -27,6 +27,70 @@ The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/face
 Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
 Citation
 ========

 Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
+Usage
+=====
+A pre-trained model can be loaded with the [`allophant`](https://github.com/kgnlp/allophant) package from a huggingface checkpoint or local file:
+```python
+from allophant.estimator import Estimator
+device = "cpu"
+model, attribute_indexer = Estimator.restore("kgnlp/allophant-shared", device=device)
+supported_features = attribute_indexer.feature_names
+# The phonetic feature categories supported by the model, including "phonemes"
+print(supported_features)
+```
+Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
+```python
+# 1. For a single language:
+inventory = attribute_indexer.phoneme_inventory("es")
+# 2. For multiple languages, e.g. in code-switching scenarios
+inventory = attribute_indexer.phoneme_inventory(["es", "it"])
+# 3. Any custom selection of phones for which features are available in the Allophoible database
+inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
+````
+Audio files can then be loaded, resampled and transcribed using the given
+inventory by first computing the log probabilities for each classifier:
+```python
+import torch
+import torchaudio
+from allophant.dataset_processing import Batch
+# Load an audio file and resample the first channel to the sample rate used by the model
+audio, sample_rate = torchaudio.load("utterance.wav")
+audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
+# Construct a batch of 0-padded single channel audio, lengths and language IDs
+# Language ID can be 0 for inference
+batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
+model_outputs = model.predict(
+  batch.to(device),
+  attribute_indexer.composition_feature_matrix(inventory).to(device)
+)
+```
+Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
+```python
+from allophant import predictions
+# Create a feature mapping for your inventory and CTC decoders for the desired feature set
+inventory_indexer = attribute_indexer.attributes.subset(inventory)
+ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
+for feature_name, decoder in ctc_decoders.items():
+    decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
+    # Print the feature name and values for each utterance in the batch
+    for [hypothesis] in decoded:
+        # NOTE: token indices are offset by one due to the <BLANK> token used during decoding
+        recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
+        print(feature_name, recognized)
+```
 Citation
 ========