Update README.md
Browse files
README.md
CHANGED
|
@@ -2,6 +2,8 @@
|
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
|
|
|
|
|
|
|
| 5 |
# Citation
|
| 6 |
|
| 7 |
Please cite as:
|
|
@@ -17,5 +19,14 @@ Please cite as:
|
|
| 17 |
pages={1493-1504},
|
| 18 |
keywords={Training;Speech processing;Machine translation;Representation learning;Cross-lingual speech representation learning;Language-agnostic speech embedding;zero-shot speech-to-text translation retrieval;zero-shot speech-to-speech translation retrieval},
|
| 19 |
doi={10.1109/JSTSP.2022.3192714}}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
```
|
| 21 |
|
|
|
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
|
| 5 |
+
We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation learning framework. Unlike previous works on speech representation learning, which learns multilingual contextual speech embedding at the resolution of an acoustic frame (10-20ms), this work focuses on learning multimodal (speech-text) multilingual speech embedding at the resolution of a sentence (5-10s) such that the embedding vector space is semantically aligned across different languages. We combine state-of-the-art multilingual acoustic frame-level speech representation learning model XLS-R with the Language Agnostic BERT Sentence Embedding (LaBSE) model to create an utterance-level multimodal multilingual speech encoder SAMU-XLSR. Although we train SAMU-XLSR with only multilingual transcribed speech data, cross-lingual speech-text and speech-speech associations emerge in its learned representation space. To substantiate our claims, we use SAMU-XLSR speech encoder in combination with a pre-trained LaBSE text sentence encoder for cross-lingual speech-to-text translation retrieval, and SAMU-XLSR alone for cross-lingual speech-to-speech translation retrieval. We highlight these applications by performing several cross-lingual text and speech translation retrieval tasks across several datasets.
|
| 6 |
+
|
| 7 |
# Citation
|
| 8 |
|
| 9 |
Please cite as:
|
|
|
|
| 19 |
pages={1493-1504},
|
| 20 |
keywords={Training;Speech processing;Machine translation;Representation learning;Cross-lingual speech representation learning;Language-agnostic speech embedding;zero-shot speech-to-text translation retrieval;zero-shot speech-to-speech translation retrieval},
|
| 21 |
doi={10.1109/JSTSP.2022.3192714}}
|
| 22 |
+
|
| 23 |
+
@inproceedings{khurana2024cross,
|
| 24 |
+
title={Cross-lingual transfer learning for low-resource speech translation},
|
| 25 |
+
author={Khurana, Sameer and Dawalatabad, Nauman and Laurent, Antoine and Vicente, Luis and Gimeno, Pablo and Mingote, Victoria and Glass, James},
|
| 26 |
+
booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)},
|
| 27 |
+
pages={670--674},
|
| 28 |
+
year={2024},
|
| 29 |
+
organization={IEEE}
|
| 30 |
+
}
|
| 31 |
```
|
| 32 |
|