Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,121 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- google/cvss
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
- fr
|
| 8 |
+
metrics:
|
| 9 |
+
- bleu
|
| 10 |
+
---
|
| 11 |
+
# NAST-S2X: A Fast and End-to-End Simultaneous Speech-to-Any Translation Model
|
| 12 |
+
<p align="center">
|
| 13 |
+
<img src="https://github.com/ictnlp/NAST-S2x/assets/43530347/02d6dea6-5887-459e-9938-bc510b6c850c"/>
|
| 14 |
+
</p>
|
| 15 |
+
|
| 16 |
+
## Features
|
| 17 |
+
* 🤖 **An end-to-end model without intermediate text decoding**
|
| 18 |
+
* 💪 **Supports offline and streaming decoding of all modalities**
|
| 19 |
+
* ⚡️ **28× faster inference compared to autoregressive models**
|
| 20 |
+
|
| 21 |
+
## Examples
|
| 22 |
+
#### We present an example of French-to-English translation using chunk sizes of 320 ms, 2560 ms, and in offline conditions.
|
| 23 |
+
* Generation with chunk sizes of 320 ms and 2560 ms starts generating English translation before the source speech is complete.
|
| 24 |
+
* In the examples of simultaneous interpretation, the left audio channel is the input streaming speech, and the right audio channel is the simultaneous translation.
|
| 25 |
+
> [!NOTE]
|
| 26 |
+
> For a better experience, please wear headphones.
|
| 27 |
+
|
| 28 |
+
Chunk Size 320ms | Chunk Size 2560ms | Offline
|
| 29 |
+
:-------------------------:|:-------------------------: |:-------------------------:
|
| 30 |
+
<video src="https://github.com/ictnlp/NAST-S2x/assets/43530347/52f2d5c4-43ad-49cb-844f-09575ef048e0" width="100"></video> | <video src="https://github.com/ictnlp/NAST-S2x/assets/43530347/56475dee-1649-40d9-9cb6-9fe033f6bb32"></video> | <video src="https://github.com/ictnlp/NAST-S2x/assets/43530347/b6fb1d09-b418-45f0-84e9-e6ed3a2cea48"></video>
|
| 31 |
+
|
| 32 |
+
Source Speech Transcript | Reference Text Translation
|
| 33 |
+
:-------------------------:|:-------------------------:
|
| 34 |
+
Avant la fusion des communes, Rouge-Thier faisait partie de la commune de Louveigné.| before the fusion of the towns rouge thier was a part of the town of louveigne
|
| 35 |
+
|
| 36 |
+
> [!NOTE]
|
| 37 |
+
> For more examples, please check https://nast-s2x.github.io/.
|
| 38 |
+
|
| 39 |
+
## Performance
|
| 40 |
+
|
| 41 |
+
* ⚡️ **Lightning Fast**: 28× faster inference and competitive quality in offline speech-to-speech translation
|
| 42 |
+
* 👩💼 **Simultaneous**: Achieves high-quality simultaneous interpretation within a delay of less than 3 seconds
|
| 43 |
+
* 🤖 **Unified Framework**: Support end-to-end text & speech generation in one model
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
**Check Details** 👇
|
| 47 |
+
Offline-S2S | Simul-S2S | Simul-S2T
|
| 48 |
+
:-------------------------:|:-------------------------:|:-------------------------:
|
| 49 |
+
|  | 
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
## Architecture
|
| 55 |
+
<p align="center">
|
| 56 |
+
<img src="https://github.com/ictnlp/NAST-S2x/assets/43530347/404cdd56-a9d9-4c10-96aa-64f0c7605248" width="800" />
|
| 57 |
+
</p>
|
| 58 |
+
|
| 59 |
+
* **Fully Non-autoregressive:** Trained with **CTC-based non-monotonic latent alignment loss [(Shao and Feng, 2022)](https://arxiv.org/abs/2210.03953)** and **glancing mechanism [(Qian et al., 2021)](https://arxiv.org/abs/2008.07905)**.
|
| 60 |
+
* **Minimum Human Design:** Seamlessly switch between offline translation and simultaneous interpretation **by adjusting the chunk size**.
|
| 61 |
+
* **End-to-End:** Generate target speech **without** target text decoding.
|
| 62 |
+
|
| 63 |
+
# Sources and Usage
|
| 64 |
+
## Model
|
| 65 |
+
> [!NOTE]
|
| 66 |
+
> We release French-to-English speech-to-speech translation models trained on the CVSS-C dataset to reproduce results in our paper. You can train models in your desired languages by following the instructions provided below.
|
| 67 |
+
|
| 68 |
+
[🤗 Model card](https://huggingface.co/ICTNLP/NAST-S2X)
|
| 69 |
+
| Chunk Size | checkpoint | ASR-BLEU | ASR-BLEU (Silence Removed) | Average Lagging |
|
| 70 |
+
| ----------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------- |---------------------------------------------------------------- |
|
| 71 |
+
| 320ms | [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/blob/main/chunk_320ms.pt) | 19.67 | 24.90 | -393ms |
|
| 72 |
+
| 1280ms | [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/blob/main/chunk_1280ms.pt) | 20.20 | 25.71 | 3330ms |
|
| 73 |
+
| 2560ms | [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/blob/main/chunk_2560ms.pt) | 24.88 | 26.14 | 4976ms |
|
| 74 |
+
| Offline | [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/blob/main/Offline.pt) | 25.82 | - | - |
|
| 75 |
+
|
| 76 |
+
| Vocoder |
|
| 77 |
+
| --- |
|
| 78 |
+
| [checkpoint](https://huggingface.co/ICTNLP/NAST-S2X/tree/main/vocoder)|
|
| 79 |
+
|
| 80 |
+
## Inference
|
| 81 |
+
> [!WARNING]
|
| 82 |
+
> Before executing all the provided shell scripts, please ensure to replace the variables in the file with the paths specific to your machine.
|
| 83 |
+
|
| 84 |
+
### Offline Inference
|
| 85 |
+
* **Data preprocessing**: Follow the instructions in the [document](https://github.com/ictnlp/NAST-S2x/blob/main/Preprocessing.md).
|
| 86 |
+
* **Generate Acoustic Unit**: Excute [``offline_s2u_infer.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/test_scripts/offline_s2u_infer.sh)
|
| 87 |
+
* **Generate Waveform**: Excute [``offline_wav_infer.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/test_scripts/offline_wav_infer.sh)
|
| 88 |
+
* **Evaluation**: Using Fairseq's [ASR-BLEU evaluation toolkit](https://github.com/facebookresearch/fairseq/tree/main/examples/speech_to_speech/asr_bleu)
|
| 89 |
+
### Simultaneous Inference
|
| 90 |
+
* We use our customized fork of [``SimulEval: b43a7c``](https://github.com/Paulmzr/SimulEval/tree/b43a7c7a9f20bb4c2ff48cf1bc573b4752d7081e) to evaluate the model in simultaneous inference. This repository is built upon the official [``SimulEval: a1435b``](https://github.com/facebookresearch/SimulEval/tree/a1435b65331cac9d62ea8047fe3344153d7e7dac) and includes additional latency scorers.
|
| 91 |
+
* **Data preprocessing**: Follow the instructions in the [document](https://github.com/ictnlp/NAST-S2x/blob/main/Preprocessing.md).
|
| 92 |
+
* **Streaming Generation and Evaluation**: Excute [``streaming_infer.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/test_scripts/streaming_infer.sh)
|
| 93 |
+
|
| 94 |
+
## Train your own NAST-S2X
|
| 95 |
+
* **Data preprocessing**: Follow the instructions in the [document](https://github.com/ictnlp/NAST-S2x/blob/main/Preprocessing.md).
|
| 96 |
+
* **CTC Pretraining**: Excute [``train_ctc.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/train_scripts/train_ctc.sh)
|
| 97 |
+
* **NMLA Training**: Excute [``train_nmla.sh``](https://github.com/ictnlp/NAST-S2x/blob/main/train_scripts/train_nmla.sh)
|
| 98 |
+
|
| 99 |
+
## Citing
|
| 100 |
+
|
| 101 |
+
Please kindly cite us if you find our papers or codes useful.
|
| 102 |
+
|
| 103 |
+
```
|
| 104 |
+
@inproceedings{
|
| 105 |
+
ma2024nonautoregressive,
|
| 106 |
+
title={A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation},
|
| 107 |
+
author={Ma, Zhengrui and Fang, Qingkai and Zhang, Shaolei and Guo, Shoutao and Feng, Yang and Zhang, Min
|
| 108 |
+
},
|
| 109 |
+
booktitle={Proceedings of ACL 2024},
|
| 110 |
+
year={2024},
|
| 111 |
+
}
|
| 112 |
+
|
| 113 |
+
@inproceedings{
|
| 114 |
+
fang2024ctcs2ut,
|
| 115 |
+
title={CTC-based Non-autoregressive Textless Speech-to-Speech Translation},
|
| 116 |
+
author={Fang, Qingkai and Ma, Zhengrui and Zhou, Yan and Zhang, Min and Feng, Yang
|
| 117 |
+
},
|
| 118 |
+
booktitle={Findings of ACL 2024},
|
| 119 |
+
year={2024},
|
| 120 |
+
}
|
| 121 |
+
```
|