| --- |
| license: mit |
| --- |
| |
|
|
| ## SparQLe – Speech Queries to Text via Instruction‑Tuned LLM ⚡ |
|
|
| **What it does:** |
| SparQLe (Speech Routing to Query LLMs) enables direct speech-to-text understanding by aligning self‑supervised speech representations (e.g., HuBERT-like features) with instruction‑tuned Large Language Models (LLMs). This is achieved using a lightweight *modality adapter*, bridging the modalities without retraining the whole LLM. ([Moonlight][1]) |
|
|
| **Key strengths:** |
|
|
| * **Preserves semantic content** of spoken input in the produced text |
| * **Efficiently leverages frozen SSL models**, avoiding heavy ASR backbones like Whisper |
| * **Modular design** with a query‑former (Q‑former) adapter and LLM backend |
|
|
| **Architecture:** |
|
|
| 1. **Speech encoder** (SSL) transforms raw input into latent features. |
| 2. **Modality adapter / Q‑former** aligns these with the LLM’s text embedding space. |
| 3. **Instruction‑tuned LLM** processes the adapted input to generate semantic text. |
|
|
|
|
| ## Citation |
|
|
| If you use SparQLe in your research, please cite: |
|
|
| ```bibtex |
| @misc{djanibekov2025sparqlespeechqueriestext, |
| title={SparQLe: Speech Queries to Text Translation Through LLMs}, |
| author={Amirbek Djanibekov and Hanan Aldarmaki}, |
| year={2025}, |
| eprint={2502.09284}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2502.09284}, |
| } |
| ``` |
|
|
| 📄 Read the full paper on arXiv: [https://arxiv.org/abs/2502.09284](https://arxiv.org/abs/2502.09284) |
|
|
| --- |
|
|
| ## License |
|
|
| This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
|
|
| --- |
|
|
| ## Acknowledgments |
|
|
| - This work builds upon [fairseq](https://github.com/facebookresearch/fairseq) 💙 |
| - The Qformer architecture is inspired by [BLIP-2](https://github.com/salesforce/BLIP-2) ✨ |
|
|