license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers
language: en
datasets:
- YirongSun/LLaSO-Align
- YirongSun/LLaSO-Instruct
- YirongSun/LLaSO-Eval
LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models
This repository contains LLaSO-Base-3.8B-Instruct, a 3.8B-parameter reference model from the LLaSO framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs).
LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.
- Paper: LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
- Code & Project Page: https://github.com/EIT-NLP/LLaSO
🔍 What is LLaSO?
LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.
The framework provides three essential resources:
- LLaSO-Align (12.0M): An ASR-based alignment corpus for grounding speech in textual semantic space.
- LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs): A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives.
- LLaSO-Eval (15,044): A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization.
- LLaSO-Base (3.8B): This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.
LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.
✨ Key Features
- Fully Open, End-to-End Stack: Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling.
- 25.5M Samples, 20 Tasks, 3 Modality Configurations: Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks.
- Stratified Evaluation (15,044): Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability.
- Robust Reference Model (3.8B): Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research.
- Empirical Insights: Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.

Architecture & Two-Stage Training
🚀 Usage
You can use this model with the transformers library. Here's a quick example for inference:
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
import librosa
import soundfile as sf
import os
import numpy as np
# Load model and processor
model_path = "YirongSun/LLaSO-Base-3.8B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()
# Example audio input (replace with your audio file)
# For demonstration, creating a dummy audio file
dummy_audio_path = "dummy_audio.wav"
sr = 16000
duration = 5 # seconds
dummy_audio_data = (np.random.rand(sr * duration) * 0.5).astype(np.float32)
sf.write(dummy_audio_path, dummy_audio_data, sr)
# Load audio and process it
audio, rate = librosa.load(dummy_audio_path, sr=sr)
audio_inputs = processor(audio=audio, sampling_rate=rate, return_tensors="pt")
# Example text prompt
# The LLaSO models are Llama-3-based, so use the corresponding chat template.
# The `processor`'s chat template automatically handles adding special tokens for roles.
# The model uses "<audio_start>" and "<audio_end>" tokens, which are usually handled internally
# when `audio_values` are passed, or explicitly via tokenization if part of the text prompt.
# Here, we pass `audio_values` separately as common in multimodal models.
prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Transcribe the audio.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"
text_inputs = processor(text=prompt, return_tensors="pt")
# Combine inputs
inputs = {
"input_ids": text_inputs.input_ids.to(model.device),
"attention_mask": text_inputs.attention_mask.to(model.device),
"audio_values": audio_inputs.audio_values.to(model.device)
}
# Generate response
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9)
# Decode and print
decoded_output = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Generated Text: {decoded_output}")
# Clean up dummy audio file
os.remove(dummy_audio_path)
For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the LLaSO GitHub repository.
📑 How to Cite
If you use LLaSO in your research or applications, please cite our paper:
@misc{sun2025llaso,
title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model},
author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen},
year={2025},
eprint={2508.15418},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.15418},
}