Audio-Text-to-Text
Transformers
Safetensors
English
llava_llama
text-generation
nielsr's picture
nielsr HF Staff
Add comprehensive model card for LLaSO-Base-3.8B-Instruct with pipeline tag, library name, and dataset links
fe48c05 verified
|
raw
history blame
7.36 kB
metadata
license: apache-2.0
pipeline_tag: audio-text-to-text
library_name: transformers
language: en
datasets:
  - YirongSun/LLaSO-Align
  - YirongSun/LLaSO-Instruct
  - YirongSun/LLaSO-Eval

LLaSO-Base-3.8B-Instruct: A Foundational Framework for Reproducible Research in Large Language and Speech Models

This repository contains LLaSO-Base-3.8B-Instruct, a 3.8B-parameter reference model from the LLaSO framework. LLaSO is introduced as the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling to advance reproducible research in the field of Large Speech-Language Models (LSLMs).

LLaSO-Base is trained exclusively on public data provided by the LLaSO framework, achieving a strong, reproducible baseline (normalized score of 0.72) for compositional speech-language understanding across 20 tasks.

HF Align HF Ins HF Eval
arXiv HF Model GitHub Stars

🔍 What is LLaSO?

LLaSO is the first fully open, end-to-end stack for large-scale speech–language modeling, unifying data, evaluation, and modeling in one framework.

The framework provides three essential resources:

  • LLaSO-Align (12.0M): An ASR-based alignment corpus for grounding speech in textual semantic space.
  • LLaSO-Instruct (13.5M / 20 tasks / 3 modality configs): A multi-task instruction-tuning dataset across linguistic, semantic, and paralinguistic objectives.
  • LLaSO-Eval (15,044): A reproducible benchmark for standardized evaluation, particularly for instruction-following and cross-modality generalization.
  • LLaSO-Base (3.8B): This model, a two-stage trained reference model adapted from LLaVA-style architectures for robust compositional understanding.

LLaSO overall performance

LLaSO-Base achieves a strong normalized overall score on LLaSO-Eval across 20 tasks spanning linguistic, semantic, and paralinguistic categories.

✨ Key Features

  • Fully Open, End-to-End Stack: Unified release of corpus, benchmark, and model enabling open-source research and fair comparison in speech-language modeling.
  • 25.5M Samples, 20 Tasks, 3 Modality Configurations: Supports all major text ↔ audio combinations (text + audio, audio + text, pure audio), covering linguistic, semantic, and paralinguistic tasks.
  • Stratified Evaluation (15,044): Cohesive design between training and test sets enables systematic assessment of instruction following, cross-modality generalization, abstention rate, and stability.
  • Robust Reference Model (3.8B): Two-stage training (ASR alignment → instruction tuning), easily reproducible and extensible for further research.
  • Empirical Insights: Broader task and modality coverage consistently leads to stronger overall performance, but unseen modality/task configurations (especially pure audio) remain challenging; interleaving and parallel decoding strategies can bridge some gaps.

Architecture & Two-Stage Training (Figure 6)
Architecture & Two-Stage Training

🚀 Usage

You can use this model with the transformers library. Here's a quick example for inference:

import torch
from transformers import AutoProcessor, AutoModelForCausalLM
import librosa
import soundfile as sf
import os
import numpy as np

# Load model and processor
model_path = "YirongSun/LLaSO-Base-3.8B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()

# Example audio input (replace with your audio file)
# For demonstration, creating a dummy audio file
dummy_audio_path = "dummy_audio.wav"
sr = 16000
duration = 5 # seconds
dummy_audio_data = (np.random.rand(sr * duration) * 0.5).astype(np.float32)
sf.write(dummy_audio_path, dummy_audio_data, sr)

# Load audio and process it
audio, rate = librosa.load(dummy_audio_path, sr=sr)
audio_inputs = processor(audio=audio, sampling_rate=rate, return_tensors="pt")

# Example text prompt
# The LLaSO models are Llama-3-based, so use the corresponding chat template.
# The `processor`'s chat template automatically handles adding special tokens for roles.
# The model uses "<audio_start>" and "<audio_end>" tokens, which are usually handled internally
# when `audio_values` are passed, or explicitly via tokenization if part of the text prompt.
# Here, we pass `audio_values` separately as common in multimodal models.
prompt = "<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Transcribe the audio.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"
text_inputs = processor(text=prompt, return_tensors="pt")

# Combine inputs
inputs = {
    "input_ids": text_inputs.input_ids.to(model.device),
    "attention_mask": text_inputs.attention_mask.to(model.device),
    "audio_values": audio_inputs.audio_values.to(model.device)
}

# Generate response
with torch.inference_mode():
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.6, top_p=0.9)

# Decode and print
decoded_output = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Generated Text: {decoded_output}")

# Clean up dummy audio file
os.remove(dummy_audio_path)

For more detailed usage, training instructions, and advanced evaluation scenarios, please refer to the LLaSO GitHub repository.

📑 How to Cite

If you use LLaSO in your research or applications, please cite our paper:

@misc{sun2025llaso,
      title={LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model}, 
      author={Yirong Sun and Yizhong Geng and Peidong Wei and Yanjun Chen and Jinghan Yang and Rongfei Chen and Wei Zhang and Xiaoyu Shen},
      year={2025},
      eprint={2508.15418},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.15418}, 
}