lysanderism
/

TimeAudio

Large Audio Language Models

Model card Files Files and versions

chukewang commited on Nov 13, 2025

Commit

574bfcc

·

1 Parent(s): 2d5c348

Init

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -31,13 +31,13 @@ To address this, we introduce TimeAudio, a novel method that empowers LALMs to c
 TimeAudio is based on the fundamental architecture of SALMONN. Specifically, TimeAudio is consists of four components: a sliding audio encoder, a window Q-former, a segment-level token merging module, and an LLM to process raw audio.
-<div align=center><img src="img/overview.png" height="100%" width="90%"/></div>
 ## Compare
 Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, Example of failed cases by Qwen2-Audio and Qwen2-Audio-R1 on fine-grained tasks that require both semantics and timestamps as output.
-<div align=center><img src="img/case.png" height="100%" width="80%"/></div>
 ## How to inference in CLI

 TimeAudio is based on the fundamental architecture of SALMONN. Specifically, TimeAudio is consists of four components: a sliding audio encoder, a window Q-former, a segment-level token merging module, and an LLM to process raw audio.
+<div align=center><img src="img/overview.png" height="100%" width="92%"/></div>
 ## Compare
 Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, Example of failed cases by Qwen2-Audio and Qwen2-Audio-R1 on fine-grained tasks that require both semantics and timestamps as output.
+<div align=center><img src="img/case.png" height="100%" width="75%"/></div>
 ## How to inference in CLI