chukewang
commited on
Commit
·
574bfcc
1
Parent(s):
2d5c348
Init
Browse files
README.md
CHANGED
|
@@ -31,13 +31,13 @@ To address this, we introduce TimeAudio, a novel method that empowers LALMs to c
|
|
| 31 |
|
| 32 |
TimeAudio is based on the fundamental architecture of SALMONN. Specifically, TimeAudio is consists of four components: a sliding audio encoder, a window Q-former, a segment-level token merging module, and an LLM to process raw audio.
|
| 33 |
|
| 34 |
-
<div align=center><img src="img/overview.png" height="100%" width="
|
| 35 |
|
| 36 |
## Compare
|
| 37 |
|
| 38 |
Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, Example of failed cases by Qwen2-Audio and Qwen2-Audio-R1 on fine-grained tasks that require both semantics and timestamps as output.
|
| 39 |
|
| 40 |
-
<div align=center><img src="img/case.png" height="100%" width="
|
| 41 |
|
| 42 |
## How to inference in CLI
|
| 43 |
|
|
|
|
| 31 |
|
| 32 |
TimeAudio is based on the fundamental architecture of SALMONN. Specifically, TimeAudio is consists of four components: a sliding audio encoder, a window Q-former, a segment-level token merging module, and an LLM to process raw audio.
|
| 33 |
|
| 34 |
+
<div align=center><img src="img/overview.png" height="100%" width="92%"/></div>
|
| 35 |
|
| 36 |
## Compare
|
| 37 |
|
| 38 |
Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, Example of failed cases by Qwen2-Audio and Qwen2-Audio-R1 on fine-grained tasks that require both semantics and timestamps as output.
|
| 39 |
|
| 40 |
+
<div align=center><img src="img/case.png" height="100%" width="75%"/></div>
|
| 41 |
|
| 42 |
## How to inference in CLI
|
| 43 |
|