Upload folder using huggingface_hub

Files changed (3) hide show

.gitattributes CHANGED Viewed

@@ -38,3 +38,4 @@ assets/moss-audio-2.png filter=lfs diff=lfs merge=lfs -text
 assets/moss-audio-image.png filter=lfs diff=lfs merge=lfs -text
 assets/moss-audio-logo.png filter=lfs diff=lfs merge=lfs -text
 assets/speech_caption_radar.png filter=lfs diff=lfs merge=lfs -text

 assets/moss-audio-image.png filter=lfs diff=lfs merge=lfs -text
 assets/moss-audio-logo.png filter=lfs diff=lfs merge=lfs -text
 assets/speech_caption_radar.png filter=lfs diff=lfs merge=lfs -text
+assets/arc.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -85,7 +85,7 @@ Understanding audio requires more than simply transcribing words — it demands
 ## Model Architecture
 <p align="center">
-  <img src="./assets/moss-audio-architecture.svg" width="95%" />
 </p>
 MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by **MOSS-Audio-Encoder** into continuous temporal representations at **12.5 Hz**, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.

 ## Model Architecture
 <p align="center">
+  <img src="./assets/arc.png" width="95%" />
 </p>
 MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by **MOSS-Audio-Encoder** into continuous temporal representations at **12.5 Hz**, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.

assets/arc.png ADDED Viewed