Upload folder using huggingface_hub
Browse files- .gitattributes +1 -0
- README.md +1 -1
- assets/arc.png +3 -0
.gitattributes
CHANGED
|
@@ -38,3 +38,4 @@ assets/moss-audio-2.png filter=lfs diff=lfs merge=lfs -text
|
|
| 38 |
assets/moss-audio-image.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
assets/moss-audio-logo.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
assets/speech_caption_radar.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 38 |
assets/moss-audio-image.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
assets/moss-audio-logo.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
assets/speech_caption_radar.png filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
assets/arc.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -85,7 +85,7 @@ Understanding audio requires more than simply transcribing words — it demands
|
|
| 85 |
## Model Architecture
|
| 86 |
|
| 87 |
<p align="center">
|
| 88 |
-
<img src="./assets/
|
| 89 |
</p>
|
| 90 |
|
| 91 |
MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by **MOSS-Audio-Encoder** into continuous temporal representations at **12.5 Hz**, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.
|
|
|
|
| 85 |
## Model Architecture
|
| 86 |
|
| 87 |
<p align="center">
|
| 88 |
+
<img src="./assets/arc.png" width="95%" />
|
| 89 |
</p>
|
| 90 |
|
| 91 |
MOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by **MOSS-Audio-Encoder** into continuous temporal representations at **12.5 Hz**, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation.
|
assets/arc.png
ADDED
|
Git LFS Details
|