Improve model card for AudioStory: add metadata and paper link
Browse filesThis PR improves the model card for AudioStory by adding key metadata and essential links.
Changes include:
- Adding `license: apache-2.0` based on the repository's license information.
- Adding `pipeline_tag: text-to-audio` as the model is designed for text-to-audio generation.
- Adding `library_name: transformers` as evidence from the `config.json` and `tokenizer_config.json` files indicates compatibility with the Hugging Face Transformers library (e.g., Qwen2 and T5 architectures, Qwen2Tokenizer class).
- Adding a direct link to the Hugging Face paper page: [AudioStory: Generating Long-Form Narrative Audio with Large Language Models](https://huggingface.co/papers/2508.20088).
- Updating relative image paths to absolute GitHub URLs for `audiostory.png` and `audiostory_framework.png` to ensure proper display on the Hub.
These enhancements will make the model more discoverable and easier to understand for users on the Hugging Face Hub.
|
@@ -1,5 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# AudioStory: Generating Long-Form Narrative Audio with Large Language Models
|
| 2 |
|
|
|
|
|
|
|
| 3 |
[[github]](https://github.com/TencentARC/AudioStory/)
|
| 4 |
|
| 5 |
✨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.
|
|
@@ -23,7 +31,7 @@
|
|
| 23 |
|
| 24 |
## 🔎 Introduction
|
| 25 |
|
| 26 |
-

|
| 27 |
|
| 28 |
Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:
|
| 29 |
|
|
@@ -51,17 +59,21 @@ Extensive experiments show the superiority of AudioStory on both single-audio ge
|
|
| 51 |
### 2. Cross-domain Video Dubbing (Tom & Jerry style)
|
| 52 |
|
| 53 |
<table class="center">
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
<td><video src="https://github.com/user-attachments/assets/e62d0c09-cdf0-4e51-b550-0a2c23f8d68d"></video></td>
|
| 55 |
-
<td><video src="https://github.com/user-attachments/assets/
|
| 56 |
<td><video src="https://github.com/user-attachments/assets/f2f7c94c-7f72-4cc0-8edc-290910980b04"></video></td>
|
| 57 |
<tr>
|
| 58 |
<td><video src="https://github.com/user-attachments/assets/d3e58dd4-31ae-4e32-aef1-03f1e649cb0c"></video></td>
|
| 59 |
-
<td><video src="https://github.com/user-attachments/assets/
|
| 60 |
<td><video src="https://github.com/user-attachments/assets/062236c3-1d26-4622-b843-cc0cd0c58053"></video></td>
|
| 61 |
<tr>
|
| 62 |
<td><video src="https://github.com/user-attachments/assets/8931f428-dd4d-430f-9927-068f2912dd36"></video></td>
|
| 63 |
-
<td><video src="https://github.com/user-attachments/assets/
|
| 64 |
-
<td><video src="https://github.com/user-attachments/assets/
|
| 65 |
<tr>
|
| 66 |
</table >
|
| 67 |
|
|
@@ -86,7 +98,7 @@ Extensive experiments show the superiority of AudioStory on both single-audio ge
|
|
| 86 |
|
| 87 |
## 🔎 Methods
|
| 88 |
|
| 89 |
-

|
| 90 |
|
| 91 |
To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end, AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs **interleaved reasoning generation**, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation.
|
| 92 |
|
|
@@ -169,4 +181,4 @@ This repository is under the [Apache 2 License](https://github.com/mashijie1028/
|
|
| 169 |
|
| 170 |
If you have further questions, feel free to contact me: guoyuxin2021@ia.ac.cn
|
| 171 |
|
| 172 |
-
Discussions and potential collaborations are also welcome.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-to-audio
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
# AudioStory: Generating Long-Form Narrative Audio with Large Language Models
|
| 8 |
|
| 9 |
+
This model is presented in the paper [AudioStory: Generating Long-Form Narrative Audio with Large Language Models](https://huggingface.co/papers/2508.20088).
|
| 10 |
+
|
| 11 |
[[github]](https://github.com/TencentARC/AudioStory/)
|
| 12 |
|
| 13 |
✨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.
|
|
|
|
| 31 |
|
| 32 |
## 🔎 Introduction
|
| 33 |
|
| 34 |
+

|
| 35 |
|
| 36 |
Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:
|
| 37 |
|
|
|
|
| 59 |
### 2. Cross-domain Video Dubbing (Tom & Jerry style)
|
| 60 |
|
| 61 |
<table class="center">
|
| 62 |
+
<td><video src="https://github.com/user-attachments/assets/4089493c-2a26-4093-9709-0827c6dafcde"></video></td>
|
| 63 |
+
<td><video src="https://github.com/user-attachments/assets/67fafed1-2547-49ba-afaa-75fc7f9d58ca"></video></td>
|
| 64 |
+
<td><video src="https://github.com/user-attachments/assets/abbc9192-894c-49a2-9b55-8cc4852483c2"></video></td>
|
| 65 |
+
<tr>
|
| 66 |
<td><video src="https://github.com/user-attachments/assets/e62d0c09-cdf0-4e51-b550-0a2c23f8d68d"></video></td>
|
| 67 |
+
<td><video src="https://github.com/user-attachments/assets/38339d5b-b96a-4ffd-8607-c94eb254beb6"></video></td>
|
| 68 |
<td><video src="https://github.com/user-attachments/assets/f2f7c94c-7f72-4cc0-8edc-290910980b04"></video></td>
|
| 69 |
<tr>
|
| 70 |
<td><video src="https://github.com/user-attachments/assets/d3e58dd4-31ae-4e32-aef1-03f1e649cb0c"></video></td>
|
| 71 |
+
<td><video src="https://github.com/user-attachments/assets/ab7e46d5-f42c-472e-b66e-df786b658210"></video></td>
|
| 72 |
<td><video src="https://github.com/user-attachments/assets/062236c3-1d26-4622-b843-cc0cd0c58053"></video></td>
|
| 73 |
<tr>
|
| 74 |
<td><video src="https://github.com/user-attachments/assets/8931f428-dd4d-430f-9927-068f2912dd36"></video></td>
|
| 75 |
+
<td><video src="https://github.com/user-attachments/assets/4f68199f-e48a-4be7-b6dc-1acb8d377a6e"></video></td>
|
| 76 |
+
<td><video src="https://github.com/user-attachments/assets/736d22ca-6636-4ef0-99f3-768e4dfb112a"></video></td>
|
| 77 |
<tr>
|
| 78 |
</table >
|
| 79 |
|
|
|
|
| 98 |
|
| 99 |
## 🔎 Methods
|
| 100 |
|
| 101 |
+

|
| 102 |
|
| 103 |
To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end, AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs **interleaved reasoning generation**, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation.
|
| 104 |
|
|
|
|
| 181 |
|
| 182 |
If you have further questions, feel free to contact me: guoyuxin2021@ia.ac.cn
|
| 183 |
|
| 184 |
+
Discussions and potential collaborations are also welcome.
|