niobures commited on
Commit
abd74e2
·
verified ·
1 Parent(s): 5bd661b

Step-Audio (models: Step-Audio-AQAA)

Browse files
models/Step-Audio-AQAA/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
models/Step-Audio-AQAA/README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # Step-Audio-AQAA: A Fully End-to-End Expressive Large Audio Language Model
5
+ **📚 Paper:** [Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model](https://arxiv.org/abs/2506.08967)
6
+
7
+ **🚀 Live Demo:** [![Try the Demo](https://img.shields.io/badge/StepFun-Audio-AQAA)](https://www.stepfun.com/docs/zh/step-audio-aqaa?studio_code=step-audio-aqaa&studio_id=121368403356246016&studio_type=1)
8
+
9
+ ## Model Overview
10
+ Step-Audio-AQAA is a fully end-to-end Large Audio-Language Model (LALM) designed for Audio Query-Audio Answer (AQAA) tasks. It directly processes audio inputs and generates natural, accurate speech responses without relying on traditional ASR and TTS modules, eliminating cascading errors and simplifying the system architecture.
11
+
12
+ ## Key Capabilities
13
+ - **Fully End-to-End Audio Interaction**: Generates speech outputs directly from raw audio inputs without ASR/TTS intermediates.
14
+ - **Fine-Grained Voice Control**: Supports sentence-level adjustments of emotional tone, speech rate, and other vocal features.
15
+ - **Multilingual & Dialect Support**: Covers Chinese (including Sichuanese, Cantonese), English, Japanese, etc.
16
+ - **Complex Task Handling**: Excels in speech emotion control, role-playing, logical reasoning, and other complex audio interactions.
17
+
18
+ ## Model Architecture
19
+ Step-Audio-AQAA consists of three core modules:
20
+
21
+ ### Dual-Codebook Audio Tokenizer
22
+ - **Linguistic Tokenizer**: Based on Paraformer encoder, extracts phonemic and linguistic attributes with a 1,024-codebook at 16.7Hz.
23
+ - **Semantic Tokenizer**: References CosyVoice 1.0, captures acoustic features with a 4,096-codebook at 25Hz.
24
+ - **Temporal Alignment**: Uses a 2:3 interleaving ratio to ensure temporal consistency between token types.
25
+
26
+ ### Backbone LLM
27
+ - **Parameter Scale**: 130-billion-parameter multi-modal LLM (Step-Omni).
28
+ - **Architecture**: Decoder-only with Transformer blocks, RMSNorm layers, and grouped query attention.
29
+ - **Vocabulary Expansion**: Incorporates 5,120 audio tokens into the text vocabulary for text-audio interleaved output.
30
+
31
+ ### Neural Vocoder
32
+ - **Architecture**: Flow-matching model based on CosyVoice, using U-Net and ResNet-1D layers.
33
+ - **Conditional Generation**: Generates high-fidelity speech waveforms conditioned solely on audio tokens.
34
+
35
+ ## Training Approach
36
+ ### Multi-Stage Training Pipeline
37
+ 1. **Pretraining**: Multi-modal pretraining on text, audio, and image data.
38
+ 2. **Supervised Fine-Tuning (SFT)**:
39
+ - Stage 1: Full-parameter update on AQTA and AQTAA datasets.
40
+ - Stage 2: Optimizes specific capabilities with high-quality AQTAA data.
41
+ 3. **Direct Preference Optimization (DPO)**: Uses audio token masking to avoid degradation of speech generation.
42
+ 4. **Model Merging**: Weighted combination of SFT and DPO models to enhance overall performance.
43
+
44
+ ### Training Data
45
+ - **Multi-Modal Pretraining Data**: 800 billion text tokens and audio-text interleaved data.
46
+ - **AQTA Dataset**: Audio query-text answer pairs.
47
+ - **AQTAA Dataset**: Audio query-text answer-audio answer triplets generated from AQTA.
48
+
49
+
50
+ ## Citation
51
+ ```bibtex
52
+ @misc{huang2025stepaudioaqaa,
53
+ title={Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model},
54
+ author={Ailin Huang and Boyong Wu and Bruce Wang and Chao Yan and Chen Hu and Chengli Feng and Fei Tian and Feiyu Shen and Jingbei Li and Mingrui Chen and et al.},
55
+ year={2025},
56
+ eprint={2506.08967},
57
+ archivePrefix={arXiv},
58
+ primaryClass={cs.SD}
59
+ }
60
+ ```
61
+
62
+ ## Team & Contributions
63
+ Step-Audio-AQAA is developed by the StepFun team, with contributions from multiple researchers and engineers. For technical support or collaboration, contact the corresponding authors: Daxin Jiang (djiang@stepfun.com), Shuchang Zhou (scotzhou@stepfun.com), Chen Hu (hatcher@stepfun.com).
64
+
65
+ ## License
66
+ This model is released under the Apache 2.0 license. For more details, please refer to the license file.
models/Step-Audio-AQAA/config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MMGPTStep1ForCausalLMV4"
4
+ ],
5
+ "model_type": "mmgpt_step1_v2",
6
+ "hidden_size": 12288,
7
+ "intermediate_size": 31232,
8
+ "num_attention_heads": 96,
9
+ "num_attention_groups": 8,
10
+ "num_hidden_layers": 88,
11
+ "max_seq_len": 999999,
12
+ "vocab_size": 74752,
13
+ "rms_norm_eps": 1e-05,
14
+ "torch_dtype": "bfloat16",
15
+ "im_end_token": "<im_end>",
16
+ "im_patch_token": "<im_patch>",
17
+ "im_start_token": "<im_start>",
18
+ "image_token_len": 169,
19
+ "use_im_start_end": true,
20
+ "vision_select_layer": -1,
21
+ "understand_projector_stride": 2,
22
+ "vit_scale": 1.0,
23
+ "projector_bias": false,
24
+ "vision_tower_config": {
25
+ "hidden_size": 1792,
26
+ "output_hidden_size": 4096,
27
+ "image_size": 728,
28
+ "intermediate_size": 15360,
29
+ "num_attention_heads": 16,
30
+ "num_hidden_layers": 63,
31
+ "patch_size": 14
32
+ }
33
+ }
models/Step-Audio-AQAA/model-00001.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:401119d809d338614caa2adabe6e55f7d5aad1a7395bf624bdcdc14605de6181
3
+ size 9434733376
models/Step-Audio-AQAA/model-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:799f7fad3eb75123046be5a822337ed28a405c8136c5243e380d59241a6cd85b
3
+ size 9940651088
models/Step-Audio-AQAA/model-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb21dfe9f67bde6d78db684419372fd245dcea5e167ad7b2b94ba613db78303b
3
+ size 9940676016
models/Step-Audio-AQAA/model-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e3f6caa5db5606bed09b08a53b51fac44f94701880d5d0cf3cdda42c0370c46b
3
+ size 9991007912
models/Step-Audio-AQAA/model-00005.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:660af7ae96e6369b92a987327c343adc88706047d991338d890ab3fdde174850
3
+ size 9638661344
models/Step-Audio-AQAA/model-00006.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2434a2806cc231904ee07bc7575c49c28e4f74f318fb041e37fb8bf2d540b1af
3
+ size 9940676048
models/Step-Audio-AQAA/model-00007.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:205072bb59dafe89f248968a26298b7f91a06e0150a66690ce99cbcd8bae686b
3
+ size 9991007944
models/Step-Audio-AQAA/model-00008.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c476603a3cd719dfc2832cfe9c312c459260f90dd6af8125b75e45a0604c8ae7
3
+ size 9638661352
models/Step-Audio-AQAA/model-00009.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d29cbfce6f0738a4431c6a885ddc4a01ca0831f56d7748d8277971ba268efda0
3
+ size 9940676048
models/Step-Audio-AQAA/model-00010.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f14f17011d5c36e84b0393611d62c9a26e6fb3a403bae1e26028b6377c183f63
3
+ size 9991007944
models/Step-Audio-AQAA/model-00011.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8ca4a5d29e943b4b4a7e6c11de86e44d40940c739f0fb8801030e86ac54f55d7
3
+ size 9638661352
models/Step-Audio-AQAA/model-00012.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb1c87ec0c8bc78feba6bb39432bb009b040d2e9eb15d068fd96e824cb88e957
3
+ size 9940676048
models/Step-Audio-AQAA/model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
models/Step-Audio-AQAA/preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_center_crop": false,
3
+ "do_normalize": true,
4
+ "do_resize": true,
5
+ "feature_extractor_type": "CLIPFeatureExtractor",
6
+ "image_mean": [
7
+ 0.48145466,
8
+ 0.4578275,
9
+ 0.40821073
10
+ ],
11
+ "image_std": [
12
+ 0.26862954,
13
+ 0.26130258,
14
+ 0.27577711
15
+ ],
16
+ "resample": 3,
17
+ "size": 728,
18
+ "mode": "bilinear"
19
+ }
models/Step-Audio-AQAA/source.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ https://huggingface.co/stepfun-ai/Step-Audio-AQAA
models/Step-Audio-AQAA/step2_tokenizer_241028.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:22f3dc190340b02713be9cb43d5fcf7e3c01c302d407c4470def4bac5eac9fdc
3
+ size 1264015