Step-Audio (models: Step-Audio-AQAA)
Browse files- models/Step-Audio-AQAA/.gitattributes +35 -0
- models/Step-Audio-AQAA/README.md +66 -0
- models/Step-Audio-AQAA/config.json +33 -0
- models/Step-Audio-AQAA/model-00001.safetensors +3 -0
- models/Step-Audio-AQAA/model-00002.safetensors +3 -0
- models/Step-Audio-AQAA/model-00003.safetensors +3 -0
- models/Step-Audio-AQAA/model-00004.safetensors +3 -0
- models/Step-Audio-AQAA/model-00005.safetensors +3 -0
- models/Step-Audio-AQAA/model-00006.safetensors +3 -0
- models/Step-Audio-AQAA/model-00007.safetensors +3 -0
- models/Step-Audio-AQAA/model-00008.safetensors +3 -0
- models/Step-Audio-AQAA/model-00009.safetensors +3 -0
- models/Step-Audio-AQAA/model-00010.safetensors +3 -0
- models/Step-Audio-AQAA/model-00011.safetensors +3 -0
- models/Step-Audio-AQAA/model-00012.safetensors +3 -0
- models/Step-Audio-AQAA/model.safetensors.index.json +0 -0
- models/Step-Audio-AQAA/preprocessor_config.json +19 -0
- models/Step-Audio-AQAA/source.txt +1 -0
- models/Step-Audio-AQAA/step2_tokenizer_241028.model +3 -0
models/Step-Audio-AQAA/.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
models/Step-Audio-AQAA/README.md
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
# Step-Audio-AQAA: A Fully End-to-End Expressive Large Audio Language Model
|
| 5 |
+
**📚 Paper:** [Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model](https://arxiv.org/abs/2506.08967)
|
| 6 |
+
|
| 7 |
+
**🚀 Live Demo:** [](https://www.stepfun.com/docs/zh/step-audio-aqaa?studio_code=step-audio-aqaa&studio_id=121368403356246016&studio_type=1)
|
| 8 |
+
|
| 9 |
+
## Model Overview
|
| 10 |
+
Step-Audio-AQAA is a fully end-to-end Large Audio-Language Model (LALM) designed for Audio Query-Audio Answer (AQAA) tasks. It directly processes audio inputs and generates natural, accurate speech responses without relying on traditional ASR and TTS modules, eliminating cascading errors and simplifying the system architecture.
|
| 11 |
+
|
| 12 |
+
## Key Capabilities
|
| 13 |
+
- **Fully End-to-End Audio Interaction**: Generates speech outputs directly from raw audio inputs without ASR/TTS intermediates.
|
| 14 |
+
- **Fine-Grained Voice Control**: Supports sentence-level adjustments of emotional tone, speech rate, and other vocal features.
|
| 15 |
+
- **Multilingual & Dialect Support**: Covers Chinese (including Sichuanese, Cantonese), English, Japanese, etc.
|
| 16 |
+
- **Complex Task Handling**: Excels in speech emotion control, role-playing, logical reasoning, and other complex audio interactions.
|
| 17 |
+
|
| 18 |
+
## Model Architecture
|
| 19 |
+
Step-Audio-AQAA consists of three core modules:
|
| 20 |
+
|
| 21 |
+
### Dual-Codebook Audio Tokenizer
|
| 22 |
+
- **Linguistic Tokenizer**: Based on Paraformer encoder, extracts phonemic and linguistic attributes with a 1,024-codebook at 16.7Hz.
|
| 23 |
+
- **Semantic Tokenizer**: References CosyVoice 1.0, captures acoustic features with a 4,096-codebook at 25Hz.
|
| 24 |
+
- **Temporal Alignment**: Uses a 2:3 interleaving ratio to ensure temporal consistency between token types.
|
| 25 |
+
|
| 26 |
+
### Backbone LLM
|
| 27 |
+
- **Parameter Scale**: 130-billion-parameter multi-modal LLM (Step-Omni).
|
| 28 |
+
- **Architecture**: Decoder-only with Transformer blocks, RMSNorm layers, and grouped query attention.
|
| 29 |
+
- **Vocabulary Expansion**: Incorporates 5,120 audio tokens into the text vocabulary for text-audio interleaved output.
|
| 30 |
+
|
| 31 |
+
### Neural Vocoder
|
| 32 |
+
- **Architecture**: Flow-matching model based on CosyVoice, using U-Net and ResNet-1D layers.
|
| 33 |
+
- **Conditional Generation**: Generates high-fidelity speech waveforms conditioned solely on audio tokens.
|
| 34 |
+
|
| 35 |
+
## Training Approach
|
| 36 |
+
### Multi-Stage Training Pipeline
|
| 37 |
+
1. **Pretraining**: Multi-modal pretraining on text, audio, and image data.
|
| 38 |
+
2. **Supervised Fine-Tuning (SFT)**:
|
| 39 |
+
- Stage 1: Full-parameter update on AQTA and AQTAA datasets.
|
| 40 |
+
- Stage 2: Optimizes specific capabilities with high-quality AQTAA data.
|
| 41 |
+
3. **Direct Preference Optimization (DPO)**: Uses audio token masking to avoid degradation of speech generation.
|
| 42 |
+
4. **Model Merging**: Weighted combination of SFT and DPO models to enhance overall performance.
|
| 43 |
+
|
| 44 |
+
### Training Data
|
| 45 |
+
- **Multi-Modal Pretraining Data**: 800 billion text tokens and audio-text interleaved data.
|
| 46 |
+
- **AQTA Dataset**: Audio query-text answer pairs.
|
| 47 |
+
- **AQTAA Dataset**: Audio query-text answer-audio answer triplets generated from AQTA.
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
## Citation
|
| 51 |
+
```bibtex
|
| 52 |
+
@misc{huang2025stepaudioaqaa,
|
| 53 |
+
title={Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model},
|
| 54 |
+
author={Ailin Huang and Boyong Wu and Bruce Wang and Chao Yan and Chen Hu and Chengli Feng and Fei Tian and Feiyu Shen and Jingbei Li and Mingrui Chen and et al.},
|
| 55 |
+
year={2025},
|
| 56 |
+
eprint={2506.08967},
|
| 57 |
+
archivePrefix={arXiv},
|
| 58 |
+
primaryClass={cs.SD}
|
| 59 |
+
}
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Team & Contributions
|
| 63 |
+
Step-Audio-AQAA is developed by the StepFun team, with contributions from multiple researchers and engineers. For technical support or collaboration, contact the corresponding authors: Daxin Jiang (djiang@stepfun.com), Shuchang Zhou (scotzhou@stepfun.com), Chen Hu (hatcher@stepfun.com).
|
| 64 |
+
|
| 65 |
+
## License
|
| 66 |
+
This model is released under the Apache 2.0 license. For more details, please refer to the license file.
|
models/Step-Audio-AQAA/config.json
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"MMGPTStep1ForCausalLMV4"
|
| 4 |
+
],
|
| 5 |
+
"model_type": "mmgpt_step1_v2",
|
| 6 |
+
"hidden_size": 12288,
|
| 7 |
+
"intermediate_size": 31232,
|
| 8 |
+
"num_attention_heads": 96,
|
| 9 |
+
"num_attention_groups": 8,
|
| 10 |
+
"num_hidden_layers": 88,
|
| 11 |
+
"max_seq_len": 999999,
|
| 12 |
+
"vocab_size": 74752,
|
| 13 |
+
"rms_norm_eps": 1e-05,
|
| 14 |
+
"torch_dtype": "bfloat16",
|
| 15 |
+
"im_end_token": "<im_end>",
|
| 16 |
+
"im_patch_token": "<im_patch>",
|
| 17 |
+
"im_start_token": "<im_start>",
|
| 18 |
+
"image_token_len": 169,
|
| 19 |
+
"use_im_start_end": true,
|
| 20 |
+
"vision_select_layer": -1,
|
| 21 |
+
"understand_projector_stride": 2,
|
| 22 |
+
"vit_scale": 1.0,
|
| 23 |
+
"projector_bias": false,
|
| 24 |
+
"vision_tower_config": {
|
| 25 |
+
"hidden_size": 1792,
|
| 26 |
+
"output_hidden_size": 4096,
|
| 27 |
+
"image_size": 728,
|
| 28 |
+
"intermediate_size": 15360,
|
| 29 |
+
"num_attention_heads": 16,
|
| 30 |
+
"num_hidden_layers": 63,
|
| 31 |
+
"patch_size": 14
|
| 32 |
+
}
|
| 33 |
+
}
|
models/Step-Audio-AQAA/model-00001.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:401119d809d338614caa2adabe6e55f7d5aad1a7395bf624bdcdc14605de6181
|
| 3 |
+
size 9434733376
|
models/Step-Audio-AQAA/model-00002.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:799f7fad3eb75123046be5a822337ed28a405c8136c5243e380d59241a6cd85b
|
| 3 |
+
size 9940651088
|
models/Step-Audio-AQAA/model-00003.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bb21dfe9f67bde6d78db684419372fd245dcea5e167ad7b2b94ba613db78303b
|
| 3 |
+
size 9940676016
|
models/Step-Audio-AQAA/model-00004.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e3f6caa5db5606bed09b08a53b51fac44f94701880d5d0cf3cdda42c0370c46b
|
| 3 |
+
size 9991007912
|
models/Step-Audio-AQAA/model-00005.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:660af7ae96e6369b92a987327c343adc88706047d991338d890ab3fdde174850
|
| 3 |
+
size 9638661344
|
models/Step-Audio-AQAA/model-00006.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2434a2806cc231904ee07bc7575c49c28e4f74f318fb041e37fb8bf2d540b1af
|
| 3 |
+
size 9940676048
|
models/Step-Audio-AQAA/model-00007.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:205072bb59dafe89f248968a26298b7f91a06e0150a66690ce99cbcd8bae686b
|
| 3 |
+
size 9991007944
|
models/Step-Audio-AQAA/model-00008.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c476603a3cd719dfc2832cfe9c312c459260f90dd6af8125b75e45a0604c8ae7
|
| 3 |
+
size 9638661352
|
models/Step-Audio-AQAA/model-00009.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d29cbfce6f0738a4431c6a885ddc4a01ca0831f56d7748d8277971ba268efda0
|
| 3 |
+
size 9940676048
|
models/Step-Audio-AQAA/model-00010.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f14f17011d5c36e84b0393611d62c9a26e6fb3a403bae1e26028b6377c183f63
|
| 3 |
+
size 9991007944
|
models/Step-Audio-AQAA/model-00011.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8ca4a5d29e943b4b4a7e6c11de86e44d40940c739f0fb8801030e86ac54f55d7
|
| 3 |
+
size 9638661352
|
models/Step-Audio-AQAA/model-00012.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bb1c87ec0c8bc78feba6bb39432bb009b040d2e9eb15d068fd96e824cb88e957
|
| 3 |
+
size 9940676048
|
models/Step-Audio-AQAA/model.safetensors.index.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
models/Step-Audio-AQAA/preprocessor_config.json
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"do_center_crop": false,
|
| 3 |
+
"do_normalize": true,
|
| 4 |
+
"do_resize": true,
|
| 5 |
+
"feature_extractor_type": "CLIPFeatureExtractor",
|
| 6 |
+
"image_mean": [
|
| 7 |
+
0.48145466,
|
| 8 |
+
0.4578275,
|
| 9 |
+
0.40821073
|
| 10 |
+
],
|
| 11 |
+
"image_std": [
|
| 12 |
+
0.26862954,
|
| 13 |
+
0.26130258,
|
| 14 |
+
0.27577711
|
| 15 |
+
],
|
| 16 |
+
"resample": 3,
|
| 17 |
+
"size": 728,
|
| 18 |
+
"mode": "bilinear"
|
| 19 |
+
}
|
models/Step-Audio-AQAA/source.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
https://huggingface.co/stepfun-ai/Step-Audio-AQAA
|
models/Step-Audio-AQAA/step2_tokenizer_241028.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:22f3dc190340b02713be9cb43d5fcf7e3c01c302d407c4470def4bac5eac9fdc
|
| 3 |
+
size 1264015
|