File size: 1,785 Bytes
04afb46 d1b27d2 04d7e86 04afb46 4aaa430 cffb5e1 4aaa430 026f52a 4aaa430 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
---
license: apache-2.0
base_model:
- Qwen/Qwen3-Omni-30B-A3B-Instruct
pipeline_tag: audio-classification
---
Leveraging the multi-dimensional fine-grained annotations produced by our pipeline, we introduce **FM-Speech**, built upon the frontier Qwen3-Omni (30B MoE) architecture.
> ๐๏ธ **Input:** Raw Speech Audio   โ   ๐ **Output:** 14-Dimension Fine-Grained Speech Attributes (Structured JSON)
To overcome modality gaps and text-conditioned hallucinations, FM-Speech is trained using a **Progressive Curriculum Fine-Tuning** framework, decoupling complex auditory comprehension into three incremental stages: Warm-up (MCQ/QA) --> Capability Ramp-up --> Final Alignment (Full JSON).
### ๐ Usage & Environment Setup
Our model is built upon the Qwen3-Omni architecture. We strongly recommend using **vLLM** for the inference and deployment of FM-Speech.
**Step 1: Create a fresh Python environment** to avoid runtime conflicts and incompatibilities.
```bash
conda create -n fmspeech python=3.12
conda activate fmspeech
```
**Step 2: Install required packages**
```bash
# Install vLLM (Specifically version 0.13.0)
pip install vllm==0.13.0
# Note: If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1,
# please use "pip install -e . -v" to build vLLM from source.
# Install Transformers and Accelerate
pip install transformers==4.57.3
pip install accelerate
# Install Qwen Omni utilities and Flash Attention
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation
```
**Step 3: Run Inference**
Prepare a sample audio file and run the inference script to generate the 14-dimension JSON output.
```bash
python infer.py
```
*(See `infer.py` in our repository for detailed loading and inference examples).*
--- |