|
|
| --- |
| license: apache-2.0 |
| base_model: |
| - Qwen/Qwen3-Omni-30B-A3B-Instruct |
| pipeline_tag: audio-classification |
| --- |
| Leveraging the multi-dimensional fine-grained annotations produced by our pipeline, we introduce **FM-Speech**, built upon the frontier Qwen3-Omni (30B MoE) architecture. |
|
|
| > ποΈ **Input:** Raw Speech Audio   β   π **Output:** 14-Dimension Fine-Grained Speech Attributes (Structured JSON) |
|
|
| To overcome modality gaps and text-conditioned hallucinations, FM-Speech is trained using a **Progressive Curriculum Fine-Tuning** framework, decoupling complex auditory comprehension into three incremental stages: Warm-up (MCQ/QA) --> Capability Ramp-up --> Final Alignment (Full JSON). |
|
|
| ### π Usage & Environment Setup |
|
|
| Our model is built upon the Qwen3-Omni architecture. We strongly recommend using **vLLM** for the inference and deployment of FM-Speech. |
|
|
| **Step 1: Create a fresh Python environment** to avoid runtime conflicts and incompatibilities. |
| ```bash |
| conda create -n fmspeech python=3.12 |
| conda activate fmspeech |
| ``` |
|
|
| **Step 2: Install required packages** |
| ```bash |
| # Install vLLM (Specifically version 0.13.0) |
| pip install vllm==0.13.0 |
| # Note: If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, |
| # please use "pip install -e . -v" to build vLLM from source. |
| |
| # Install Transformers and Accelerate |
| pip install transformers==4.57.3 |
| pip install accelerate |
| |
| # Install Qwen Omni utilities and Flash Attention |
| pip install qwen-omni-utils -U |
| pip install -U flash-attn --no-build-isolation |
| ``` |
|
|
| **Step 3: Run Inference** |
| Prepare a sample audio file and run the inference script to generate the 14-dimension JSON output. |
| ```bash |
| python infer.py |
| ``` |
| *(See `infer.py` in our repository for detailed loading and inference examples).* |
|
|
| --- |