ASLP-lab
/

FM-Speech

Audio Classification

Model card Files Files and versions

FM-Speech / README.md

ASLP-lab's picture

Update README.md

026f52a verified 7 days ago

|

history blame contribute delete

1.79 kB


	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-Omni-30B-A3B-Instruct
	pipeline_tag: audio-classification
	---
	Leveraging the multi-dimensional fine-grained annotations produced by our pipeline, we introduce FM-Speech, built upon the frontier Qwen3-Omni (30B MoE) architecture.

	> 🎙️ Input: Raw Speech Audio &emsp; ➔ &emsp; 📊 Output: 14-Dimension Fine-Grained Speech Attributes (Structured JSON)

	To overcome modality gaps and text-conditioned hallucinations, FM-Speech is trained using a Progressive Curriculum Fine-Tuning framework, decoupling complex auditory comprehension into three incremental stages: Warm-up (MCQ/QA) --> Capability Ramp-up --> Final Alignment (Full JSON).

	### 🚀 Usage & Environment Setup

	Our model is built upon the Qwen3-Omni architecture. We strongly recommend using vLLM for the inference and deployment of FM-Speech.

	Step 1: Create a fresh Python environment to avoid runtime conflicts and incompatibilities.
	```bash
	conda create -n fmspeech python=3.12
	conda activate fmspeech
	```

	Step 2: Install required packages
	```bash
	# Install vLLM (Specifically version 0.13.0)
	pip install vllm==0.13.0
	# Note: If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1,
	# please use "pip install -e . -v" to build vLLM from source.

	# Install Transformers and Accelerate
	pip install transformers==4.57.3
	pip install accelerate

	# Install Qwen Omni utilities and Flash Attention
	pip install qwen-omni-utils -U
	pip install -U flash-attn --no-build-isolation
	```

	Step 3: Run Inference
	Prepare a sample audio file and run the inference script to generate the 14-dimension JSON output.
	```bash
	python infer.py
	```
	(See `infer.py` in our repository for detailed loading and inference examples).

	---