AI & ML interests

Audio Tokenizer Speech enhancement Audio Generation

Recent Activity

Metacebertrunk  updated a Space about 12 hours ago
QuarkAudio/README
opsam  updated a model about 14 hours ago
QuarkAudio/HCodec-1.5-adaptive
bingong11  published a Space about 14 hours ago
QuarkAudio/README
View all activity

QuarkAudio: An Open-Source Project to Unify Audio Processing and Generation

Paper GitHub Demo Hugging Face ModelScope

Introduction

This project contains a series of works developed for audio (including speech, music, and general audio events) processing and generation, which helps reproducible research in the field of audio. The target of QuarkAudio is to explore a unified framework to handle different audio processing and generation tasks, including:

🚀 Key Highlights:

  • Unified & Prompt-Free: Handles multiple tasks without explicit instruction.
  • ⚙️ Decoder-only AR-LM Backbone: Leverages LLM-style autoregressive generation for speech token prediction.
  • 🔄 End-to-End Compatible: Integrates WavLM/Hubert (feature extractor), H-Codec (discrete codec), and LM into one pipeline.
  • 🌍 Multitask Support: SE, SR, TSE, SS, EDIT, VC, LASS, TTA, and more — all in a single model.

📄 Paper: arXiv:2510.20441 | 🎤 Listen: Demo Page | 🤗 Model: Hugging Face Spaces

📋 Supported Tasks

Task Full Name Status Description
SR Speech Restoration ⛳ supported Recover clean speech from corrupted inputs (e.g., noise, reverb, packet loss)
TSE Target Speaker Extraction ⛳ supported Extract target speaker using reference enrollment audio
SS Speech Separation ⛳ supported Separate mixed speakers or sound sources
VC Voice Conversion ⛳ supported Convert the speaker identity of input speech while preserving linguistic content
LASS Language-Queried Audio Source Separatio ⛳ supported Separate sound sources based on natural language queries (e.g., "remove the man's voice")
CODEC Audio Tokenization ⛳ supported Encode speech into compact discrete tokens and reconstruct high-fidelity audio via decoding
AE Audio Editing ⛳ supported Edit spoken content by inserting, deleting, or substituting words/phrases in the audio domain
TTA Text to Audio ⏳ Developing Generate speech or environmental sounds directly from text prompts (upcoming in next release)
AEC Acoustic Echo Cancellation ⏳ Developing Remove echo artifacts in teleconferencing scenarios (upcoming in next release)
  • more...

In addition to the frameworks for specific audio tasks, QuarkAudio also provides works involving neural audio codec (NAC), which is the fundamental module to combine audio modality with language models.

🚀 News

  • 2025/12/24: We release QuarkAudio, an Open-Source Project to Unify Audio Processing and Generation.arXiv. The code is publicly available at: QuarkAudio-HCodec, along with pretrained models and inference examples.
  • 2025/10/26: We release UniTok-Audio, The system supports target speaker extraction, universal speech enhancement, Speech Restoration, Voice Conversion, Language-Queried Audio Source Separation, Audio Tokenization,demo, arXiv. Code will comming soon.
  • 2025/09/22: We release UniSE, a foundation model for unified speech generation. The system supports target speaker extraction, universal speech enhancement.demo, arXiv. The code is publicly available at: QuarkAudio-UniSE, along with pretrained models and inference examples.

datasets 0

None public yet