Instructions to use laion/universal-audio-annotation-pipeline with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries
NeMo
How to use laion/universal-audio-annotation-pipeline with NeMo:
```
# tag did not correspond to a valid NeMo domain.
```

How to use laion/universal-audio-annotation-pipeline with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="laion/universal-audio-annotation-pipeline",
	filename="models/gemma-4-12b-it-gguf/gemma-4-12b-it-Q8_0.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use laion/universal-audio-annotation-pipeline with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf laion/universal-audio-annotation-pipeline:Q8_0
# Run inference directly in the terminal:
llama-cli -hf laion/universal-audio-annotation-pipeline:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf laion/universal-audio-annotation-pipeline:Q8_0
# Run inference directly in the terminal:
llama-cli -hf laion/universal-audio-annotation-pipeline:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf laion/universal-audio-annotation-pipeline:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf laion/universal-audio-annotation-pipeline:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf laion/universal-audio-annotation-pipeline:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf laion/universal-audio-annotation-pipeline:Q8_0

Use Docker

docker model run hf.co/laion/universal-audio-annotation-pipeline:Q8_0

LM Studio
Jan
Ollama
How to use laion/universal-audio-annotation-pipeline with Ollama:
```
ollama run hf.co/laion/universal-audio-annotation-pipeline:Q8_0
```

Unsloth Studio

How to use laion/universal-audio-annotation-pipeline with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for laion/universal-audio-annotation-pipeline to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for laion/universal-audio-annotation-pipeline to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for laion/universal-audio-annotation-pipeline to start chatting

How to use laion/universal-audio-annotation-pipeline with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf laion/universal-audio-annotation-pipeline:Q8_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "laion/universal-audio-annotation-pipeline:Q8_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use laion/universal-audio-annotation-pipeline with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf laion/universal-audio-annotation-pipeline:Q8_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default laion/universal-audio-annotation-pipeline:Q8_0

Run Hermes

hermes

Docker Model Runner
How to use laion/universal-audio-annotation-pipeline with Docker Model Runner:
```
docker model run hf.co/laion/universal-audio-annotation-pipeline:Q8_0
```

Lemonade

How to use laion/universal-audio-annotation-pipeline with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull laion/universal-audio-annotation-pipeline:Q8_0

Run and chat with the model

lemonade run user.universal-audio-annotation-pipeline-Q8_0

List all available models

lemonade list

Universal Audio Annotation Pipeline — self-contained model mirror

This repository is a complete, self-contained mirror of the LAION Universal Audio Annotation Pipeline: all model weights for every stage plus all the code needed to run it. If any of the upstream model repositories ever disappears, cloning this single repo gives you everything required to reproduce the pipeline end-to-end.

💻 Code (GitHub): https://github.com/LAION-AI/univeral-audio-annotation-pipeline
🌐 Live example predictions: https://laion-ai.github.io/univeral-audio-annotation-pipeline/predictions/ (also bundled here under predictions/)

What it does

Given any audio file (a movie scene, a podcast, a field recording…), the pipeline produces a single structured JSON annotation of everything audible, second by second across the whole clip:

Speech — transcription, speaker diarization (who speaks when), language, accent, speaking rate, age, gender, voice timbre, and expressive captions for the speaker's emotion and speaking style (e.g. "clearly intense anger laced with wounded disappointment", "low conspiratorial whisper"); singing is flagged explicitly.
Vocal bursts — laughs, gasps, sighs, screams, scoffs, etc., with emotion.
Sound events — every non-speech sound and the musical/ambient background, with timestamps and loudness, covering the full timeline (silence is labelled too).

What it's good for

Building rich audio datasets and captions for TTS / audio-LM training, content understanding, media indexing, accessibility, emotion & paralinguistic research, and soundscape analysis.

What it produces

A JSON array of segments (speech, vocal_burst, sound_event, music) — see the output schema. A self-contained HTML report (base64 audio + annotations) can be generated for inspection; an example is in predictions/.

Architecture

┌───────────────────────────────────────────────────────────────────────┐
│                    INPUT: Audio File (any length)                      │
└───────────────────────────────────┬───────────────────────────────────┘
                                     │
        ┌──────────────────────────┴──────────────────┐
        ▼                                              ▼
  ┌──────────────┐                          ┌────────────────────┐
  │ VibeVoice    │                          │ Nemotron 3.5 ASR + │
  │ ASR          │                          │ Sortformer         │
  │ (diarization │                          │ (words + secondary │
  │  & timing    │                          │  diarization)      │
  │  authority)  │                          │                    │
  └──────┬───────┘                          └─────────┬──────────┘
         │   diarization / timing      words / what is said
         └──────────────┬──────────────────────┬───────┘
              ▼                        ▼
  ┌──────────────────────┐  ┌──────────────────────────────┐
  │ Whisper experts (x3) │  │ Specialist sound-event prepass│
  │ emotion · timbre ·   │  │ • SFX LoRA (MOSS-8B-Instruct  │
  │ speaking-style        │  │   + laion sfx-lora r=128)     │
  │ (per utterance)      │  │ • Vocal-burst locator @0.7    │
  └──────────┬───────────┘  │   + sound-effect captioner    │
             │              └───────────────┬──────────────┘
             └───────────────┬──────────────┘
                             ▼
        ┌──────────────────────────────────────────────┐
        │   Gemma-4-12B — TEXT-ONLY fusion (no audio)   │
        │  (+ pyannote overlap + DiCoW per-speaker ASR) │
        │  Nemotron words on VibeVoice timeline         │
        │  DETAILED sound-event + dedicated music caps  │
        │  legacy: MOSS-Audio-8B (audio, --fusion moss) │
        └───────────────────────┬──────────────────────┘
                                │  + deterministic gap-fill
                                ▼     (non-speech background only)
        ┌──────────────────────────────────────────────┐
        │   OUTPUT: Structured JSON (covers full clip)  │
        │   [speech · vocal_burst · sound_event · music]│
        └──────────────────────────────────────────────┘

Default configuration (Gemma-12B + DiCoW): Nemotron 3.5 words + VibeVoice/Sortformer diarization + pyannote overlap detection + DiCoW overlap-aware per-speaker ASR, fused by a text-only Gemma-4-12B LLM (no audio in the final step). Highest-Reward pipeline on SoundScape-Bench (0.253, rank 3 of all systems, ~Gemini 3.5 Flash). It trades precision for recall (hallucination ~43% vs the audio MOSS config's 27%); the legacy audio MOSS-Audio-8B annotator stays available via --fusion moss. 🎧 Audio demo (20 samples vs ground truth) · 📊 Full model comparison

What is mirrored here

`models/` — all weights

Default (Gemma-12B + DiCoW) models marked ⭐:

Subfolder	Stage	Original repo
`vibevoice-asr` ⭐	diarization / timing authority	`microsoft/VibeVoice-ASR`
`nemotron-3.5-asr-streaming-0.6b` ⭐	words (default ASR)	`nvidia/nemotron-3.5-asr-streaming-0.6b`
`diar_sortformer_4spk-v1` ⭐	diarization (Nemotron)	`nvidia/diar_sortformer_4spk-v1`
`pyannote-segmentation-3.0` ⭐	overlap/segmentation	`pyannote/segmentation-3.0`
`pyannote-speaker-diarization-3.1` ⭐	diarization pipeline config	`pyannote/speaker-diarization-3.1`
`pyannote-wespeaker-voxceleb-resnet34-LM` ⭐	speaker embedding (for pyannote)	`pyannote/wespeaker-voxceleb-resnet34-LM`
`dicow-v3_3` ⭐	overlap-aware per-speaker ASR	`BUT-FIT/DiCoW_v3_3`
`gemma-4-12b-it-gguf` ⭐	TEXT-only final fusion (Q8 GGUF)	`unsloth/gemma-4-12b-it-GGUF`
`bud-e-whisper` ⭐	voice: emotion	`laion/BUD-E-Whisper`
`timbre-whisper` ⭐	voice: timbre	`laion/timbre-whisper`
`voice-tagging-whisper` ⭐	voice: speaking style	`laion/voice-tagging-whisper`
`moss-audio-8b-instruct` ⭐	SFX LoRA base	`OpenMOSS-Team/MOSS-Audio-8B-Instruct`
`moss-audio-sfx-lora-v4` ⭐	SFX LoRA adapter	`laion/moss-audio-sfx-lora-v4`
`vocalburst-locator` ⭐	vocal-burst locator	`laion/vocalburst-locator`
`sound-effect-captioning-whisper` ⭐	SFX captioner	`laion/sound-effect-captioning-whisper`
`whisper-small` ⭐	base for locator/captioner/experts	`openai/whisper-small`
`moss-audio-8b-thinking`	legacy final annotator (`--fusion moss`)	`OpenMOSS-Team/MOSS-Audio-8B-Thinking`
`parakeet-tdt-0.6b-v3`	legacy ensemble ASR	`nvidia/parakeet-tdt-0.6b-v3`
`qwen3-asr-1.7b` / `qwen3-forcedaligner-0.6b`	legacy ensemble ASR	`Qwen/Qwen3-ASR-1.7B` (+aligner)

`code/` — everything needed to run

Subfolder	Contents
`universal-audio-annotation-pipeline`	the pipeline repo (`default_pipeline/`, `pipeline/`, docs)
`MOSS-Audio`	MOSS-Audio source (`src.*`, required by the MOSS stages)
`VibeVoice`	VibeVoice source (the ASR modeling code)

`predictions/` — an example report (open `predictions/index.html`)

How to run inference

Full setup and run instructions: docs/default_pipeline.md.

The three ASR packages pin incompatible transformers versions, so each ASR stage runs in its own virtual-env; the Whisper experts, SFX LoRA, vocal-burst pre-pass and MOSS annotator share a base env. The helper script builds all of them:

# from code/universal-audio-annotation-pipeline/default_pipeline
bash setup_environments.sh ./envs                 # 4 venvs + clones of the sources
export UAAP_MOSS_SRC="$(pwd)/envs/MOSS-Audio"     # (or code/MOSS-Audio from this mirror)
huggingface-cli login                             # only needed if pulling the gated LoRA upstream

bash run_all.sh --audio /path/to/clips --workdir ./uaap_work --envs ./envs
#   --no-sfx          skip the SFX LoRA stage

To run entirely offline from this mirror, point each stage at the local models/<name> folder instead of the HF hub id, and set UAAP_MOSS_SRC=<this repo>/code/MOSS-Audio.

Outputs: <audio>_pred.json next to every input, all intermediates under uaap_work/<stem>/, and a self-contained uaap_work/report.html.

Key parameters

Parameter	Default	Meaning
fusion	`--fusion gemma` (default)	Gemma-4-12B text-only fusion + pyannote/DiCoW overlap (best Reward); `--fusion moss` = legacy audio MOSS-Audio-8B (higher precision)
decoding	greedy (`do_sample=False`)	ASR reconciliation + final annotation
vocal-burst threshold	`0.7`	locator confidence cutoff (`UAAP_VB_THRESHOLD`)
`--no-sfx`	off	skip the gated SFX LoRA stage
GPUs	2× 24 GB recommended	VibeVoice-ASR is sharded across both

Models load/unload sequentially → peak VRAM ≈ one large model (~18–23 GB) at a time.

Efficiency: with ≥2 GPUs the heavy stages (Gemma fusion ~30 s/clip, SFX, ASR…) are auto-sharded across GPUs over disjoint clips (--gpus 0,1), ≈N× faster with identical output. A quality-neutral lower-VRAM fuser is available via export GEMMA_FILE=gemma-4-12b-it-UD-Q6_K_XL.gguf.

Hardware

~68 GB disk for the weights; 2× 24 GB GPUs recommended (VibeVoice-ASR is sharded across two, the rest fit on a single 24 GB GPU).

Licensing & attribution

This is a mirror for resilience. Each mirrored model keeps the license of its original repository (linked in the table above) — please consult and comply with each. The pipeline code is Apache-2.0. Mirrored by LAION.

Downloads last month: 9,042

GGUF

Model size

12B params

Architecture

gemma4

Hardware compatibility

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Universal Audio Annotation Pipeline — self-contained model mirror

What it does

What it's good for

What it produces

Architecture

What is mirrored here

models/ — all weights

code/ — everything needed to run

predictions/ — an example report (open predictions/index.html)

How to run inference

Key parameters

Hardware

Licensing & attribution

`models/` — all weights

`code/` — everything needed to run

`predictions/` — an example report (open `predictions/index.html`)