Update

9631395 verified 17 days ago

6.91 kB

	---
	license: gpl-3.0
	language:
	- en
	library_name: onnx
	tags:
	- audio
	- music
	- music-recommendation
	- clap
	- onnx
	- on-device
	- android
	- mobile
	pipeline_tag: feature-extraction
	---

	# LatentJam Models

	ONNX model weights for [LatentJam](https://github.com/Nikita-sud/latentjam) — a privacy-first Android music player that recommends what to play next entirely on-device. These models live here because [`clap_audio.onnx`](./clap_audio.onnx) is 116 MB and exceeds GitHub's 100 MB per-file cap.

	The Android app downloads these files at build time via [`scripts/download-models.sh`](https://github.com/Nikita-sud/latentjam/blob/main/scripts/download-models.sh) and bundles them into `app/src/main/assets/ml/`. Inference at runtime uses [ONNX Runtime](https://onnxruntime.ai/) with the [Qualcomm QNN execution provider](https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html) for Hexagon NPU offload on Snapdragon devices, falling back to CPU on everything else.

	## Files

	\| File \| Size \| Role \|
	\|---\|---\|---\|
	\| [`clap_audio.onnx`](./clap_audio.onnx) \| 116 MB \| Audio encoder derived from CLAP. Consumes a 15 s mono PCM chunk at 48 kHz, produces a 512-d L2-normalized embedding per track. Runs once per track during library indexing, then the embedding is cached in the app's Room database. \|
	\| [`predictor_state.onnx`](./predictor_state.onnx) \| 32 MB \| Transformer-style state encoder. Reads a sequence of recent listening events (skip / listen-through / replay, weighted by recency) and produces a user-state vector. \|
	\| [`predictor_scorer_n100.onnx`](./predictor_scorer_n100.onnx) \| 5 MB \| Top-100 candidate scorer. Given the predictor state and 100 candidate embeddings (chosen by approximate-nearest-neighbor retrieval against the user state), scores each candidate. The highest score becomes the next track in smart-shuffle mode. \|
	\| [`embedding_version.txt`](./embedding_version.txt) \| 69 B \| Bumps when the encoder changes. The app re-extracts all embeddings on mismatch. \|
	\| [`predictor_version.txt`](./predictor_version.txt) \| 20 B \| Bumps when the predictor changes. The app drops the predictor cache on mismatch. \|

	## Intended use

	- Powering the smart-shuffle feature in the LatentJam Android app: cycling the shuffle button to `SMART` picks the next track using these models.
	- Experimenting with on-device music recommendation on mobile. The encoder + predictor are deliberately small — the entire pipeline (audio decode → encoder → state encoder → scorer) runs end-to-end in under a second on a Snapdragon 8 Gen 3 with the Hexagon NPU enabled.

	These models are not intended for:
	- Server-side recommendation (use a bigger CLAP variant and a proper retrieval index)
	- Music classification or tagging
	- Generating audio

	## Pipeline overview

	```
	Library indexing (one-time, in background)
	┌──────────────────────────────────────┐
	mp3 / flac / opus / m4a / ogg ──┤ native C++ decoder (in LatentJam) │
	│ ↓ │
	│ 15 s mono PCM at 48 kHz │
	│ ↓ │
	│ clap_audio.onnx (this repo) │
	│ ↓ │
	│ 512-d embedding, L2-normalized │
	│ ↓ │
	│ Room (on-device cache) │
	└──────────────────────────────────────┘

	Smart-shuffle inference (on demand)
	┌──────────────────────────────────────┐
	listening history (Room) ──┤ predictor_state.onnx (this repo) │
	│ ↓ │
	│ user-state vector │
	│ ↓ │
	│ ANN retrieval over cached embeddings │
	│ ↓ │
	│ 100 candidate tracks │
	│ ↓ │
	│ predictor_scorer_n100.onnx │
	│ ↓ │
	│ next track │
	└──────────────────────────────────────┘
	```

	## Privacy

	- All inference is on-device. No audio, no embeddings, no listening history is ever transmitted anywhere.
	- The LatentJam Android app does not request the `INTERNET` permission for the recommender. The only network access is the build-time download from this repo onto the developer's machine.

	## Limitations

	- Smart mode requires that an embedding has been computed for every track. The first time you index a large library this takes a while — the encoder runs in the background only when the device is charging + idle (via WorkManager) to avoid thermal throttling and battery drain.
	- The encoder is CLAP-derived but distilled to fit on-device. Genre/mood discrimination is good for popular Western genres and weaker for genres CLAP's training data underrepresented.
	- The predictor was trained on a closed user-history dataset and may not generalize perfectly to your taste right away. On-device fine-tuning is planned but not yet shipped (see [`ml/retrain/RetrainWorker.kt`](https://github.com/Nikita-sud/latentjam/blob/main/app/src/main/java/io/github/nikitasud/latentjam/ml/retrain/RetrainWorker.kt) in the app repo — currently a stub).

	## License

	GPL-3.0-or-later, matching the [LatentJam Android app](https://github.com/Nikita-sud/latentjam).

	The CLAP audio encoder is derived from [LAION's CLAP](https://github.com/LAION-AI/CLAP) (CC0/MIT) and quantized + exported to ONNX for on-device use. The state encoder and scorer were trained from scratch for this project.

	## Links

	- 📱 Android app: https://github.com/Nikita-sud/latentjam
	- 📐 Architecture notes: https://github.com/Nikita-sud/latentjam/blob/main/ARCHITECTURE_NOTES.md
	- 📜 Fork notice & attribution: https://github.com/Nikita-sud/latentjam/blob/main/FORK_NOTICE.md