| --- |
| license: gpl-3.0 |
| language: |
| - en |
| library_name: onnx |
| tags: |
| - audio |
| - music |
| - music-recommendation |
| - clap |
| - onnx |
| - on-device |
| - android |
| - mobile |
| pipeline_tag: feature-extraction |
| --- |
| |
| # LatentJam Models |
|
|
| ONNX model weights for [LatentJam](https://github.com/Nikita-sud/latentjam) β a privacy-first Android music player that recommends what to play next entirely on-device. These models live here because [`clap_audio.onnx`](./clap_audio.onnx) is 116 MB and exceeds GitHub's 100 MB per-file cap. |
|
|
| The Android app downloads these files at build time via [`scripts/download-models.sh`](https://github.com/Nikita-sud/latentjam/blob/main/scripts/download-models.sh) and bundles them into `app/src/main/assets/ml/`. Inference at runtime uses [ONNX Runtime](https://onnxruntime.ai/) with the [Qualcomm QNN execution provider](https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html) for Hexagon NPU offload on Snapdragon devices, falling back to CPU on everything else. |
|
|
| ## Files |
|
|
| | File | Size | Role | |
| |---|---|---| |
| | [`clap_audio.onnx`](./clap_audio.onnx) | 116 MB | Audio encoder derived from CLAP. Consumes a 15 s mono PCM chunk at 48 kHz, produces a 512-d L2-normalized embedding per track. Runs once per track during library indexing, then the embedding is cached in the app's Room database. | |
| | [`predictor_state.onnx`](./predictor_state.onnx) | 32 MB | Transformer-style state encoder. Reads a sequence of recent listening events (skip / listen-through / replay, weighted by recency) and produces a user-state vector. | |
| | [`predictor_scorer_n100.onnx`](./predictor_scorer_n100.onnx) | 5 MB | Top-100 candidate scorer. Given the predictor state and 100 candidate embeddings (chosen by approximate-nearest-neighbor retrieval against the user state), scores each candidate. The highest score becomes the next track in smart-shuffle mode. | |
| | [`embedding_version.txt`](./embedding_version.txt) | 69 B | Bumps when the encoder changes. The app re-extracts all embeddings on mismatch. | |
| | [`predictor_version.txt`](./predictor_version.txt) | 20 B | Bumps when the predictor changes. The app drops the predictor cache on mismatch. | |
|
|
| ## Intended use |
|
|
| - Powering the **smart-shuffle** feature in the LatentJam Android app: cycling the shuffle button to `SMART` picks the next track using these models. |
| - Experimenting with on-device music recommendation on mobile. The encoder + predictor are deliberately small β the entire pipeline (audio decode β encoder β state encoder β scorer) runs end-to-end in under a second on a Snapdragon 8 Gen 3 with the Hexagon NPU enabled. |
|
|
| These models are **not** intended for: |
| - Server-side recommendation (use a bigger CLAP variant and a proper retrieval index) |
| - Music classification or tagging |
| - Generating audio |
|
|
| ## Pipeline overview |
|
|
| ``` |
| Library indexing (one-time, in background) |
| ββββββββββββββββββββββββββββββββββββββββ |
| mp3 / flac / opus / m4a / ogg βββ€ native C++ decoder (in LatentJam) β |
| β β β |
| β 15 s mono PCM at 48 kHz β |
| β β β |
| β clap_audio.onnx (this repo) β |
| β β β |
| β 512-d embedding, L2-normalized β |
| β β β |
| β Room (on-device cache) β |
| ββββββββββββββββββββββββββββββββββββββββ |
| |
| Smart-shuffle inference (on demand) |
| ββββββββββββββββββββββββββββββββββββββββ |
| listening history (Room) βββ€ predictor_state.onnx (this repo) β |
| β β β |
| β user-state vector β |
| β β β |
| β ANN retrieval over cached embeddings β |
| β β β |
| β 100 candidate tracks β |
| β β β |
| β predictor_scorer_n100.onnx β |
| β β β |
| β next track β |
| ββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ## Privacy |
|
|
| - All inference is on-device. No audio, no embeddings, no listening history is ever transmitted anywhere. |
| - The LatentJam Android app does not request the `INTERNET` permission for the recommender. The only network access is the build-time download from this repo onto the developer's machine. |
|
|
| ## Limitations |
|
|
| - Smart mode requires that an embedding has been computed for every track. The first time you index a large library this takes a while β the encoder runs in the background only when the device is **charging + idle** (via WorkManager) to avoid thermal throttling and battery drain. |
| - The encoder is CLAP-derived but distilled to fit on-device. Genre/mood discrimination is good for popular Western genres and weaker for genres CLAP's training data underrepresented. |
| - The predictor was trained on a closed user-history dataset and may not generalize perfectly to your taste right away. On-device fine-tuning is planned but not yet shipped (see [`ml/retrain/RetrainWorker.kt`](https://github.com/Nikita-sud/latentjam/blob/main/app/src/main/java/io/github/nikitasud/latentjam/ml/retrain/RetrainWorker.kt) in the app repo β currently a stub). |
|
|
| ## License |
|
|
| GPL-3.0-or-later, matching the [LatentJam Android app](https://github.com/Nikita-sud/latentjam). |
|
|
| The CLAP audio encoder is derived from [LAION's CLAP](https://github.com/LAION-AI/CLAP) (CC0/MIT) and quantized + exported to ONNX for on-device use. The state encoder and scorer were trained from scratch for this project. |
|
|
| ## Links |
|
|
| - π± **Android app**: https://github.com/Nikita-sud/latentjam |
| - π **Architecture notes**: https://github.com/Nikita-sud/latentjam/blob/main/ARCHITECTURE_NOTES.md |
| - π **Fork notice & attribution**: https://github.com/Nikita-sud/latentjam/blob/main/FORK_NOTICE.md |