AILOVER3000 commited on
Commit
9631395
Β·
verified Β·
1 Parent(s): d80e0e9
Files changed (1) hide show
  1. README.md +94 -0
README.md CHANGED
@@ -1,3 +1,97 @@
1
  ---
2
  license: gpl-3.0
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: gpl-3.0
3
+ language:
4
+ - en
5
+ library_name: onnx
6
+ tags:
7
+ - audio
8
+ - music
9
+ - music-recommendation
10
+ - clap
11
+ - onnx
12
+ - on-device
13
+ - android
14
+ - mobile
15
+ pipeline_tag: feature-extraction
16
  ---
17
+
18
+ # LatentJam Models
19
+
20
+ ONNX model weights for [LatentJam](https://github.com/Nikita-sud/latentjam) β€” a privacy-first Android music player that recommends what to play next entirely on-device. These models live here because [`clap_audio.onnx`](./clap_audio.onnx) is 116 MB and exceeds GitHub's 100 MB per-file cap.
21
+
22
+ The Android app downloads these files at build time via [`scripts/download-models.sh`](https://github.com/Nikita-sud/latentjam/blob/main/scripts/download-models.sh) and bundles them into `app/src/main/assets/ml/`. Inference at runtime uses [ONNX Runtime](https://onnxruntime.ai/) with the [Qualcomm QNN execution provider](https://onnxruntime.ai/docs/execution-providers/QNN-ExecutionProvider.html) for Hexagon NPU offload on Snapdragon devices, falling back to CPU on everything else.
23
+
24
+ ## Files
25
+
26
+ | File | Size | Role |
27
+ |---|---|---|
28
+ | [`clap_audio.onnx`](./clap_audio.onnx) | 116 MB | Audio encoder derived from CLAP. Consumes a 15 s mono PCM chunk at 48 kHz, produces a 512-d L2-normalized embedding per track. Runs once per track during library indexing, then the embedding is cached in the app's Room database. |
29
+ | [`predictor_state.onnx`](./predictor_state.onnx) | 32 MB | Transformer-style state encoder. Reads a sequence of recent listening events (skip / listen-through / replay, weighted by recency) and produces a user-state vector. |
30
+ | [`predictor_scorer_n100.onnx`](./predictor_scorer_n100.onnx) | 5 MB | Top-100 candidate scorer. Given the predictor state and 100 candidate embeddings (chosen by approximate-nearest-neighbor retrieval against the user state), scores each candidate. The highest score becomes the next track in smart-shuffle mode. |
31
+ | [`embedding_version.txt`](./embedding_version.txt) | 69 B | Bumps when the encoder changes. The app re-extracts all embeddings on mismatch. |
32
+ | [`predictor_version.txt`](./predictor_version.txt) | 20 B | Bumps when the predictor changes. The app drops the predictor cache on mismatch. |
33
+
34
+ ## Intended use
35
+
36
+ - Powering the **smart-shuffle** feature in the LatentJam Android app: cycling the shuffle button to `SMART` picks the next track using these models.
37
+ - Experimenting with on-device music recommendation on mobile. The encoder + predictor are deliberately small β€” the entire pipeline (audio decode β†’ encoder β†’ state encoder β†’ scorer) runs end-to-end in under a second on a Snapdragon 8 Gen 3 with the Hexagon NPU enabled.
38
+
39
+ These models are **not** intended for:
40
+ - Server-side recommendation (use a bigger CLAP variant and a proper retrieval index)
41
+ - Music classification or tagging
42
+ - Generating audio
43
+
44
+ ## Pipeline overview
45
+
46
+ ```
47
+ Library indexing (one-time, in background)
48
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
49
+ mp3 / flac / opus / m4a / ogg ─── native C++ decoder (in LatentJam) β”‚
50
+ β”‚ ↓ β”‚
51
+ β”‚ 15 s mono PCM at 48 kHz β”‚
52
+ β”‚ ↓ β”‚
53
+ β”‚ clap_audio.onnx (this repo) β”‚
54
+ β”‚ ↓ β”‚
55
+ β”‚ 512-d embedding, L2-normalized β”‚
56
+ β”‚ ↓ β”‚
57
+ β”‚ Room (on-device cache) β”‚
58
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
59
+
60
+ Smart-shuffle inference (on demand)
61
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
62
+ listening history (Room) ─── predictor_state.onnx (this repo) β”‚
63
+ β”‚ ↓ β”‚
64
+ β”‚ user-state vector β”‚
65
+ β”‚ ↓ β”‚
66
+ β”‚ ANN retrieval over cached embeddings β”‚
67
+ β”‚ ↓ β”‚
68
+ β”‚ 100 candidate tracks β”‚
69
+ β”‚ ↓ β”‚
70
+ β”‚ predictor_scorer_n100.onnx β”‚
71
+ β”‚ ↓ β”‚
72
+ β”‚ next track β”‚
73
+ β””β”€β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
74
+ ```
75
+
76
+ ## Privacy
77
+
78
+ - All inference is on-device. No audio, no embeddings, no listening history is ever transmitted anywhere.
79
+ - The LatentJam Android app does not request the `INTERNET` permission for the recommender. The only network access is the build-time download from this repo onto the developer's machine.
80
+
81
+ ## Limitations
82
+
83
+ - Smart mode requires that an embedding has been computed for every track. The first time you index a large library this takes a while β€” the encoder runs in the background only when the device is **charging + idle** (via WorkManager) to avoid thermal throttling and battery drain.
84
+ - The encoder is CLAP-derived but distilled to fit on-device. Genre/mood discrimination is good for popular Western genres and weaker for genres CLAP's training data underrepresented.
85
+ - The predictor was trained on a closed user-history dataset and may not generalize perfectly to your taste right away. On-device fine-tuning is planned but not yet shipped (see [`ml/retrain/RetrainWorker.kt`](https://github.com/Nikita-sud/latentjam/blob/main/app/src/main/java/io/github/nikitasud/latentjam/ml/retrain/RetrainWorker.kt) in the app repo β€” currently a stub).
86
+
87
+ ## License
88
+
89
+ GPL-3.0-or-later, matching the [LatentJam Android app](https://github.com/Nikita-sud/latentjam).
90
+
91
+ The CLAP audio encoder is derived from [LAION's CLAP](https://github.com/LAION-AI/CLAP) (CC0/MIT) and quantized + exported to ONNX for on-device use. The state encoder and scorer were trained from scratch for this project.
92
+
93
+ ## Links
94
+
95
+ - πŸ“± **Android app**: https://github.com/Nikita-sud/latentjam
96
+ - πŸ“ **Architecture notes**: https://github.com/Nikita-sud/latentjam/blob/main/ARCHITECTURE_NOTES.md
97
+ - πŸ“œ **Fork notice & attribution**: https://github.com/Nikita-sud/latentjam/blob/main/FORK_NOTICE.md