a8nova
/

adreno-llms-weights

on-device-inference

Model card Files Files and versions

adreno-llms-weights / README.md

a8nova's picture

Update README.md

88ef52a verified 14 days ago

|

history blame contribute delete

3.27 kB

	---
	license: apache-2.0
	tags:
	- adreno
	- android
	- opencl
	- mobile
	- on-device-inference
	---

	# adreno-llms-weights

	Pre-converted fp16 weights for the model ports in [adreno-llms](https://github.com/a8nova/adreno-llms) — small language models hand-tuned for Adreno 6xx GPUs on non-flagship Android phones.

	These binaries are NOT directly compatible with HuggingFace `transformers` or PyTorch. They use a custom layout produced by [NNOpt](mailto:a8nova@gmail.com) and are consumed by the C++/OpenCL inference binaries in the GitHub repo above.

	## Usage

	```bash
	git clone https://github.com/a8nova/adreno-llms.git
	cd adreno-llms
	./scripts/fetch_weights.sh smollm2-135m-instruct # pulls from this repo
	cd src/models/smollm2-135m-instruct
	NNOPT_DTYPE=fp16 ./scripts/build.sh --release
	NNOPT_DTYPE=fp16 ./scripts/deploy_android.sh
	NNOPT_DTYPE=fp16 ./scripts/run_android.sh "Once upon a time" 64
	```

	See [the GitHub repo README](https://github.com/a8nova/adreno-llms) for full setup, hardware requirements, and per-model performance numbers (5-run warm median on Motorola Razr 2020 / Adreno 618).

	## Models in this repo

	Decode tok/s = 5-run warm median, fp16, greedy (`temperature=0, seed=42`), 32-token generation, on Motorola Razr 2020 (Adreno 618), measured 2026-05-06.

	\| Path \| Upstream \| Params \| Decode tok/s \| License of upstream weights \|
	\|---\|---\|---:\|---:\|---\|
	\| `mamba2-130m/model.fp16.bin` \| [state-spaces/mamba2-130m](https://huggingface.co/state-spaces/mamba2-130m) \| 130M \| 23.18 \| Apache 2.0 \|
	\| `mamba-130m/model.fp16.bin` \| [state-spaces/mamba-130m-hf](https://huggingface.co/state-spaces/mamba-130m-hf) \| 130M \| 22.15 \| Apache 2.0 \|
	\| `smollm2-135m-instruct/model.fp16.bin` \| [HuggingFaceTB/SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct) \| 135M \| 14.57 \| Apache 2.0 \|
	\| `lfm2-5-350m/model.fp16.bin` \| [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) \| 350M \| 10.20 \| Liquid AI Open License \|
	\| `qwen2-5-0-5b/model.fp16.bin` \| [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) \| 500M \| 8.45 \| Apache 2.0 \|
	\| `openelm-270m/` (companion files only) \| [apple/OpenELM-270M](https://huggingface.co/apple/OpenELM-270M) \| 270M \| 4.47 \| Apple ASCL — fetch + convert locally \|

	OpenELM-270M is partially hosted here. Under `openelm-270m/` you'll find only the small companion files:

	```
	openelm-270m/model.fp16.meta.json # tensor layout for the C++ runtime
	openelm-270m/tokenizer.json # HuggingFace tokenizer config
	openelm-270m/tokenizer_vocab.bin # vocab + merges (binary)
	```

	The actual `model.fp16.bin` is NOT redistributed — Apple's [Apple Sample Code License](https://huggingface.co/apple/OpenELM-270M/blob/main/LICENSE) restricts that. Instead, `scripts/fetch_openelm_weights.sh` in the GitHub repo pulls `apple/OpenELM-270M`'s safetensors directly from Apple's HF and runs `scripts/convert_openelm_weights.py` to produce the binary locally using the layout described in `model.fp16.meta.json`.

	## License

	- These conversion artifacts: Apache 2.0 (re-publish freely, attribute the upstream model).
	- Underlying model weights: each carries its upstream license (see the table above). Users are responsible for compliance.