adreno-llms-weights

Pre-converted fp16 weights for the model ports in adreno-llms — small language models hand-tuned for Adreno 6xx GPUs on non-flagship Android phones.

These binaries are NOT directly compatible with HuggingFace transformers or PyTorch. They use a custom layout produced by NNOpt and are consumed by the C++/OpenCL inference binaries in the GitHub repo above.

Usage

git clone https://github.com/a8nova/adreno-llms.git
cd adreno-llms
./scripts/fetch_weights.sh smollm2-135m-instruct   # pulls from this repo
cd src/models/smollm2-135m-instruct
NNOPT_DTYPE=fp16 ./scripts/build.sh --release
NNOPT_DTYPE=fp16 ./scripts/deploy_android.sh
NNOPT_DTYPE=fp16 ./scripts/run_android.sh "Once upon a time" 64

See the GitHub repo README for full setup, hardware requirements, and per-model performance numbers (5-run warm median on Motorola Razr 2020 / Adreno 618).

Models in this repo

Decode tok/s = 5-run warm median, fp16, greedy (temperature=0, seed=42), 32-token generation, on Motorola Razr 2020 (Adreno 618), measured 2026-05-06.

Path	Upstream	Params	Decode tok/s	License of upstream weights
`mamba2-130m/model.fp16.bin`	state-spaces/mamba2-130m	130M	23.18	Apache 2.0
`mamba-130m/model.fp16.bin`	state-spaces/mamba-130m-hf	130M	22.15	Apache 2.0
`smollm2-135m-instruct/model.fp16.bin`	HuggingFaceTB/SmolLM2-135M-Instruct	135M	14.57	Apache 2.0
`lfm2-5-350m/model.fp16.bin`	LiquidAI/LFM2.5-350M-Base	350M	10.20	Liquid AI Open License
`qwen2-5-0-5b/model.fp16.bin`	Qwen/Qwen2.5-0.5B	500M	8.45	Apache 2.0
`openelm-270m/` (companion files only)	apple/OpenELM-270M	270M	4.47	Apple ASCL — fetch + convert locally

OpenELM-270M is partially hosted here. Under openelm-270m/ you'll find only the small companion files:

openelm-270m/model.fp16.meta.json     # tensor layout for the C++ runtime
openelm-270m/tokenizer.json           # HuggingFace tokenizer config
openelm-270m/tokenizer_vocab.bin      # vocab + merges (binary)

The actual model.fp16.bin is NOT redistributed — Apple's Apple Sample Code License restricts that. Instead, scripts/fetch_openelm_weights.sh in the GitHub repo pulls apple/OpenELM-270M's safetensors directly from Apple's HF and runs scripts/convert_openelm_weights.py to produce the binary locally using the layout described in model.fp16.meta.json.

How were these produced?

Every binary in this repo was generated by NNOpt — a coding agent for porting and optimizing neural networks for Android embedded targets. None of the kernels, layouts, or build tooling in the consumer repo was hand-written.

If you have a model you want running on Adreno, Snapdragon, Mali, or any Android device with this kind of polish, email a8nova@gmail.com for early access.

License

These conversion artifacts: Apache 2.0 (re-publish freely, attribute the upstream model).
Underlying model weights: each carries its upstream license (see the table above). Users are responsible for compliance.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support