| --- |
| license: apache-2.0 |
| tags: |
| - adreno |
| - android |
| - opencl |
| - mobile |
| - on-device-inference |
| --- |
| |
| # adreno-llms-weights |
|
|
| Pre-converted **fp16 weights** for the model ports in [adreno-llms](https://github.com/a8nova/adreno-llms) — small language models hand-tuned for **Adreno 6xx GPUs on non-flagship Android phones**. |
|
|
| These binaries are NOT directly compatible with HuggingFace `transformers` or PyTorch. They use a custom layout produced by [NNOpt](mailto:a8nova@gmail.com) and are consumed by the C++/OpenCL inference binaries in the GitHub repo above. |
|
|
| ## Usage |
|
|
| ```bash |
| git clone https://github.com/a8nova/adreno-llms.git |
| cd adreno-llms |
| ./scripts/fetch_weights.sh smollm2-135m-instruct # pulls from this repo |
| cd src/models/smollm2-135m-instruct |
| NNOPT_DTYPE=fp16 ./scripts/build.sh --release |
| NNOPT_DTYPE=fp16 ./scripts/deploy_android.sh |
| NNOPT_DTYPE=fp16 ./scripts/run_android.sh "Once upon a time" 64 |
| ``` |
|
|
| See [the GitHub repo README](https://github.com/a8nova/adreno-llms) for full setup, hardware requirements, and per-model performance numbers (5-run warm median on Motorola Razr 2020 / Adreno 618). |
|
|
| ## Models in this repo |
|
|
| Decode tok/s = 5-run warm median, fp16, greedy (`temperature=0, seed=42`), 32-token generation, on Motorola Razr 2020 (Adreno 618), measured 2026-05-06. |
|
|
| | Path | Upstream | Params | Decode tok/s | License of upstream weights | |
| |---|---|---:|---:|---| |
| | `mamba2-130m/model.fp16.bin` | [state-spaces/mamba2-130m](https://huggingface.co/state-spaces/mamba2-130m) | 130M | 23.18 | Apache 2.0 | |
| | `mamba-130m/model.fp16.bin` | [state-spaces/mamba-130m-hf](https://huggingface.co/state-spaces/mamba-130m-hf) | 130M | 22.15 | Apache 2.0 | |
| | `smollm2-135m-instruct/model.fp16.bin` | [HuggingFaceTB/SmolLM2-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct) | 135M | 14.57 | Apache 2.0 | |
| | `lfm2-5-350m/model.fp16.bin` | [LiquidAI/LFM2.5-350M-Base](https://huggingface.co/LiquidAI/LFM2.5-350M-Base) | 350M | 10.20 | Liquid AI Open License | |
| | `qwen2-5-0-5b/model.fp16.bin` | [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | 500M | 8.45 | Apache 2.0 | |
| | `openelm-270m/` (companion files only) | [apple/OpenELM-270M](https://huggingface.co/apple/OpenELM-270M) | 270M | 4.47 | Apple ASCL — fetch + convert locally | |
|
|
| **OpenELM-270M is partially hosted here.** Under `openelm-270m/` you'll find only the small companion files: |
|
|
| ``` |
| openelm-270m/model.fp16.meta.json # tensor layout for the C++ runtime |
| openelm-270m/tokenizer.json # HuggingFace tokenizer config |
| openelm-270m/tokenizer_vocab.bin # vocab + merges (binary) |
| ``` |
|
|
| The actual `model.fp16.bin` is **NOT redistributed** — Apple's [Apple Sample Code License](https://huggingface.co/apple/OpenELM-270M/blob/main/LICENSE) restricts that. Instead, `scripts/fetch_openelm_weights.sh` in the GitHub repo pulls `apple/OpenELM-270M`'s safetensors directly from Apple's HF and runs `scripts/convert_openelm_weights.py` to produce the binary locally using the layout described in `model.fp16.meta.json`. |
|
|
| ## License |
|
|
| - These conversion artifacts: Apache 2.0 (re-publish freely, attribute the upstream model). |
| - Underlying model weights: each carries its upstream license (see the table above). Users are responsible for compliance. |
|
|