Transformers
English
Lulu245M-Mobile / README.md
TheOpenMachine's picture
Update README.md
938100e verified
---
license: apache-2.0
datasets:
- moca-embed/dclm_20b
- openbmb/UltraChat
language:
- en
library_name: transformers
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
- 245M parameters
- 4 Layers
- D_size 1280
- 16 MoE
- 8 KV
- FP32 2.3GB - Onix export
Trained on only 20B tokens of web text data.
Fine-tuned on 80K of UltraChat, no LoRA or similar tricks.
### Model Description
# Lulu Local Android Demo
**Lulu Local** is an offline Android AI demo by **Open Machine**.
This release runs a local Lulu language model directly on an Android phone using **ONNX Runtime CPU inference**.
No cloud.
No server.
No GPU.
No NPU.
No internet required after install.
Runs on the Samsung A25 5G.
This is a raw early proof that a custom local model can run directly on consumer Android hardware.
For the record this is a literally un-optimized model, with heavily python loop, pure ONNX export of 2.3GB FP32. This is currently running on the CPU, we haven't touched the NPU, Vulcan or anything else yet.
The current generation takes about three minutes (a full forward pass on 128CTX as I mentioned, it's unoptimized), and APK file is here with GitHub follows for Onix model and Android. Again No Custom Runtimes: Just standard ONNX format loaded straight into Android memory.
This is running on your Exynos—with the consideration that after we chatted for 10 minutes, the battery didn't move, and no heating occurred.
We completed everything in the last two days: training, benchmarks, fine-tuning, and Onix runtime, all for less than €1000.
Why this is interesting
Most mobile LLM demos rely on one or more of the following:
heavily quantized models
GPU acceleration
NPU acceleration
server-side inference
vendor SDKs
cloud APIs
This demo is intentionally simple and direct:
Android app
+ ONNX Runtime
+ local tokenizer
+ local ONNX model
+ CPU only
The current model is not small, not heavily optimized, and not using mobile accelerator tricks.
That is the point of the demo.
Model architecture note
The Android build uses a stateful single-token step ONNX export.
The runtime loop is:
token_id + position + cache tensors
→ ONNX step model
→ logits + updated cache tensors
→ sample next token
→ repeat
This replaced the earlier full-sequence ONNX path, which was much slower and used much more memory during generation.
Current ONNX interface:
Inputs:
- token_id: [1, 1] int64
- pos: [1] int64
- k_0, v_0 ... k_23, v_23
Outputs:
- logits: [1, 32000] float32
- out_k_0, out_v_0 ... out_k_23, out_v_23
Cache shape per K/V tensor:
[1, 16, 128, 80]
Total runtime cache is about 31 MB.
- **Developed by: The Open Machine**
- **Model type:** [The Open Machine Transformers Version]
- **Language(s) (NLP):** [English]
- **License:** [Apache 2.0 ]
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** [Wiull be provided in upcoming days]
- **Paper [optional]:** [Coming Soon]
- **Demo [optional]:** [More Information Needed]
## Uses
Demo highlights
Fully offline Android assistant
Runs on mobile CPU only
Stateful single-token ONNX generation
Live token streaming UI
Battery / RAM / speed display
Cool / Turbo mode
Cool: 2 CPU threads
Turbo: 4 CPU threads
No GPU acceleration
No NPU acceleration
No network calls required for inference
Tested device
Early demo testing was done on a Samsung A25-class Android phone.
Observed behavior:
Model loads locally from app storage
Generation works fully offline
CPU-only generation is slow but usable for demo purposes
Example speed observed around 0.20 tok/s, depending on temperature, prompt length, and thread mode
This is not yet optimized.
Install
Download the APK:
LuluLocal-Android-CPU-fp32.apk
On Android:
Open the APK file.
Allow install from unknown sources if Android asks.
Install.
Open Lulu.
Wait for the model to load.
Ask a question.
First load may take longer because the app prepares the local ONNX model.
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
[Privacy
Inference is local.
The demo is designed so prompts are processed on-device.
No cloud inference is required.
If you build or modify the app, review the source code and Android permissions yourself.]
### Out-of-Scope Use
[Important warning
This is an experimental local AI demo.
The model may:
hallucinate
answer incorrectly
repeat itself
generate incomplete text
be slow on low-end hardware
consume significant battery and RAM
Do not use this for medical, legal, financial, emergency, or safety-critical decisions.]
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
[Current limitations
CPU only
fp32 ONNX model is large
no NPU backend yet
no GPU/Vulkan backend yet
no quantization yet
context length currently limited
APK size is large
generation quality is still experimental]
## Model Card Authors [optional]
Credits
Built by Open Machine.
Lulu is an experimental local AI assistant project focused on running useful AI directly on personal devices.
## Model Card Contact
Open Machine
info@theopenmachine.com