| --- |
| license: apache-2.0 |
| datasets: |
| - moca-embed/dclm_20b |
| - openbmb/UltraChat |
| language: |
| - en |
| library_name: transformers |
| --- |
| # Model Card for Model ID |
|
|
| <!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
| ## Model Details |
| - 245M parameters |
| - 4 Layers |
| - D_size 1280 |
| - 16 MoE |
| - 8 KV |
| - FP32 2.3GB - Onix export |
| |
| Trained on only 20B tokens of web text data. |
| |
| Fine-tuned on 80K of UltraChat, no LoRA or similar tricks. |
| |
| ### Model Description |
| |
| # Lulu Local Android Demo |
| |
| **Lulu Local** is an offline Android AI demo by **Open Machine**. |
| |
| This release runs a local Lulu language model directly on an Android phone using **ONNX Runtime CPU inference**. |
| |
| No cloud. |
| No server. |
| No GPU. |
| No NPU. |
| No internet required after install. |
| |
| Runs on the Samsung A25 5G. |
| |
| This is a raw early proof that a custom local model can run directly on consumer Android hardware. |
| |
| For the record this is a literally un-optimized model, with heavily python loop, pure ONNX export of 2.3GB FP32. This is currently running on the CPU, we haven't touched the NPU, Vulcan or anything else yet. |
| The current generation takes about three minutes (a full forward pass on 128CTX as I mentioned, it's unoptimized), and APK file is here with GitHub follows for Onix model and Android. Again No Custom Runtimes: Just standard ONNX format loaded straight into Android memory. |
| This is running on your Exynos—with the consideration that after we chatted for 10 minutes, the battery didn't move, and no heating occurred. |
| We completed everything in the last two days: training, benchmarks, fine-tuning, and Onix runtime, all for less than €1000. |
| |
| Why this is interesting |
| |
| Most mobile LLM demos rely on one or more of the following: |
| |
| heavily quantized models |
| GPU acceleration |
| NPU acceleration |
| server-side inference |
| vendor SDKs |
| cloud APIs |
| |
| This demo is intentionally simple and direct: |
| |
| Android app |
| + ONNX Runtime |
| + local tokenizer |
| + local ONNX model |
| + CPU only |
| |
| The current model is not small, not heavily optimized, and not using mobile accelerator tricks. |
| That is the point of the demo. |
| |
| Model architecture note |
| |
| The Android build uses a stateful single-token step ONNX export. |
| |
| The runtime loop is: |
| |
| token_id + position + cache tensors |
| → ONNX step model |
| → logits + updated cache tensors |
| → sample next token |
| → repeat |
|
|
| This replaced the earlier full-sequence ONNX path, which was much slower and used much more memory during generation. |
|
|
| Current ONNX interface: |
|
|
| Inputs: |
| - token_id: [1, 1] int64 |
| - pos: [1] int64 |
| - k_0, v_0 ... k_23, v_23 |
| |
| Outputs: |
| - logits: [1, 32000] float32 |
| - out_k_0, out_v_0 ... out_k_23, out_v_23 |
| |
| Cache shape per K/V tensor: |
| |
| [1, 16, 128, 80] |
| |
| Total runtime cache is about 31 MB. |
| |
| - **Developed by: The Open Machine** |
| - **Model type:** [The Open Machine Transformers Version] |
| - **Language(s) (NLP):** [English] |
| - **License:** [Apache 2.0 ] |
| |
| ### Model Sources [optional] |
| |
| <!-- Provide the basic links for the model. --> |
| |
| - **Repository:** [Wiull be provided in upcoming days] |
| - **Paper [optional]:** [Coming Soon] |
| - **Demo [optional]:** [More Information Needed] |
| |
| ## Uses |
| |
| Demo highlights |
| Fully offline Android assistant |
| Runs on mobile CPU only |
| Stateful single-token ONNX generation |
| Live token streaming UI |
| Battery / RAM / speed display |
| Cool / Turbo mode |
| Cool: 2 CPU threads |
| Turbo: 4 CPU threads |
| No GPU acceleration |
| No NPU acceleration |
| No network calls required for inference |
| |
| Tested device |
| |
| Early demo testing was done on a Samsung A25-class Android phone. |
| |
| Observed behavior: |
| |
| Model loads locally from app storage |
| Generation works fully offline |
| CPU-only generation is slow but usable for demo purposes |
| Example speed observed around 0.20 tok/s, depending on temperature, prompt length, and thread mode |
| |
| This is not yet optimized. |
| |
| Install |
| |
| Download the APK: |
| |
| LuluLocal-Android-CPU-fp32.apk |
| |
| On Android: |
| |
| Open the APK file. |
| Allow install from unknown sources if Android asks. |
| Install. |
| Open Lulu. |
| Wait for the model to load. |
| Ask a question. |
| |
| First load may take longer because the app prepares the local ONNX model. |
| |
| ### Direct Use |
| |
| <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
| |
| [Privacy |
| |
| Inference is local. |
| |
| The demo is designed so prompts are processed on-device. |
| No cloud inference is required. |
| |
| If you build or modify the app, review the source code and Android permissions yourself.] |
| |
| |
| |
| ### Out-of-Scope Use |
| |
| [Important warning |
| |
| This is an experimental local AI demo. |
| |
| The model may: |
| |
| hallucinate |
| answer incorrectly |
| repeat itself |
| generate incomplete text |
| be slow on low-end hardware |
| consume significant battery and RAM |
| |
| Do not use this for medical, legal, financial, emergency, or safety-critical decisions.] |
| |
| ## Bias, Risks, and Limitations |
| |
| <!-- This section is meant to convey both technical and sociotechnical limitations. --> |
| |
| [Current limitations |
| CPU only |
| fp32 ONNX model is large |
| no NPU backend yet |
| no GPU/Vulkan backend yet |
| no quantization yet |
| context length currently limited |
| APK size is large |
| generation quality is still experimental] |
| |
| |
| ## Model Card Authors [optional] |
| Credits |
| |
| Built by Open Machine. |
| |
| Lulu is an experimental local AI assistant project focused on running useful AI directly on personal devices. |
| |
| ## Model Card Contact |
| |
| Open Machine |
| info@theopenmachine.com |
| |