File size: 5,244 Bytes

---
license: apache-2.0
datasets:
- moca-embed/dclm_20b
- openbmb/UltraChat
language:
- en
library_name: transformers
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->


## Model Details
- 245M parameters
- 4 Layers
- D_size 1280
- 16 MoE
- 8 KV
- FP32 2.3GB - Onix export

Trained on only 20B tokens of web text data.

Fine-tuned on 80K of UltraChat, no LoRA or similar tricks. 

### Model Description

# Lulu Local Android Demo

**Lulu Local** is an offline Android AI demo by **Open Machine**.

This release runs a local Lulu language model directly on an Android phone using **ONNX Runtime CPU inference**.

No cloud.
No server.
No GPU.
No NPU.
No internet required after install.

Runs on the Samsung A25 5G.

This is a raw early proof that a custom local model can run directly on consumer Android hardware.

For the record this is a literally un-optimized model, with heavily python loop, pure ONNX export of 2.3GB FP32. This is currently running on the CPU, we haven't touched the NPU, Vulcan or anything else yet. 
The current generation takes about three minutes (a full forward pass on 128CTX as I mentioned, it's unoptimized), and APK file is here with GitHub follows for Onix model and Android. Again No Custom Runtimes: Just standard ONNX format loaded straight into Android memory.
This is running on your Exynos—with the consideration that after we chatted for 10 minutes, the battery didn't move, and no heating occurred. 
We completed everything in the last two days: training, benchmarks, fine-tuning, and Onix runtime, all for less than €1000. 

Why this is interesting

Most mobile LLM demos rely on one or more of the following:

heavily quantized models
GPU acceleration
NPU acceleration
server-side inference
vendor SDKs
cloud APIs

This demo is intentionally simple and direct:

Android app
+ ONNX Runtime
+ local tokenizer
+ local ONNX model
+ CPU only

The current model is not small, not heavily optimized, and not using mobile accelerator tricks.
That is the point of the demo.

Model architecture note

The Android build uses a stateful single-token step ONNX export.

The runtime loop is:

token_id + position + cache tensors
→ ONNX step model
→ logits + updated cache tensors
→ sample next token
→ repeat

This replaced the earlier full-sequence ONNX path, which was much slower and used much more memory during generation.

Current ONNX interface:

Inputs:
- token_id: [1, 1] int64
- pos: [1] int64
- k_0, v_0 ... k_23, v_23

Outputs:
- logits: [1, 32000] float32
- out_k_0, out_v_0 ... out_k_23, out_v_23

Cache shape per K/V tensor:

[1, 16, 128, 80]

Total runtime cache is about 31 MB.

- **Developed by: The Open Machine** 
- **Model type:** [The Open Machine Transformers Version]
- **Language(s) (NLP):** [English]
- **License:** [Apache 2.0 ]

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [Wiull be provided in upcoming days]
- **Paper [optional]:** [Coming Soon]
- **Demo [optional]:** [More Information Needed]

## Uses

Demo highlights
Fully offline Android assistant
Runs on mobile CPU only
Stateful single-token ONNX generation
Live token streaming UI
Battery / RAM / speed display
Cool / Turbo mode
Cool: 2 CPU threads
Turbo: 4 CPU threads
No GPU acceleration
No NPU acceleration
No network calls required for inference

Tested device

Early demo testing was done on a Samsung A25-class Android phone.

Observed behavior:

Model loads locally from app storage
Generation works fully offline
CPU-only generation is slow but usable for demo purposes
Example speed observed around 0.20 tok/s, depending on temperature, prompt length, and thread mode

This is not yet optimized.

Install

Download the APK:

LuluLocal-Android-CPU-fp32.apk

On Android:

Open the APK file.
Allow install from unknown sources if Android asks.
Install.
Open Lulu.
Wait for the model to load.
Ask a question.

First load may take longer because the app prepares the local ONNX model.

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

[Privacy

Inference is local.

The demo is designed so prompts are processed on-device.
No cloud inference is required.

If you build or modify the app, review the source code and Android permissions yourself.]



### Out-of-Scope Use

[Important warning

This is an experimental local AI demo.

The model may:

hallucinate
answer incorrectly
repeat itself
generate incomplete text
be slow on low-end hardware
consume significant battery and RAM

Do not use this for medical, legal, financial, emergency, or safety-critical decisions.]

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

[Current limitations
CPU only
fp32 ONNX model is large
no NPU backend yet
no GPU/Vulkan backend yet
no quantization yet
context length currently limited
APK size is large
generation quality is still experimental]


## Model Card Authors [optional]
Credits

Built by Open Machine.

Lulu is an experimental local AI assistant project focused on running useful AI directly on personal devices.

## Model Card Contact

Open Machine
info@theopenmachine.com