Update README.md

938100e verified 9 days ago

5.24 kB

	---
	license: apache-2.0
	datasets:
	- moca-embed/dclm_20b
	- openbmb/UltraChat
	language:
	- en
	library_name: transformers
	---
	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->


	## Model Details
	- 245M parameters
	- 4 Layers
	- D_size 1280
	- 16 MoE
	- 8 KV
	- FP32 2.3GB - Onix export

	Trained on only 20B tokens of web text data.

	Fine-tuned on 80K of UltraChat, no LoRA or similar tricks.

	### Model Description

	# Lulu Local Android Demo

	Lulu Local is an offline Android AI demo by Open Machine.

	This release runs a local Lulu language model directly on an Android phone using ONNX Runtime CPU inference.

	No cloud.
	No server.
	No GPU.
	No NPU.
	No internet required after install.

	Runs on the Samsung A25 5G.

	This is a raw early proof that a custom local model can run directly on consumer Android hardware.

	For the record this is a literally un-optimized model, with heavily python loop, pure ONNX export of 2.3GB FP32. This is currently running on the CPU, we haven't touched the NPU, Vulcan or anything else yet.
	The current generation takes about three minutes (a full forward pass on 128CTX as I mentioned, it's unoptimized), and APK file is here with GitHub follows for Onix model and Android. Again No Custom Runtimes: Just standard ONNX format loaded straight into Android memory.
	This is running on your Exynos—with the consideration that after we chatted for 10 minutes, the battery didn't move, and no heating occurred.
	We completed everything in the last two days: training, benchmarks, fine-tuning, and Onix runtime, all for less than €1000.

	Why this is interesting

	Most mobile LLM demos rely on one or more of the following:

	heavily quantized models
	GPU acceleration
	NPU acceleration
	server-side inference
	vendor SDKs
	cloud APIs

	This demo is intentionally simple and direct:

	Android app
	+ ONNX Runtime
	+ local tokenizer
	+ local ONNX model
	+ CPU only

	The current model is not small, not heavily optimized, and not using mobile accelerator tricks.
	That is the point of the demo.

	Model architecture note

	The Android build uses a stateful single-token step ONNX export.

	The runtime loop is:

	token_id + position + cache tensors
	→ ONNX step model
	→ logits + updated cache tensors
	→ sample next token
	→ repeat

	This replaced the earlier full-sequence ONNX path, which was much slower and used much more memory during generation.

	Current ONNX interface:

	Inputs:
	- token_id: [1, 1] int64
	- pos: [1] int64
	- k_0, v_0 ... k_23, v_23

	Outputs:
	- logits: [1, 32000] float32
	- out_k_0, out_v_0 ... out_k_23, out_v_23

	Cache shape per K/V tensor:

	[1, 16, 128, 80]

	Total runtime cache is about 31 MB.

	- Developed by: The Open Machine
	- Model type: [The Open Machine Transformers Version]
	- Language(s) (NLP): [English]
	- License: [Apache 2.0 ]

	### Model Sources [optional]

	<!-- Provide the basic links for the model. -->

	- Repository: [Wiull be provided in upcoming days]
	- Paper [optional]: [Coming Soon]
	- Demo [optional]: [More Information Needed]

	## Uses

	Demo highlights
	Fully offline Android assistant
	Runs on mobile CPU only
	Stateful single-token ONNX generation
	Live token streaming UI
	Battery / RAM / speed display
	Cool / Turbo mode
	Cool: 2 CPU threads
	Turbo: 4 CPU threads
	No GPU acceleration
	No NPU acceleration
	No network calls required for inference

	Tested device

	Early demo testing was done on a Samsung A25-class Android phone.

	Observed behavior:

	Model loads locally from app storage
	Generation works fully offline
	CPU-only generation is slow but usable for demo purposes
	Example speed observed around 0.20 tok/s, depending on temperature, prompt length, and thread mode

	This is not yet optimized.

	Install

	Download the APK:

	LuluLocal-Android-CPU-fp32.apk

	On Android:

	Open the APK file.
	Allow install from unknown sources if Android asks.
	Install.
	Open Lulu.
	Wait for the model to load.
	Ask a question.

	First load may take longer because the app prepares the local ONNX model.

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	[Privacy

	Inference is local.

	The demo is designed so prompts are processed on-device.
	No cloud inference is required.

	If you build or modify the app, review the source code and Android permissions yourself.]



	### Out-of-Scope Use

	[Important warning

	This is an experimental local AI demo.

	The model may:

	hallucinate
	answer incorrectly
	repeat itself
	generate incomplete text
	be slow on low-end hardware
	consume significant battery and RAM

	Do not use this for medical, legal, financial, emergency, or safety-critical decisions.]

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	[Current limitations
	CPU only
	fp32 ONNX model is large
	no NPU backend yet
	no GPU/Vulkan backend yet
	no quantization yet
	context length currently limited
	APK size is large
	generation quality is still experimental]


	## Model Card Authors [optional]
	Credits

	Built by Open Machine.

	Lulu is an experimental local AI assistant project focused on running useful AI directly on personal devices.

	## Model Card Contact

	Open Machine
	info@theopenmachine.com