Spaces:

Cactus-Compute
/

README

Configuration error

App Files Files Community

README / README.md

hmunachii

Update README.md

3e9e700 verified 3 months ago

preview code

raw

history blame contribute delete

7.99 kB

	Energy-efficient kernels & inference engine for phones.

	## Why Cactus?
	- Phones run on battery, GPUs drain energy and heat the devices.
	- 70% of phones today don't ship NPUs which most frameworks optimse for.
	- Cactus is optimsed for old and new ARM-CPU first, with NPU/DSP/ISP coming.
	- Fast on all phones with less battery drain and heating.

	## Performance (CPU only)

	- Speed for various sizes can be estimated proportionally
	- INT4 wiil give 30% gains when merged
	- GPUs yield gains but drain battery, will be passed on for NPUs

	\| Device \| Qwen3-INT8-600m (toks/sec) \|
	\|:------------------------------\|:------------------------:\|
	\| iPhone 17 Pro \| 74 \|
	\| Galaxy S25 Ultra / 16 Pro \| 58 \|
	\| iPhone 16 / Galaxy S25 / Nothing 3 \| 52 \|
	\| iPhone 15 Pro \| 48 \|
	\| iPhone 14 Pro / OnePlus 13 5G \| 47 \|
	\| Galaxy S24 Ultra / iPhone 15 \| 42 \|
	\| OnePlus Open / Galaxy S23 \| 41 \|
	\| iPhone 13 Pro / OnePlus 12 \| 38 \|
	\| iPhone 13 mini / Redmi K70 Ultra / Xiaomi 13 / OnePlus 11 \| 27 \|
	\| Pixel 6a / Nothing 3a / iPhone X / Galaxy S21 \| 16 \|

	## File Size Comparison

	\| Format \| Size (Qwen3-0.6B-INT8) \|
	\|--------\|------------------------\|
	\| Cactus \| 370-420 MB \|
	\| ONNX/TFLite/MLX \| 600 MB \|
	\| GGUF \| 800 MB \|
	\| Executorch \| 944 MB \|

	## Battery drain

	- Newer devices have bigger battery
	- NPUs are designed for less drain (2-10x)
	- Apple Intelligence drain 0.6 percent/min on iPhone 16 Pro Max

	\| Device \| Qwen3-INT8-600m (percent/min) \|
	\|:------------------------------\|:------------------------:\|
	\| OnePlus 13 5G \| 0.33 \|
	\| Redmi K70 Ultra / OnePlus 12 \| 0.41 \|
	\| Galaxy S25 Ultra / Iphone 17 Pro / Nothing 3 \| 0.44 \|
	\| Galaxy S24 Ultra / Nothing 3a / Pixel 6a \| 0.48 \|
	\| iPhone 16 Pro Max / Xiaomi 13 \| 0.50 \|

	## Design
	```
	┌─────────────────┐
	│ Cactus FFI │ ←── OpenAI compatible C API for integration
	└─────────────────┘
	│
	┌─────────────────┐
	│ Cactus Engine │ ←── High-level transformer engine
	└─────────────────┘
	│
	┌─────────────────┐
	│ Cactus Graph │ ←── Unified zero-copy computation graph
	└─────────────────┘
	│
	┌─────────────────┐
	│ Cactus Kernels │ ←── Low-level ARM-specific SIMD operations
	└─────────────────┘
	```

	## Cactus Graph & Kernels
	Cactus Graph is a general numerical computing framework that runs on Cactus Kernels.
	Great for implementing custom models and scientific computing, like JAX for phones.

	```cpp
	#include cactus.h

	CactusGraph graph;

	auto a = graph.input({2, 3}, Precision::FP16);
	auto b = graph.input({3, 4}, Precision::INT8);

	auto x1 = graph.matmul(a, b, false);
	auto x2 = graph.transpose(x1);
	auto result = graph.matmul(b, x2, true);

	float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
	float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

	graph.set_input(a, a_data, Precision::FP16);
	graph.set_input(b, b_data, Precision::INT8);
	graph.execute();

	void* output_data = graph.get_output(result);
	graph.hard_reset();

	```

	## Cactus Engine & APIs
	Cactus Engine is a transformer inference engine built on top of Cactus Graphs.
	It is abstracted via Cactus Foreign Function Interface APIs.
	Header files are self-documenting but documentation contributions are welcome.

	```cpp
	#include cactus.h

	const char* model_path = "path/to/weight/folder";
	cactus_model_t model = cactus_init(model_path, 2048);

	const char* messages = R"([
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "/nothink My name is Henry Ndubuaku"}
	])";

	const char* options = R"({
	"max_tokens": 50,
	"stop_sequences": ["<\|im_end\|>"]
	})";

	char response[1024];
	int result = cactus_complete(model, messages, response, sizeof(response), options, nullptr, nullptr, nullptr);
	```

	With tool support:
	```cpp
	const char* tools = R"([
	{
	"function": {
	"name": "get_weather",
	"description": "Get weather for a location",
	"parameters": {
	"properties": {
	"location": {
	"type": "string",
	"description": "City name",
	"required": true
	}
	},
	"required": ["location"]
	}
	}
	}
	])";

	int result = cactus_complete(model, messages, response, sizeof(response), options, tools, nullptr, nullptr);
	```

	## Using Cactus in your apps
	Cactus SDKs run 500k+ weekly inference tasks in production today, try them!

	<a href="https://github.com/cactus-compute/cactus-flutter" target="_blank">
	<img alt="Flutter" src="https://img.shields.io/badge/Flutter-grey.svg?style=for-the-badge&logo=Flutter&logoColor=white">
	</a> <a href="https://github.com/cactus-compute/cactus-react" target="_blank">
	<img alt="React Native" src="https://img.shields.io/badge/React%20Native-grey.svg?style=for-the-badge&logo=react&logoColor=%2361DAFB">
	</a> <a href="https://github.com/cactus-compute/cactus-kotlin" target="_blank">
	<img alt="Kotlin" src="https://img.shields.io/badge/Kotlin_MP-grey.svg?style=for-the-badge&logo=kotlin&logoColor=white">
	</a>

	## Getting started
	<a href="https://cactuscompute.com/docs" target="_blank">
	<img alt="Documentation" src="https://img.shields.io/badge/Documentation-4A90E2?style=for-the-badge&logo=gitbook&logoColor=white">
	</a> <a href="https://discord.gg/bNurx3AXTJ" target="_blank">
	<img alt="Discord" src="https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white">
	</a>

	## Demo
	<a href="https://apps.apple.com/gb/app/cactus-chat/id6744444212" target="_blank">
	<img alt="Download iOS App" src="https://img.shields.io/badge/Try_iOS_Demo-grey?style=for-the-badge&logo=apple&logoColor=white">
	</a> <a href="https://play.google.com/store/apps/details?id=com.rshemetsubuser.myapp&pcampaignid=web_share" target="_blank">
	<img alt="Download Android App" src="https://img.shields.io/badge/Try_Android_Demo-grey?style=for-the-badge&logo=android&logoColor=white">
	</a>

	## Using this repo
	You can run these codes directly on M-series Macbooks since they are ARM-based.
	Vanilla M3 CPU-only can run Qwen3-600m-INT8 at 60-70 toks/sec, just run the following:

	```bash
	./tests/run.sh # chmod +x first time
	```

	## Generating weights from HuggingFace
	Use any of the (270m, 350m, 360m, 600m, 750m, 1B, 1.2B, 1.7B activated params):
	```bash
	# Language models
	python3 tools/convert_hf.py google/gemma-3-270m-it weights/gemma3-270m/ --precision INT8
	python3 tools/convert_hf.py LiquidAI/LFM2-350M weights/lfm2-350m/ --precision INT8
	python3 tools/convert_hf.py HuggingFaceTB/SmolLM2-360m-Instruct weights/smollm2-360m/ --precision INT8
	python3 tools/convert_hf.py Qwen/Qwen3-0.6B weights/qwen3-600m/ --precision INT8
	python3 tools/convert_hf.py LiquidAI/LFM2-700M weights/lfm2-700m/ --precision INT8
	python3 tools/convert_hf.py google/gemma-3-1b-it weights/gemma3-1b/ --precision INT8
	python3 tools/convert_hf.py LiquidAI/LFM2-1.2B weights/lfm2-1.2B/ --precision INT8
	python3 tools/convert_hf.py Qwen/Qwen3-1.7B weights/qwen3-1.7B/ --precision INT8

	# Embedding models
	python3 tools/convert_hf.py Qwen/Qwen3-Embedding-0.6B weights/qwen3-embed-600m/ --precision INT8
	python3 tools/convert_hf.py nomic-ai/nomic-embed-text-v2-moe weights/nomic/ --precision INT8
	```

	Simply replace the weight path `tests/test_engine.cpp` with your choice.

	## Limitations
	While Cactus can be used for all Apple devices including Macbooks, for computers/AMD/Intel/Nvidia generally,
	please use HuggingFace, Llama.cpp, Ollama, vLLM, MLX. They're built for those, support x86, and are all