jakmro commited on
Commit
e5f27a4
Β·
verified Β·
1 Parent(s): 3e9e700

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -146
README.md CHANGED
@@ -1,81 +1,37 @@
1
- Energy-efficient kernels & inference engine for phones.
2
-
3
- ## Why Cactus?
4
- - Phones run on battery, GPUs drain energy and heat the devices.
5
- - 70% of phones today don't ship NPUs which most frameworks optimse for.
6
- - Cactus is optimsed for old and new ARM-CPU first, with NPU/DSP/ISP coming.
7
- - Fast on all phones with less battery drain and heating.
8
-
9
- ## Performance (CPU only)
10
-
11
- - Speed for various sizes can be estimated proportionally
12
- - INT4 wiil give 30% gains when merged
13
- - GPUs yield gains but drain battery, will be passed on for NPUs
14
-
15
- | Device | Qwen3-INT8-600m (toks/sec) |
16
- |:------------------------------|:------------------------:|
17
- | iPhone 17 Pro | 74 |
18
- | Galaxy S25 Ultra / 16 Pro | 58 |
19
- | iPhone 16 / Galaxy S25 / Nothing 3 | 52 |
20
- | iPhone 15 Pro | 48 |
21
- | iPhone 14 Pro / OnePlus 13 5G | 47 |
22
- | Galaxy S24 Ultra / iPhone 15 | 42 |
23
- | OnePlus Open / Galaxy S23 | 41 |
24
- | iPhone 13 Pro / OnePlus 12 | 38 |
25
- | iPhone 13 mini / Redmi K70 Ultra / Xiaomi 13 / OnePlus 11 | 27 |
26
- | Pixel 6a / Nothing 3a / iPhone X / Galaxy S21 | 16 |
27
-
28
- ## File Size Comparison
29
-
30
- | Format | Size (Qwen3-0.6B-INT8) |
31
- |--------|------------------------|
32
- | Cactus | 370-420 MB |
33
- | ONNX/TFLite/MLX | 600 MB |
34
- | GGUF | 800 MB |
35
- | Executorch | 944 MB |
36
-
37
- ## Battery drain
38
-
39
- - Newer devices have bigger battery
40
- - NPUs are designed for less drain (2-10x)
41
- - Apple Intelligence drain 0.6 percent/min on iPhone 16 Pro Max
42
-
43
- | Device | Qwen3-INT8-600m (percent/min) |
44
- |:------------------------------|:------------------------:|
45
- | OnePlus 13 5G | 0.33 |
46
- | Redmi K70 Ultra / OnePlus 12 | 0.41 |
47
- | Galaxy S25 Ultra / Iphone 17 Pro / Nothing 3 | 0.44 |
48
- | Galaxy S24 Ultra / Nothing 3a / Pixel 6a | 0.48 |
49
- | iPhone 16 Pro Max / Xiaomi 13 | 0.50 |
50
-
51
- ## Design
52
  ```
53
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
54
- β”‚ Cactus FFI β”‚ ←── OpenAI compatible C API for integration
 
 
 
 
55
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
56
  β”‚
57
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
58
- β”‚ Cactus Engine β”‚ ←── High-level transformer engine
59
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
60
  β”‚
61
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
62
- β”‚ Cactus Graph β”‚ ←── Unified zero-copy computation graph
63
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
64
  β”‚
65
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
66
- β”‚ Cactus Kernels β”‚ ←── Low-level ARM-specific SIMD operations
67
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
68
  ```
69
 
70
- ## Cactus Graph & Kernels
71
- Cactus Graph is a general numerical computing framework that runs on Cactus Kernels.
72
- Great for implementing custom models and scientific computing, like JAX for phones.
73
-
74
  ```cpp
75
  #include cactus.h
76
 
77
  CactusGraph graph;
78
-
79
  auto a = graph.input({2, 3}, Precision::FP16);
80
  auto b = graph.input({3, 4}, Precision::INT8);
81
 
@@ -88,27 +44,28 @@ float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
88
 
89
  graph.set_input(a, a_data, Precision::FP16);
90
  graph.set_input(b, b_data, Precision::INT8);
91
- graph.execute();
92
 
 
93
  void* output_data = graph.get_output(result);
 
94
  graph.hard_reset();
95
 
96
  ```
97
 
98
- ## Cactus Engine & APIs
99
- Cactus Engine is a transformer inference engine built on top of Cactus Graphs.
100
- It is abstracted via Cactus Foreign Function Interface APIs.
101
- Header files are self-documenting but documentation contributions are welcome.
102
-
103
  ```cpp
104
  #include cactus.h
105
 
106
- const char* model_path = "path/to/weight/folder";
107
- cactus_model_t model = cactus_init(model_path, 2048);
 
 
 
 
108
 
109
  const char* messages = R"([
110
  {"role": "system", "content": "You are a helpful assistant."},
111
- {"role": "user", "content": "/nothink My name is Henry Ndubuaku"}
112
  ])";
113
 
114
  const char* options = R"({
@@ -116,87 +73,118 @@ const char* options = R"({
116
  "stop_sequences": ["<|im_end|>"]
117
  })";
118
 
119
- char response[1024];
120
- int result = cactus_complete(model, messages, response, sizeof(response), options, nullptr, nullptr, nullptr);
 
 
 
 
 
 
 
 
 
121
  ```
122
-
123
- With tool support:
124
- ```cpp
125
- const char* tools = R"([
126
- {
127
- "function": {
128
- "name": "get_weather",
129
- "description": "Get weather for a location",
130
- "parameters": {
131
- "properties": {
132
- "location": {
133
- "type": "string",
134
- "description": "City name",
135
- "required": true
136
- }
137
- },
138
- "required": ["location"]
139
- }
140
- }
141
- }
142
- ])";
143
-
144
- int result = cactus_complete(model, messages, response, sizeof(response), options, tools, nullptr, nullptr);
145
  ```
146
 
147
- ## Using Cactus in your apps
148
- Cactus SDKs run 500k+ weekly inference tasks in production today, try them!
149
-
150
- <a href="https://github.com/cactus-compute/cactus-flutter" target="_blank">
151
- <img alt="Flutter" src="https://img.shields.io/badge/Flutter-grey.svg?style=for-the-badge&logo=Flutter&logoColor=white">
152
- </a> <a href="https://github.com/cactus-compute/cactus-react" target="_blank">
153
- <img alt="React Native" src="https://img.shields.io/badge/React%20Native-grey.svg?style=for-the-badge&logo=react&logoColor=%2361DAFB">
154
- </a> <a href="https://github.com/cactus-compute/cactus-kotlin" target="_blank">
155
- <img alt="Kotlin" src="https://img.shields.io/badge/Kotlin_MP-grey.svg?style=for-the-badge&logo=kotlin&logoColor=white">
156
- </a>
157
-
158
- ## Getting started
159
- <a href="https://cactuscompute.com/docs" target="_blank">
160
- <img alt="Documentation" src="https://img.shields.io/badge/Documentation-4A90E2?style=for-the-badge&logo=gitbook&logoColor=white">
161
- </a> <a href="https://discord.gg/bNurx3AXTJ" target="_blank">
162
- <img alt="Discord" src="https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white">
163
- </a>
164
-
165
- ## Demo
166
- <a href="https://apps.apple.com/gb/app/cactus-chat/id6744444212" target="_blank">
167
- <img alt="Download iOS App" src="https://img.shields.io/badge/Try_iOS_Demo-grey?style=for-the-badge&logo=apple&logoColor=white">
168
- </a> <a href="https://play.google.com/store/apps/details?id=com.rshemetsubuser.myapp&pcampaignid=web_share" target="_blank">
169
- <img alt="Download Android App" src="https://img.shields.io/badge/Try_Android_Demo-grey?style=for-the-badge&logo=android&logoColor=white">
170
- </a>
171
-
172
- ## Using this repo
173
- You can run these codes directly on M-series Macbooks since they are ARM-based.
174
- Vanilla M3 CPU-only can run Qwen3-600m-INT8 at 60-70 toks/sec, just run the following:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
  ```bash
177
- ./tests/run.sh # chmod +x first time
178
  ```
179
 
180
- ## Generating weights from HuggingFace
181
- Use any of the (270m, 350m, 360m, 600m, 750m, 1B, 1.2B, 1.7B activated params):
182
- ```bash
183
- # Language models
184
- python3 tools/convert_hf.py google/gemma-3-270m-it weights/gemma3-270m/ --precision INT8
185
- python3 tools/convert_hf.py LiquidAI/LFM2-350M weights/lfm2-350m/ --precision INT8
186
- python3 tools/convert_hf.py HuggingFaceTB/SmolLM2-360m-Instruct weights/smollm2-360m/ --precision INT8
187
- python3 tools/convert_hf.py Qwen/Qwen3-0.6B weights/qwen3-600m/ --precision INT8
188
- python3 tools/convert_hf.py LiquidAI/LFM2-700M weights/lfm2-700m/ --precision INT8
189
- python3 tools/convert_hf.py google/gemma-3-1b-it weights/gemma3-1b/ --precision INT8
190
- python3 tools/convert_hf.py LiquidAI/LFM2-1.2B weights/lfm2-1.2B/ --precision INT8
191
- python3 tools/convert_hf.py Qwen/Qwen3-1.7B weights/qwen3-1.7B/ --precision INT8
192
-
193
- # Embedding models
194
- python3 tools/convert_hf.py Qwen/Qwen3-Embedding-0.6B weights/qwen3-embed-600m/ --precision INT8
195
- python3 tools/convert_hf.py nomic-ai/nomic-embed-text-v2-moe weights/nomic/ --precision INT8
196
- ```
197
-
198
- Simply replace the weight path `tests/test_engine.cpp` with your choice.
199
-
200
- ## Limitations
201
- While Cactus can be used for all Apple devices including Macbooks, for computers/AMD/Intel/Nvidia generally,
202
- please use HuggingFace, Llama.cpp, Ollama, vLLM, MLX. They're built for those, support x86, and are all
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ thumbnail: >-
3
+ https://cdn-uploads.huggingface.co/production/uploads/6690e4cacadc8dd5b9008614/cD9bvFNEmDvjWeKzd2Gtu.jpeg
4
+ ---
5
+
6
+ Cross-platform & energy-efficient kernels, runtime and AI inference engine for mobile devices.
7
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ```
9
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
10
+ β”‚ Cactus FFI β”‚ ←── OpenAI compatible C API for integration (tools, RAG, cloud handoff)
11
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
12
+ β”‚
13
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
14
+ β”‚ Cactus Engine β”‚ ←── High-level transformer engine (NPU support, INT4/INT8/FP16/MIXED)
15
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
16
  β”‚
17
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
18
+ β”‚ Cactus Models β”‚ ←── Implements SOTA models using Cactus Graphs
19
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
20
  β”‚
21
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
22
+ β”‚ Cactus Graph β”‚ ←── Unified zero-copy computation graph (think NumPy for mobile)
23
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
24
  β”‚
25
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
26
+ β”‚ Cactus Kernels β”‚ ←── Low-level ARM-specific SIMD operations (think CUDA for mobile)
27
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
28
  ```
29
 
30
+ # Cactus Graph & Kernel
 
 
 
31
  ```cpp
32
  #include cactus.h
33
 
34
  CactusGraph graph;
 
35
  auto a = graph.input({2, 3}, Precision::FP16);
36
  auto b = graph.input({3, 4}, Precision::INT8);
37
 
 
44
 
45
  graph.set_input(a, a_data, Precision::FP16);
46
  graph.set_input(b, b_data, Precision::INT8);
 
47
 
48
+ graph.execute();
49
  void* output_data = graph.get_output(result);
50
+
51
  graph.hard_reset();
52
 
53
  ```
54
 
55
+ # Cactus Engine & FFI
 
 
 
 
56
  ```cpp
57
  #include cactus.h
58
 
59
+ cactus_set_pro_key(""); // email founders@cactuscompute.com for optional key
60
+
61
+ cactus_model_t model = cactus_init(
62
+ "path/to/weight/folder", // section to generate weigths below
63
+ "txt/or/md/file/or/dir/with/many", // nullptr if none, cactus does automatic fast RAG
64
+ );
65
 
66
  const char* messages = R"([
67
  {"role": "system", "content": "You are a helpful assistant."},
68
+ {"role": "user", "content": "My name is Henry Ndubuaku"}
69
  ])";
70
 
71
  const char* options = R"({
 
73
  "stop_sequences": ["<|im_end|>"]
74
  })";
75
 
76
+ char response[4096];
77
+ int result = cactus_complete(
78
+ model, // model handle from cactus_init
79
+ messages, // JSON array of chat messages
80
+ response, // buffer to store response JSON
81
+ sizeof(response), // size of response buffer
82
+ options, // optional: generation options (nullptr for defaults)
83
+ nullptr, // optional: tools JSON for function calling
84
+ nullptr, // optional: streaming callback fn(token, id, user_data)
85
+ nullptr // optional: user data passed to callback
86
+ );
87
  ```
88
+ Example response from Gemma3-270m
89
+ ```json
90
+ {
91
+ "success": true, // when successfully generated locally
92
+ "error": null, // returns specific errors if success = false
93
+ "cloud_handoff": false, // true when model is unconfident, simply route to cloud
94
+ "response": "Hi there!", // null when error is not null or cloud_handoff = true
95
+ "function_calls": [], // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}]
96
+ "confidence": 0.8193, // how confident the model is with its response
97
+ "time_to_first_token_ms": 45.23, // latency (time to first token)
98
+ "total_time_ms": 163.67, // total execution time
99
+ "prefill_tps": 1621.89, // prefill tokens per second
100
+ "decode_tps": 168.42, // decode tokens per second
101
+ "ram_usage_mb": 245.67, // current process RAM usage in MB
102
+ "prefill_tokens": 28,
103
+ "decode_tokens": 50,
104
+ "total_tokens": 78
105
+ }
 
 
 
 
 
106
  ```
107
 
108
+ # Performance
109
+
110
+ - <sub>**Models:** LFM2-VL-450m & Whisper-Small</sub>
111
+ - <sub>**Precision:** Cactus smartly blends INT4, INT8 and F16 for all weights.</sub>
112
+ - <sub>**Decode** = toks/sec, **P/D** = prefill/decode, **VLM** = 256Γ—256 image, **STT** = 30s audio</sub>
113
+ - <sub>**Cactus Pro**: Uses NPU for realtime and large context (Apple for now), scores are marked with *</sub>
114
+
115
+ | Device | Short Decode | 4k-P/D | VLM-TTFT | VLM-Dec | STT-TTFT | STT-Dec |
116
+ |--------|--------|--------|----------|---------|----------|---------|
117
+ | Mac M4 Pro | 170 | 989/150 | 0.2s/0.1s* | 168 | 1.0s/0.2s* | 92 |
118
+ | Mac M3 Pro | 140 | 890/123 | 0.3s/0.1s* | 149 | 1.5s/0.4s* | 81 |
119
+ | iPad/Mac M4 | 134 | 603/106 | 0.3s/0.1s* | 129 | 1.8s0.3s* | 70 |
120
+ | iPad/Mac M3 | 117 | 525/93 | 0.4s/0.1s* | 111 | 2.8s/0.7s* | 61 |
121
+ | iPhone 17 Pro | 126 | 428/84 | 0.5s/0.1s* | 120 | 3.0s/0.6s* | 80 |
122
+ | iPhone 16 Pro | 106 | 380/81 | 0.6s/0.2s* | 101 | 4.3s/0.7s* | 75 |
123
+ | iPhone 15 Pro | 90 | 330/75 | 0.7s/0.3s* | 92 | 4.5s/0.8s* | 70 |
124
+ | Galaxy S25 Ultra | 80 | 355/52 | 0.7s | 70 | 3.6s/- | 32 |
125
+ | Nothing 3 | 56 | 320/46 | 0.8s | 54 | 4.5s | 55 |
126
+ | Pixel 6a | 25 | 108/24 | 2.3s | 25 | 9.6 | 15 |
127
+ | Raspberry Pi 5 | 20 | 292/18 | 1.7s | 23 | 15s | 16 |
128
+
129
+
130
+ # Supported models
131
+
132
+ - <sub>Cactus smartly and compactly blends INT4, INT8 and F16 for all weights.</sub>
133
+ - <sub>You could still quantize everything with one precision, but mixed is optimal</sub>
134
+
135
+ | Model | Zipped Size | Completion | Tools | Vision | Embed | Speech | Pro |
136
+ |-------|------------------|------------|-------|--------|-------|--------|-----|
137
+ | google/gemma-3-270m-it | 252MB | βœ“ | βœ— | βœ— | βœ— | βœ— | βœ— |
138
+ | google/functiongemma-270m-it | 252MB | βœ“ | βœ“ | βœ— | βœ— | βœ— | βœ— |
139
+ | openai/whisper-small | 283MB | βœ— | βœ— | βœ— | βœ“ | βœ“ | Apple |
140
+ | LiquidAI/LFM2-350M | 244MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
141
+ | LiquidAI/LFM2-VL-450M | 448MB | βœ“ | βœ— | βœ“ | βœ“ | βœ— | Apple |
142
+ | nomic-ai/nomic-embed-text-v2-moe | 451MB | βœ— | βœ— | βœ— | βœ“ | βœ— | βœ— |
143
+ | Qwen/Qwen3-0.6B | 514MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
144
+ | Qwen/Qwen3-Embedding-0.6B | 514MB | βœ— | βœ— | βœ— | βœ“ | βœ— | βœ— |
145
+ | LiquidAI/LFM2-700M | 498MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
146
+ | google/gemma-3-1b-it | 642MB | βœ“ | βœ— | βœ— | βœ— | βœ— | βœ— |
147
+ | LiquidAI/LFM2.5-1.2B-Instruct | 474MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
148
+ | LiquidAI/LFM2-1.2B-RAG | 474MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
149
+ | LiquidAI/LFM2-1.2B-Tool | 474MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
150
+ | openai/whisper-medium | 658MB | βœ— | βœ— | βœ— | βœ“ | βœ“ | Apple |
151
+ | LiquidAI/LFM2.5-VL-1.6B | 954MB | βœ“ | βœ— | βœ“ | βœ“ | βœ— | Apple |
152
+ | Qwen/Qwen3-1.7B | 749MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
153
+
154
+ # Using this repo on a Mac
155
 
156
  ```bash
157
+ git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
158
  ```
159
 
160
+ - <sub> `[model]` is a HuggingFace name from the table above (default: `google/gemma-3-270m-it`)</sub>
161
+ - <sub> Common flags: `--precision INT4|INT8|FP16` (default: INT4), `--token <hf_token>`</sub>
162
+ - <sub>Always run `source ./setup` in any new terminal.</sub>
163
+
164
+ | Command | Description |
165
+ |---------|-------------|
166
+ | `cactus run [model]` | Opens playground (auto downloads model) |
167
+ | `cactus download [model]` | Downloads model to `./weights` |
168
+ | `cactus convert [model] [dir]` | Converts model, supports LoRA merging (`--lora <path>`) |
169
+ | `cactus build` | Builds for ARM (`--apple` or `--android`) |
170
+ | `cactus test` | Runs tests (`--ios` / `--android`, `--model [name/path]`) |
171
+ | `cactus clean` | Removes build artifacts |
172
+ | `cactus --help` | Shows all commands and flags |
173
+
174
+ # Using in your apps
175
+
176
+ - [Python for Mac](/python/)
177
+ - [React Native SDK](https://github.com/cactus-compute/cactus-react-native)
178
+ - [Swift Multiplatform SDK](https://github.com/mhayes853/swift-cactus)
179
+ - [Kotlin Multiplatform SDK](https://github.com/cactus-compute/cactus-kotlin)
180
+ - [Flutter SDK](https://github.com/cactus-compute/cactus-flutter)
181
+ - [Rust SDK](https://github.com/mrsarac/cactus-rs)
182
+
183
+ # Try demo apps
184
+
185
+ - [iOS Demo](https://apps.apple.com/gb/app/cactus-chat/id6744444212)
186
+ - [Android Demo](https://play.google.com/store/apps/details?id=com.rshemetsubuser.myapp)
187
+
188
+ # Maintaining Organisations
189
+ 1. [Cactus Compute, Inc](https://cactuscompute.com/)
190
+ 2. [UCLA's BruinAI](https://bruinai.org/)