hmunachii commited on
Commit
070e145
Β·
verified Β·
1 Parent(s): e0add3b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -121
README.md CHANGED
@@ -1,63 +1,29 @@
1
  ![banner](https://cdn-uploads.huggingface.co/production/uploads/6690e4cacadc8dd5b9008614/Iw9rww1TDVTT9LS7kGLiv.jpeg)
2
 
3
- Cross-platform & energy-efficient kernels, runtime and AI inference engine for mobile devices.
4
 
5
  ```
6
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
7
- β”‚ Cactus FFI β”‚ ←── OpenAI compatible C API for integration (tools, RAG, cloud handoff)
8
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
9
  β”‚
10
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
11
- β”‚ Cactus Engine β”‚ ←── High-level transformer engine (NPU support, INT4/INT8/FP16/MIXED)
12
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
13
  β”‚
14
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
15
- β”‚ Cactus Models β”‚ ←── Implements SOTA models using Cactus Graphs
16
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
17
- β”‚
18
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
19
- β”‚ Cactus Graph β”‚ ←── Unified zero-copy computation graph (think NumPy for mobile)
20
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
21
- β”‚
22
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
23
- β”‚ Cactus Kernels β”‚ ←── Low-level ARM-specific SIMD operations (think CUDA for mobile)
24
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
25
  ```
26
 
27
- # Cactus Graph & Kernel
28
- ```cpp
29
- #include cactus.h
30
-
31
- CactusGraph graph;
32
- auto a = graph.input({2, 3}, Precision::FP16);
33
- auto b = graph.input({3, 4}, Precision::INT8);
34
-
35
- auto x1 = graph.matmul(a, b, false);
36
- auto x2 = graph.transpose(x1);
37
- auto result = graph.matmul(b, x2, true);
38
 
39
- float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
40
- float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
41
 
42
- graph.set_input(a, a_data, Precision::FP16);
43
- graph.set_input(b, b_data, Precision::INT8);
44
-
45
- graph.execute();
46
- void* output_data = graph.get_output(result);
47
-
48
- graph.hard_reset();
49
-
50
- ```
51
-
52
- # Cactus Engine & FFI
53
  ```cpp
54
  #include cactus.h
55
 
56
- cactus_set_pro_key(""); // email founders@cactuscompute.com for optional key
57
-
58
  cactus_model_t model = cactus_init(
59
- "path/to/weight/folder", // section to generate weigths below
60
- "txt/or/md/file/or/dir/with/many", // nullptr if none, cactus does automatic fast RAG
61
  );
62
 
63
  const char* messages = R"([
@@ -72,91 +38,115 @@ const char* options = R"({
72
 
73
  char response[4096];
74
  int result = cactus_complete(
75
- model, // model handle from cactus_init
76
- messages, // JSON array of chat messages
77
- response, // buffer to store response JSON
78
- sizeof(response), // size of response buffer
79
- options, // optional: generation options (nullptr for defaults)
80
- nullptr, // optional: tools JSON for function calling
81
- nullptr, // optional: streaming callback fn(token, id, user_data)
82
- nullptr // optional: user data passed to callback
83
  );
84
  ```
85
  Example response from Gemma3-270m
86
  ```json
87
  {
88
- "success": true, // when successfully generated locally
89
- "error": null, // returns specific errors if success = false
90
- "cloud_handoff": false, // true when model is unconfident, simply route to cloud
91
- "response": "Hi there!", // null when error is not null or cloud_handoff = true
92
- "function_calls": [], // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}]
93
- "confidence": 0.8193, // how confident the model is with its response
94
- "time_to_first_token_ms": 45.23, // latency (time to first token)
95
- "total_time_ms": 163.67, // total execution time
96
- "prefill_tps": 1621.89, // prefill tokens per second
97
- "decode_tps": 168.42, // decode tokens per second
98
- "ram_usage_mb": 245.67, // current process RAM usage in MB
99
  "prefill_tokens": 28,
100
  "decode_tokens": 50,
101
  "total_tokens": 78
102
  }
103
  ```
104
 
105
- # Performance
106
-
107
- - <sub>**Models:** LFM2-VL-450m & Whisper-Small</sub>
108
- - <sub>**Precision:** Cactus smartly blends INT4, INT8 and F16 for all weights.</sub>
109
- - <sub>**Decode** = toks/sec, **P/D** = prefill/decode, **VLM** = 256Γ—256 image, **STT** = 30s audio</sub>
110
- - <sub>**Cactus Pro**: Uses NPU for realtime and large context (Apple for now), scores are marked with *</sub>
111
-
112
- | Device | Short Decode | 4k-P/D | VLM-TTFT | VLM-Dec | STT-TTFT | STT-Dec |
113
- |--------|--------|--------|----------|---------|----------|---------|
114
- | Mac M4 Pro | 170 | 989/150 | 0.2s/0.1s* | 168 | 1.0s/0.2s* | 92 |
115
- | Mac M3 Pro | 140 | 890/123 | 0.3s/0.1s* | 149 | 1.5s/0.4s* | 81 |
116
- | iPad/Mac M4 | 134 | 603/106 | 0.3s/0.1s* | 129 | 1.8s0.3s* | 70 |
117
- | iPad/Mac M3 | 117 | 525/93 | 0.4s/0.1s* | 111 | 2.8s/0.7s* | 61 |
118
- | iPhone 17 Pro | 126 | 428/84 | 0.5s/0.1s* | 120 | 3.0s/0.6s* | 80 |
119
- | iPhone 16 Pro | 106 | 380/81 | 0.6s/0.2s* | 101 | 4.3s/0.7s* | 75 |
120
- | iPhone 15 Pro | 90 | 330/75 | 0.7s/0.3s* | 92 | 4.5s/0.8s* | 70 |
121
- | Galaxy S25 Ultra | 80 | 355/52 | 0.7s | 70 | 3.6s/- | 32 |
122
- | Nothing 3 | 56 | 320/46 | 0.8s | 54 | 4.5s | 55 |
123
- | Pixel 6a | 25 | 108/24 | 2.3s | 25 | 9.6 | 15 |
124
- | Raspberry Pi 5 | 20 | 292/18 | 1.7s | 23 | 15s | 16 |
125
-
126
-
127
- # Supported models
128
-
129
- - <sub>Cactus smartly and compactly blends INT4, INT8 and F16 for all weights.</sub>
130
- - <sub>You could still quantize everything with one precision, but mixed is optimal</sub>
131
-
132
- | Model | Zipped Size | Completion | Tools | Vision | Embed | Speech | Pro |
133
- |-------|------------------|------------|-------|--------|-------|--------|-----|
134
- | google/gemma-3-270m-it | 252MB | βœ“ | βœ— | βœ— | βœ— | βœ— | βœ— |
135
- | google/functiongemma-270m-it | 252MB | βœ“ | βœ“ | βœ— | βœ— | βœ— | βœ— |
136
- | openai/whisper-small | 283MB | βœ— | βœ— | βœ— | βœ“ | βœ“ | Apple |
137
- | LiquidAI/LFM2-350M | 244MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
138
- | LiquidAI/LFM2-VL-450M | 448MB | βœ“ | βœ— | βœ“ | βœ“ | βœ— | Apple |
139
- | nomic-ai/nomic-embed-text-v2-moe | 451MB | βœ— | βœ— | βœ— | βœ“ | βœ— | βœ— |
140
- | Qwen/Qwen3-0.6B | 514MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
141
- | Qwen/Qwen3-Embedding-0.6B | 514MB | βœ— | βœ— | βœ— | βœ“ | βœ— | βœ— |
142
- | LiquidAI/LFM2-700M | 498MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
143
- | google/gemma-3-1b-it | 642MB | βœ“ | βœ— | βœ— | βœ— | βœ— | βœ— |
144
- | LiquidAI/LFM2.5-1.2B-Instruct | 474MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
145
- | LiquidAI/LFM2-1.2B-RAG | 474MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
146
- | LiquidAI/LFM2-1.2B-Tool | 474MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
147
- | openai/whisper-medium | 658MB | βœ— | βœ— | βœ— | βœ“ | βœ“ | Apple |
148
- | LiquidAI/LFM2.5-VL-1.6B | 954MB | βœ“ | βœ— | βœ“ | βœ“ | βœ— | Apple |
149
- | Qwen/Qwen3-1.7B | 749MB | βœ“ | βœ“ | βœ— | βœ“ | βœ— | βœ— |
150
-
151
- # Using this repo on a Mac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
  ```bash
 
154
  git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
155
  ```
156
 
157
- - <sub> `[model]` is a HuggingFace name from the table above (default: `google/gemma-3-270m-it`)</sub>
158
- - <sub> Common flags: `--precision INT4|INT8|FP16` (default: INT4), `--token <hf_token>`</sub>
159
- - <sub>Always run `source ./setup` in any new terminal.</sub>
160
 
161
  | Command | Description |
162
  |---------|-------------|
@@ -164,11 +154,12 @@ git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setu
164
  | `cactus download [model]` | Downloads model to `./weights` |
165
  | `cactus convert [model] [dir]` | Converts model, supports LoRA merging (`--lora <path>`) |
166
  | `cactus build` | Builds for ARM (`--apple` or `--android`) |
167
- | `cactus test` | Runs tests (`--ios` / `--android`, `--model [name/path]`) |
 
168
  | `cactus clean` | Removes build artifacts |
169
- | `cactus --help` | Shows all commands and flags |
170
 
171
- # Using in your apps
172
 
173
  - [Python for Mac](/python/)
174
  - [React Native SDK](https://github.com/cactus-compute/cactus-react-native)
@@ -177,11 +168,21 @@ git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setu
177
  - [Flutter SDK](https://github.com/cactus-compute/cactus-flutter)
178
  - [Rust SDK](https://github.com/mrsarac/cactus-rs)
179
 
180
- # Try demo apps
181
 
182
  - [iOS Demo](https://apps.apple.com/gb/app/cactus-chat/id6744444212)
183
  - [Android Demo](https://play.google.com/store/apps/details?id=com.rshemetsubuser.myapp)
184
 
185
- # Maintaining Organisations
186
  1. [Cactus Compute, Inc](https://cactuscompute.com/)
187
- 2. [UCLA's BruinAI](https://bruinai.org/)
 
 
 
 
 
 
 
 
 
 
 
1
  ![banner](https://cdn-uploads.huggingface.co/production/uploads/6690e4cacadc8dd5b9008614/Iw9rww1TDVTT9LS7kGLiv.jpeg)
2
 
 
3
 
4
  ```
5
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” Energy-efficient inference engine for running AI on mobile devices
6
+ β”‚ Cactus Engine β”‚ ←── OpenAI compatible APIs for C/C++, Swift, Kotlin, Flutter & React-Native
7
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Supports tool call, auto RAG, NPU, INT4, and cloud handoff for complex tasks
8
  β”‚
9
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” Zero-copy computation graph, think PyTorch for mobile devices
10
+ β”‚ Cactus Graph β”‚ ←── You can implement custom models directly using this
11
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Highly optimised for RAM & lossless weight quantisation
12
  β”‚
13
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” Low-level ARM-specific SIMD kernels (Apple, Snapdragon, Google, Exynos, MediaTek & Raspberry Pi)
14
+ β”‚ Cactus Kernels β”‚ ←── Optimised Matrix Multiplication & n
15
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ Custom attention kernels with KV-Cache Quantisation, chunked prefill, streaming LLM, etc.
 
 
 
 
 
 
 
 
16
  ```
17
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ ## Cactus Engine
 
20
 
 
 
 
 
 
 
 
 
 
 
 
21
  ```cpp
22
  #include cactus.h
23
 
 
 
24
  cactus_model_t model = cactus_init(
25
+ "path/to/weight/folder",
26
+ "path to txt or dir of txts for auto-rag",
27
  );
28
 
29
  const char* messages = R"([
 
38
 
39
  char response[4096];
40
  int result = cactus_complete(
41
+ model, // model handle from cactus_init
42
+ messages, // JSON array of chat messages
43
+ response, // buffer to store response JSON
44
+ sizeof(response), // size of response buffer
45
+ options, // optional: generation options (nullptr for defaults)
46
+ nullptr, // optional: tools JSON for function calling
47
+ nullptr, // optional: streaming callback fn(token, id, user_data)
48
+ nullptr // optional: user data passed to callback
49
  );
50
  ```
51
  Example response from Gemma3-270m
52
  ```json
53
  {
54
+ "success": true, // when successfully generated locally
55
+ "error": null, // returns specific errors if success = false
56
+ "cloud_handoff": false, // true when model is unconfident, simply route to cloud
57
+ "response": "Hi there!", // null when error is not null or cloud_handoff = true
58
+ "function_calls": [], // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}]
59
+ "confidence": 0.8193, // how confident the model is with its response
60
+ "time_to_first_token_ms": 45.23, // latency (time to first token)
61
+ "total_time_ms": 163.67, // total execution time
62
+ "prefill_tps": 1621.89, // prefill tokens per second
63
+ "decode_tps": 168.42, // decode tokens per second
64
+ "ram_usage_mb": 245.67, // current process RAM usage in MB
65
  "prefill_tokens": 28,
66
  "decode_tokens": 50,
67
  "total_tokens": 78
68
  }
69
  ```
70
 
71
+ ## Cactus Graph
72
+
73
+ ```cpp
74
+ #include cactus.h
75
+
76
+ CactusGraph graph;
77
+ auto a = graph.input({2, 3}, Precision::FP16);
78
+ auto b = graph.input({3, 4}, Precision::INT8);
79
+
80
+ auto x1 = graph.matmul(a, b, false);
81
+ auto x2 = graph.transpose(x1);
82
+ auto result = graph.matmul(b, x2, true);
83
+
84
+ float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
85
+ float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
86
+
87
+ graph.set_input(a, a_data, Precision::FP16);
88
+ graph.set_input(b, b_data, Precision::INT8);
89
+
90
+ graph.execute();
91
+ void* output_data = graph.get_output(result);
92
+
93
+ graph.hard_reset();
94
+ ```
95
+
96
+ ## Benchmark (INT8)
97
+
98
+ | Device | LFM2.5-1.2B<br>(1k-Prefill/100-Decode) | LFM2.5-VL-1.6B<br>(256px-Latency & Decode) | Whisper-Small<br>(30s-audio-Latency & Decode)
99
+ |--------|--------|--------|----------|
100
+ | Mac M4 Pro | 582/77 tps| 0.2s & 76tps | 0.1s & 111tps |
101
+ | iPad/Mac M4 | - | - | - |
102
+ | iPhone 17 Pro | 300/33 tps | 0.3s & 33tps | 0.6s & 114tps |
103
+ | Galaxy S25 Ultra | 226/36 tps | 2.6s & 33tps | 2.3s & 90tps |
104
+ | Pixel 10 Pro | - | - | - |
105
+ | Vivo X200 Pro | - | - | - |
106
+
107
+ | Device | LFM2-350m<br>(1k-Prefill/100-Decode) | LFM2-VL-450m<br>(256px-Latency & Decode) | Moonshine-Base<br>(30s-audio-Latency & Decode)
108
+ |--------|--------|--------|----------|
109
+ | iPad/Mac M1 | - | - | - |
110
+ | iPhone 13 Mini | - | - | - |
111
+ | Galaxy A56 | - | - | - |
112
+ | Pixel 6a | 218/44 tps | 2.5s & 36 tps | 1.5s & 189 tps |
113
+ | Nothing CMF | - | - | - |
114
+ | Raspberry Pi 5 | - | - | - |
115
+
116
+ ## Supported Models
117
+
118
+ | Model | Features |
119
+ |-------|----------|
120
+ | google/gemma-3-270m-it | completion |
121
+ | google/functiongemma-270m-it | completion, tools |
122
+ | LiquidAI/LFM2-350M | completion, tools, embed |
123
+ | Qwen/Qwen3-0.6B | completion, tools, embed |
124
+ | LiquidAI/LFM2-700M | completion, tools, embed |
125
+ | google/gemma-3-1b-it | completion |
126
+ | LiquidAI/LFM2.5-1.2B-Thinking | completion, tools, embed |
127
+ | LiquidAI/LFM2.5-1.2B-Instruct | completion, tools, embed |
128
+ | Qwen/Qwen3-1.7B | completion, tools, embed |
129
+ | LiquidAI/LFM2-2.6B | completion, tools, embed |
130
+ | LiquidAI/LFM2-VL-450M | vision, txt & img embed, Apple NPU |
131
+ | LiquidAI/LFM2.5-VL-1.6B | vision, txt & img embed, Apple NPU |
132
+ | UsefulSensors/moonshine-base | transcription, speech embed |
133
+ | openai/whisper-small | transcription, speech embed, Apple NPU |
134
+ | openai/whisper-medium | transcribe, speech embed, Apple NPU |
135
+ | nomic-ai/nomic-embed-text-v2-moe | embed |
136
+ | Qwen/Qwen3-Embedding-0.6B | embed |
137
+
138
+ ## Using this repo on Mac
139
+ ```bash
140
+ git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
141
+ ```
142
+
143
+ ## Using this repo on Linux (Ubuntu/Debian)
144
 
145
  ```bash
146
+ sudo apt-get install python3 python3-venv python3-pip cmake build-essential libcurl4-openssl-dev
147
  git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
148
  ```
149
 
 
 
 
150
 
151
  | Command | Description |
152
  |---------|-------------|
 
154
  | `cactus download [model]` | Downloads model to `./weights` |
155
  | `cactus convert [model] [dir]` | Converts model, supports LoRA merging (`--lora <path>`) |
156
  | `cactus build` | Builds for ARM (`--apple` or `--android`) |
157
+ | `cactus test` | Runs tests (`--ios` / `--android`, `--model [name/path]`), `--precision` |
158
+ | `cactus transcribe [model]` | Transcribe audio file (`--file`) or live microphone |
159
  | `cactus clean` | Removes build artifacts |
160
+ | `cactus --help` | Shows all commands and flags (always run this) |
161
 
162
+ ## Using in your apps
163
 
164
  - [Python for Mac](/python/)
165
  - [React Native SDK](https://github.com/cactus-compute/cactus-react-native)
 
168
  - [Flutter SDK](https://github.com/cactus-compute/cactus-flutter)
169
  - [Rust SDK](https://github.com/mrsarac/cactus-rs)
170
 
171
+ ## Try demo apps
172
 
173
  - [iOS Demo](https://apps.apple.com/gb/app/cactus-chat/id6744444212)
174
  - [Android Demo](https://play.google.com/store/apps/details?id=com.rshemetsubuser.myapp)
175
 
176
+ ## Maintaining Organisations
177
  1. [Cactus Compute, Inc](https://cactuscompute.com/)
178
+ 2. [UCLA's BruinAI](https://bruinai.org/)
179
+ 3. [Yale's AI Society](https://www.yale-ai.org/team)
180
+ 4. [National Unoversity of Singapore's AI Society](https://www.nusaisociety.org/)
181
+ 5. [UC Irvine's AI@UCI](https://aiclub.ics.uci.edu/)
182
+ 6. [Imperial College's AI Society](https://www.imperialcollegeunion.org/csp/1391)
183
+ 7. [University of Pennsylvania's AI@Penn](https://ai-at-penn-main-105.vercel.app/)
184
+ 8. [University of Michigan Ann-Arbor MSAIL](https://msail.github.io/)
185
+ 9. [University of Colorado Boulder's AI Club](https://www.cuaiclub.org/)
186
+
187
+ ## Join The Community
188
+ - [Reddit Channel](https://www.reddit.com/r/cactuscompute/)