Spaces:
Configuration error
Configuration error
File size: 7,987 Bytes
8aff72b 960c7eb 8aff72b 3e9e700 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
Energy-efficient kernels & inference engine for phones.
## Why Cactus?
- Phones run on battery, GPUs drain energy and heat the devices.
- 70% of phones today don't ship NPUs which most frameworks optimse for.
- Cactus is optimsed for old and new ARM-CPU first, with NPU/DSP/ISP coming.
- Fast on all phones with less battery drain and heating.
## Performance (CPU only)
- Speed for various sizes can be estimated proportionally
- INT4 wiil give 30% gains when merged
- GPUs yield gains but drain battery, will be passed on for NPUs
| Device | Qwen3-INT8-600m (toks/sec) |
|:------------------------------|:------------------------:|
| iPhone 17 Pro | 74 |
| Galaxy S25 Ultra / 16 Pro | 58 |
| iPhone 16 / Galaxy S25 / Nothing 3 | 52 |
| iPhone 15 Pro | 48 |
| iPhone 14 Pro / OnePlus 13 5G | 47 |
| Galaxy S24 Ultra / iPhone 15 | 42 |
| OnePlus Open / Galaxy S23 | 41 |
| iPhone 13 Pro / OnePlus 12 | 38 |
| iPhone 13 mini / Redmi K70 Ultra / Xiaomi 13 / OnePlus 11 | 27 |
| Pixel 6a / Nothing 3a / iPhone X / Galaxy S21 | 16 |
## File Size Comparison
| Format | Size (Qwen3-0.6B-INT8) |
|--------|------------------------|
| Cactus | 370-420 MB |
| ONNX/TFLite/MLX | 600 MB |
| GGUF | 800 MB |
| Executorch | 944 MB |
## Battery drain
- Newer devices have bigger battery
- NPUs are designed for less drain (2-10x)
- Apple Intelligence drain 0.6 percent/min on iPhone 16 Pro Max
| Device | Qwen3-INT8-600m (percent/min) |
|:------------------------------|:------------------------:|
| OnePlus 13 5G | 0.33 |
| Redmi K70 Ultra / OnePlus 12 | 0.41 |
| Galaxy S25 Ultra / Iphone 17 Pro / Nothing 3 | 0.44 |
| Galaxy S24 Ultra / Nothing 3a / Pixel 6a | 0.48 |
| iPhone 16 Pro Max / Xiaomi 13 | 0.50 |
## Design
```
βββββββββββββββββββ
β Cactus FFI β βββ OpenAI compatible C API for integration
βββββββββββββββββββ
β
βββββββββββββββββββ
β Cactus Engine β βββ High-level transformer engine
βββββββββββββββββββ
β
βββββββββββββββββββ
β Cactus Graph β βββ Unified zero-copy computation graph
βββββββββββββββββββ
β
βββββββββββββββββββ
β Cactus Kernels β βββ Low-level ARM-specific SIMD operations
βββββββββββββββββββ
```
## Cactus Graph & Kernels
Cactus Graph is a general numerical computing framework that runs on Cactus Kernels.
Great for implementing custom models and scientific computing, like JAX for phones.
```cpp
#include cactus.h
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset();
```
## Cactus Engine & APIs
Cactus Engine is a transformer inference engine built on top of Cactus Graphs.
It is abstracted via Cactus Foreign Function Interface APIs.
Header files are self-documenting but documentation contributions are welcome.
```cpp
#include cactus.h
const char* model_path = "path/to/weight/folder";
cactus_model_t model = cactus_init(model_path, 2048);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "/nothink My name is Henry Ndubuaku"}
])";
const char* options = R"({
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[1024];
int result = cactus_complete(model, messages, response, sizeof(response), options, nullptr, nullptr, nullptr);
```
With tool support:
```cpp
const char* tools = R"([
{
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"properties": {
"location": {
"type": "string",
"description": "City name",
"required": true
}
},
"required": ["location"]
}
}
}
])";
int result = cactus_complete(model, messages, response, sizeof(response), options, tools, nullptr, nullptr);
```
## Using Cactus in your apps
Cactus SDKs run 500k+ weekly inference tasks in production today, try them!
<a href="https://github.com/cactus-compute/cactus-flutter" target="_blank">
<img alt="Flutter" src="https://img.shields.io/badge/Flutter-grey.svg?style=for-the-badge&logo=Flutter&logoColor=white">
</a> <a href="https://github.com/cactus-compute/cactus-react" target="_blank">
<img alt="React Native" src="https://img.shields.io/badge/React%20Native-grey.svg?style=for-the-badge&logo=react&logoColor=%2361DAFB">
</a> <a href="https://github.com/cactus-compute/cactus-kotlin" target="_blank">
<img alt="Kotlin" src="https://img.shields.io/badge/Kotlin_MP-grey.svg?style=for-the-badge&logo=kotlin&logoColor=white">
</a>
## Getting started
<a href="https://cactuscompute.com/docs" target="_blank">
<img alt="Documentation" src="https://img.shields.io/badge/Documentation-4A90E2?style=for-the-badge&logo=gitbook&logoColor=white">
</a> <a href="https://discord.gg/bNurx3AXTJ" target="_blank">
<img alt="Discord" src="https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white">
</a>
## Demo
<a href="https://apps.apple.com/gb/app/cactus-chat/id6744444212" target="_blank">
<img alt="Download iOS App" src="https://img.shields.io/badge/Try_iOS_Demo-grey?style=for-the-badge&logo=apple&logoColor=white">
</a> <a href="https://play.google.com/store/apps/details?id=com.rshemetsubuser.myapp&pcampaignid=web_share" target="_blank">
<img alt="Download Android App" src="https://img.shields.io/badge/Try_Android_Demo-grey?style=for-the-badge&logo=android&logoColor=white">
</a>
## Using this repo
You can run these codes directly on M-series Macbooks since they are ARM-based.
Vanilla M3 CPU-only can run Qwen3-600m-INT8 at 60-70 toks/sec, just run the following:
```bash
./tests/run.sh # chmod +x first time
```
## Generating weights from HuggingFace
Use any of the (270m, 350m, 360m, 600m, 750m, 1B, 1.2B, 1.7B activated params):
```bash
# Language models
python3 tools/convert_hf.py google/gemma-3-270m-it weights/gemma3-270m/ --precision INT8
python3 tools/convert_hf.py LiquidAI/LFM2-350M weights/lfm2-350m/ --precision INT8
python3 tools/convert_hf.py HuggingFaceTB/SmolLM2-360m-Instruct weights/smollm2-360m/ --precision INT8
python3 tools/convert_hf.py Qwen/Qwen3-0.6B weights/qwen3-600m/ --precision INT8
python3 tools/convert_hf.py LiquidAI/LFM2-700M weights/lfm2-700m/ --precision INT8
python3 tools/convert_hf.py google/gemma-3-1b-it weights/gemma3-1b/ --precision INT8
python3 tools/convert_hf.py LiquidAI/LFM2-1.2B weights/lfm2-1.2B/ --precision INT8
python3 tools/convert_hf.py Qwen/Qwen3-1.7B weights/qwen3-1.7B/ --precision INT8
# Embedding models
python3 tools/convert_hf.py Qwen/Qwen3-Embedding-0.6B weights/qwen3-embed-600m/ --precision INT8
python3 tools/convert_hf.py nomic-ai/nomic-embed-text-v2-moe weights/nomic/ --precision INT8
```
Simply replace the weight path `tests/test_engine.cpp` with your choice.
## Limitations
While Cactus can be used for all Apple devices including Macbooks, for computers/AMD/Intel/Nvidia generally,
please use HuggingFace, Llama.cpp, Ollama, vLLM, MLX. They're built for those, support x86, and are all |