Spaces:

Cactus-Compute
/

README

Running

App Files Files Community

jakmro commited on Jan 20

Commit

e5f27a4

verified ·

1 Parent(s): 3e9e700

Update README.md

Browse files

Files changed (1) hide show

README.md +134 -146

README.md CHANGED Viewed

@@ -1,81 +1,37 @@
-Energy-efficient kernels & inference engine for phones.
-## Why Cactus?
-- Phones run on battery, GPUs drain energy and heat the devices.
-- 70% of phones today don't ship NPUs which most frameworks optimse for.
-- Cactus is optimsed for old and new ARM-CPU first, with NPU/DSP/ISP coming.
-- Fast on all phones with less battery drain and heating.
-## Performance (CPU only)
-- Speed for various sizes can be estimated proportionally
-- INT4 wiil give 30% gains when merged
-- GPUs yield gains but drain battery, will be passed on for NPUs
-| Device                        |  Qwen3-INT8-600m (toks/sec) |
-|:------------------------------|:------------------------:|
-| iPhone 17 Pro                 | 74 |
-| Galaxy S25 Ultra / 16 Pro     | 58 |
-| iPhone 16 / Galaxy S25 / Nothing 3 | 52 |
-| iPhone 15 Pro                 | 48 |
-| iPhone 14 Pro / OnePlus 13 5G | 47 |
-| Galaxy S24 Ultra / iPhone 15  | 42 |
-| OnePlus Open / Galaxy S23     | 41 |
-| iPhone 13 Pro / OnePlus 12    | 38 |
-| iPhone 13 mini / Redmi K70 Ultra / Xiaomi 13 / OnePlus 11 | 27 |
-| Pixel 6a / Nothing 3a / iPhone X / Galaxy S21 | 16 |
-## File Size Comparison
-| Format | Size (Qwen3-0.6B-INT8) |
-|--------|------------------------|
-| Cactus | 370-420 MB |
-| ONNX/TFLite/MLX | 600 MB |
-| GGUF | 800 MB |
-| Executorch | 944 MB |
-## Battery drain
- - Newer devices have bigger battery
- - NPUs are designed for less drain (2-10x)
- - Apple Intelligence drain 0.6 percent/min on iPhone 16 Pro Max
-| Device                        |  Qwen3-INT8-600m (percent/min) |
-|:------------------------------|:------------------------:|
-| OnePlus 13 5G | 0.33 |
-| Redmi K70 Ultra / OnePlus 12 | 0.41 |
-| Galaxy S25 Ultra / Iphone 17 Pro / Nothing 3 | 0.44 |
-| Galaxy S24 Ultra / Nothing 3a / Pixel 6a | 0.48 |
-| iPhone 16 Pro Max / Xiaomi 13 | 0.50  |
-## Design
 ```
 ┌─────────────────┐
-│   Cactus FFI    │ ←── OpenAI compatible C API for integration
 └─────────────────┘
          │
 ┌─────────────────┐
-│  Cactus Engine  │ ←── High-level transformer engine
 └─────────────────┘
          │
 ┌─────────────────┐
-│  Cactus Graph   │ ←── Unified zero-copy computation graph
 └─────────────────┘
          │
 ┌─────────────────┐
-│ Cactus Kernels  │ ←── Low-level ARM-specific SIMD operations
 └─────────────────┘
 ```
-## Cactus Graph & Kernels
-Cactus Graph is a general numerical computing framework that runs on Cactus Kernels.
-Great for implementing custom models and scientific computing, like JAX for phones.
 ```cpp
 #include cactus.h
 CactusGraph graph;
 auto a = graph.input({2, 3}, Precision::FP16);
 auto b = graph.input({3, 4}, Precision::INT8);
@@ -88,27 +44,28 @@ float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
 graph.set_input(a, a_data, Precision::FP16);
 graph.set_input(b, b_data, Precision::INT8);
-graph.execute();
 void* output_data = graph.get_output(result);
 graph.hard_reset();
 ```
-## Cactus Engine & APIs
-Cactus Engine is a transformer inference engine built on top of Cactus Graphs.
-It is abstracted via Cactus Foreign Function Interface APIs.
-Header files are self-documenting but documentation contributions are welcome.
 ```cpp
 #include cactus.h
-const char* model_path = "path/to/weight/folder";
-cactus_model_t model = cactus_init(model_path, 2048);
 const char* messages = R"([
     {"role": "system", "content": "You are a helpful assistant."},
-    {"role": "user", "content": "/nothink My name is Henry Ndubuaku"}
 ])";
 const char* options = R"({
@@ -116,87 +73,118 @@ const char* options = R"({
     "stop_sequences": ["<|im_end|>"]
 })";
-char response[1024];
-int result = cactus_complete(model, messages, response, sizeof(response), options, nullptr, nullptr, nullptr);
 ```
-With tool support:
-```cpp
-const char* tools = R"([
-    {
-        "function": {
-            "name": "get_weather",
-            "description": "Get weather for a location",
-            "parameters": {
-                "properties": {
-                    "location": {
-                        "type": "string",
-                        "description": "City name",
-                        "required": true
-                    }
-                },
-                "required": ["location"]
-            }
-        }
-    }
-])";
-int result = cactus_complete(model, messages, response, sizeof(response), options, tools, nullptr, nullptr);
 ```
-## Using Cactus in your apps
-Cactus SDKs run 500k+ weekly inference tasks in production today, try them!
-<a href="https://github.com/cactus-compute/cactus-flutter" target="_blank">
-  <img alt="Flutter" src="https://img.shields.io/badge/Flutter-grey.svg?style=for-the-badge&logo=Flutter&logoColor=white">
-</a> <a href="https://github.com/cactus-compute/cactus-react" target="_blank">
-  <img alt="React Native" src="https://img.shields.io/badge/React%20Native-grey.svg?style=for-the-badge&logo=react&logoColor=%2361DAFB">
-</a> <a href="https://github.com/cactus-compute/cactus-kotlin" target="_blank">
-  <img alt="Kotlin" src="https://img.shields.io/badge/Kotlin_MP-grey.svg?style=for-the-badge&logo=kotlin&logoColor=white">
-</a>
-## Getting started
-<a href="https://cactuscompute.com/docs" target="_blank">
-  <img alt="Documentation" src="https://img.shields.io/badge/Documentation-4A90E2?style=for-the-badge&logo=gitbook&logoColor=white">
-</a> <a href="https://discord.gg/bNurx3AXTJ" target="_blank">
-  <img alt="Discord" src="https://img.shields.io/badge/Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white">
-</a>
-## Demo
-<a href="https://apps.apple.com/gb/app/cactus-chat/id6744444212" target="_blank">
-  <img alt="Download iOS App" src="https://img.shields.io/badge/Try_iOS_Demo-grey?style=for-the-badge&logo=apple&logoColor=white">
-</a> <a href="https://play.google.com/store/apps/details?id=com.rshemetsubuser.myapp&pcampaignid=web_share" target="_blank">
-  <img alt="Download Android App" src="https://img.shields.io/badge/Try_Android_Demo-grey?style=for-the-badge&logo=android&logoColor=white">
-</a>
-## Using this repo
-You can run these codes directly on M-series Macbooks since they are ARM-based.
-Vanilla M3 CPU-only can run Qwen3-600m-INT8 at 60-70 toks/sec, just run the following:
 ```bash
-./tests/run.sh # chmod +x first time
 ```
-## Generating weights from HuggingFace
-Use any of the (270m, 350m, 360m, 600m, 750m, 1B, 1.2B, 1.7B activated params):
-```bash
-# Language models
-python3 tools/convert_hf.py google/gemma-3-270m-it weights/gemma3-270m/ --precision INT8
-python3 tools/convert_hf.py LiquidAI/LFM2-350M weights/lfm2-350m/ --precision INT8
-python3 tools/convert_hf.py HuggingFaceTB/SmolLM2-360m-Instruct weights/smollm2-360m/ --precision INT8
-python3 tools/convert_hf.py Qwen/Qwen3-0.6B weights/qwen3-600m/ --precision INT8
-python3 tools/convert_hf.py LiquidAI/LFM2-700M weights/lfm2-700m/ --precision INT8
-python3 tools/convert_hf.py google/gemma-3-1b-it weights/gemma3-1b/ --precision INT8
-python3 tools/convert_hf.py LiquidAI/LFM2-1.2B weights/lfm2-1.2B/ --precision INT8
-python3 tools/convert_hf.py Qwen/Qwen3-1.7B weights/qwen3-1.7B/ --precision INT8
-# Embedding models
-python3 tools/convert_hf.py Qwen/Qwen3-Embedding-0.6B weights/qwen3-embed-600m/ --precision INT8
-python3 tools/convert_hf.py nomic-ai/nomic-embed-text-v2-moe weights/nomic/ --precision INT8
-```
-Simply replace the weight path `tests/test_engine.cpp` with your choice.
-## Limitations
-While Cactus can be used for all Apple devices including Macbooks, for computers/AMD/Intel/Nvidia generally,
-please use HuggingFace, Llama.cpp, Ollama, vLLM, MLX. They're built for those, support x86, and are all

+---
+thumbnail: >-
+  https://cdn-uploads.huggingface.co/production/uploads/6690e4cacadc8dd5b9008614/cD9bvFNEmDvjWeKzd2Gtu.jpeg
+---
+Cross-platform & energy-efficient kernels, runtime and AI inference engine for mobile devices.
 ```
 ┌─────────────────┐
+│   Cactus FFI    │ ←── OpenAI compatible C API for integration (tools, RAG, cloud handoff)
+└─────────────────┘
+         │
+┌─────────────────┐
+│  Cactus Engine  │ ←── High-level transformer engine (NPU support, INT4/INT8/FP16/MIXED)
 └─────────────────┘
          │
 ┌─────────────────┐
+│ Cactus Models   │ ←── Implements SOTA models using Cactus Graphs
 └─────────────────┘
          │
 ┌─────────────────┐
+│  Cactus Graph   │ ←── Unified zero-copy computation graph (think NumPy for mobile)
 └─────────────────┘
          │
 ┌─────────────────┐
+│ Cactus Kernels  │ ←── Low-level ARM-specific SIMD operations (think CUDA for mobile)
 └─────────────────┘
 ```
+# Cactus Graph & Kernel
 ```cpp
 #include cactus.h
 CactusGraph graph;
 auto a = graph.input({2, 3}, Precision::FP16);
 auto b = graph.input({3, 4}, Precision::INT8);
 graph.set_input(a, a_data, Precision::FP16);
 graph.set_input(b, b_data, Precision::INT8);
+graph.execute();
 void* output_data = graph.get_output(result);
 graph.hard_reset();
 ```
+# Cactus Engine & FFI
 ```cpp
 #include cactus.h
+cactus_set_pro_key("");                  // email founders@cactuscompute.com for optional key
+cactus_model_t model = cactus_init(
+    "path/to/weight/folder",             // section to generate weigths below
+    "txt/or/md/file/or/dir/with/many",   // nullptr if none, cactus does automatic fast RAG
+);
 const char* messages = R"([
     {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "My name is Henry Ndubuaku"}
 ])";
 const char* options = R"({
     "stop_sequences": ["<|im_end|>"]
 })";
+char response[4096];
+int result = cactus_complete(
+    model,                               // model handle from cactus_init
+    messages,                            // JSON array of chat messages
+    response,                            // buffer to store response JSON
+    sizeof(response),                    // size of response buffer
+    options,                             // optional: generation options (nullptr for defaults)
+    nullptr,                             // optional: tools JSON for function calling
+    nullptr,                             // optional: streaming callback fn(token, id, user_data)
+    nullptr                              // optional: user data passed to callback
+);
 ```
+Example response from Gemma3-270m
+```json
+{
+    "success": true,                    // when successfully generated locally
+    "error": null,                      // returns specific errors if success = false
+    "cloud_handoff": false,             // true when model is unconfident, simply route to cloud
+    "response": "Hi there!",            // null when error is not null or cloud_handoff = true
+    "function_calls": [],               // parsed to [{"name":"set_alarm","arguments":{"hour":"10","minute":"0"}}]
+    "confidence": 0.8193,               // how confident the model is with its response
+    "time_to_first_token_ms": 45.23,    // latency (time to first token)
+    "total_time_ms": 163.67,            // total execution time
+    "prefill_tps": 1621.89,             // prefill tokens per second
+    "decode_tps": 168.42,               // decode tokens per second
+    "ram_usage_mb": 245.67,             // current process RAM usage in MB
+    "prefill_tokens": 28,
+    "decode_tokens": 50,
+    "total_tokens": 78
+}
 ```
+# Performance
+- <sub>**Models:** LFM2-VL-450m & Whisper-Small</sub>
+- <sub>**Precision:** Cactus smartly blends INT4, INT8 and F16 for all weights.</sub>
+- <sub>**Decode** = toks/sec, **P/D** = prefill/decode, **VLM** = 256×256 image, **STT** = 30s audio</sub>
+- <sub>**Cactus Pro**: Uses NPU for realtime and large context (Apple for now), scores are marked with *</sub>
+| Device | Short Decode | 4k-P/D | VLM-TTFT | VLM-Dec | STT-TTFT | STT-Dec |
+|--------|--------|--------|----------|---------|----------|---------|
+| Mac M4 Pro | 170 | 989/150 | 0.2s/0.1s* | 168 | 1.0s/0.2s* | 92 |
+| Mac M3 Pro | 140 | 890/123 | 0.3s/0.1s* | 149 | 1.5s/0.4s* | 81 |
+| iPad/Mac M4 | 134 | 603/106 | 0.3s/0.1s* | 129 | 1.8s0.3s* | 70 |
+| iPad/Mac M3 | 117 | 525/93 | 0.4s/0.1s* | 111 | 2.8s/0.7s* | 61 |
+| iPhone 17 Pro | 126 | 428/84 | 0.5s/0.1s* | 120 | 3.0s/0.6s* | 80 |
+| iPhone 16 Pro | 106 | 380/81 | 0.6s/0.2s* | 101 | 4.3s/0.7s* | 75 |
+| iPhone 15 Pro | 90 | 330/75 | 0.7s/0.3s* | 92 | 4.5s/0.8s* | 70 |
+| Galaxy S25 Ultra | 80 | 355/52 | 0.7s | 70 | 3.6s/- | 32 |
+| Nothing 3 | 56 | 320/46 | 0.8s | 54 | 4.5s | 55 |
+| Pixel 6a | 25 | 108/24 | 2.3s | 25 | 9.6 | 15 |
+| Raspberry Pi 5 | 20 | 292/18 | 1.7s | 23 | 15s | 16 |
+# Supported models
+- <sub>Cactus smartly and compactly blends INT4, INT8 and F16 for all weights.</sub>
+- <sub>You could still quantize everything with one precision, but mixed is optimal</sub>
+| Model | Zipped Size | Completion | Tools | Vision | Embed | Speech | Pro |
+|-------|------------------|------------|-------|--------|-------|--------|-----|
+| google/gemma-3-270m-it | 252MB | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
+| google/functiongemma-270m-it | 252MB | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ |
+| openai/whisper-small | 283MB | ✗ | ✗ | ✗ | ✓ | ✓ | Apple |
+| LiquidAI/LFM2-350M | 244MB | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
+| LiquidAI/LFM2-VL-450M | 448MB | ✓ | ✗ | ✓ | ✓ | ✗ | Apple |
+| nomic-ai/nomic-embed-text-v2-moe | 451MB | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ |
+| Qwen/Qwen3-0.6B | 514MB | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
+| Qwen/Qwen3-Embedding-0.6B | 514MB | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ |
+| LiquidAI/LFM2-700M | 498MB | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
+| google/gemma-3-1b-it | 642MB | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ |
+| LiquidAI/LFM2.5-1.2B-Instruct | 474MB | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
+| LiquidAI/LFM2-1.2B-RAG | 474MB | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
+| LiquidAI/LFM2-1.2B-Tool | 474MB | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
+| openai/whisper-medium | 658MB | ✗ | ✗ | ✗ | ✓ | ✓ | Apple |
+| LiquidAI/LFM2.5-VL-1.6B | 954MB | ✓ | ✗ | ✓ | ✓ | ✗ | Apple |
+| Qwen/Qwen3-1.7B | 749MB | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
+# Using this repo on a Mac
 ```bash
+git clone https://github.com/cactus-compute/cactus && cd cactus && source ./setup
 ```
+- <sub> `[model]` is a HuggingFace name from the table above (default: `google/gemma-3-270m-it`)</sub>
+- <sub> Common flags: `--precision INT4|INT8|FP16` (default: INT4), `--token <hf_token>`</sub>
+- <sub>Always run `source ./setup` in any new terminal.</sub>
+| Command | Description |
+|---------|-------------|
+| `cactus run [model]` | Opens playground (auto downloads model) |
+| `cactus download [model]` | Downloads model to `./weights` |
+| `cactus convert [model] [dir]` | Converts model, supports LoRA merging (`--lora <path>`) |
+| `cactus build` | Builds for ARM (`--apple` or `--android`) |
+| `cactus test` | Runs tests (`--ios` / `--android`, `--model [name/path]`) |
+| `cactus clean` | Removes build artifacts |
+| `cactus --help` | Shows all commands and flags |
+# Using in your apps
+- [Python for Mac](/python/)
+- [React Native SDK](https://github.com/cactus-compute/cactus-react-native)
+- [Swift Multiplatform SDK](https://github.com/mhayes853/swift-cactus)
+- [Kotlin Multiplatform SDK](https://github.com/cactus-compute/cactus-kotlin)
+- [Flutter SDK](https://github.com/cactus-compute/cactus-flutter)
+- [Rust SDK](https://github.com/mrsarac/cactus-rs)
+# Try demo apps
+- [iOS Demo](https://apps.apple.com/gb/app/cactus-chat/id6744444212)
+- [Android Demo](https://play.google.com/store/apps/details?id=com.rshemetsubuser.myapp)
+# Maintaining Organisations
+1. [Cactus Compute, Inc](https://cactuscompute.com/)
+2. [UCLA's BruinAI](https://bruinai.org/)