Spaces:

jakmro
/

README

Running

App Files Files Community

README / README.md

jakmro

Update organization README

af063ca verified 6 days ago

preview code

raw

history blame contribute delete

13.3 kB

metadata

title: jakmro
sdk: static
pinned: true

Cactus

A hybrid low-latency energy-efficient AI engine for mobile devices & wearables.

┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for all major languages
└─────────────────┘     Chat, vision, STT, RAG, tool call, cloud handoff
         │
┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph (PyTorch for mobile)
└─────────────────┘     Custom models, optimised for RAM & quantisation
         │
┌─────────────────┐
│ Cactus Kernels  │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
└─────────────────┘     Custom attention, KV-cache quant, chunked prefill

Quick Demo

Step 1: brew install cactus-compute/cactus/cactus
Step 2: cactus transcribe or cactus run

Cactus Engine

#include cactus.h

cactus_model_t model = cactus_init(
    "path/to/weight/folder",
    "path to txt or dir of txts for auto-rag",
);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[4096];
int result = cactus_complete(
    model,            // model handle
    messages,         // JSON chat messages
    response,         // response buffer
    sizeof(response), // buffer size
    options,          // generation options
    nullptr,          // tools JSON
    nullptr,          // streaming callback
    nullptr           // user data
);

Example response from Gemma3-270m

{
    "success": true,        // generation succeeded
    "error": null,          // error details if failed
    "cloud_handoff": false, // true if cloud model used
    "response": "Hi there!",
    "function_calls": [],   // parsed tool calls
    "confidence": 0.8193,   // model confidence
    "time_to_first_token_ms": 45.23,
    "total_time_ms": 163.67,
    "prefill_tps": 1621.89,
    "decode_tps": 168.42,
    "ram_usage_mb": 245.67,
    "prefill_tokens": 28,
    "decode_tokens": 50,
    "total_tokens": 78
}

Cactus Graph

#include cactus.h

CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

graph.execute();
void* output_data = graph.get_output(result);

graph.hard_reset();

API & SDK References

Reference	Language	Description
Engine API	C	Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, VAD, vector index, cloud handoff
Graph API	C++	Tensor operations, matrix multiplication, attention, normalization, activation functions
Python SDK	Python	Mac, Linux
Swift SDK	Swift	iOS, macOS, tvOS, watchOS, Android
Kotlin SDK	Kotlin	Android, iOS (via KMP)
Flutter SDK	Dart	iOS, macOS, Android
Rust SDK	Rust	Mac, Linux
React Native	JavaScript	iOS, Android

Benchmarks

All weights INT4 quantised
LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
LFM-VL: 256px input, values are latency / decode tps
Parakeet: 30s audio input, values are latency / decode tps
Missing latency = no NPU support yet

Device	LFM 1.2B	LFMVL 1.6B	Parakeet 1.1B	RAM
Mac M4 Pro	582/100	0.2s/98	0.1s/900k+	76MB
iPad/Mac M3	350/60	0.3s/69	0.3s/800k+	70MB
iPhone 17 Pro	327/48	0.3s/48	0.3s/300k+	108MB
iPhone 13 Mini	148/34	0.3s/35	0.7s/90k+	1GB
Galaxy S25 Ultra	255/37	-/34	-/250k+	1.5GB
Pixel 6a	70/15	-/15	-/17k+	1GB
Galaxy A17 5G	32/10	-/11	-/40k+	727MB
CMF Phone 2 Pro	-	-	-	-
Raspberry Pi 5	69/11	13.3s/11	4.5s/180k+	869MB

Roadmap

Date	Status	Milestone
Sep 2025	Done	Released v1
Oct 2025	Done	Chunked prefill, KVCache Quant (2x prefill)
Nov 2025	Done	Cactus Attention (10 & 1k prefill = same decode)
Dec 2025	Done	Team grows to +6 Research Engineers
Jan 2026	Done	Apple NPU/RAM, 5-11x faster iOS/Mac
Feb 2026	Done	Hybrid inference, INT4, lossless Quant (1.5x)
Mar 2026	Coming	Qualcomm/Google NPUs, 5-11x faster Android
Apr 2026	Coming	Mediatek/Exynos NPUs, Cactus@ICLR
May 2026	Coming	Kernel→C++, Graph/Engine→Rust, Mac GPU & VR
Jun 2026	Coming	Torch/JAX model transpilers
Jul 2026	Coming	Wearables optimisations, Cactus@ICML
Aug 2026	Coming	Orchestration
Sep 2026	Coming	Full Cactus paper, chip manufacturer partners

Using this repo

┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│ Step 0: if on Linux (Ubuntu/Debian)                                          │
│ sudo apt-get install python3 python3-venv python3-pip cmake                  │
│   build-essential libcurl4-openssl-dev                                       │
│                                                                              │
│ Step 1: clone and setup                                                      │
│ git clone https://github.com/cactus-compute/cactus && cd cactus              │
│ source ./setup                                                               │
│                                                                              │
│ Step 2: use the commands                                                     │
│──────────────────────────────────────────────────────────────────────────────│
│                                                                              │
│  cactus auth                         manage Cloud API key                    │
│    --status                          show key status                         │
│    --clear                           remove saved key                        │
│                                                                              │
│  cactus run <model>                  opens playground (auto downloads)       │
│    --precision INT4|INT8|FP16        quantization (default: INT4)            │
│    --token <token>                   HF token (gated models)                 │
│    --reconvert                       force reconversion from source          │
│                                                                              │
│  cactus transcribe [model]           live mic transcription (parakeet-1.1b)  │
│    --file <audio.wav>                transcribe file instead of mic          │
│    --precision INT4|INT8|FP16        quantization (default: INT4)            │
│    --token <token>                   HF token (gated models)                 │
│    --reconvert                       force reconversion from source          │
│                                                                              │
│  cactus download <model>             downloads model to ./weights            │
│    --precision INT4|INT8|FP16        quantization (default: INT4)            │
│    --token <token>                   HuggingFace API token                   │
│    --reconvert                       force reconversion from source          │
│                                                                              │
│  cactus convert <model> [dir]        convert model, supports LoRA merge      │
│    --precision INT4|INT8|FP16        quantization (default: INT4)            │
│    --lora <path>                     LoRA adapter to merge                   │
│    --token <token>                   HuggingFace API token                   │
│                                                                              │
│  cactus build                        build for ARM → build/libcactus.a       │
│    --apple                           Apple (iOS/macOS)                       │
│    --android                         Android                                 │
│    --flutter                         Flutter (all platforms)                 │
│    --python                          shared lib for Python FFI               │
│                                                                              │
│  cactus test                         run unit tests and benchmarks           │
│    --model <model>                   default: LFM2-VL-450M                   │
│    --transcribe_model <model>        default: moonshine-base                 │
│    --benchmark                       use larger models                       │
│    --precision INT4|INT8|FP16        regenerate weights with precision       │
│    --reconvert                       force reconversion from source          │
│    --no-rebuild                      skip building library                   │
│    --only <test>                     specific test (llm, vlm, stt, etc)      │
│    --ios                             run on connected iPhone                 │
│    --android                         run on connected Android                │
│                                                                              │
│  cactus clean                        remove all build artifacts              │
│  cactus --help                       show all commands and flags             │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Maintaining Organisations

Citation

If you use Cactus in your research, please cite it as follows:

@software{cactus,
  title        = {Cactus: AI Inference Engine for Phones & Wearables},
  author       = {Ndubuaku, Henry and Cactus Team},
  url          = {https://github.com/cactus-compute/cactus},
  year         = {2025}
}

N/B: Scroll all the way up and click the shields link for resources!