| # Running on a cell phone (on-device or thin-client) |
|
|
| "Runs on a cell phone" can mean two things; the app supports both via one env switch. |
|
|
| ## The inference switch |
|
|
| `server/model.py` reads `INFERENCE_BASE_URL`: |
|
|
| - **Unset (default):** the GGUF is loaded in-process via `llama-cpp-python` (the Space / a laptop). |
| - **Set:** generation is delegated to a remote **OpenAI-compatible / llama.cpp server** at that URL. |
| Same agent code, different inference location. |
|
|
| ```bash |
| export INFERENCE_BASE_URL="http://127.0.0.1:8080/v1" # a llama-server on the phone |
| export INFERENCE_API_KEY="..." # optional |
| export INFERENCE_MODEL="gemma-e4b" # optional label |
| ``` |
|
|
| So "on the phone" = run a `llama-server` **on the device** and point the agent at `127.0.0.1`. |
|
|
| ## On-device model profile (Gemma E4B edge) |
|
|
| A 31B Q4 GGUF (~18–20 GB) needs a GPU and will not run on a phone. Use the lightweight **Gemma E4B** |
| edge variant (see [PLAN.md](../PLAN.md) and the README *Accuracy upgrade* section), with a small |
| context window: |
|
|
| ```bash |
| export MODEL_REPO="<your-or-community gemma E4B GGUF repo>" |
| export MODEL_FILE="<gemma-e4b-*-Q4_K_M.gguf>" |
| export N_CTX=4096 # keep the KV cache small on a phone |
| export N_GPU_LAYERS=0 # CPU; on a Mac use Metal layers instead |
| ``` |
|
|
| ## Android (Termux) — genuinely on-device |
|
|
| ```bash |
| pkg install python git cmake clang |
| git clone <this repo> && cd imessage-calendar-agent |
| pip install -r requirements-ci.txt llama-cpp-python # CPU build |
| # Option 1: run the whole app (UI + /agent) on the phone |
| USE_STUB_EXTRACTOR=0 python app.py # http://127.0.0.1:7860 |
| # Option 2: run only a llama-server and point a client/app at it |
| # llama-server -m <gemma-e4b.gguf> --port 8080 |
| # then set INFERENCE_BASE_URL=http://127.0.0.1:8080/v1 |
| ``` |
|
|
| Expect multi-second latency per request on phone CPU — keep `N_CTX` small and threads short. |
|
|
| ## iOS — the honest limit |
|
|
| iOS does **not** allow background message access or a persistent background LLM server. You cannot |
| run an autonomous on-device agent for iMessage on an iPhone. The supported iOS path is the |
| foreground **Shortcut** in [automations.md](./automations.md), optionally pointing at a remote |
| `INFERENCE_BASE_URL` for the model. |
|
|