| --- |
| sidebar_position: 14 |
| title: "API Server" |
| description: "Expose hermes-agent as an OpenAI-compatible API for any frontend" |
| --- |
| |
| # API Server |
|
|
| The API server exposes hermes-agent as an OpenAI-compatible HTTP endpoint. Any frontend that speaks the OpenAI format β Open WebUI, LobeChat, LibreChat, NextChat, ChatBox, and hundreds more β can connect to hermes-agent and use it as a backend. |
|
|
| Your agent handles requests with its full toolset (terminal, file operations, web search, memory, skills) and returns the final response. Tool calls execute invisibly server-side. |
|
|
| ## Quick Start |
|
|
| ### 1. Enable the API server |
|
|
| Add to `~/.hermes/.env`: |
|
|
| ```bash |
| API_SERVER_ENABLED=true |
| API_SERVER_KEY=change-me-local-dev |
| # Optional: only if a browser must call Hermes directly |
| # API_SERVER_CORS_ORIGINS=http://localhost:3000 |
| ``` |
|
|
| ### 2. Start the gateway |
|
|
| ```bash |
| hermes gateway |
| ``` |
|
|
| You'll see: |
|
|
| ``` |
| [API Server] API server listening on http://127.0.0.1:8642 |
| ``` |
|
|
| ### 3. Connect a frontend |
|
|
| Point any OpenAI-compatible client at `http://localhost:8642/v1`: |
|
|
| ```bash |
| # Test with curl |
| curl http://localhost:8642/v1/chat/completions \ |
| -H "Authorization: Bearer change-me-local-dev" \ |
| -H "Content-Type: application/json" \ |
| -d '{"model": "hermes-agent", "messages": [{"role": "user", "content": "Hello!"}]}' |
| ``` |
|
|
| Or connect Open WebUI, LobeChat, or any other frontend β see the [Open WebUI integration guide](/docs/user-guide/messaging/open-webui) for step-by-step instructions. |
|
|
| ## Endpoints |
|
|
| ### POST /v1/chat/completions |
|
|
| Standard OpenAI Chat Completions format. Stateless β the full conversation is included in each request via the `messages` array. |
|
|
| **Request:** |
| ```json |
| { |
| "model": "hermes-agent", |
| "messages": [ |
| {"role": "system", "content": "You are a Python expert."}, |
| {"role": "user", "content": "Write a fibonacci function"} |
| ], |
| "stream": false |
| } |
| ``` |
|
|
| **Response:** |
| ```json |
| { |
| "id": "chatcmpl-abc123", |
| "object": "chat.completion", |
| "created": 1710000000, |
| "model": "hermes-agent", |
| "choices": [{ |
| "index": 0, |
| "message": {"role": "assistant", "content": "Here's a fibonacci function..."}, |
| "finish_reason": "stop" |
| }], |
| "usage": {"prompt_tokens": 50, "completion_tokens": 200, "total_tokens": 250} |
| } |
| ``` |
|
|
| **Streaming** (`"stream": true`): Returns Server-Sent Events (SSE) with token-by-token response chunks. When streaming is enabled in config, tokens are emitted live as the LLM generates them. When disabled, the full response is sent as a single SSE chunk. |
|
|
| ### POST /v1/responses |
|
|
| OpenAI Responses API format. Supports server-side conversation state via `previous_response_id` β the server stores full conversation history (including tool calls and results) so multi-turn context is preserved without the client managing it. |
|
|
| **Request:** |
| ```json |
| { |
| "model": "hermes-agent", |
| "input": "What files are in my project?", |
| "instructions": "You are a helpful coding assistant.", |
| "store": true |
| } |
| ``` |
|
|
| **Response:** |
| ```json |
| { |
| "id": "resp_abc123", |
| "object": "response", |
| "status": "completed", |
| "model": "hermes-agent", |
| "output": [ |
| {"type": "function_call", "name": "terminal", "arguments": "{\"command\": \"ls\"}", "call_id": "call_1"}, |
| {"type": "function_call_output", "call_id": "call_1", "output": "README.md src/ tests/"}, |
| {"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Your project has..."}]} |
| ], |
| "usage": {"input_tokens": 50, "output_tokens": 200, "total_tokens": 250} |
| } |
| ``` |
|
|
| #### Multi-turn with previous_response_id |
|
|
| Chain responses to maintain full context (including tool calls) across turns: |
|
|
| ```json |
| { |
| "input": "Now show me the README", |
| "previous_response_id": "resp_abc123" |
| } |
| ``` |
|
|
| The server reconstructs the full conversation from the stored response chain β all previous tool calls and results are preserved. |
|
|
| #### Named conversations |
|
|
| Use the `conversation` parameter instead of tracking response IDs: |
|
|
| ```json |
| {"input": "Hello", "conversation": "my-project"} |
| {"input": "What's in src/?", "conversation": "my-project"} |
| {"input": "Run the tests", "conversation": "my-project"} |
| ``` |
|
|
| The server automatically chains to the latest response in that conversation. Like the `/title` command for gateway sessions. |
|
|
| ### GET /v1/responses/\{id\} |
|
|
| Retrieve a previously stored response by ID. |
|
|
| ### DELETE /v1/responses/\{id\} |
|
|
| Delete a stored response. |
|
|
| ### GET /v1/models |
|
|
| Lists `hermes-agent` as an available model. Required by most frontends for model discovery. |
|
|
| ### GET /health |
|
|
| Health check. Returns `{"status": "ok"}`. |
|
|
| ## System Prompt Handling |
|
|
| When a frontend sends a `system` message (Chat Completions) or `instructions` field (Responses API), hermes-agent **layers it on top** of its core system prompt. Your agent keeps all its tools, memory, and skills β the frontend's system prompt adds extra instructions. |
|
|
| This means you can customize behavior per-frontend without losing capabilities: |
| - Open WebUI system prompt: "You are a Python expert. Always include type hints." |
| - The agent still has terminal, file tools, web search, memory, etc. |
|
|
| ## Authentication |
|
|
| Bearer token auth via the `Authorization` header: |
|
|
| ``` |
| Authorization: Bearer *** |
| ``` |
|
|
| Configure the key via `API_SERVER_KEY` env var. If you need a browser to call Hermes directly, also set `API_SERVER_CORS_ORIGINS` to an explicit allowlist. |
|
|
| :::warning Security |
| The API server gives full access to hermes-agent's toolset, **including terminal commands**. If you change the bind address to `0.0.0.0` (network-accessible), **always set `API_SERVER_KEY`** and keep `API_SERVER_CORS_ORIGINS` narrow β without that, remote callers may be able to execute arbitrary commands on your machine. |
|
|
| The default bind address (`127.0.0.1`) is for local-only use. Browser access is disabled by default; enable it only for explicit trusted origins. |
| ::: |
|
|
| ## Configuration |
|
|
| ### Environment Variables |
|
|
| | Variable | Default | Description | |
| |----------|---------|-------------| |
| | `API_SERVER_ENABLED` | `false` | Enable the API server | |
| | `API_SERVER_PORT` | `8642` | HTTP server port | |
| | `API_SERVER_HOST` | `127.0.0.1` | Bind address (localhost only by default) | |
| | `API_SERVER_KEY` | _(none)_ | Bearer token for auth | |
| | `API_SERVER_CORS_ORIGINS` | _(none)_ | Comma-separated allowed browser origins | |
|
|
| ### config.yaml |
|
|
| ```yaml |
| # Not yet supported β use environment variables. |
| # config.yaml support coming in a future release. |
| ``` |
|
|
| ## CORS |
|
|
| The API server does **not** enable browser CORS by default. |
|
|
| For direct browser access, set an explicit allowlist: |
|
|
| ```bash |
| API_SERVER_CORS_ORIGINS=http://localhost:3000,http://127.0.0.1:3000 |
| ``` |
|
|
| Most documented frontends such as Open WebUI connect server-to-server and do not need CORS at all. |
|
|
| ## Compatible Frontends |
|
|
| Any frontend that supports the OpenAI API format works. Tested/documented integrations: |
|
|
| | Frontend | Stars | Connection | |
| |----------|-------|------------| |
| | [Open WebUI](/docs/user-guide/messaging/open-webui) | 126k | Full guide available | |
| | LobeChat | 73k | Custom provider endpoint | |
| | LibreChat | 34k | Custom endpoint in librechat.yaml | |
| | AnythingLLM | 56k | Generic OpenAI provider | |
| | NextChat | 87k | BASE_URL env var | |
| | ChatBox | 39k | API Host setting | |
| | Jan | 26k | Remote model config | |
| | HF Chat-UI | 8k | OPENAI_BASE_URL | |
| | big-AGI | 7k | Custom endpoint | |
| | OpenAI Python SDK | β | `OpenAI(base_url="http://localhost:8642/v1")` | |
| | curl | β | Direct HTTP requests | |
|
|
| ## Limitations |
|
|
| - **Response storage** β stored responses (for `previous_response_id`) are persisted in SQLite and survive gateway restarts. Max 100 stored responses (LRU eviction). |
| - **No file upload** β vision/document analysis via uploaded files is not yet supported through the API. |
| - **Model field is cosmetic** β the `model` field in requests is accepted but the actual LLM model used is configured server-side in config.yaml. |
|
|