Lightweight Protocol-Translation Bridges for Heterogeneous LLM Tool-Calling APIs
A Case Study on Codex-Ollama Interoperation
Authors: xuanyuan (xrz5785@gmail.com)
Date: May 2026
Status: Technical Report v1.0
Code: /Users/x/ai-assets/codex-proxy/
Abstract
Large Language Model (LLM) inference frameworks widely claim "OpenAI API compatibility," yet this compatibility often exists only at the syntactic levelβHTTP endpoints accept request payloads and return 200, but fail to preserve semantic contracts around tool-calling (function-calling) behavior. We document a concrete failure case: Ollama's /v1/responses endpoint, when used as a provider for Codex CLI (OpenAI's coding agent), returns malformed responses that cause all tool-call attempts to fail with "unsupported call" errors, despite the same models producing correct tool_calls through Ollama's /v1/chat/completions endpoint.
We propose the Protocol-Translation Bridge (PTB) pattern: a lightweight, zero-configuration proxy that performs three core transformationsβ(1) request format translation from Responses API to Chat Completions API, (2) tool schema simplification to fit local models' attention budgets, and (3) reverse synthesis of SSE event streams from non-streaming upstream responses. Our implementation (~800 lines of Python) restores full tool-calling functionality across all tested local models (qwen3:14b, huihui4:8b-a4b, qwen2.5-coder:3b) with zero modifications to either the client framework or the inference engine.
The key insight is counterintuitive: downgrading from the newer Responses API to the older Chat Completions API increases semantic fidelity, because the older endpoint has more mature native implementations within inference frameworks.
1. Introduction
1.1 Background
LLM-based coding agents (Codex, Cursor, Copilot, Aider) rely on tool-calling APIs to execute shell commands, read and write files, spawn sub-agents, and interact with users. These agents typically target OpenAI's API contractβeither the Chat Completions API (/v1/chat/completions) or the newer Responses API (/v1/responses). The Responses API is the recommended modern endpoint for agent frameworks because it provides structured multi-output responses, native streaming SSE events, and a unified interface for text + tool-call outputs.
Meanwhile, the local-model ecosystem (Ollama, vLLM, llama.cpp) has rapidly evolved to support tool-calling. Ollama's documentation states that its /v1/responses endpoint is "OpenAI-compatible." Codex CLI v0.130.0 added --oss --local-provider ollama flags to support local models through this endpoint.
1.2 The Problem
When Codex connects to Ollama's /v1/responses endpoint with a local model, tool calls fail with:
unsupported call: call_<id> (exec_command)
The error is silent at the HTTP levelβOllama returns 200 OK. The failure occurs at the semantic level: the response body does not contain the structured function_call output items that Codex's agent runtime expects. The same model, queried via /v1/chat/completions with identical tool definitions, produces correct tool_calls in its response.
This paper documents the diagnosis, solution, and generalization of this problem.
2. Problem Analysis
2.1 Root Cause
Ollama's /v1/responses endpoint is a thin wrapper around its native /api/chat endpoint. The internal path is:
Codex β POST /v1/responses (Ollama's OpenAI-compat layer)
β Ollama rewraps as /api/chat request
β Model outputs text (not structured function_call)
β Ollama returns {output: [{type: "message", content: "..."}]}
β Codex finds no function_call β "unsupported call"
When the same model is queried via /v1/chat/completions:
Client β POST /v1/chat/completions (Ollama's OpenAI-compat layer)
β Ollama rewraps as /api/chat request WITH tool definitions properly nested
β Model outputs {tool_calls: [{function: {name, arguments}}]}
β Ollama returns {choices: [{message: {tool_calls: [...]}}]}
The critical difference: Ollama's native /api/chat endpoint natively supports tool-calling and correctly passes tool definitions to the model. The /v1/responses wrapper loses this capability during format translation.
2.2 Why Not Fix Upstream?
- Ollama's issue tracker already has reports about
/v1/responsestool-calling gaps - Fix timeline unknown β the
/v1/responsesendpoint is not Ollama's priority - A proxy is zero-friction β no need to modify Codex or Ollama; works instantly
- Generalizes β the same pattern applies to any pair of incompatible LLM API surfaces
2.3 Diagnosis Methodology
We employed a 12-step iterative diagnosis:
| Step | Hypothesis | Test | Result |
|---|---|---|---|
| 1 | Direct /v1/responses works |
Codex with Ollama | "unsupported call" |
| 2 | Model capability issue | Manual /v1/chat/completions |
Tool calls work |
| 3 | Ollama responses impl broken | Read source | Thin wrapper, loses tools |
| 4 | Tool choice mode | required vs auto |
required causes loops |
| 5 | Too many tools | 11 tools β 3 tools | Better, still inconsistent |
| 6 | Tool param overload | 10 params β 2-3 params | Model selects correctly |
| 7 | Model outputs JSON text | Check response format | Text JSON, not tool_calls |
| 8 | System prompt too weak | Add CRITICAL directive | Model starts calling tools |
| 9 | Usage field mismatch | Check input_tokens |
Missing, causes disconnect |
| 10 | output_index hardcoded | Multi-output responses | Text overwrites tool_call |
| 11 | Pull fails for custom models | POST /api/pull | "file does not exist" |
| 12 | GGUF arch compatibility | gemma4 in Ollama | Needs specific arch support |
3. The Protocol-Translation Bridge (PTB) Pattern
3.1 Architecture
βββββββββββ /v1/responses ββββββββββββββββ /v1/chat/completions βββββββββββ
β Codex β βββββββββββββββββββββββ PTB Proxy β βββββββββββββββββββββββββββββ Ollama β
β (Client)β βββββ SSE events βββββ :11434 β βββββ JSON response ββββββββ β :11433 β
βββββββββββ ββββββββββββββββ βββββββββββ
The proxy is transparent for non-Responses requests (pass-through proxy for /api/tags, /api/chat, etc.), and only intervenes for two specific paths:
POST /v1/responsesβ full protocol translationPOST /api/pullβ intercept and short-circuit for locally-available models
3.2 Transformation 1: Request Format Translation
The Responses API request is converted to Chat Completions format:
Input (Responses API):
{
"model": "qwen3:14b",
"input": "list files in /tmp",
"instructions": "You are a coding agent.",
"tools": [
{"type": "function", "function": {"name": "exec_command", ...}}
],
"stream": true
}
Output (Chat Completions API):
{
"model": "qwen3:14b",
"messages": [
{"role": "system", "content": "You are a coding agent.\n\nCRITICAL: You MUST call..."},
{"role": "user", "content": "list files in /tmp"}
],
"tools": [
{"type": "function", "function": {"name": "exec_command", ...}}
],
"tool_choice": "auto",
"stream": false
}
Key design decisions:
stream: falseβ we always request non-streaming from Ollama, then synthesize SSE events ourselves. This avoids the complexity of real-time SSEβSSE translation and gives us complete control over event ordering.tool_choice: "auto"β"required"causes infinite tool-call loops with some models."auto"combined with a strong system prompt provides the right balance.instructionsbecomes part of the system message, augmented with tool-usage directives.
3.3 Transformation 2: Tool Schema Simplification
Codex's internal tools are complex. exec_command alone has 10 parameters. Across 11 tools, the total tool definition is approximately 4,100 tokensβexceeding the effective attention budget of 8B-class models.
We reduce each tool to its essential parameters:
| Tool | Original Params | Essential Params |
|---|---|---|
exec_command |
10 (cmd, workdir, timeout, env, stdin, ...) | 2 (cmd, workdir) |
write_stdin |
6 | 2 (session_id, chars) |
spawn_agent |
8 | 3 (agent_type, items, message) |
view_image |
3 | 1 (path) |
| ... | ... | ... |
After simplification: ~800 tokens total. Codex's tool executor fills in sensible defaults for omitted parameters.
This is a zero-cost accuracy improvement β the model selects the correct tool with higher probability because the signal-to-noise ratio in the tool definitions is higher.
3.4 Transformation 3: Response Event Synthesis
From a non-streaming Chat Completions JSON response, we synthesize the SSE event stream that Codex expects:
Chat Completion JSON:
{
"choices": [{
"message": {
"tool_calls": [{
"id": "call_abc",
"function": {"name": "exec_command", "arguments": "{\"cmd\":\"ls /tmp\"}"}
}],
"content": ""
},
"finish_reason": "tool_calls"
}],
"usage": {"prompt_tokens": 99, "completion_tokens": 79, "total_tokens": 178}
}
ββSYNTHESIZED ASβββΆ
SSE Events:
event: response.created
event: response.in_progress
event: response.output_item.added β function_call item, output_index=0
event: response.function_call_arguments.delta
event: response.function_call_arguments.done
event: response.output_item.done
event: response.completed β normalized usage {input_tokens, output_tokens}
Critical details that caused failures:
output_index: MUST be 0 for the first function_call, 1 for the text message. Hardcoding 0 causes multi-output corruption.sequence_number: MUST be globally monotonically increasing across all events.usage: Ollama returns{prompt_tokens, completion_tokens}but Codex expects{input_tokens, output_tokens}.- Event ordering: The sequence
created β in_progress β output_item.added β ... β output_item.done β completedis a strict protocol; deviations cause client disconnections.
3.5 Pull Interception
Codex calls POST /api/pull for every model before first use. Custom GGUF models (e.g., huihui4-8b-a4b) are not in Ollama's registry, causing pull failures. The proxy intercepts this endpoint, checks if the model exists locally via /api/tags, and returns {"status":"success"}\n (NDJSON format) for locally-available models.
4. Implementation
4.1 Technology Stack
- Language: Python 3.12 (standard library + aiohttp)
- Lines of code: ~480 (effective, excluding comments/whitespace), 807 total
- Dependencies: aiohttp only
- Deployment: macOS launchd (KeepAlive daemon) or manual
python3 proxy.py
4.2 Code Organization
proxy.py
βββ normalize_usage() β Usage field mapping (Ollama β OpenAI)
βββ simplify_tools() β Tool parameter reduction
βββ responses_to_chat() β Request format translation
βββ SSEResponseBuilder β SSE event synthesis engine
β βββ start() / in_progress() / complete() / error()
β βββ add_text_delta()
β βββ add_tool_call_start() / add_tool_args_delta() / finish_tool_call()
βββ proxy_handler() β Main request dispatcher
βββ _synthesize_sse() β SSE stream construction
βββ health_handler() β Health check endpoint
βββ main() β CLI, signal handling, startup check
4.3 Deployment
Production (launchd):
<!-- ~/Library/LaunchAgents/com.x.codex-bridge.plist -->
<key>RunAtLoad</key><true/>
<key>KeepAlive</key><true/>
Control script:
codex-bridge-ctl.sh start|stop|restart|status|logs
Codex aliases:
alias cx14='codex --oss --local-provider ollama -m qwen3:14b'
alias cx14e='codex exec --skip-git-repo-check --oss --local-provider ollama -m qwen3:14b'
5. Experimental Validation
5.1 Test Setup
| Component | Version |
|---|---|
| Codex CLI | v0.130.0 |
| Ollama | 0.23.4 |
| Bridge | v1.1.0 |
| OS | macOS 26.4 (Apple Silicon) |
5.2 Model Compatibility
| Model | Size | Tool Calling | Text Response | Chinese | Notes |
|---|---|---|---|---|---|
| qwen3:14b | 9.3GB | β Stable | β | β Native | Flagship |
| huihui4:8b-a4b | 5.4GB | β Good | β | β | MoE, 4/8.1B active |
| Qwen2.5-Coder-7B-GGUF | 7B | β οΈ Moderate | β | β | Backup |
| qwen2.5-coder:3b | 1.9GB | β οΈ Weak | β | β | Lightweight text |
| gpt-oss:20b | 13GB | Not tested | β | β | Too resource-heavy |
| llama3.1:8b | 4.9GB | β οΈ Weak | β | β | English only |
| deepseek-r1:14b | 9.0GB | Not tested | β | β | Reasoning model |
5.3 End-to-End Test
$ codex exec --skip-git-repo-check --ephemeral --oss \
--local-provider ollama -m "huihui4-8b-a4b:latest" \
"list files in /Users/x/ai-assets/codex-proxy/"
exec
/bin/zsh -lc 'ls -R /Users/x/ai-assets/codex-proxy/'
succeeded in 0ms:
__pycache__
proxy.py
/Users/x/ai-assets/codex-proxy/__pycache__:
proxy.cpython-312.pyc
The model correctly called exec_command({"cmd":"ls -R /Users/x/ai-assets/codex-proxy/"}), Codex executed it, and the result was returned.
5.4 Failure Analysis
During development, we encountered and resolved 10 distinct failure modes:
| # | Failure | Root Cause | Fix |
|---|---|---|---|
| 1 | Port conflict | Ollama launchd auto-restart | Manual port management |
| 2 | 400 Bad Request | Content-Length not updated after body modification | Recalculate header |
| 3 | Content-Length off by 1 | stream:false changed after header calc | Reorder operations |
| 4 | Empty function name | 4100-token tool definition overloads model | simplify_tools() |
| 5 | Text JSON instead of tool_calls | Model outputs {"command": "ls"} as text |
Enhanced system prompt |
| 6 | Missing input_tokens | Usage field name mismatch | normalize_usage() |
| 7 | Transport closed | Codex disconnects on malformed completed | Fix usage normalization |
| 8 | output_index=0 for text | Multi-output ordering broken | Dynamic output_index |
| 9 | Pull failure for custom models | Model not in Ollama registry | Pull interception |
| 10 | Literal \n in pull response | Escaped newline vs real newline | Binary correct newline |
6. Discussion
6.1 The "Newer API is Better" Fallacy
A counterintuitive finding: the newer /v1/responses endpoint (introduced by OpenAI in 2025) performed worse than the older /v1/chat/completions endpoint for tool-calling through Ollama. This is because the Chat Completions API has been the primary integration target for inference frameworks for years, receiving more testing and native optimization. The Responses API, being newer, has thinner compatibility wrappers.
Lesson: When debugging API compatibility issues, try downgrading to an older API surface before assuming the model or framework is broken.
6.2 Attention Budget as a First-Class Constraint
Tool definitions consume prompt tokens. For an 8B model with a 32K context window, 4,100 tokens of tool definitions represent ~13% of the total budget. But the effective attention budget for tool selection is much smallerβthe model must attend to the system prompt, conversation history, AND tool definitions simultaneously.
Our simplification from 4,100 β 800 tokens (5Γ reduction) was the single most impactful change for model accuracy. This suggests that tool definition design for local models should be treated as a prompt engineering problem, not just an API integration problem.
6.3 SSE Synthesis vs. Real-Time Translation
We chose to synthesize SSE from non-streaming responses rather than translate SSEβSSE in real time. This is a deliberate trade-off:
| Approach | Pros | Cons |
|---|---|---|
| Real-time SSEβSSE | Lower latency, true streaming | Complex state machine, event reordering |
| Non-streaming β SSE | Simple, correct, ~480 loc | First-byte latency = model generation time |
For local models where generation latency is typically 5-30 seconds, the first-byte latency of non-streaming is acceptable. For production deployments with faster models, real-time translation would be the next optimization.
6.4 Generalizability
The PTB pattern applies beyond Codex-Ollama. Any pair of LLM API surfaces with syntactic-but-not-semantic compatibility can be bridged:
- Cursor + Ollama β Cursor uses a different tool-calling format
- Continue.dev + vLLM β Continue's API expectations vs vLLM's implementation
- LangChain agents + llama.cpp β Any agent framework + any inference engine
The core principle is always: identify the API surface where tool-calling works natively, then translate requests to that surface and responses back to the client's expected surface.
6.5 Limitations
- No real streaming: First-byte latency equals full model generation time
- Single-model focus: No load balancing across multiple Ollama instances
- No auth: Assumes local-only deployment
- No automated tests: Manual end-to-end testing only
- No response caching: Repeated queries re-generate
7. Related Work
- LiteLLM (BerriAI, 2024): Universal LLM proxy supporting ~100 providers with format translation. Primarily targets Chat Completions API; Responses API support is nascent.
- OpenRouter: Commercial routing service providing unified Chat Completions interface across providers. Does not address Responses API.
- vLLM OpenAI-compatible server: Built-in API compatibility layer. Focused on serving, not protocol translation between API surfaces.
- OpenAI Agents SDK: Official agent framework using Responses API as primary interface. Our work enables running these agents with local models.
- porter (porter.sh): Lightweight LLM API proxy. Focuses on authentication and routing, not protocol translation.
The unique contribution of this work is the combination of: (a) Responses API β Chat Completions semantic translation, (b) tool schema simplification for local models, and (c) reverse SSE synthesis from non-streaming responses.
8. Conclusion
We presented the Protocol-Translation Bridge (PTB) pattern, a lightweight solution to the problem of heterogeneous LLM API tool-calling compatibility. Our implementation for Codex-Ollama interoperation (~800 lines of Python) successfully restores tool-calling functionality for local models, with zero modifications to either the client framework or the inference engine.
The key findings are:
- API "compatibility" claims require semantic-level verification, not just HTTP-level validation
- Downgrading to older API surfaces can increase semantic fidelity
- Tool schema simplification is a zero-cost optimization for local models
- SSE synthesis from non-streaming responses is a viable alternative to real-time SSE translation
- Usage field naming varies across implementations and requires normalization
The code, deployment configuration, and this report are released as open-source at:
/Users/x/ai-assets/codex-proxy/
βββ proxy.py # Protocol bridge (v1.1.0)
βββ paper/
β βββ technical-report.md # This document
β βββ paper.tex # LaTeX preprint (Level B)
βββ com.x.codex-bridge.plist # macOS launchd configuration
βββ codex-bridge-ctl.sh # Control script
Acknowledgments
Thanks to the Ollama and Codex teams for building the tools that made this work possible. The 12-step diagnosis benefited from rapid iteration enabled by Claude Code's agent capabilities.
References
- OpenAI. "Responses API Reference." https://platform.openai.com/docs/api-reference/responses
- Ollama. "OpenAI Compatibility." https://ollama.com/blog/openai-compatibility
- Codex CLI. "OpenAI Codex CLI." https://github.com/openai/codex
- LiteLLM. "LiteLLM: Call all LLM APIs using the OpenAI format." https://github.com/BerriAI/litellm
- vLLM. "OpenAI-Compatible Server." https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
- Huihui-ai. "Huihui4-8B-A4B: Mixture-of-Experts Language Model." https://huggingface.co/huihui-ai/Huihui4-8B-A4B-GGUF
- Qwen Team. "Qwen3: Technical Report." 2025.
- Anthropic. "Claude Code: Agentic coding tool." https://docs.anthropic.com/en/docs/claude-code