Matrix Agent
Add frontend dashboard, comprehensive docs, and enhanced logging v3.1
8910367
metadata
title: Anthropic Compatible API
emoji: 🤖
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: apache-2.0

Anthropic-Compatible API

A production-ready, self-hosted API that provides full Anthropic Messages API compatibility using the Qwen2.5-Coder-7B model with llama.cpp backend.

Live Dashboard: https://likhonsheikh-anthropic-compatible-api.hf.space

Features

Feature Description
Full Anthropic API Complete Messages API compatibility
OpenAI API Dual compatibility with OpenAI Chat API
Streaming (SSE) Real-time token streaming
Tool Use Function calling / tool use support
Extended Thinking <thinking> block support for reasoning
Request Queue Concurrency control with priority
Prompt Caching LRU cache for system prompts
Multi-Model Hot-swap between models
Live Dashboard Built-in web UI with playground
Logs Viewer Real-time API logs

Quick Start

1. Claude Code CLI

The easiest way to use this API with Claude Code:

# Set environment variables
export ANTHROPIC_API_KEY="any-key"
export ANTHROPIC_BASE_URL="https://likhonsheikh-anthropic-compatible-api.hf.space/anthropic"

# Run Claude Code
claude "Write a Python script that reads a CSV file"

# Or with explicit model
claude --model qwen2.5-coder-7b "Explain this code"

Persistent Configuration (add to ~/.bashrc or ~/.zshrc):

# Anthropic-Compatible API Configuration
export ANTHROPIC_API_KEY="any-key"
export ANTHROPIC_BASE_URL="https://likhonsheikh-anthropic-compatible-api.hf.space/anthropic"

2. Python SDK

import anthropic

client = anthropic.Anthropic(
    api_key="any-key",
    base_url="https://likhonsheikh-anthropic-compatible-api.hf.space/anthropic"
)

# Basic message
message = client.messages.create(
    model="qwen2.5-coder-7b",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello! Write a hello world in Python."}]
)
print(message.content[0].text)

# With system prompt
message = client.messages.create(
    model="qwen2.5-coder-7b",
    max_tokens=1024,
    system="You are a helpful coding assistant. Always include comments in your code.",
    messages=[{"role": "user", "content": "Write a function to calculate factorial"}]
)
print(message.content[0].text)

3. Streaming Response

import anthropic

client = anthropic.Anthropic(
    api_key="any-key",
    base_url="https://likhonsheikh-anthropic-compatible-api.hf.space/anthropic"
)

with client.messages.stream(
    model="qwen2.5-coder-7b",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a detailed explanation of recursion"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

4. Tool Use / Function Calling

import anthropic
import json

client = anthropic.Anthropic(
    api_key="any-key",
    base_url="https://likhonsheikh-anthropic-compatible-api.hf.space/anthropic"
)

tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a location",
        "input_schema": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
]

message = client.messages.create(
    model="qwen2.5-coder-7b",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)

if message.stop_reason == "tool_use":
    for block in message.content:
        if block.type == "tool_use":
            print(f"Tool: {block.name}")
            print(f"Input: {json.dumps(block.input, indent=2)}")

5. Extended Thinking

import anthropic

client = anthropic.Anthropic(
    api_key="any-key",
    base_url="https://likhonsheikh-anthropic-compatible-api.hf.space/anthropic"
)

message = client.messages.create(
    model="qwen2.5-coder-7b",
    max_tokens=2048,
    thinking={"type": "enabled", "budget_tokens": 1024},
    messages=[{"role": "user", "content": "Solve step by step: What is 15% of 240?"}]
)

for block in message.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
    elif block.type == "text":
        print("=== ANSWER ===")
        print(block.text)

6. TypeScript/JavaScript

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({
  apiKey: 'any-key',
  baseURL: 'https://likhonsheikh-anthropic-compatible-api.hf.space/anthropic'
});

const message = await client.messages.create({
  model: 'qwen2.5-coder-7b',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Hello!' }]
});

console.log(message.content[0].text);

7. cURL

curl -X POST "https://likhonsheikh-anthropic-compatible-api.hf.space/anthropic/v1/messages" \
  -H "Content-Type: application/json" \
  -H "x-api-key: any-key" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "qwen2.5-coder-7b",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

8. OpenAI SDK (Alternative)

from openai import OpenAI

client = OpenAI(
    api_key="any-key",
    base_url="https://likhonsheikh-anthropic-compatible-api.hf.space/v1"
)

response = client.chat.completions.create(
    model="qwen2.5-coder-7b",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=1024
)
print(response.choices[0].message.content)

API Reference

Endpoints

Method Endpoint Description
GET / Dashboard with status & playground
GET /health Health check with queue/cache stats
GET /logs?lines=100 View API logs
GET /queue/status Request queue statistics
GET /models/status Loaded models information
POST /models/{id}/load Manually load a model
POST /models/{id}/unload Unload a model
GET /anthropic/v1/models List models (Anthropic format)
POST /anthropic/v1/messages Create message (Anthropic API)
POST /anthropic/v1/messages/count_tokens Count tokens
GET /v1/models List models (OpenAI format)
POST /v1/chat/completions Chat completion (OpenAI API)

Request Format

{
  "model": "qwen2.5-coder-7b",
  "max_tokens": 1024,
  "messages": [{"role": "user", "content": "Hello!"}],
  "system": "You are a helpful assistant.",
  "temperature": 0.7,
  "stream": false,
  "tools": [...],
  "thinking": {"type": "enabled", "budget_tokens": 1024}
}

Response Format

{
  "id": "msg_abc123",
  "type": "message",
  "role": "assistant",
  "content": [{"type": "text", "text": "Hello!"}],
  "model": "qwen2.5-coder-7b",
  "stop_reason": "end_turn",
  "usage": {"input_tokens": 10, "output_tokens": 25}
}

Model Info

Property Value
Model Qwen2.5-Coder-7B-Instruct
Format GGUF (Q4_K_M quantization)
Parameters 7 Billion
Context Length 8,192 tokens
Backend llama.cpp
Optimized For Code, tool use, agent workflows

Troubleshooting

Issue Solution
Connection Timeout Space may be sleeping. First request wakes it (~30s)
503 Queue Full Too many requests. Retry in a few seconds
Slow Response CPU-based, expect ~10-30 tokens/second
Tool Use Issues Ensure valid JSON schema

License

Apache 2.0 | Built with llama.cpp + FastAPI by Matrix Agent