---
title: LLM Chat API
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
---

# LLM Chat API

OpenAI-compatible chat API running Qwen 2.5 3B on CPU with optimized llama-cpp build.

## SillyTavern Connection

API Connections → Chat Completion → Custom (OpenAI-compatible):
- **Server URL**: `https://YOUR-SPACE-NAME.hf.space`
- **Model**: `qwen-3b`
- **API Key**: anything (not validated)

## Endpoints

| Method | Path | Description |
|--------|------|-------------|
| GET | `/` | Status page |
| GET | `/health` | Health check |
| GET | `/v1/models` | List models |
| POST | `/v1/chat/completions` | Chat (streaming supported) |

## Notes

- First boot downloads the model (~2.5GB) into persistent storage `/data/models/`
- Subsequent boots load from cache instantly
- Built with OpenBLAS + AVX2 for best CPU throughput