Spaces:
Paused
Paused
metadata
title: LLM Chat API
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
LLM Chat API
OpenAI-compatible chat API running Qwen 2.5 3B on CPU with optimized llama-cpp build.
SillyTavern Connection
API Connections → Chat Completion → Custom (OpenAI-compatible):
- Server URL:
https://YOUR-SPACE-NAME.hf.space - Model:
qwen-3b - API Key: anything (not validated)
Endpoints
| Method | Path | Description |
|---|---|---|
| GET | / |
Status page |
| GET | /health |
Health check |
| GET | /v1/models |
List models |
| POST | /v1/chat/completions |
Chat (streaming supported) |
Notes
- First boot downloads the model (~2.5GB) into persistent storage
/data/models/ - Subsequent boots load from cache instantly
- Built with OpenBLAS + AVX2 for best CPU throughput