Spaces:
Sleeping
title: Fallback Module Trial
emoji: π
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference# Qwen3-14B β OpenAI-compatible API (CPU Β· Docker)
A lightweight FastAPI server that wraps Qwen3-14B (GGUF quantised) and exposes an OpenAI-compatible REST API so tools like Paperclip can talk to it as a drop-in OpenAI provider.
Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/v1/models |
List available models |
POST |
/v1/chat/completions |
Chat completions (streaming + non-streaming) |
GET |
/health |
Docker health check |
1 β Download the model
You need a GGUF quantised version of Qwen3-14B. The recommended variant for CPU is Q4_K_M (~9 GB RAM at runtime).
mkdir -p models
# Option A β Hugging Face CLI
pip install huggingface_hub
huggingface-cli download \
bartowski/Qwen3-14B-GGUF \
Qwen3-14B-Q4_K_M.gguf \
--local-dir ./models
# Option B β wget (find the direct URL on the HF repo)
# wget -O models/qwen3-14b-q4_k_m.gguf <url>
Rename the file so it matches MODEL_PATH in docker-compose.yml
(or change the env var to match your filename).
2 β Build & run
# Build the image (one-time, ~5β10 min β compiles llama-cpp from source)
docker compose build
# Start the server
docker compose up -d
# Tail logs
docker compose logs -f
The API will be available at http://localhost:8000 once the model has loaded (allow ~1β3 min for the GGUF to map into memory on first start).
3 β Connect Paperclip
In Paperclip's settings, add a new OpenAI-compatible provider:
| Setting | Value |
|---|---|
| Base URL | http://localhost:8000/v1 |
| API Key | (any non-empty string, e.g. local) |
| Model | qwen3-14b |
Paperclip will call /v1/models to verify the connection and
/v1/chat/completions for inference β both are implemented.
4 β Environment variables
| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
/models/qwen3-14b-q4_k_m.gguf |
Path to GGUF file inside container |
MODEL_ID |
qwen3-14b |
Model name returned by the API |
N_CTX |
4096 |
Context window size (tokens) |
N_THREADS |
8 |
CPU threads β set to your physical core count |
N_BATCH |
512 |
Prompt processing batch size |
VERBOSE |
false |
Enable llama.cpp verbose logging |
Override in docker-compose.yml or pass with -e flags to docker run.
5 β Performance tips (CPU)
- Set
N_THREADSto your physical core count (not hyper-threaded). On a modern 8-core machineN_THREADS=8is a good start. - Expect ~3β8 tokens/sec on a modern laptop; a server with many cores does better.
- If you have more RAM, try Q5_K_M or Q6_K for better quality.
- Reduce
N_CTXto2048if you hit memory pressure.
Quick test (curl)
# List models
curl http://localhost:8000/v1/models
# Non-streaming chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 256
}'
# Streaming chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"messages": [{"role": "user", "content": "Tell me a joke."}],
"max_tokens": 256,
"stream": true
}'