---
title: Fallback Module Trial
emoji: 🌖
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference# Qwen3-14B — OpenAI-compatible API (CPU · Docker)

A lightweight FastAPI server that wraps **Qwen3-14B** (GGUF quantised) and
exposes an **OpenAI-compatible REST API** so tools like **Paperclip** can talk
to it as a drop-in OpenAI provider.

## Endpoints

| Method | Path | Description |
|--------|------|-------------|
| `GET`  | `/v1/models` | List available models |
| `POST` | `/v1/chat/completions` | Chat completions (streaming + non-streaming) |
| `GET`  | `/health` | Docker health check |

---

## 1 — Download the model

You need a **GGUF quantised** version of Qwen3-14B.
The recommended variant for CPU is **Q4_K_M** (~9 GB RAM at runtime).

```bash
mkdir -p models

# Option A — Hugging Face CLI
pip install huggingface_hub
huggingface-cli download \
  bartowski/Qwen3-14B-GGUF \
  Qwen3-14B-Q4_K_M.gguf \
  --local-dir ./models

# Option B — wget (find the direct URL on the HF repo)
# wget -O models/qwen3-14b-q4_k_m.gguf <url>
```

Rename the file so it matches `MODEL_PATH` in `docker-compose.yml`
(or change the env var to match your filename).

---

## 2 — Build & run

```bash
# Build the image (one-time, ~5–10 min — compiles llama-cpp from source)
docker compose build

# Start the server
docker compose up -d

# Tail logs
docker compose logs -f
```

The API will be available at **http://localhost:8000** once the model has
loaded (allow ~1–3 min for the GGUF to map into memory on first start).

---

## 3 — Connect Paperclip

In Paperclip's settings, add a new **OpenAI-compatible** provider:

| Setting | Value |
|---------|-------|
| **Base URL** | `http://localhost:8000/v1` |
| **API Key** | *(any non-empty string, e.g. `local`)* |
| **Model** | `qwen3-14b` |

Paperclip will call `/v1/models` to verify the connection and
`/v1/chat/completions` for inference — both are implemented.

---

## 4 — Environment variables

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_PATH` | `/models/qwen3-14b-q4_k_m.gguf` | Path to GGUF file inside container |
| `MODEL_ID` | `qwen3-14b` | Model name returned by the API |
| `N_CTX` | `4096` | Context window size (tokens) |
| `N_THREADS` | `8` | CPU threads — set to your physical core count |
| `N_BATCH` | `512` | Prompt processing batch size |
| `VERBOSE` | `false` | Enable llama.cpp verbose logging |

Override in `docker-compose.yml` or pass with `-e` flags to `docker run`.

---

## 5 — Performance tips (CPU)

- Set `N_THREADS` to your **physical** core count (not hyper-threaded).
  On a modern 8-core machine `N_THREADS=8` is a good start.
- Expect ~3–8 tokens/sec on a modern laptop; a server with many cores does better.
- If you have more RAM, try **Q5_K_M** or **Q6_K** for better quality.
- Reduce `N_CTX` to `2048` if you hit memory pressure.

---

## Quick test (curl)

```bash
# List models
curl http://localhost:8000/v1/models

# Non-streaming chat
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

# Streaming chat
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Tell me a joke."}],
    "max_tokens": 256,
    "stream": true
  }'
```