--- title: Fallback Module Trial emoji: ๐ŸŒ– colorFrom: indigo colorTo: gray sdk: docker pinned: false --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference# Qwen3-14B โ€” OpenAI-compatible API (CPU ยท Docker) A lightweight FastAPI server that wraps **Qwen3-14B** (GGUF quantised) and exposes an **OpenAI-compatible REST API** so tools like **Paperclip** can talk to it as a drop-in OpenAI provider. ## Endpoints | Method | Path | Description | |--------|------|-------------| | `GET` | `/v1/models` | List available models | | `POST` | `/v1/chat/completions` | Chat completions (streaming + non-streaming) | | `GET` | `/health` | Docker health check | --- ## 1 โ€” Download the model You need a **GGUF quantised** version of Qwen3-14B. The recommended variant for CPU is **Q4_K_M** (~9 GB RAM at runtime). ```bash mkdir -p models # Option A โ€” Hugging Face CLI pip install huggingface_hub huggingface-cli download \ bartowski/Qwen3-14B-GGUF \ Qwen3-14B-Q4_K_M.gguf \ --local-dir ./models # Option B โ€” wget (find the direct URL on the HF repo) # wget -O models/qwen3-14b-q4_k_m.gguf ``` Rename the file so it matches `MODEL_PATH` in `docker-compose.yml` (or change the env var to match your filename). --- ## 2 โ€” Build & run ```bash # Build the image (one-time, ~5โ€“10 min โ€” compiles llama-cpp from source) docker compose build # Start the server docker compose up -d # Tail logs docker compose logs -f ``` The API will be available at **http://localhost:8000** once the model has loaded (allow ~1โ€“3 min for the GGUF to map into memory on first start). --- ## 3 โ€” Connect Paperclip In Paperclip's settings, add a new **OpenAI-compatible** provider: | Setting | Value | |---------|-------| | **Base URL** | `http://localhost:8000/v1` | | **API Key** | *(any non-empty string, e.g. `local`)* | | **Model** | `qwen3-14b` | Paperclip will call `/v1/models` to verify the connection and `/v1/chat/completions` for inference โ€” both are implemented. --- ## 4 โ€” Environment variables | Variable | Default | Description | |----------|---------|-------------| | `MODEL_PATH` | `/models/qwen3-14b-q4_k_m.gguf` | Path to GGUF file inside container | | `MODEL_ID` | `qwen3-14b` | Model name returned by the API | | `N_CTX` | `4096` | Context window size (tokens) | | `N_THREADS` | `8` | CPU threads โ€” set to your physical core count | | `N_BATCH` | `512` | Prompt processing batch size | | `VERBOSE` | `false` | Enable llama.cpp verbose logging | Override in `docker-compose.yml` or pass with `-e` flags to `docker run`. --- ## 5 โ€” Performance tips (CPU) - Set `N_THREADS` to your **physical** core count (not hyper-threaded). On a modern 8-core machine `N_THREADS=8` is a good start. - Expect ~3โ€“8 tokens/sec on a modern laptop; a server with many cores does better. - If you have more RAM, try **Q5_K_M** or **Q6_K** for better quality. - Reduce `N_CTX` to `2048` if you hit memory pressure. --- ## Quick test (curl) ```bash # List models curl http://localhost:8000/v1/models # Non-streaming chat curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-14b", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256 }' # Streaming chat curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-14b", "messages": [{"role": "user", "content": "Tell me a joke."}], "max_tokens": 256, "stream": true }' ```