Spaces:
Sleeping
Sleeping
| title: Fallback Module Trial | |
| emoji: π | |
| colorFrom: indigo | |
| colorTo: gray | |
| sdk: docker | |
| pinned: false | |
| Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference# Qwen3-14B β OpenAI-compatible API (CPU Β· Docker) | |
| A lightweight FastAPI server that wraps **Qwen3-14B** (GGUF quantised) and | |
| exposes an **OpenAI-compatible REST API** so tools like **Paperclip** can talk | |
| to it as a drop-in OpenAI provider. | |
| ## Endpoints | |
| | Method | Path | Description | | |
| |--------|------|-------------| | |
| | `GET` | `/v1/models` | List available models | | |
| | `POST` | `/v1/chat/completions` | Chat completions (streaming + non-streaming) | | |
| | `GET` | `/health` | Docker health check | | |
| --- | |
| ## 1 β Download the model | |
| You need a **GGUF quantised** version of Qwen3-14B. | |
| The recommended variant for CPU is **Q4_K_M** (~9 GB RAM at runtime). | |
| ```bash | |
| mkdir -p models | |
| # Option A β Hugging Face CLI | |
| pip install huggingface_hub | |
| huggingface-cli download \ | |
| bartowski/Qwen3-14B-GGUF \ | |
| Qwen3-14B-Q4_K_M.gguf \ | |
| --local-dir ./models | |
| # Option B β wget (find the direct URL on the HF repo) | |
| # wget -O models/qwen3-14b-q4_k_m.gguf <url> | |
| ``` | |
| Rename the file so it matches `MODEL_PATH` in `docker-compose.yml` | |
| (or change the env var to match your filename). | |
| --- | |
| ## 2 β Build & run | |
| ```bash | |
| # Build the image (one-time, ~5β10 min β compiles llama-cpp from source) | |
| docker compose build | |
| # Start the server | |
| docker compose up -d | |
| # Tail logs | |
| docker compose logs -f | |
| ``` | |
| The API will be available at **http://localhost:8000** once the model has | |
| loaded (allow ~1β3 min for the GGUF to map into memory on first start). | |
| --- | |
| ## 3 β Connect Paperclip | |
| In Paperclip's settings, add a new **OpenAI-compatible** provider: | |
| | Setting | Value | | |
| |---------|-------| | |
| | **Base URL** | `http://localhost:8000/v1` | | |
| | **API Key** | *(any non-empty string, e.g. `local`)* | | |
| | **Model** | `qwen3-14b` | | |
| Paperclip will call `/v1/models` to verify the connection and | |
| `/v1/chat/completions` for inference β both are implemented. | |
| --- | |
| ## 4 β Environment variables | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `MODEL_PATH` | `/models/qwen3-14b-q4_k_m.gguf` | Path to GGUF file inside container | | |
| | `MODEL_ID` | `qwen3-14b` | Model name returned by the API | | |
| | `N_CTX` | `4096` | Context window size (tokens) | | |
| | `N_THREADS` | `8` | CPU threads β set to your physical core count | | |
| | `N_BATCH` | `512` | Prompt processing batch size | | |
| | `VERBOSE` | `false` | Enable llama.cpp verbose logging | | |
| Override in `docker-compose.yml` or pass with `-e` flags to `docker run`. | |
| --- | |
| ## 5 β Performance tips (CPU) | |
| - Set `N_THREADS` to your **physical** core count (not hyper-threaded). | |
| On a modern 8-core machine `N_THREADS=8` is a good start. | |
| - Expect ~3β8 tokens/sec on a modern laptop; a server with many cores does better. | |
| - If you have more RAM, try **Q5_K_M** or **Q6_K** for better quality. | |
| - Reduce `N_CTX` to `2048` if you hit memory pressure. | |
| --- | |
| ## Quick test (curl) | |
| ```bash | |
| # List models | |
| curl http://localhost:8000/v1/models | |
| # Non-streaming chat | |
| curl http://localhost:8000/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "qwen3-14b", | |
| "messages": [{"role": "user", "content": "Hello!"}], | |
| "max_tokens": 256 | |
| }' | |
| # Streaming chat | |
| curl http://localhost:8000/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "qwen3-14b", | |
| "messages": [{"role": "user", "content": "Tell me a joke."}], | |
| "max_tokens": 256, | |
| "stream": true | |
| }' | |
| ``` | |