fomext's picture
Update README.md
7e4adf4 verified
|
Raw
History Blame Contribute Delete
3.57 kB
---
title: Fallback Module Trial
emoji: πŸŒ–
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference# Qwen3-14B β€” OpenAI-compatible API (CPU Β· Docker)
A lightweight FastAPI server that wraps **Qwen3-14B** (GGUF quantised) and
exposes an **OpenAI-compatible REST API** so tools like **Paperclip** can talk
to it as a drop-in OpenAI provider.
## Endpoints
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/v1/models` | List available models |
| `POST` | `/v1/chat/completions` | Chat completions (streaming + non-streaming) |
| `GET` | `/health` | Docker health check |
---
## 1 β€” Download the model
You need a **GGUF quantised** version of Qwen3-14B.
The recommended variant for CPU is **Q4_K_M** (~9 GB RAM at runtime).
```bash
mkdir -p models
# Option A β€” Hugging Face CLI
pip install huggingface_hub
huggingface-cli download \
bartowski/Qwen3-14B-GGUF \
Qwen3-14B-Q4_K_M.gguf \
--local-dir ./models
# Option B β€” wget (find the direct URL on the HF repo)
# wget -O models/qwen3-14b-q4_k_m.gguf <url>
```
Rename the file so it matches `MODEL_PATH` in `docker-compose.yml`
(or change the env var to match your filename).
---
## 2 β€” Build & run
```bash
# Build the image (one-time, ~5–10 min β€” compiles llama-cpp from source)
docker compose build
# Start the server
docker compose up -d
# Tail logs
docker compose logs -f
```
The API will be available at **http://localhost:8000** once the model has
loaded (allow ~1–3 min for the GGUF to map into memory on first start).
---
## 3 β€” Connect Paperclip
In Paperclip's settings, add a new **OpenAI-compatible** provider:
| Setting | Value |
|---------|-------|
| **Base URL** | `http://localhost:8000/v1` |
| **API Key** | *(any non-empty string, e.g. `local`)* |
| **Model** | `qwen3-14b` |
Paperclip will call `/v1/models` to verify the connection and
`/v1/chat/completions` for inference β€” both are implemented.
---
## 4 β€” Environment variables
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_PATH` | `/models/qwen3-14b-q4_k_m.gguf` | Path to GGUF file inside container |
| `MODEL_ID` | `qwen3-14b` | Model name returned by the API |
| `N_CTX` | `4096` | Context window size (tokens) |
| `N_THREADS` | `8` | CPU threads β€” set to your physical core count |
| `N_BATCH` | `512` | Prompt processing batch size |
| `VERBOSE` | `false` | Enable llama.cpp verbose logging |
Override in `docker-compose.yml` or pass with `-e` flags to `docker run`.
---
## 5 β€” Performance tips (CPU)
- Set `N_THREADS` to your **physical** core count (not hyper-threaded).
On a modern 8-core machine `N_THREADS=8` is a good start.
- Expect ~3–8 tokens/sec on a modern laptop; a server with many cores does better.
- If you have more RAM, try **Q5_K_M** or **Q6_K** for better quality.
- Reduce `N_CTX` to `2048` if you hit memory pressure.
---
## Quick test (curl)
```bash
# List models
curl http://localhost:8000/v1/models
# Non-streaming chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 256
}'
# Streaming chat
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-14b",
"messages": [{"role": "user", "content": "Tell me a joke."}],
"max_tokens": 256,
"stream": true
}'
```