fallback_module_trial3

Sleeping

App Files Files Community

fallback_module_trial3 / README.md

fomext

Update README.md

7e4adf4 verified 19 days ago

preview code

Raw

History Blame Contribute Delete

3.57 kB

	---
	title: Fallback Module Trial
	emoji: 🌖
	colorFrom: indigo
	colorTo: gray
	sdk: docker
	pinned: false
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference# Qwen3-14B — OpenAI-compatible API (CPU · Docker)

	A lightweight FastAPI server that wraps Qwen3-14B (GGUF quantised) and
	exposes an OpenAI-compatible REST API so tools like Paperclip can talk
	to it as a drop-in OpenAI provider.

	## Endpoints

	\| Method \| Path \| Description \|
	\|--------\|------\|-------------\|
	\| `GET` \| `/v1/models` \| List available models \|
	\| `POST` \| `/v1/chat/completions` \| Chat completions (streaming + non-streaming) \|
	\| `GET` \| `/health` \| Docker health check \|

	---

	## 1 — Download the model

	You need a GGUF quantised version of Qwen3-14B.
	The recommended variant for CPU is Q4_K_M (~9 GB RAM at runtime).

	```bash
	mkdir -p models

	# Option A — Hugging Face CLI
	pip install huggingface_hub
	huggingface-cli download \
	bartowski/Qwen3-14B-GGUF \
	Qwen3-14B-Q4_K_M.gguf \
	--local-dir ./models

	# Option B — wget (find the direct URL on the HF repo)
	# wget -O models/qwen3-14b-q4_k_m.gguf <url>
	```

	Rename the file so it matches `MODEL_PATH` in `docker-compose.yml`
	(or change the env var to match your filename).

	---

	## 2 — Build & run

	```bash
	# Build the image (one-time, ~5–10 min — compiles llama-cpp from source)
	docker compose build

	# Start the server
	docker compose up -d

	# Tail logs
	docker compose logs -f
	```

	The API will be available at http://localhost:8000 once the model has
	loaded (allow ~1–3 min for the GGUF to map into memory on first start).

	---

	## 3 — Connect Paperclip

	In Paperclip's settings, add a new OpenAI-compatible provider:

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Base URL \| `http://localhost:8000/v1` \|
	\| API Key \| (any non-empty string, e.g. `local`) \|
	\| Model \| `qwen3-14b` \|

	Paperclip will call `/v1/models` to verify the connection and
	`/v1/chat/completions` for inference — both are implemented.

	---

	## 4 — Environment variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `MODEL_PATH` \| `/models/qwen3-14b-q4_k_m.gguf` \| Path to GGUF file inside container \|
	\| `MODEL_ID` \| `qwen3-14b` \| Model name returned by the API \|
	\| `N_CTX` \| `4096` \| Context window size (tokens) \|
	\| `N_THREADS` \| `8` \| CPU threads — set to your physical core count \|
	\| `N_BATCH` \| `512` \| Prompt processing batch size \|
	\| `VERBOSE` \| `false` \| Enable llama.cpp verbose logging \|

	Override in `docker-compose.yml` or pass with `-e` flags to `docker run`.

	---

	## 5 — Performance tips (CPU)

	- Set `N_THREADS` to your physical core count (not hyper-threaded).
	On a modern 8-core machine `N_THREADS=8` is a good start.
	- Expect ~3–8 tokens/sec on a modern laptop; a server with many cores does better.
	- If you have more RAM, try Q5_K_M or Q6_K for better quality.
	- Reduce `N_CTX` to `2048` if you hit memory pressure.

	---

	## Quick test (curl)

	```bash
	# List models
	curl http://localhost:8000/v1/models

	# Non-streaming chat
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "qwen3-14b",
	"messages": [{"role": "user", "content": "Hello!"}],
	"max_tokens": 256
	}'

	# Streaming chat
	curl http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "qwen3-14b",
	"messages": [{"role": "user", "content": "Tell me a joke."}],
	"max_tokens": 256,
	"stream": true
	}'
	```