fallback_module_trial5

Sleeping

App Files Files Community

fallback_module_trial5 / README.md

fomext

Update README.md

7e4adf4 verified 18 days ago

preview code

Raw

History Blame Contribute Delete

3.57 kB

metadata

title: Fallback Module Trial
emoji: 🌖
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference# Qwen3-14B — OpenAI-compatible API (CPU · Docker)

A lightweight FastAPI server that wraps Qwen3-14B (GGUF quantised) and exposes an OpenAI-compatible REST API so tools like Paperclip can talk to it as a drop-in OpenAI provider.

Endpoints

Method	Path	Description
`GET`	`/v1/models`	List available models
`POST`	`/v1/chat/completions`	Chat completions (streaming + non-streaming)
`GET`	`/health`	Docker health check

1 — Download the model

You need a GGUF quantised version of Qwen3-14B. The recommended variant for CPU is Q4_K_M (~9 GB RAM at runtime).

mkdir -p models

# Option A — Hugging Face CLI
pip install huggingface_hub
huggingface-cli download \
  bartowski/Qwen3-14B-GGUF \
  Qwen3-14B-Q4_K_M.gguf \
  --local-dir ./models

# Option B — wget (find the direct URL on the HF repo)
# wget -O models/qwen3-14b-q4_k_m.gguf <url>

Rename the file so it matches MODEL_PATH in docker-compose.yml (or change the env var to match your filename).

2 — Build & run

# Build the image (one-time, ~5–10 min — compiles llama-cpp from source)
docker compose build

# Start the server
docker compose up -d

# Tail logs
docker compose logs -f

The API will be available at http://localhost:8000 once the model has loaded (allow ~1–3 min for the GGUF to map into memory on first start).

3 — Connect Paperclip

In Paperclip's settings, add a new OpenAI-compatible provider:

Setting	Value
Base URL	`http://localhost:8000/v1`
API Key	(any non-empty string, e.g. `local`)
Model	`qwen3-14b`

Paperclip will call /v1/models to verify the connection and /v1/chat/completions for inference — both are implemented.

4 — Environment variables

Variable	Default	Description
`MODEL_PATH`	`/models/qwen3-14b-q4_k_m.gguf`	Path to GGUF file inside container
`MODEL_ID`	`qwen3-14b`	Model name returned by the API
`N_CTX`	`4096`	Context window size (tokens)
`N_THREADS`	`8`	CPU threads — set to your physical core count
`N_BATCH`	`512`	Prompt processing batch size
`VERBOSE`	`false`	Enable llama.cpp verbose logging

Override in docker-compose.yml or pass with -e flags to docker run.

5 — Performance tips (CPU)

Set N_THREADS to your physical core count (not hyper-threaded). On a modern 8-core machine N_THREADS=8 is a good start.
Expect ~3–8 tokens/sec on a modern laptop; a server with many cores does better.
If you have more RAM, try Q5_K_M or Q6_K for better quality.
Reduce N_CTX to 2048 if you hit memory pressure.

Quick test (curl)

# List models
curl http://localhost:8000/v1/models

# Non-streaming chat
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

# Streaming chat
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Tell me a joke."}],
    "max_tokens": 256,
    "stream": true
  }'