fomext's picture
Update README.md
7e4adf4 verified
|
Raw
History Blame Contribute Delete
3.57 kB
metadata
title: Fallback Module Trial
emoji: πŸŒ–
colorFrom: indigo
colorTo: gray
sdk: docker
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference# Qwen3-14B β€” OpenAI-compatible API (CPU Β· Docker)

A lightweight FastAPI server that wraps Qwen3-14B (GGUF quantised) and exposes an OpenAI-compatible REST API so tools like Paperclip can talk to it as a drop-in OpenAI provider.

Endpoints

Method Path Description
GET /v1/models List available models
POST /v1/chat/completions Chat completions (streaming + non-streaming)
GET /health Docker health check

1 β€” Download the model

You need a GGUF quantised version of Qwen3-14B. The recommended variant for CPU is Q4_K_M (~9 GB RAM at runtime).

mkdir -p models

# Option A β€” Hugging Face CLI
pip install huggingface_hub
huggingface-cli download \
  bartowski/Qwen3-14B-GGUF \
  Qwen3-14B-Q4_K_M.gguf \
  --local-dir ./models

# Option B β€” wget (find the direct URL on the HF repo)
# wget -O models/qwen3-14b-q4_k_m.gguf <url>

Rename the file so it matches MODEL_PATH in docker-compose.yml (or change the env var to match your filename).


2 β€” Build & run

# Build the image (one-time, ~5–10 min β€” compiles llama-cpp from source)
docker compose build

# Start the server
docker compose up -d

# Tail logs
docker compose logs -f

The API will be available at http://localhost:8000 once the model has loaded (allow ~1–3 min for the GGUF to map into memory on first start).


3 β€” Connect Paperclip

In Paperclip's settings, add a new OpenAI-compatible provider:

Setting Value
Base URL http://localhost:8000/v1
API Key (any non-empty string, e.g. local)
Model qwen3-14b

Paperclip will call /v1/models to verify the connection and /v1/chat/completions for inference β€” both are implemented.


4 β€” Environment variables

Variable Default Description
MODEL_PATH /models/qwen3-14b-q4_k_m.gguf Path to GGUF file inside container
MODEL_ID qwen3-14b Model name returned by the API
N_CTX 4096 Context window size (tokens)
N_THREADS 8 CPU threads β€” set to your physical core count
N_BATCH 512 Prompt processing batch size
VERBOSE false Enable llama.cpp verbose logging

Override in docker-compose.yml or pass with -e flags to docker run.


5 β€” Performance tips (CPU)

  • Set N_THREADS to your physical core count (not hyper-threaded). On a modern 8-core machine N_THREADS=8 is a good start.
  • Expect ~3–8 tokens/sec on a modern laptop; a server with many cores does better.
  • If you have more RAM, try Q5_K_M or Q6_K for better quality.
  • Reduce N_CTX to 2048 if you hit memory pressure.

Quick test (curl)

# List models
curl http://localhost:8000/v1/models

# Non-streaming chat
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 256
  }'

# Streaming chat
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Tell me a joke."}],
    "max_tokens": 256,
    "stream": true
  }'