title: RWKV HF Space
emoji: π¦ββ¬
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false
Simple RWKV OpenAI-Compatible API
title: RWKV HF Space emoji: π¦ββ¬ colorFrom: purple colorTo: pink sdk: docker pinned: false
Simple RWKV OpenAI-Compatible API
Quick Windows Setup (no Docker)
This repository was originally packaged with a Dockerfile. It now provides a setup_windows.ps1 script that mirrors Dockerfile actions and sets up the service locally on Windows (installs Python dependencies, builds the frontend, and downloads the 0.1B model).
Prerequisites:
- Python 3.10+ installed and in PATH
- Node.js + npm (optional, required for building the frontend)
- (Optional) NVIDIA GPU and CUDA (for GPU runtime)
To setup locally on Windows (CPU-only):
.\setup_windows.ps1 -gpu:$false -buildFrontend:$true -CONFIG_FILE config.production.yaml
If you have a compatible NVIDIA GPU and prefer to install GPU-enabled dependencies, run with the -gpu switch.
After setup, run the API:
#$env:CONFIG_FILE='config.production.yaml'
python app.py
The default production config in config.production.yaml now contains a single model β the 0.1B model rwkv7-g1a-0.1b-20250728-ctx4096 β set as default chat and reasoning model.
To download models defined in any config:
python download_models.py --config config.production.yaml
This will store the downloaded .pth files under the DOWNLOAD_MODEL_DIR specified in the YAML (defaults to ./models).
Advanced features:
reasoningis performed in-process by the same model (no external reasoning model is used). Use a request model likerwkv-latest:thinkingor set the reasoning suffix and the requested model will run reasoning in the same model.web_searchfunctionality is available at the request level β setweb_search: trueand optionallysearch_top_kto inject search results from DuckDuckGo into the prompt. This is executed by the server and provided to the same model as context.toolsare executed server-side and results injected into the prompt for the same model. Supported tools:web_searchandcalc(calculator). Example oftoolsusage:
{
"model": "rwkv-latest",
"prompt": "Calculate 2+3*4 and tell me the result",
"tools": [{"name": "calc", "args": {"expression": "2+3*4"}}]
}
API endpoints and model listing:
GET /api/v1/modelsβ returns a JSON list of configured models, sampler defaults, and ALLOW_* flags. This lets clients build per-model UI toggles (web search, tools, reasoning) based on server-provided capabilities.
Examples:
curl http://127.0.0.1:7860/api/v1/modelswill show configured models and their sampler defaults. Example: POST withweb_searchand reasoning enabled
{
"model": "rwkv-latest:thinking",
"prompt": "Who is the current president of France?",
"max_tokens": 32,
"web_search": true,
"search_top_k": 3
}
The server will perform a web search for the prompt, aggregate the top 3 results, and inject those into the prompt, then run the model with reasoning enabled β all using the same model instead of an external reasoning or search model.
Universal tool and model-initiated tool calls:
- The
universaltool returns a structured JSON/dict with the following fields:action(calc/web_search),result(string), andmetadata(dict withconfidence, query/expression, etc.). - Example
universalresult:
{
"action": "calc",
"result": "14",
"metadata": {"expression": "2+3*4", "confidence": 0.98}
}
- The model can also request tools mid-generation by emitting a sentinel tag, e.g.:
<tool-call>{"name":"calc","args":{"expression":"40+2"}}</tool-call>
When the model emits such a sentinel, the server will execute the requested tool, inject the results into the prompt, and continue streaming output. The server will also emit a metadata-only streaming chunk so the client is aware a tool was executed mid-stream.
Streaming behavior:
- The API streams responses token-by-token by default (
stream: true) and persists astate_namefor the generation if requested (or will generate one). Providestate_nameto resume continuation from where the previous stream stopped. The server stores model state in memory under(model, state_name)so subsequent requests with the samestate_namecan continue generation from that exact point.