| | ---
|
| |
|
| | title: RWKV HF Space
|
| | emoji: π¦ββ¬
|
| | colorFrom: purple
|
| | colorTo: pink
|
| | sdk: docker
|
| | pinned: false
|
| |
|
| | ---
|
| |
|
| | # Simple RWKV OpenAI-Compatible API
|
| |
|
| | ---
|
| |
|
| | title: RWKV HF Space
|
| | emoji: π¦ββ¬
|
| | colorFrom: purple
|
| | colorTo: pink
|
| | sdk: docker
|
| | pinned: false
|
| |
|
| | ---
|
| |
|
| | # Simple RWKV OpenAI-Compatible API
|
| |
|
| | ## Quick Windows Setup (no Docker)
|
| |
|
| | This repository was originally packaged with a Dockerfile. It now provides a `setup_windows.ps1` script that mirrors Dockerfile actions and sets up the service locally on Windows (installs Python dependencies, builds the frontend, and downloads the 0.1B model).
|
| |
|
| | Prerequisites:
|
| | - Python 3.10+ installed and in PATH
|
| | - Node.js + npm (optional, required for building the frontend)
|
| | - (Optional) NVIDIA GPU and CUDA (for GPU runtime)
|
| |
|
| | To setup locally on Windows (CPU-only):
|
| |
|
| | ```powershell
|
| | .\setup_windows.ps1 -gpu:$false -buildFrontend:$true -CONFIG_FILE config.production.yaml
|
| | ```
|
| |
|
| | If you have a compatible NVIDIA GPU and prefer to install GPU-enabled dependencies, run with the `-gpu` switch.
|
| |
|
| | After setup, run the API:
|
| |
|
| | ```powershell
|
| | #$env:CONFIG_FILE='config.production.yaml'
|
| | python app.py
|
| | ```
|
| |
|
| | The default production config in `config.production.yaml` now contains a single model β the 0.1B model `rwkv7-g1a-0.1b-20250728-ctx4096` β set as default chat and reasoning model.
|
| |
|
| | To download models defined in any config:
|
| |
|
| | ```powershell
|
| | python download_models.py --config config.production.yaml
|
| | ```
|
| |
|
| | This will store the downloaded .pth files under the `DOWNLOAD_MODEL_DIR` specified in the YAML (defaults to `./models`).
|
| |
|
| | Advanced features:
|
| | - `reasoning` is performed in-process by the same model (no external reasoning model is used). Use a request model like `rwkv-latest:thinking` or set the reasoning suffix and the requested model will run reasoning in the same model.
|
| | - `web_search` functionality is available at the request level β set `web_search: true` and optionally `search_top_k` to inject search results from DuckDuckGo into the prompt. This is executed by the server and provided to the same model as context.
|
| | - `tools` are executed server-side and results injected into the prompt for the same model. Supported tools: `web_search` and `calc` (calculator). Example of `tools` usage:
|
| |
|
| | ```json
|
| | {
|
| | "model": "rwkv-latest",
|
| | "prompt": "Calculate 2+3*4 and tell me the result",
|
| | "tools": [{"name": "calc", "args": {"expression": "2+3*4"}}]
|
| | }
|
| | ```
|
| |
|
| | API endpoints and model listing:
|
| | - `GET /api/v1/models` β returns a JSON list of configured models, sampler defaults, and ALLOW_* flags. This lets clients build per-model UI toggles (web search, tools, reasoning) based on server-provided capabilities.
|
| |
|
| | Examples:
|
| | - `curl http://127.0.0.1:7860/api/v1/models` will show configured models and their sampler defaults.
|
| | Example: POST with `web_search` and reasoning enabled
|
| |
|
| | ```json
|
| | {
|
| | "model": "rwkv-latest:thinking",
|
| | "prompt": "Who is the current president of France?",
|
| | "max_tokens": 32,
|
| | "web_search": true,
|
| | "search_top_k": 3
|
| | }
|
| | ```
|
| |
|
| | The server will perform a web search for the prompt, aggregate the top 3 results, and inject those into the prompt, then run the model with reasoning enabled β all using the same model instead of an external reasoning or search model.
|
| |
|
| | Universal tool and model-initiated tool calls:
|
| | - The `universal` tool returns a structured JSON/dict with the following fields: `action` (calc/web_search), `result` (string), and `metadata` (dict with `confidence`, query/expression, etc.).
|
| | - Example `universal` result:
|
| |
|
| | ```json
|
| | {
|
| | "action": "calc",
|
| | "result": "14",
|
| | "metadata": {"expression": "2+3*4", "confidence": 0.98}
|
| | }
|
| | ```
|
| |
|
| | - The model can also request tools mid-generation by emitting a sentinel tag, e.g.:
|
| |
|
| | ```
|
| | <tool-call>{"name":"calc","args":{"expression":"40+2"}}</tool-call>
|
| | ```
|
| |
|
| | When the model emits such a sentinel, the server will execute the requested tool, inject the results into the prompt, and continue streaming output. The server will also emit a metadata-only streaming chunk so the client is aware a tool was executed mid-stream.
|
| |
|
| | Streaming behavior:
|
| | - The API streams responses token-by-token by default (`stream: true`) and persists a `state_name` for the generation if requested (or will generate one). Provide `state_name` to resume continuation from where the previous stream stopped. The server stores model state in memory under `(model, state_name)` so subsequent requests with the same `state_name` can continue generation from that exact point. |