Hhh / README.md
Ksjsjjdj's picture
Upload 42 files
aed88a2 verified
metadata
title: RWKV HF Space
emoji: πŸ¦β€β¬›
colorFrom: purple
colorTo: pink
sdk: docker
pinned: false

Simple RWKV OpenAI-Compatible API


title: RWKV HF Space emoji: πŸ¦β€β¬› colorFrom: purple colorTo: pink sdk: docker pinned: false


Simple RWKV OpenAI-Compatible API

Quick Windows Setup (no Docker)

This repository was originally packaged with a Dockerfile. It now provides a setup_windows.ps1 script that mirrors Dockerfile actions and sets up the service locally on Windows (installs Python dependencies, builds the frontend, and downloads the 0.1B model).

Prerequisites:

  • Python 3.10+ installed and in PATH
  • Node.js + npm (optional, required for building the frontend)
  • (Optional) NVIDIA GPU and CUDA (for GPU runtime)

To setup locally on Windows (CPU-only):

.\setup_windows.ps1 -gpu:$false -buildFrontend:$true -CONFIG_FILE config.production.yaml

If you have a compatible NVIDIA GPU and prefer to install GPU-enabled dependencies, run with the -gpu switch.

After setup, run the API:

#$env:CONFIG_FILE='config.production.yaml'
python app.py

The default production config in config.production.yaml now contains a single model β€” the 0.1B model rwkv7-g1a-0.1b-20250728-ctx4096 β€” set as default chat and reasoning model.

To download models defined in any config:

python download_models.py --config config.production.yaml

This will store the downloaded .pth files under the DOWNLOAD_MODEL_DIR specified in the YAML (defaults to ./models).

Advanced features:

  • reasoning is performed in-process by the same model (no external reasoning model is used). Use a request model like rwkv-latest:thinking or set the reasoning suffix and the requested model will run reasoning in the same model.
  • web_search functionality is available at the request level β€” set web_search: true and optionally search_top_k to inject search results from DuckDuckGo into the prompt. This is executed by the server and provided to the same model as context.
  • tools are executed server-side and results injected into the prompt for the same model. Supported tools: web_search and calc (calculator). Example of tools usage:
{
    "model": "rwkv-latest",
    "prompt": "Calculate 2+3*4 and tell me the result",
    "tools": [{"name": "calc", "args": {"expression": "2+3*4"}}]
}

API endpoints and model listing:

  • GET /api/v1/models β€” returns a JSON list of configured models, sampler defaults, and ALLOW_* flags. This lets clients build per-model UI toggles (web search, tools, reasoning) based on server-provided capabilities.

Examples:

  • curl http://127.0.0.1:7860/api/v1/models will show configured models and their sampler defaults. Example: POST with web_search and reasoning enabled
{
    "model": "rwkv-latest:thinking",
    "prompt": "Who is the current president of France?",
    "max_tokens": 32,
    "web_search": true,
    "search_top_k": 3
}

The server will perform a web search for the prompt, aggregate the top 3 results, and inject those into the prompt, then run the model with reasoning enabled β€” all using the same model instead of an external reasoning or search model.

Universal tool and model-initiated tool calls:

  • The universal tool returns a structured JSON/dict with the following fields: action (calc/web_search), result (string), and metadata (dict with confidence, query/expression, etc.).
  • Example universal result:
{
    "action": "calc",
    "result": "14",
    "metadata": {"expression": "2+3*4", "confidence": 0.98}
}
  • The model can also request tools mid-generation by emitting a sentinel tag, e.g.:
<tool-call>{"name":"calc","args":{"expression":"40+2"}}</tool-call>
When the model emits such a sentinel, the server will execute the requested tool, inject the results into the prompt, and continue streaming output. The server will also emit a metadata-only streaming chunk so the client is aware a tool was executed mid-stream.

Streaming behavior:

  • The API streams responses token-by-token by default (stream: true) and persists a state_name for the generation if requested (or will generate one). Provide state_name to resume continuation from where the previous stream stopped. The server stores model state in memory under (model, state_name) so subsequent requests with the same state_name can continue generation from that exact point.