QUEST / README.md
Lzy01241010's picture
ui: capitalize QUEST in Space title, browser tab title, footer copy
154aaf2

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: QUEST
emoji: πŸ”Ž
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false

DeepResearch Space

An interactive Hugging Face Space for a Quest DeepResearch agent. The app can either talk to osunlp/QUEST-35B (our own fine-tuned research model, routed through a private HF Inference Endpoint) or fall back to open-weights models through the shared HF Inference API.

Supported tools:

  • search (DuckDuckGo, multi-query)
  • visit (HTTP fetch + text extraction, multi-URL)
  • lightweight research-state summary to cut repeated work
  • <answer> extraction for the final response

1) Use our own osunlp/QUEST-35B model (recommended)

Because the model is private during the beta, it is not on the free Inference API. You host it yourself on a dedicated HF Inference Endpoint (pay-as-you-go, scale-to-zero), and point this Space at it.

1a) Create the endpoint once

  1. Open https://ui.endpoints.huggingface.co/ and click "New endpoint".
  2. Model repository: osunlp/QUEST-35B (use a token with access).
  3. Hardware: 1x Nvidia L4 (24GB) is usually the sweet spot for a 35B model. Nvidia T4 small (16GB) works too and is cheaper.
  4. Advanced β†’ Container Type: keep Text Generation Inference (TGI) or pick vLLM. Both expose an OpenAI-compatible /v1/ route.
  5. Autoscaling β†’ Scale-to-Zero: enable it so you only pay when the endpoint is serving traffic.
  6. Hit Create endpoint. After ~1–2 minutes it turns Running and shows a base URL like https://abcdef.us-east-1.aws.endpoints.huggingface.cloud.

1b) Tell the Space how to reach it

In this Space's Settings β†’ Secrets / Variables:

Name Value Why
HF_TOKEN your personal HF token with read access to osunlp/QUEST-35B pulls private weights & authenticates the endpoint call
QUEST_BASE_URL the endpoint URL ending with /v1/ (e.g. https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/) tells the app to route chat completions to your endpoint
QUEST_ENDPOINT_MODEL tgi (default; set to the original repo id osunlp/QUEST-35B if you deployed with vLLM) some containers need the exact model name
DEFAULT_MODEL osunlp/QUEST-35B preselects the right option in the UI

Click Restart this Space. The Model dropdown now shows osunlp/QUEST-35B at the top; selecting it routes requests through your endpoint.

Cost reality-check: on a 1Γ— L4 at $0.80/hr with Scale-to-Zero, a small internal beta (a handful of testers, dozens of queries per day) typically stays under $100/month. You can stop the endpoint manually from the UI any time to freeze costs.


2) Fallback: free open-weights models

If you just want to try the UI without spinning up an endpoint, pick any of these in the dropdown. They run through the shared HF Inference API.

  • Qwen/Qwen3-8B
  • google/gemma-3-12b-it
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  • Qwen/Qwen2.5-7B-Instruct
  • meta-llama/Llama-3.1-8B-Instruct

Only HF_TOKEN is required for this path.


3) Share the beta with org members (without paying for Team)

Option A (simplest, $0 for access, Space Hardware stays on free CPU):

  1. Keep the Space under your personal account.
  2. Settings β†’ Visibility β†’ Private.
  3. Settings β†’ Collaborators β†’ add each tester by HF username.
  4. Endpoint lives under your personal namespace too, so the bill goes to your personal payment method (you can expense invoices from https://huggingface.co/settings/billing).

Option B (org-level billing): upgrade the organization to a Team plan and recreate both the Space and the endpoint under the org namespace.


4) Local development

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=...                      # required
export QUEST_BASE_URL=https://.../v1/    # optional; only if testing against the endpoint
python app.py

5) Architecture notes

  • app.py uses huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...) for the private-endpoint path and the same client without base_url for the shared API path.
  • The system prompt matches the schema QUEST-35B was trained on (array-based search / visit with an explicit goal), so the private model stays in-distribution. The open-weights fallbacks also follow the same schema.
  • Visited URLs and search queries are cached in-process so repeated tool calls don't re-hit the network.
  • <answer>...</answer> terminates the ReAct loop.