A newer version of the Gradio SDK is available: 6.14.0
title: QUEST
emoji: π
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
DeepResearch Space
An interactive Hugging Face Space for a Quest DeepResearch agent. The app
can either talk to osunlp/QUEST-35B (our own fine-tuned research model,
routed through a private HF Inference Endpoint) or fall back to open-weights
models through the shared HF Inference API.
Supported tools:
search(DuckDuckGo, multi-query)visit(HTTP fetch + text extraction, multi-URL)- lightweight research-state summary to cut repeated work
<answer>extraction for the final response
1) Use our own osunlp/QUEST-35B model (recommended)
Because the model is private during the beta, it is not on the free Inference API. You host it yourself on a dedicated HF Inference Endpoint (pay-as-you-go, scale-to-zero), and point this Space at it.
1a) Create the endpoint once
- Open https://ui.endpoints.huggingface.co/ and click "New endpoint".
- Model repository:
osunlp/QUEST-35B(use a token with access). - Hardware:
1x Nvidia L4 (24GB)is usually the sweet spot for a 35B model.Nvidia T4 small (16GB)works too and is cheaper. - Advanced β Container Type: keep
Text Generation Inference(TGI) or pickvLLM. Both expose an OpenAI-compatible/v1/route. - Autoscaling β Scale-to-Zero: enable it so you only pay when the endpoint is serving traffic.
- Hit Create endpoint. After ~1β2 minutes it turns
Runningand shows a base URL likehttps://abcdef.us-east-1.aws.endpoints.huggingface.cloud.
1b) Tell the Space how to reach it
In this Space's Settings β Secrets / Variables:
| Name | Value | Why |
|---|---|---|
HF_TOKEN |
your personal HF token with read access to osunlp/QUEST-35B |
pulls private weights & authenticates the endpoint call |
QUEST_BASE_URL |
the endpoint URL ending with /v1/ (e.g. https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/) |
tells the app to route chat completions to your endpoint |
QUEST_ENDPOINT_MODEL |
tgi (default; set to the original repo id osunlp/QUEST-35B if you deployed with vLLM) |
some containers need the exact model name |
DEFAULT_MODEL |
osunlp/QUEST-35B |
preselects the right option in the UI |
Click Restart this Space. The Model dropdown now shows
osunlp/QUEST-35B at the top; selecting it routes requests through your
endpoint.
Cost reality-check: on a 1Γ L4 at
$0.80/hrwith Scale-to-Zero, a small internal beta (a handful of testers, dozens of queries per day) typically stays under $100/month. You can stop the endpoint manually from the UI any time to freeze costs.
2) Fallback: free open-weights models
If you just want to try the UI without spinning up an endpoint, pick any of these in the dropdown. They run through the shared HF Inference API.
Qwen/Qwen3-8Bgoogle/gemma-3-12b-itdeepseek-ai/DeepSeek-R1-Distill-Qwen-7BQwen/Qwen2.5-7B-Instructmeta-llama/Llama-3.1-8B-Instruct
Only HF_TOKEN is required for this path.
3) Share the beta with org members (without paying for Team)
Option A (simplest, $0 for access, Space Hardware stays on free CPU):
- Keep the Space under your personal account.
- Settings β Visibility β Private.
- Settings β Collaborators β add each tester by HF username.
- Endpoint lives under your personal namespace too, so the bill goes to your personal payment method (you can expense invoices from https://huggingface.co/settings/billing).
Option B (org-level billing): upgrade the organization to a Team plan and recreate both the Space and the endpoint under the org namespace.
4) Local development
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=... # required
export QUEST_BASE_URL=https://.../v1/ # optional; only if testing against the endpoint
python app.py
5) Architecture notes
app.pyuseshuggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)for the private-endpoint path and the same client withoutbase_urlfor the shared API path.- The system prompt matches the schema QUEST-35B was trained on (array-based
search/visitwith an explicitgoal), so the private model stays in-distribution. The open-weights fallbacks also follow the same schema. - Visited URLs and search queries are cached in-process so repeated tool calls don't re-hit the network.
<answer>...</answer>terminates the ReAct loop.