QUEST / README.md
Lzy01241010's picture
ui: capitalize QUEST in Space title, browser tab title, footer copy
154aaf2
---
title: QUEST
emoji: πŸ”Ž
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
---
# DeepResearch Space
An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
can either talk to **`osunlp/QUEST-35B`** (our own fine-tuned research model,
routed through a private HF Inference Endpoint) or fall back to open-weights
models through the shared HF Inference API.
Supported tools:
- `search` (DuckDuckGo, multi-query)
- `visit` (HTTP fetch + text extraction, multi-URL)
- lightweight research-state summary to cut repeated work
- `<answer>` extraction for the final response
---
## 1) Use our own `osunlp/QUEST-35B` model (recommended)
Because the model is **private** during the beta, it is not on the free
Inference API. You host it yourself on a dedicated HF Inference Endpoint
(pay-as-you-go, scale-to-zero), and point this Space at it.
### 1a) Create the endpoint once
1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
2. **Model repository**: `osunlp/QUEST-35B` (use a token with access).
3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B
model. `Nvidia T4 small (16GB)` works too and is cheaper.
4. **Advanced β†’ Container Type**: keep `Text Generation Inference` (TGI) or
pick `vLLM`. Both expose an OpenAI-compatible `/v1/` route.
5. **Autoscaling β†’ Scale-to-Zero**: enable it so you only pay when the
endpoint is serving traffic.
6. Hit **Create endpoint**. After ~1–2 minutes it turns `Running` and shows a
base URL like `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud`.
### 1b) Tell the Space how to reach it
In this Space's **Settings β†’ Secrets / Variables**:
| Name | Value | Why |
|---|---|---|
| `HF_TOKEN` | your personal HF token with read access to `osunlp/QUEST-35B` | pulls private weights & authenticates the endpoint call |
| `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
| `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/QUEST-35B` if you deployed with vLLM) | some containers need the exact model name |
| `DEFAULT_MODEL` | `osunlp/QUEST-35B` | preselects the right option in the UI |
Click **Restart this Space**. The `Model` dropdown now shows
`osunlp/QUEST-35B` at the top; selecting it routes requests through your
endpoint.
> Cost reality-check: on a 1Γ— L4 at `$0.80/hr` with Scale-to-Zero, a small
> internal beta (a handful of testers, dozens of queries per day) typically
> stays under **\$100/month**. You can stop the endpoint manually from the UI
> any time to freeze costs.
---
## 2) Fallback: free open-weights models
If you just want to try the UI without spinning up an endpoint, pick any of
these in the dropdown. They run through the shared HF Inference API.
- `Qwen/Qwen3-8B`
- `google/gemma-3-12b-it`
- `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
- `Qwen/Qwen2.5-7B-Instruct`
- `meta-llama/Llama-3.1-8B-Instruct`
Only `HF_TOKEN` is required for this path.
---
## 3) Share the beta with org members (without paying for Team)
Option A (simplest, **\$0** for access, Space Hardware stays on free CPU):
1. Keep the Space under your personal account.
2. **Settings β†’ Visibility β†’ Private**.
3. **Settings β†’ Collaborators** β†’ add each tester by HF username.
4. Endpoint lives under your personal namespace too, so the bill goes to
your personal payment method (you can expense invoices from
<https://huggingface.co/settings/billing>).
Option B (org-level billing): upgrade the organization to a Team plan and
recreate both the Space and the endpoint under the org namespace.
---
## 4) Local development
```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=... # required
export QUEST_BASE_URL=https://.../v1/ # optional; only if testing against the endpoint
python app.py
```
---
## 5) Architecture notes
- `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
for the private-endpoint path and the same client without `base_url` for the
shared API path.
- The system prompt matches the schema QUEST-35B was trained on (array-based
`search` / `visit` with an explicit `goal`), so the private model stays
in-distribution. The open-weights fallbacks also follow the same schema.
- Visited URLs and search queries are cached in-process so repeated tool
calls don't re-hit the network.
- `<answer>...</answer>` terminates the ReAct loop.