| --- |
| title: QUEST |
| emoji: π |
| colorFrom: blue |
| colorTo: indigo |
| sdk: gradio |
| sdk_version: 5.29.0 |
| app_file: app.py |
| pinned: false |
| --- |
| |
| # DeepResearch Space |
|
|
| An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app |
| can either talk to **`osunlp/QUEST-35B`** (our own fine-tuned research model, |
| routed through a private HF Inference Endpoint) or fall back to open-weights |
| models through the shared HF Inference API. |
|
|
| Supported tools: |
| - `search` (DuckDuckGo, multi-query) |
| - `visit` (HTTP fetch + text extraction, multi-URL) |
| - lightweight research-state summary to cut repeated work |
| - `<answer>` extraction for the final response |
|
|
| --- |
|
|
| ## 1) Use our own `osunlp/QUEST-35B` model (recommended) |
|
|
| Because the model is **private** during the beta, it is not on the free |
| Inference API. You host it yourself on a dedicated HF Inference Endpoint |
| (pay-as-you-go, scale-to-zero), and point this Space at it. |
|
|
| ### 1a) Create the endpoint once |
|
|
| 1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**. |
| 2. **Model repository**: `osunlp/QUEST-35B` (use a token with access). |
| 3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B |
| model. `Nvidia T4 small (16GB)` works too and is cheaper. |
| 4. **Advanced β Container Type**: keep `Text Generation Inference` (TGI) or |
| pick `vLLM`. Both expose an OpenAI-compatible `/v1/` route. |
| 5. **Autoscaling β Scale-to-Zero**: enable it so you only pay when the |
| endpoint is serving traffic. |
| 6. Hit **Create endpoint**. After ~1β2 minutes it turns `Running` and shows a |
| base URL like `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud`. |
|
|
| ### 1b) Tell the Space how to reach it |
|
|
| In this Space's **Settings β Secrets / Variables**: |
|
|
| | Name | Value | Why | |
| |---|---|---| |
| | `HF_TOKEN` | your personal HF token with read access to `osunlp/QUEST-35B` | pulls private weights & authenticates the endpoint call | |
| | `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint | |
| | `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/QUEST-35B` if you deployed with vLLM) | some containers need the exact model name | |
| | `DEFAULT_MODEL` | `osunlp/QUEST-35B` | preselects the right option in the UI | |
|
|
| Click **Restart this Space**. The `Model` dropdown now shows |
| `osunlp/QUEST-35B` at the top; selecting it routes requests through your |
| endpoint. |
|
|
| > Cost reality-check: on a 1Γ L4 at `$0.80/hr` with Scale-to-Zero, a small |
| > internal beta (a handful of testers, dozens of queries per day) typically |
| > stays under **\$100/month**. You can stop the endpoint manually from the UI |
| > any time to freeze costs. |
|
|
| --- |
|
|
| ## 2) Fallback: free open-weights models |
|
|
| If you just want to try the UI without spinning up an endpoint, pick any of |
| these in the dropdown. They run through the shared HF Inference API. |
|
|
| - `Qwen/Qwen3-8B` |
| - `google/gemma-3-12b-it` |
| - `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B` |
| - `Qwen/Qwen2.5-7B-Instruct` |
| - `meta-llama/Llama-3.1-8B-Instruct` |
|
|
| Only `HF_TOKEN` is required for this path. |
|
|
| --- |
|
|
| ## 3) Share the beta with org members (without paying for Team) |
|
|
| Option A (simplest, **\$0** for access, Space Hardware stays on free CPU): |
|
|
| 1. Keep the Space under your personal account. |
| 2. **Settings β Visibility β Private**. |
| 3. **Settings β Collaborators** β add each tester by HF username. |
| 4. Endpoint lives under your personal namespace too, so the bill goes to |
| your personal payment method (you can expense invoices from |
| <https://huggingface.co/settings/billing>). |
|
|
| Option B (org-level billing): upgrade the organization to a Team plan and |
| recreate both the Space and the endpoint under the org namespace. |
|
|
| --- |
|
|
| ## 4) Local development |
|
|
| ```bash |
| python -m venv .venv |
| source .venv/bin/activate |
| pip install -r requirements.txt |
| export HF_TOKEN=... # required |
| export QUEST_BASE_URL=https://.../v1/ # optional; only if testing against the endpoint |
| python app.py |
| ``` |
|
|
| --- |
|
|
| ## 5) Architecture notes |
|
|
| - `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)` |
| for the private-endpoint path and the same client without `base_url` for the |
| shared API path. |
| - The system prompt matches the schema QUEST-35B was trained on (array-based |
| `search` / `visit` with an explicit `goal`), so the private model stays |
| in-distribution. The open-weights fallbacks also follow the same schema. |
| - Visited URLs and search queries are cached in-process so repeated tool |
| calls don't re-hit the network. |
| - `<answer>...</answer>` terminates the ReAct loop. |
|
|