Spaces:

osunlp
/

QUEST

Running

File size: 4,638 Bytes

---
title: QUEST
emoji: 🔎
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
---

# DeepResearch Space

An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
can either talk to **`osunlp/QUEST-35B`** (our own fine-tuned research model,
routed through a private HF Inference Endpoint) or fall back to open-weights
models through the shared HF Inference API.

Supported tools:
- `search` (DuckDuckGo, multi-query)
- `visit` (HTTP fetch + text extraction, multi-URL)
- lightweight research-state summary to cut repeated work
- `<answer>` extraction for the final response

---

## 1) Use our own `osunlp/QUEST-35B` model (recommended)

Because the model is **private** during the beta, it is not on the free
Inference API. You host it yourself on a dedicated HF Inference Endpoint
(pay-as-you-go, scale-to-zero), and point this Space at it.

### 1a) Create the endpoint once

1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
2. **Model repository**: `osunlp/QUEST-35B` (use a token with access).
3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B
   model. `Nvidia T4 small (16GB)` works too and is cheaper.
4. **Advanced → Container Type**: keep `Text Generation Inference` (TGI) or
   pick `vLLM`. Both expose an OpenAI-compatible `/v1/` route.
5. **Autoscaling → Scale-to-Zero**: enable it so you only pay when the
   endpoint is serving traffic.
6. Hit **Create endpoint**. After ~1–2 minutes it turns `Running` and shows a
   base URL like `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud`.

### 1b) Tell the Space how to reach it

In this Space's **Settings → Secrets / Variables**:

| Name | Value | Why |
|---|---|---|
| `HF_TOKEN` | your personal HF token with read access to `osunlp/QUEST-35B` | pulls private weights & authenticates the endpoint call |
| `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
| `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/QUEST-35B` if you deployed with vLLM) | some containers need the exact model name |
| `DEFAULT_MODEL` | `osunlp/QUEST-35B` | preselects the right option in the UI |

Click **Restart this Space**. The `Model` dropdown now shows
`osunlp/QUEST-35B` at the top; selecting it routes requests through your
endpoint.

> Cost reality-check: on a 1× L4 at `$0.80/hr` with Scale-to-Zero, a small
> internal beta (a handful of testers, dozens of queries per day) typically
> stays under **\$100/month**. You can stop the endpoint manually from the UI
> any time to freeze costs.

---

## 2) Fallback: free open-weights models

If you just want to try the UI without spinning up an endpoint, pick any of
these in the dropdown. They run through the shared HF Inference API.

- `Qwen/Qwen3-8B`
- `google/gemma-3-12b-it`
- `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
- `Qwen/Qwen2.5-7B-Instruct`
- `meta-llama/Llama-3.1-8B-Instruct`

Only `HF_TOKEN` is required for this path.

---

## 3) Share the beta with org members (without paying for Team)

Option A (simplest, **\$0** for access, Space Hardware stays on free CPU):

1. Keep the Space under your personal account.
2. **Settings → Visibility → Private**.
3. **Settings → Collaborators** → add each tester by HF username.
4. Endpoint lives under your personal namespace too, so the bill goes to
   your personal payment method (you can expense invoices from
   <https://huggingface.co/settings/billing>).

Option B (org-level billing): upgrade the organization to a Team plan and
recreate both the Space and the endpoint under the org namespace.

---

## 4) Local development

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=...                      # required
export QUEST_BASE_URL=https://.../v1/    # optional; only if testing against the endpoint
python app.py
```

---

## 5) Architecture notes

- `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
  for the private-endpoint path and the same client without `base_url` for the
  shared API path.
- The system prompt matches the schema QUEST-35B was trained on (array-based
  `search` / `visit` with an explicit `goal`), so the private model stays
  in-distribution. The open-weights fallbacks also follow the same schema.
- Visited URLs and search queries are cached in-process so repeated tool
  calls don't re-hit the network.
- `<answer>...</answer>` terminates the ReAct loop.