File size: 4,638 Bytes
ba58d7b
154aaf2
54c79d6
 
 
 
 
ba58d7b
 
 
 
04b8201
54c79d6
04b8201
0c32859
04b8201
 
54c79d6
04b8201
 
 
 
 
54c79d6
04b8201
54c79d6
0c32859
04b8201
 
 
 
54c79d6
04b8201
54c79d6
04b8201
0c32859
14d1d61
04b8201
 
 
 
 
 
 
54c79d6
04b8201
69abb97
04b8201
69abb97
04b8201
 
0c32859
04b8201
0c32859
 
69abb97
04b8201
0c32859
04b8201
54c79d6
04b8201
 
 
 
54c79d6
04b8201
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54c79d6
04b8201
54c79d6
04b8201
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54c79d6
04b8201
 
 
 
 
0c32859
04b8201
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: QUEST
emoji: πŸ”Ž
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.29.0
app_file: app.py
pinned: false
---

# DeepResearch Space

An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
can either talk to **`osunlp/QUEST-35B`** (our own fine-tuned research model,
routed through a private HF Inference Endpoint) or fall back to open-weights
models through the shared HF Inference API.

Supported tools:
- `search` (DuckDuckGo, multi-query)
- `visit` (HTTP fetch + text extraction, multi-URL)
- lightweight research-state summary to cut repeated work
- `<answer>` extraction for the final response

---

## 1) Use our own `osunlp/QUEST-35B` model (recommended)

Because the model is **private** during the beta, it is not on the free
Inference API. You host it yourself on a dedicated HF Inference Endpoint
(pay-as-you-go, scale-to-zero), and point this Space at it.

### 1a) Create the endpoint once

1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
2. **Model repository**: `osunlp/QUEST-35B` (use a token with access).
3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B
   model. `Nvidia T4 small (16GB)` works too and is cheaper.
4. **Advanced β†’ Container Type**: keep `Text Generation Inference` (TGI) or
   pick `vLLM`. Both expose an OpenAI-compatible `/v1/` route.
5. **Autoscaling β†’ Scale-to-Zero**: enable it so you only pay when the
   endpoint is serving traffic.
6. Hit **Create endpoint**. After ~1–2 minutes it turns `Running` and shows a
   base URL like `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud`.

### 1b) Tell the Space how to reach it

In this Space's **Settings β†’ Secrets / Variables**:

| Name | Value | Why |
|---|---|---|
| `HF_TOKEN` | your personal HF token with read access to `osunlp/QUEST-35B` | pulls private weights & authenticates the endpoint call |
| `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
| `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/QUEST-35B` if you deployed with vLLM) | some containers need the exact model name |
| `DEFAULT_MODEL` | `osunlp/QUEST-35B` | preselects the right option in the UI |

Click **Restart this Space**. The `Model` dropdown now shows
`osunlp/QUEST-35B` at the top; selecting it routes requests through your
endpoint.

> Cost reality-check: on a 1Γ— L4 at `$0.80/hr` with Scale-to-Zero, a small
> internal beta (a handful of testers, dozens of queries per day) typically
> stays under **\$100/month**. You can stop the endpoint manually from the UI
> any time to freeze costs.

---

## 2) Fallback: free open-weights models

If you just want to try the UI without spinning up an endpoint, pick any of
these in the dropdown. They run through the shared HF Inference API.

- `Qwen/Qwen3-8B`
- `google/gemma-3-12b-it`
- `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
- `Qwen/Qwen2.5-7B-Instruct`
- `meta-llama/Llama-3.1-8B-Instruct`

Only `HF_TOKEN` is required for this path.

---

## 3) Share the beta with org members (without paying for Team)

Option A (simplest, **\$0** for access, Space Hardware stays on free CPU):

1. Keep the Space under your personal account.
2. **Settings β†’ Visibility β†’ Private**.
3. **Settings β†’ Collaborators** β†’ add each tester by HF username.
4. Endpoint lives under your personal namespace too, so the bill goes to
   your personal payment method (you can expense invoices from
   <https://huggingface.co/settings/billing>).

Option B (org-level billing): upgrade the organization to a Team plan and
recreate both the Space and the endpoint under the org namespace.

---

## 4) Local development

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN=...                      # required
export QUEST_BASE_URL=https://.../v1/    # optional; only if testing against the endpoint
python app.py
```

---

## 5) Architecture notes

- `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
  for the private-endpoint path and the same client without `base_url` for the
  shared API path.
- The system prompt matches the schema QUEST-35B was trained on (array-based
  `search` / `visit` with an explicit `goal`), so the private model stays
  in-distribution. The open-weights fallbacks also follow the same schema.
- Visited URLs and search queries are cached in-process so repeated tool
  calls don't re-hit the network.
- `<answer>...</answer>` terminates the ReAct loop.