--- title: QUEST emoji: 🔎 colorFrom: blue colorTo: indigo sdk: gradio sdk_version: 5.29.0 app_file: app.py pinned: false --- # DeepResearch Space An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app can either talk to **`osunlp/QUEST-35B`** (our own fine-tuned research model, routed through a private HF Inference Endpoint) or fall back to open-weights models through the shared HF Inference API. Supported tools: - `search` (DuckDuckGo, multi-query) - `visit` (HTTP fetch + text extraction, multi-URL) - lightweight research-state summary to cut repeated work - `` extraction for the final response --- ## 1) Use our own `osunlp/QUEST-35B` model (recommended) Because the model is **private** during the beta, it is not on the free Inference API. You host it yourself on a dedicated HF Inference Endpoint (pay-as-you-go, scale-to-zero), and point this Space at it. ### 1a) Create the endpoint once 1. Open and click **"New endpoint"**. 2. **Model repository**: `osunlp/QUEST-35B` (use a token with access). 3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B model. `Nvidia T4 small (16GB)` works too and is cheaper. 4. **Advanced → Container Type**: keep `Text Generation Inference` (TGI) or pick `vLLM`. Both expose an OpenAI-compatible `/v1/` route. 5. **Autoscaling → Scale-to-Zero**: enable it so you only pay when the endpoint is serving traffic. 6. Hit **Create endpoint**. After ~1–2 minutes it turns `Running` and shows a base URL like `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud`. ### 1b) Tell the Space how to reach it In this Space's **Settings → Secrets / Variables**: | Name | Value | Why | |---|---|---| | `HF_TOKEN` | your personal HF token with read access to `osunlp/QUEST-35B` | pulls private weights & authenticates the endpoint call | | `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint | | `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/QUEST-35B` if you deployed with vLLM) | some containers need the exact model name | | `DEFAULT_MODEL` | `osunlp/QUEST-35B` | preselects the right option in the UI | Click **Restart this Space**. The `Model` dropdown now shows `osunlp/QUEST-35B` at the top; selecting it routes requests through your endpoint. > Cost reality-check: on a 1× L4 at `$0.80/hr` with Scale-to-Zero, a small > internal beta (a handful of testers, dozens of queries per day) typically > stays under **\$100/month**. You can stop the endpoint manually from the UI > any time to freeze costs. --- ## 2) Fallback: free open-weights models If you just want to try the UI without spinning up an endpoint, pick any of these in the dropdown. They run through the shared HF Inference API. - `Qwen/Qwen3-8B` - `google/gemma-3-12b-it` - `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B` - `Qwen/Qwen2.5-7B-Instruct` - `meta-llama/Llama-3.1-8B-Instruct` Only `HF_TOKEN` is required for this path. --- ## 3) Share the beta with org members (without paying for Team) Option A (simplest, **\$0** for access, Space Hardware stays on free CPU): 1. Keep the Space under your personal account. 2. **Settings → Visibility → Private**. 3. **Settings → Collaborators** → add each tester by HF username. 4. Endpoint lives under your personal namespace too, so the bill goes to your personal payment method (you can expense invoices from ). Option B (org-level billing): upgrade the organization to a Team plan and recreate both the Space and the endpoint under the org namespace. --- ## 4) Local development ```bash python -m venv .venv source .venv/bin/activate pip install -r requirements.txt export HF_TOKEN=... # required export QUEST_BASE_URL=https://.../v1/ # optional; only if testing against the endpoint python app.py ``` --- ## 5) Architecture notes - `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)` for the private-endpoint path and the same client without `base_url` for the shared API path. - The system prompt matches the schema QUEST-35B was trained on (array-based `search` / `visit` with an explicit `goal`), so the private model stays in-distribution. The open-weights fallbacks also follow the same schema. - Visited URLs and search queries are cached in-process so repeated tool calls don't re-hit the network. - `...` terminates the ReAct loop.