IQuestLab
/

HOTE-8B

Add a complete model card based on arXiv:2606.13710, including intended use, checkpoint layout, training details, benchmark results, limitations, and citation.

Files changed (1) hide show

README.md +183 -1

README.md CHANGED Viewed

@@ -7,4 +7,186 @@ base_model:
 datasets:
 - rl-research/dr-tulu-sft-data
 - rl-research/dr-tulu-rl-data
----

 datasets:
 - rl-research/dr-tulu-sft-data
 - rl-research/dr-tulu-rl-data
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- deep-research
+- agent
+- reinforcement-learning
+- tool-use
+- open-ended-evolution
+- qwen3
+model-index:
+- name: HOTE-8B
+  results:
+  - task:
+      type: text-generation
+      name: Long-form deep research
+    dataset:
+      name: HealthBench
+      type: HealthBench
+    metrics:
+    - type: score
+      value: 54.4
+      name: HealthBench score
+  - task:
+      type: text-generation
+      name: Long-form deep research
+    dataset:
+      name: DeepResearchBench
+      type: DeepResearchBench
+    metrics:
+    - type: score
+      value: 76.9
+      name: DRB Overall
+    - type: score
+      value: 45.9
+      name: DRB Average
+  - task:
+      type: text-generation
+      name: Long-form deep research
+    dataset:
+      name: ResearchQA
+      type: ResearchQA
+    metrics:
+    - type: score
+      value: 59.1
+      name: ResearchQA score
+---
+# HOTE-8B
+HOTE-8B is an 8B-parameter deep research model trained with **Hybrid Open-Ended Tri-Evolution (HOTE)**, a reinforcement-learning framework for open-ended research agents. The model is introduced in [Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher](https://arxiv.org/abs/2606.13710) (arXiv:2606.13710v2, 2026-06-15).
+HOTE trains a deep research system through the co-evolution of three roles:
+- **Solver**: plans, searches, integrates retrieved evidence, and writes long-form research reports with citations.
+- **Judge**: generates and updates rubrics, evaluates multiple solver responses, and provides rewards beyond deterministic-answer tasks.
+- **Proposer**: searches for weaknesses identified by the judge and proposes challenging but learnable research tasks.
+The framework uses a dual-mode strategy with both tool-use and no-tool training. According to the paper, this improves training efficiency while allowing the tool-use and no-tool modes to benefit each other.
+## Repository Contents
+This repository contains the following checkpoint folders:
+- `step_700/`: HOTE-8B deep research model checkpoint.
+- `step_700_query/`: query/proposer checkpoint used in the HOTE framework.
+## Intended Use
+HOTE-8B is intended for research on long-form deep research agents, search-augmented report generation, open-ended agent evolution, and reinforcement learning for non-verifiable tasks.
+The model is most useful when integrated with a search-enabled agent runtime. In the paper, the solver operates with ReAct-style actions including thinking, tool calls, final answers, and citations. The model weights alone do not provide web search, browsing, paper search, citation validation, or tool execution.
+## Quick Start
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+repo_id = "IQuestLab/HOTE-8B"
+subfolder = "step_700"
+tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
+model = AutoModelForCausalLM.from_pretrained(
+    repo_id,
+    subfolder=subfolder,
+    torch_dtype="auto",
+    device_map="auto",
+)
+messages = [
+    {
+        "role": "user",
+        "content": "Write a concise research report on recent progress in search-augmented language agents.",
+    }
+]
+inputs = tokenizer.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt",
+).to(model.device)
+outputs = model.generate(
+    inputs,
+    max_new_tokens=4096,
+    temperature=0.7,
+    top_p=0.95,
+)
+print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
+```
+For full deep-research behavior, connect the model to an agent loop that parses tool-call actions, executes search/browse/paper-search tools, appends observations to the context, and validates cited sources.
+## Training Details
+The paper reports the following HOTE-8B setup:
+- Proposer initialization: `Qwen3-8B`.
+- Solver initialization: `DR Tulu-8B-SFT`.
+- Judge model during training: `Qwen3-235B-A22B-Instruct-FP8`.
+- Original RL training set: DR Tulu training data, 9K samples, licensed under ODC-BY.
+- Batch size: 48.
+- Solver group size: 8.
+- Proposer group size: 6.
+- Learning rate: `5e-7`.
+- Maximum tool uses per response: 10.
+- Training temperature: 1.
+- Response length: 16,384 tokens.
+- Schedule: 600 no-tool steps followed by 700 hybrid-mode steps.
+The paper states that benchmark data was not added to the training set and that search tools were blocked from accessing benchmark websites during evaluation.
+## Evaluation
+The paper evaluates HOTE-8B on three long-form, open-ended deep research benchmarks:
+| Benchmark | Score |
+| --- | ---: |
+| HealthBench | 54.4 |
+| DeepResearchBench Overall | 76.9 |
+| DeepResearchBench Average | 45.9 |
+| ResearchQA | 59.1 |
+DeepResearchBench aspect scores reported for HOTE-8B:
+| Aspect | Score |
+| --- | ---: |
+| Comprehensiveness | 44.9 |
+| Insight | 45.4 |
+| Instruction Following | 47.8 |
+| Readability | 45.8 |
+Average training time per step reported in the paper:
+| Method | Wall-clock seconds/step | GPU hours/step |
+| --- | ---: | ---: |
+| HOTE no-tool | 382.0 | 1.5 |
+| HOTE hybrid | 753.3 | 2.6 |
+See the paper for the full comparison against closed deep research systems, open deep research models, fixed-pipeline systems, RL baselines, and evolving-agent baselines.
+## Limitations
+- The model is designed for deep research workflows and should be paired with robust tool execution, citation validation, and source-quality checks.
+- The model may generate inaccurate, incomplete, outdated, or unsupported claims, especially without retrieval tools.
+- The paper notes that evolution slows as training progresses and that the upper bound may still be constrained by model scale.
+- The HOTE method still relies on initial training data; fully data-free open-ended deep research evolution is left for future work.
+- Research outputs in sensitive domains such as healthcare, law, finance, or public policy should be reviewed by qualified experts.
+## Citation
+```bibtex
+@misc{piao2026hybridopenendedtrievolutionmakes,
+  title = {Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher},
+  author = {Hongming Piao and Chi Liu and Mengzhuo Chen and Yan Shu and Xidong Wang and Derek Li and Ying Wei and Bryan Dai},
+  year = {2026},
+  eprint = {2606.13710},
+  archivePrefix = {arXiv},
+  primaryClass = {cs.AI},
+  url = {https://arxiv.org/abs/2606.13710}
+}
+```