HongmingPIAO commited on
Commit
15533dd
·
verified ·
1 Parent(s): eb93874

Add HOTE-8B model card

Browse files

Add a complete model card based on arXiv:2606.13710, including intended use, checkpoint layout, training details, benchmark results, limitations, and citation.

Files changed (1) hide show
  1. README.md +183 -1
README.md CHANGED
@@ -7,4 +7,186 @@ base_model:
7
  datasets:
8
  - rl-research/dr-tulu-sft-data
9
  - rl-research/dr-tulu-rl-data
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  datasets:
8
  - rl-research/dr-tulu-sft-data
9
  - rl-research/dr-tulu-rl-data
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ tags:
13
+ - deep-research
14
+ - agent
15
+ - reinforcement-learning
16
+ - tool-use
17
+ - open-ended-evolution
18
+ - qwen3
19
+ model-index:
20
+ - name: HOTE-8B
21
+ results:
22
+ - task:
23
+ type: text-generation
24
+ name: Long-form deep research
25
+ dataset:
26
+ name: HealthBench
27
+ type: HealthBench
28
+ metrics:
29
+ - type: score
30
+ value: 54.4
31
+ name: HealthBench score
32
+ - task:
33
+ type: text-generation
34
+ name: Long-form deep research
35
+ dataset:
36
+ name: DeepResearchBench
37
+ type: DeepResearchBench
38
+ metrics:
39
+ - type: score
40
+ value: 76.9
41
+ name: DRB Overall
42
+ - type: score
43
+ value: 45.9
44
+ name: DRB Average
45
+ - task:
46
+ type: text-generation
47
+ name: Long-form deep research
48
+ dataset:
49
+ name: ResearchQA
50
+ type: ResearchQA
51
+ metrics:
52
+ - type: score
53
+ value: 59.1
54
+ name: ResearchQA score
55
+ ---
56
+
57
+ # HOTE-8B
58
+
59
+ HOTE-8B is an 8B-parameter deep research model trained with **Hybrid Open-Ended Tri-Evolution (HOTE)**, a reinforcement-learning framework for open-ended research agents. The model is introduced in [Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher](https://arxiv.org/abs/2606.13710) (arXiv:2606.13710v2, 2026-06-15).
60
+
61
+ HOTE trains a deep research system through the co-evolution of three roles:
62
+
63
+ - **Solver**: plans, searches, integrates retrieved evidence, and writes long-form research reports with citations.
64
+ - **Judge**: generates and updates rubrics, evaluates multiple solver responses, and provides rewards beyond deterministic-answer tasks.
65
+ - **Proposer**: searches for weaknesses identified by the judge and proposes challenging but learnable research tasks.
66
+
67
+ The framework uses a dual-mode strategy with both tool-use and no-tool training. According to the paper, this improves training efficiency while allowing the tool-use and no-tool modes to benefit each other.
68
+
69
+ ## Repository Contents
70
+
71
+ This repository contains the following checkpoint folders:
72
+
73
+ - `step_700/`: HOTE-8B deep research model checkpoint.
74
+ - `step_700_query/`: query/proposer checkpoint used in the HOTE framework.
75
+
76
+ ## Intended Use
77
+
78
+ HOTE-8B is intended for research on long-form deep research agents, search-augmented report generation, open-ended agent evolution, and reinforcement learning for non-verifiable tasks.
79
+
80
+ The model is most useful when integrated with a search-enabled agent runtime. In the paper, the solver operates with ReAct-style actions including thinking, tool calls, final answers, and citations. The model weights alone do not provide web search, browsing, paper search, citation validation, or tool execution.
81
+
82
+ ## Quick Start
83
+
84
+ ```python
85
+ from transformers import AutoModelForCausalLM, AutoTokenizer
86
+
87
+ repo_id = "IQuestLab/HOTE-8B"
88
+ subfolder = "step_700"
89
+
90
+ tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder)
91
+ model = AutoModelForCausalLM.from_pretrained(
92
+ repo_id,
93
+ subfolder=subfolder,
94
+ torch_dtype="auto",
95
+ device_map="auto",
96
+ )
97
+
98
+ messages = [
99
+ {
100
+ "role": "user",
101
+ "content": "Write a concise research report on recent progress in search-augmented language agents.",
102
+ }
103
+ ]
104
+
105
+ inputs = tokenizer.apply_chat_template(
106
+ messages,
107
+ tokenize=True,
108
+ add_generation_prompt=True,
109
+ return_tensors="pt",
110
+ ).to(model.device)
111
+
112
+ outputs = model.generate(
113
+ inputs,
114
+ max_new_tokens=4096,
115
+ temperature=0.7,
116
+ top_p=0.95,
117
+ )
118
+
119
+ print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
120
+ ```
121
+
122
+ For full deep-research behavior, connect the model to an agent loop that parses tool-call actions, executes search/browse/paper-search tools, appends observations to the context, and validates cited sources.
123
+
124
+ ## Training Details
125
+
126
+ The paper reports the following HOTE-8B setup:
127
+
128
+ - Proposer initialization: `Qwen3-8B`.
129
+ - Solver initialization: `DR Tulu-8B-SFT`.
130
+ - Judge model during training: `Qwen3-235B-A22B-Instruct-FP8`.
131
+ - Original RL training set: DR Tulu training data, 9K samples, licensed under ODC-BY.
132
+ - Batch size: 48.
133
+ - Solver group size: 8.
134
+ - Proposer group size: 6.
135
+ - Learning rate: `5e-7`.
136
+ - Maximum tool uses per response: 10.
137
+ - Training temperature: 1.
138
+ - Response length: 16,384 tokens.
139
+ - Schedule: 600 no-tool steps followed by 700 hybrid-mode steps.
140
+
141
+ The paper states that benchmark data was not added to the training set and that search tools were blocked from accessing benchmark websites during evaluation.
142
+
143
+ ## Evaluation
144
+
145
+ The paper evaluates HOTE-8B on three long-form, open-ended deep research benchmarks:
146
+
147
+ | Benchmark | Score |
148
+ | --- | ---: |
149
+ | HealthBench | 54.4 |
150
+ | DeepResearchBench Overall | 76.9 |
151
+ | DeepResearchBench Average | 45.9 |
152
+ | ResearchQA | 59.1 |
153
+
154
+ DeepResearchBench aspect scores reported for HOTE-8B:
155
+
156
+ | Aspect | Score |
157
+ | --- | ---: |
158
+ | Comprehensiveness | 44.9 |
159
+ | Insight | 45.4 |
160
+ | Instruction Following | 47.8 |
161
+ | Readability | 45.8 |
162
+
163
+ Average training time per step reported in the paper:
164
+
165
+ | Method | Wall-clock seconds/step | GPU hours/step |
166
+ | --- | ---: | ---: |
167
+ | HOTE no-tool | 382.0 | 1.5 |
168
+ | HOTE hybrid | 753.3 | 2.6 |
169
+
170
+ See the paper for the full comparison against closed deep research systems, open deep research models, fixed-pipeline systems, RL baselines, and evolving-agent baselines.
171
+
172
+ ## Limitations
173
+
174
+ - The model is designed for deep research workflows and should be paired with robust tool execution, citation validation, and source-quality checks.
175
+ - The model may generate inaccurate, incomplete, outdated, or unsupported claims, especially without retrieval tools.
176
+ - The paper notes that evolution slows as training progresses and that the upper bound may still be constrained by model scale.
177
+ - The HOTE method still relies on initial training data; fully data-free open-ended deep research evolution is left for future work.
178
+ - Research outputs in sensitive domains such as healthcare, law, finance, or public policy should be reviewed by qualified experts.
179
+
180
+ ## Citation
181
+
182
+ ```bibtex
183
+ @misc{piao2026hybridopenendedtrievolutionmakes,
184
+ title = {Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher},
185
+ author = {Hongming Piao and Chi Liu and Mengzhuo Chen and Yan Shu and Xidong Wang and Derek Li and Ying Wei and Bryan Dai},
186
+ year = {2026},
187
+ eprint = {2606.13710},
188
+ archivePrefix = {arXiv},
189
+ primaryClass = {cs.AI},
190
+ url = {https://arxiv.org/abs/2606.13710}
191
+ }
192
+ ```