--- license: apache-2.0 language: - en base_model: - Qwen/Qwen3-8B datasets: - rl-research/dr-tulu-sft-data - rl-research/dr-tulu-rl-data library_name: transformers pipeline_tag: text-generation tags: - deep-research - agent - reinforcement-learning - tool-use - open-ended-evolution - qwen3 model-index: - name: HOTE-8B results: - task: type: text-generation name: Long-form deep research dataset: name: HealthBench type: HealthBench metrics: - type: score value: 54.4 name: HealthBench score - task: type: text-generation name: Long-form deep research dataset: name: ResearchQA type: ResearchQA metrics: - type: score value: 76.9 name: ResearchQA score - task: type: text-generation name: Long-form deep research dataset: name: DeepResearchBench type: DeepResearchBench metrics: - type: score value: 45.9 name: DeepResearchBench score --- # HOTE-8B HOTE-8B is an 8B-parameter deep research model trained with **Hybrid Open-Ended Tri-Evolution (HOTE)**, a reinforcement-learning framework for open-ended research agents. The model is introduced in [Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher](https://arxiv.org/abs/2606.13710) (arXiv:2606.13710v2, 2026-06-15). HOTE trains a deep research system through the co-evolution of three roles: - **Solver**: plans, searches, integrates retrieved evidence, and writes long-form research reports with citations. - **Judge**: generates and updates rubrics, evaluates multiple solver responses, and provides rewards beyond deterministic-answer tasks. - **Proposer**: searches for weaknesses identified by the judge and proposes challenging but learnable research tasks. The framework uses a dual-mode strategy with both tool-use and no-tool training. According to the paper, this improves training efficiency while allowing the tool-use and no-tool modes to benefit each other. ## Repository Contents This repository contains the following checkpoint folders: - `step_700/`: HOTE-8B deep research model checkpoint. - `step_700_query/`: proposer checkpoint used in the HOTE framework. ## Intended Use HOTE-8B is intended for research on long-form deep research agents, search-augmented report generation, open-ended agent evolution, and reinforcement learning for non-verifiable tasks. The model is most useful when integrated with a search-enabled agent runtime. In the paper, the solver operates with ReAct-style actions including thinking, tool calls, final answers, and citations. The model weights alone do not provide web search, browsing, paper search, citation validation, or tool execution. ## Limitations - The model is designed for deep research workflows and should be paired with robust tool execution, citation validation, and source-quality checks. - The model may generate inaccurate, incomplete, outdated, or unsupported claims, especially without retrieval tools. - The paper notes that evolution slows as training progresses and that the upper bound may still be constrained by model scale. - The HOTE method still relies on initial training data; fully data-free open-ended deep research evolution is left for future work. - Research outputs in sensitive domains such as healthcare, law, finance, or public policy should be reviewed by qualified experts. ## Citation ```bibtex @misc{piao2026hybridopenendedtrievolutionmakes, title = {Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher}, author = {Hongming Piao and Chi Liu and Mengzhuo Chen and Yan Shu and Xidong Wang and Derek Li and Ying Wei and Bryan Dai}, year = {2026}, eprint = {2606.13710}, archivePrefix = {arXiv}, primaryClass = {cs.AI}, url = {https://arxiv.org/abs/2606.13710} } ```