HOTE-8B / README.md

Add HOTE-8B model card (#1)

31bd326 2 days ago

3.81 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen3-8B
	datasets:
	- rl-research/dr-tulu-sft-data
	- rl-research/dr-tulu-rl-data
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- deep-research
	- agent
	- reinforcement-learning
	- tool-use
	- open-ended-evolution
	- qwen3
	model-index:
	- name: HOTE-8B
	results:
	- task:
	type: text-generation
	name: Long-form deep research
	dataset:
	name: HealthBench
	type: HealthBench
	metrics:
	- type: score
	value: 54.4
	name: HealthBench score
	- task:
	type: text-generation
	name: Long-form deep research
	dataset:
	name: ResearchQA
	type: ResearchQA
	metrics:
	- type: score
	value: 76.9
	name: ResearchQA score
	- task:
	type: text-generation
	name: Long-form deep research
	dataset:
	name: DeepResearchBench
	type: DeepResearchBench
	metrics:
	- type: score
	value: 45.9
	name: DeepResearchBench score
	---

	# HOTE-8B

	HOTE-8B is an 8B-parameter deep research model trained with Hybrid Open-Ended Tri-Evolution (HOTE), a reinforcement-learning framework for open-ended research agents. The model is introduced in [Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher](https://arxiv.org/abs/2606.13710) (arXiv:2606.13710v2, 2026-06-15).

	HOTE trains a deep research system through the co-evolution of three roles:

	- Solver: plans, searches, integrates retrieved evidence, and writes long-form research reports with citations.
	- Judge: generates and updates rubrics, evaluates multiple solver responses, and provides rewards beyond deterministic-answer tasks.
	- Proposer: searches for weaknesses identified by the judge and proposes challenging but learnable research tasks.

	The framework uses a dual-mode strategy with both tool-use and no-tool training. According to the paper, this improves training efficiency while allowing the tool-use and no-tool modes to benefit each other.

	## Repository Contents

	This repository contains the following checkpoint folders:

	- `step_700/`: HOTE-8B deep research model checkpoint.
	- `step_700_query/`: proposer checkpoint used in the HOTE framework.

	## Intended Use

	HOTE-8B is intended for research on long-form deep research agents, search-augmented report generation, open-ended agent evolution, and reinforcement learning for non-verifiable tasks.

	The model is most useful when integrated with a search-enabled agent runtime. In the paper, the solver operates with ReAct-style actions including thinking, tool calls, final answers, and citations. The model weights alone do not provide web search, browsing, paper search, citation validation, or tool execution.


	## Limitations

	- The model is designed for deep research workflows and should be paired with robust tool execution, citation validation, and source-quality checks.
	- The model may generate inaccurate, incomplete, outdated, or unsupported claims, especially without retrieval tools.
	- The paper notes that evolution slows as training progresses and that the upper bound may still be constrained by model scale.
	- The HOTE method still relies on initial training data; fully data-free open-ended deep research evolution is left for future work.
	- Research outputs in sensitive domains such as healthcare, law, finance, or public policy should be reviewed by qualified experts.

	## Citation

	```bibtex
	@misc{piao2026hybridopenendedtrievolutionmakes,
	title = {Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher},
	author = {Hongming Piao and Chi Liu and Mengzhuo Chen and Yan Shu and Xidong Wang and Derek Li and Ying Wei and Bryan Dai},
	year = {2026},
	eprint = {2606.13710},
	archivePrefix = {arXiv},
	primaryClass = {cs.AI},
	url = {https://arxiv.org/abs/2606.13710}
	}
	```