Update arxiv link to 2603.03756

8c73782 verified 3 days ago

11.3 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- science
	- hypothesis-generation
	- biomedical
	- deepseek
	- qwen2
	base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
	pipeline_tag: text-generation
	---

	# MOOSE-Star-HC-R1D-7B

	MOOSE-Star Hypothesis Composition model — a 7B model fine-tuned for generating scientific hypotheses from research questions, background surveys, and cross-paper inspirations.

	> Note: This model is referred to as MS-HC-7B (w/ 1x bounded) in the paper. The full name includes "R1D" to indicate it is fine-tuned from DeepSeek-R1-Distill-Qwen-7B; the SFT data can be used to train other base models as well.

	## Model Description

	- Base Model: [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
	- Training Method: Full-parameter SFT (ZeRO-3)
	- Training Data: [TOMATO-Star-SFT-Data-R1D-32B](https://huggingface.co/datasets/ZonglinY/TOMATO-Star-SFT-Data-R1D-32B) HC split (114,548 samples = 96,879 normal + 17,669 bounded, mixed 1x)
	- Teacher Model: Training data generated via rejection sampling with [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)
	- Paper: [MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier](https://arxiv.org/abs/2603.03756)

	## Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| DeepSeek-R1-Distill-Qwen-7B \|
	\| Chat Template \| deepseekr1 \|
	\| Cutoff Length \| 8192 \|
	\| Learning Rate \| 1e-5 \|
	\| Epochs \| 1 \|
	\| Batch Size \| 64 (via gradient accumulation) \|
	\| Training \| Full-parameter, ZeRO-3, bf16 \|
	\| GPUs \| 64x (multi-node) \|

	## Task Description

	Given a research question, background survey, a previous hypothesis (if any), and a new inspiration paper, the model outputs a delta hypothesis — the specific contribution from this inspiration:

	- Inspiration: Key concept derived from the inspiration paper
	- Motivation (WHY): Why this addresses a gap in existing methods or the current hypothesis
	- Mechanism (HOW IT WORKS): Core scientific mechanism or theoretical framework
	- Methodology (HOW IT'S INTEGRATED): Proposed implementation approach

	The model is designed for incremental hypothesis composition: hypotheses are built one inspiration at a time, making it suitable for hierarchical search where inspirations are selected and composed step by step.

	## Prompt Format

	The HC task uses a structured prompt (all in user message, no system prompt):

	```
	[Task instruction preamble — delta hypothesis composition guidelines]

	## Information Provided

	Research Question (the specific problem to solve):
	{research_question}

	Background Survey (existing methods and their limitations):
	{background_survey}

	Previous Hypothesis (if any — the current state of your solution to build upon):
	{previous_hypothesis_or_none}

	New Inspiration Paper Title (external work to incorporate):
	{inspiration_title}

	New Inspiration Paper Abstract:
	{inspiration_abstract}

	## Your Response

	<think>
	[reasoning process]
	</think>

	Inspiration: [Key concept from the inspiration paper]
	- Motivation (WHY): [Why this addresses a gap]
	- Mechanism (HOW IT WORKS): [How the concept works in this context]
	- Methodology (HOW IT'S INTEGRATED): [How to integrate it]
	```

	See [TOMATO-Star-SFT-Data-R1D-32B](https://huggingface.co/datasets/ZonglinY/TOMATO-Star-SFT-Data-R1D-32B) `HC/bounded_composition.jsonl` for complete prompt examples.

	## Usage

	```python
	# Clone the MOOSE-Star repo to access official prompt templates:
	# git clone https://github.com/ZonglinY/MOOSE-Star.git
	# Then run from the repo root:

	import sys
	sys.path.insert(0, "./utils")
	from prompt_store import instruction_prompts
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "ZonglinY/MOOSE-Star-HC-R1D-7B"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto")

	# Bounded/delta composition (for use with hierarchical search):
	p = instruction_prompts("prepare_HC_sft_data_to_go_comprehensive_v2_delta")

	# Fill in your inputs:
	research_question = "Your research question here"
	background_survey = "Your background survey here"
	inspiration_title = "Inspiration paper title"
	inspiration_abstract = "Inspiration paper abstract"

	prompt = (p[0] + research_question
	+ p[1] + background_survey
	+ p[2] + "No previous hypothesis." # or previous hypothesis string
	+ p[3] + inspiration_title
	+ p[4] + inspiration_abstract
	+ p[5])

	messages = [{"role": "user", "content": prompt}]
	# Note: use add_generation_prompt=False and manually append <｜Assistant｜>
	# to avoid a double <think> tag in the DeepSeek R1-Distill chat template
	formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
	formatted += "<｜Assistant｜>"

	inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.6, top_p=0.9, do_sample=True)
	response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
	print(response)
	# Output: <think>...</think> followed by structured delta hypothesis
	# (Inspiration / Motivation / Mechanism / Methodology)
	```

	<details>
	<summary>Full prompt template (contents of <code>instruction_prompts("prepare_HC_sft_data_to_go_comprehensive_v2_delta")</code>)</summary>

	```
	You are a scientific hypothesis composer. Given a research question, background survey, potentially a previous hypothesis to build upon, and a new inspiration paper, you will reason through how to adapt concepts from the inspiration to advance the solution, then formulate a DELTA HYPOTHESIS - the specific contribution from THIS inspiration paper (not the full cumulative hypothesis).

	## Your Task

	Analyze the provided research context and inspiration paper to:
	1. Identify the key conceptual innovation from the inspiration paper (Note: the paper may directly provide a concept that can be adapted, OR it may contain related ideas/transferable mechanisms that inspire what we need - look beyond exact concept names)
	2. Determine how this innovation addresses gaps (either in existing methods or in your previous hypothesis)
	3. Reason through adaptation and integration into your solution
	4. Formulate a delta hypothesis describing ONLY what THIS inspiration contributes

	## Key Principles

	Reasoning Process:
	- Start by understanding the problem and what current methods lack
	- If there's a previous hypothesis, understand what it already addresses and what gaps/limitations remain
	- Analyze the inspiration paper to identify relevant concepts that could be adapted
	- Reason through: What specific knowledge/technique from this paper could serve as an inspiration?
	- Connect the dots: How does this potential inspiration address the identified gaps?
	- Work through the mechanism: How would this inspiration actually function in our context?
	- Develop the methodology: Detail the specific implementation and integration
	- Don't just identify concepts - reason through their practical application and adaptation

	Delta Hypothesis Requirements:
	- Output ONLY what THIS inspiration adds (delta), NOT the full cumulative hypothesis
	- Don't repeat what's already in the previous hypothesis
	- Must clearly explain WHY the inspiration addresses the problem (Motivation)
	- Must detail HOW the inspiration works in this context (Mechanism)
	- Must specify HOW to implement it methodologically (Methodology)
	- Follow the exact structured format shown below

	## Information Provided

	Research Question (the specific problem to solve):
	{research_question}

	Background Survey (existing methods and their limitations):
	{background_survey}

	Previous Hypothesis (if any - the current state of your solution to build upon):
	{previous_hypothesis}

	New Inspiration Paper Title (external work to incorporate):
	{inspiration_title}

	New Inspiration Paper Abstract:
	{inspiration_abstract}

	## Your Response

	Analyze how this inspiration paper's concepts can advance your solution:

	1. If starting from scratch (no previous hypothesis):
	- Identify how the inspiration addresses the core gaps in existing methods
	- This becomes your first conceptual building block beyond the baseline approach

	2. If building on a previous hypothesis:
	- First understand what the previous hypothesis already accomplishes
	- Identify remaining limitations or opportunities for enhancement
	- Determine how this new inspiration specifically addresses those gaps

	Show your reasoning process as you:
	- Extract relevant concepts from the inspiration paper (may not be obvious - reason through what could be useful)
	- Identify what specific technique/knowledge could serve as the inspiration
	- Connect this inspiration to your problem: Why is this relevant? How does it address gaps?
	- Work through adaptation: How to modify this concept for your specific context?
	- Detail the motivation, mechanism, and methodology for THIS inspiration's contribution
	- Reason through implementation details

	Then formulate a delta hypothesis that captures ONLY what THIS inspiration adds.

	## Output Format

	IMPORTANT: Structure your response exactly as follows:

	<think>
	[Your reasoning process here - explore all aspects thoroughly]
	</think>

	Delta Hypothesis starts:
	Inspiration: [Key concept derived from or inspired by the inspiration paper]
	- Motivation (WHY): [Why this addresses a gap - what specific limitation does it solve?]
	- Mechanism (HOW IT WORKS): [How the concept works in this context]
	- Methodology (HOW IT'S INTEGRATED): [How to integrate it - specific implementation steps]
	Delta Hypothesis ends

	⚠️ CRITICAL: The delta hypothesis is the ONLY part that gets evaluated!
	- Include ALL components you FINALIZED in your reasoning (not early ideas you later revised)
	- Be COMPREHENSIVE - every technical detail, mechanism, and methodology step you reasoned through should appear in the delta hypothesis
	- Don't assume the reader saw your reasoning - the delta hypothesis must be SELF-CONTAINED and COMPLETE
	- Focus on THIS inspiration's contribution only - don't repeat previous hypothesis content
	```

	</details>

	## Evaluation Results

	### Normal Composition (given ground-truth inspirations, from paper Table 2)

	Scores on a rubric scale. "Total" aggregates Motivation (Mot), Mechanism (Mec), and Methodology (Met).

	\| Model \| Total \| Mot \| Mec \| Met \| Length \|
	\|-------\|-------\|-----\|-----\|-----\|--------\|
	\| R1-Distilled-Qwen-7B (base) \| 4.34 \| 1.96 \| 1.40 \| 0.97 \| 231.02 \|
	\| MS-HC-7B \| 5.08 \| 2.21 \| 1.58 \| 1.29 \| 204.12 \|
	\| MS-HC-7B w/ 1x bounded (this model) \| 5.16 \| 2.24 \| 1.60 \| 1.31 \| 203.84 \|
	\| MS-HC-7B w/ 2x bounded \| 5.15 \| 2.25 \| 1.58 \| 1.32 \| 205.17 \|

	## Citation

	```bibtex
	@article{yang2025moosestar,
	title={MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier},
	author={Yang, Zonglin and Bing, Lidong},
	year={2025}
	}
	```

	## License

	This model is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license.