| | --- |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - science |
| | - hypothesis-generation |
| | - biomedical |
| | - deepseek |
| | - qwen2 |
| | base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # MOOSE-Star-HC-R1D-7B |
| |
|
| | **MOOSE-Star Hypothesis Composition model** — a 7B model fine-tuned for generating scientific hypotheses from research questions, background surveys, and cross-paper inspirations. |
| |
|
| | > **Note**: This model is referred to as **MS-HC-7B (w/ 1x bounded)** in the paper. The full name includes "R1D" to indicate it is fine-tuned from DeepSeek-R1-Distill-Qwen-7B; the SFT data can be used to train other base models as well. |
| |
|
| | ## Model Description |
| |
|
| | - **Base Model**: [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) |
| | - **Training Method**: Full-parameter SFT (ZeRO-3) |
| | - **Training Data**: [TOMATO-Star-SFT-Data-R1D-32B](https://huggingface.co/datasets/ZonglinY/TOMATO-Star-SFT-Data-R1D-32B) HC split (114,548 samples = 96,879 normal + 17,669 bounded, mixed 1x) |
| | - **Teacher Model**: Training data generated via rejection sampling with [DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) |
| | - **Paper**: [MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier](https://arxiv.org/abs/2603.03756) |
| |
|
| | ## Training Configuration |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Base Model | DeepSeek-R1-Distill-Qwen-7B | |
| | | Chat Template | deepseekr1 | |
| | | Cutoff Length | 8192 | |
| | | Learning Rate | 1e-5 | |
| | | Epochs | 1 | |
| | | Batch Size | 64 (via gradient accumulation) | |
| | | Training | Full-parameter, ZeRO-3, bf16 | |
| | | GPUs | 64x (multi-node) | |
| |
|
| | ## Task Description |
| |
|
| | Given a research question, background survey, a previous hypothesis (if any), and a new inspiration paper, the model outputs a **delta hypothesis** — the specific contribution from this inspiration: |
| |
|
| | - **Inspiration**: Key concept derived from the inspiration paper |
| | - **Motivation (WHY)**: Why this addresses a gap in existing methods or the current hypothesis |
| | - **Mechanism (HOW IT WORKS)**: Core scientific mechanism or theoretical framework |
| | - **Methodology (HOW IT'S INTEGRATED)**: Proposed implementation approach |
| |
|
| | The model is designed for **incremental hypothesis composition**: hypotheses are built one inspiration at a time, making it suitable for hierarchical search where inspirations are selected and composed step by step. |
| |
|
| | ## Prompt Format |
| |
|
| | The HC task uses a structured prompt (all in user message, no system prompt): |
| |
|
| | ``` |
| | [Task instruction preamble — delta hypothesis composition guidelines] |
| | |
| | ## Information Provided |
| | |
| | **Research Question** (the specific problem to solve): |
| | {research_question} |
| | |
| | **Background Survey** (existing methods and their limitations): |
| | {background_survey} |
| | |
| | **Previous Hypothesis** (if any — the current state of your solution to build upon): |
| | {previous_hypothesis_or_none} |
| | |
| | **New Inspiration Paper Title** (external work to incorporate): |
| | {inspiration_title} |
| | |
| | **New Inspiration Paper Abstract**: |
| | {inspiration_abstract} |
| | |
| | ## Your Response |
| | |
| | <think> |
| | [reasoning process] |
| | </think> |
| | |
| | Inspiration: [Key concept from the inspiration paper] |
| | - Motivation (WHY): [Why this addresses a gap] |
| | - Mechanism (HOW IT WORKS): [How the concept works in this context] |
| | - Methodology (HOW IT'S INTEGRATED): [How to integrate it] |
| | ``` |
| |
|
| | See [TOMATO-Star-SFT-Data-R1D-32B](https://huggingface.co/datasets/ZonglinY/TOMATO-Star-SFT-Data-R1D-32B) `HC/bounded_composition.jsonl` for complete prompt examples. |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | # Clone the MOOSE-Star repo to access official prompt templates: |
| | # git clone https://github.com/ZonglinY/MOOSE-Star.git |
| | # Then run from the repo root: |
| | |
| | import sys |
| | sys.path.insert(0, "./utils") |
| | from prompt_store import instruction_prompts |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_name = "ZonglinY/MOOSE-Star-HC-R1D-7B" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto") |
| | |
| | # Bounded/delta composition (for use with hierarchical search): |
| | p = instruction_prompts("prepare_HC_sft_data_to_go_comprehensive_v2_delta") |
| | |
| | # Fill in your inputs: |
| | research_question = "Your research question here" |
| | background_survey = "Your background survey here" |
| | inspiration_title = "Inspiration paper title" |
| | inspiration_abstract = "Inspiration paper abstract" |
| | |
| | prompt = (p[0] + research_question |
| | + p[1] + background_survey |
| | + p[2] + "No previous hypothesis." # or previous hypothesis string |
| | + p[3] + inspiration_title |
| | + p[4] + inspiration_abstract |
| | + p[5]) |
| | |
| | messages = [{"role": "user", "content": prompt}] |
| | # Note: use add_generation_prompt=False and manually append <|Assistant|> |
| | # to avoid a double <think> tag in the DeepSeek R1-Distill chat template |
| | formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) |
| | formatted += "<|Assistant|>" |
| | |
| | inputs = tokenizer(formatted, return_tensors="pt").to(model.device) |
| | outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.6, top_p=0.9, do_sample=True) |
| | response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) |
| | print(response) |
| | # Output: <think>...</think> followed by structured delta hypothesis |
| | # (Inspiration / Motivation / Mechanism / Methodology) |
| | ``` |
| |
|
| | <details> |
| | <summary>Full prompt template (contents of <code>instruction_prompts("prepare_HC_sft_data_to_go_comprehensive_v2_delta")</code>)</summary> |
| | |
| | ``` |
| | You are a scientific hypothesis composer. Given a research question, background survey, potentially a previous hypothesis to build upon, and a new inspiration paper, you will reason through how to adapt concepts from the inspiration to advance the solution, then formulate a DELTA HYPOTHESIS - the specific contribution from THIS inspiration paper (not the full cumulative hypothesis). |
| | |
| | ## Your Task |
| | |
| | Analyze the provided research context and inspiration paper to: |
| | 1. Identify the key conceptual innovation from the inspiration paper (Note: the paper may directly provide a concept that can be adapted, OR it may contain related ideas/transferable mechanisms that inspire what we need - look beyond exact concept names) |
| | 2. Determine how this innovation addresses gaps (either in existing methods or in your previous hypothesis) |
| | 3. Reason through adaptation and integration into your solution |
| | 4. Formulate a delta hypothesis describing ONLY what THIS inspiration contributes |
| | |
| | ## Key Principles |
| | |
| | **Reasoning Process:** |
| | - Start by understanding the problem and what current methods lack |
| | - If there's a previous hypothesis, understand what it already addresses and what gaps/limitations remain |
| | - Analyze the inspiration paper to identify relevant concepts that could be adapted |
| | - Reason through: What specific knowledge/technique from this paper could serve as an inspiration? |
| | - Connect the dots: How does this potential inspiration address the identified gaps? |
| | - Work through the mechanism: How would this inspiration actually function in our context? |
| | - Develop the methodology: Detail the specific implementation and integration |
| | - Don't just identify concepts - reason through their practical application and adaptation |
| | |
| | **Delta Hypothesis Requirements:** |
| | - Output ONLY what THIS inspiration adds (delta), NOT the full cumulative hypothesis |
| | - Don't repeat what's already in the previous hypothesis |
| | - Must clearly explain WHY the inspiration addresses the problem (Motivation) |
| | - Must detail HOW the inspiration works in this context (Mechanism) |
| | - Must specify HOW to implement it methodologically (Methodology) |
| | - Follow the exact structured format shown below |
| | |
| | ## Information Provided |
| | |
| | **Research Question** (the specific problem to solve): |
| | {research_question} |
| |
|
| | **Background Survey** (existing methods and their limitations): |
| | {background_survey} |
| | |
| | **Previous Hypothesis** (if any - the current state of your solution to build upon): |
| | {previous_hypothesis} |
| |
|
| | **New Inspiration Paper Title** (external work to incorporate): |
| | {inspiration_title} |
| | |
| | **New Inspiration Paper Abstract**: |
| | {inspiration_abstract} |
| |
|
| | ## Your Response |
| |
|
| | Analyze how this inspiration paper's concepts can advance your solution: |
| |
|
| | 1. **If starting from scratch** (no previous hypothesis): |
| | - Identify how the inspiration addresses the core gaps in existing methods |
| | - This becomes your first conceptual building block beyond the baseline approach |
| |
|
| | 2. **If building on a previous hypothesis**: |
| | - First understand what the previous hypothesis already accomplishes |
| | - Identify remaining limitations or opportunities for enhancement |
| | - Determine how this new inspiration specifically addresses those gaps |
| |
|
| | Show your reasoning process as you: |
| | - Extract relevant concepts from the inspiration paper (may not be obvious - reason through what could be useful) |
| | - Identify what specific technique/knowledge could serve as the inspiration |
| | - Connect this inspiration to your problem: Why is this relevant? How does it address gaps? |
| | - Work through adaptation: How to modify this concept for your specific context? |
| | - Detail the motivation, mechanism, and methodology for THIS inspiration's contribution |
| | - Reason through implementation details |
| |
|
| | Then formulate a delta hypothesis that captures ONLY what THIS inspiration adds. |
| |
|
| | ## Output Format |
| |
|
| | **IMPORTANT**: Structure your response exactly as follows: |
| |
|
| | <think> |
| | [Your reasoning process here - explore all aspects thoroughly] |
| | </think> |
| |
|
| | **Delta Hypothesis starts:** |
| | Inspiration: [Key concept derived from or inspired by the inspiration paper] |
| | - Motivation (WHY): [Why this addresses a gap - what specific limitation does it solve?] |
| | - Mechanism (HOW IT WORKS): [How the concept works in this context] |
| | - Methodology (HOW IT'S INTEGRATED): [How to integrate it - specific implementation steps] |
| | **Delta Hypothesis ends** |
| |
|
| | ⚠️ CRITICAL: The delta hypothesis is the ONLY part that gets evaluated! |
| | - Include ALL components you FINALIZED in your reasoning (not early ideas you later revised) |
| | - Be COMPREHENSIVE - every technical detail, mechanism, and methodology step you reasoned through should appear in the delta hypothesis |
| | - Don't assume the reader saw your reasoning - the delta hypothesis must be SELF-CONTAINED and COMPLETE |
| | - Focus on THIS inspiration's contribution only - don't repeat previous hypothesis content |
| | ``` |
| | |
| | </details> |
| | |
| | ## Evaluation Results |
| | |
| | ### Normal Composition (given ground-truth inspirations, from paper Table 2) |
| | |
| | Scores on a rubric scale. "Total" aggregates Motivation (Mot), Mechanism (Mec), and Methodology (Met). |
| | |
| | | Model | Total | Mot | Mec | Met | Length | |
| | |-------|-------|-----|-----|-----|--------| |
| | | R1-Distilled-Qwen-7B (base) | 4.34 | 1.96 | 1.40 | 0.97 | 231.02 | |
| | | MS-HC-7B | 5.08 | 2.21 | 1.58 | 1.29 | 204.12 | |
| | | **MS-HC-7B w/ 1x bounded** (this model) | **5.16** | **2.24** | **1.60** | **1.31** | 203.84 | |
| | | MS-HC-7B w/ 2x bounded | 5.15 | 2.25 | 1.58 | 1.32 | 205.17 | |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @article{yang2025moosestar, |
| | title={MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier}, |
| | author={Yang, Zonglin and Bing, Lidong}, |
| | year={2025} |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | This model is released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license. |
| | |