File size: 9,446 Bytes

---
language:
- en
license: apache-2.0
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
tags:
- scientific-discovery
- hypothesis-generation
- inspiration-retrieval
- multi-task
datasets:
- ZonglinY/TOMATO-Star-SFT-Data-R1D-32B
library_name: transformers
pipeline_tag: text-generation
---

# MOOSE-Star-R1D-7B Model Card

## Overview

**MOOSE-Star-R1D-7B** (referred to as **MS-7B** in the paper) is a 7B parameter multi-task language model fine-tuned for both **inspiration retrieval** and **hypothesis composition** in scientific discovery workflows. It matches the IR performance of the single-task model ([MOOSE-Star-IR-R1D-7B](https://huggingface.co/ZonglinY/MOOSE-Star-IR-R1D-7B)) while significantly outperforming the single-task HC model ([MOOSE-Star-HC-R1D-7B](https://huggingface.co/ZonglinY/MOOSE-Star-HC-R1D-7B)), all in a single unified model.

- **Paper**: [MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier](https://arxiv.org/abs/2603.03756) (arXiv:2603.03756)
- **Base Model**: [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
- **License**: Apache 2.0
- **Code**: [ZonglinY/MOOSE-Star](https://github.com/ZonglinY/MOOSE-Star)

## Model Description

| Parameter | Value |
|-----------|-------|
| **Base Model** | DeepSeek-R1-Distill-Qwen-7B |
| **Training Method** | Full-parameter SFT (ZeRO-3) |
| **Training Data** | TOMATO-Star-SFT-Data-R1D-32B: IR split (150,218 samples) + HC split with 1x bounded (114,548 samples) |
| **Chat Template** | deepseekr1 |
| **Cutoff Length** | 16384 |
| **Learning Rate** | 1e-5 |
| **Epochs** | 1 |
| **Batch Size** | 128 |

## Task 1: Inspiration Retrieval (IR)

The model selects the most relevant **cross-paper inspiration** from 15 candidates (A-O) that includes 1 correct inspiration and 14 hard negatives.

### IR Prompt Format (Simplified Overview)

The full prompt template is constructed via `instruction_prompts()` in the code examples below. The general structure is:

```
[Task instruction preamble]

## Context

**Research Question:**
{research_question}

**Background Survey (existing methods for THIS task):**
{background_survey}

**Previous Hypothesis (if any):**
{previous_hypothesis_or_none}

## Candidate Inspiration Papers

### Candidate [A]
**Title:** {title_A}
**Abstract:** {abstract_A}

... (15 candidates total, A through O)

## Output Format

<think>
[reasoning process]
</think>

**Selected ID starts:** [X] **Selected ID ends**

**Selection Reason starts:** [reason] **Selection Reason ends**
```

### IR Usage

**Prerequisites**: Clone the [MOOSE-Star repo](https://github.com/ZonglinY/MOOSE-Star) for prompt templates and inference utilities:
```bash
git clone https://github.com/ZonglinY/MOOSE-Star.git && cd MOOSE-Star
# See requirements.txt for full dependencies; at minimum: pip install transformers torch
```

#### Option A: SGLang Deployment (Recommended)

```bash
# SGLang requires a separate environment; see https://github.com/sgl-project/sglang for installation
# Start the server
python -m sglang.launch_server --model-path ZonglinY/MOOSE-Star-R1D-7B --port 1235
```

```python
import sys
sys.path.insert(0, "./Inference")
from ir_probability_extractor import IRProbabilityExtractor

extractor = IRProbabilityExtractor(base_urls=["http://localhost:1235/v1"])
result = extractor.get_selection_probabilities(
    research_question="Your research question",
    background_survey="Your background survey",
    candidates=[
        {"title": "Candidate A title", "abstract": "Candidate A abstract"},
        {"title": "Candidate B title", "abstract": "Candidate B abstract"},
        # ... up to 15 candidates (labeled A-O)
    ],
)
print(f"Selected: [{result.selected_label}]")
print(f"Probabilities: {result.probabilities}")
```

#### Option B: Direct HuggingFace Inference

```python
import sys
sys.path.insert(0, "./utils")
from prompt_store import instruction_prompts
from transformers import AutoModelForCausalLM, AutoTokenizer
import re

model_name = "ZonglinY/MOOSE-Star-R1D-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto")

p = instruction_prompts("inspiration_retrieval_with_reasoning_with_alphabetical_candidates")

candidates = [{"title": "...", "abstract": "..."}, ...]
candidates_text = "".join(
    f"### Candidate [{chr(ord('A') + i)}]\n**Title:** {c['title']}\n**Abstract:** {c['abstract']}\n\n"
    for i, c in enumerate(candidates)
)

research_question = "Your research question"
background_survey = "Your background survey"
prompt = (p[0] + research_question
        + p[1] + background_survey
        + p[2] + "No previous hypothesis."
        + p[3] + candidates_text
        + p[4])

messages = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
formatted += "<\uff5cAssistant\uff5c>"

inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.6, top_p=0.9, do_sample=True)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

match = re.search(r"\*\*Selected ID starts:\*\*\s*\[(\w)\]\s*\*\*Selected ID ends\*\*", response)
if match:
    print(f"Selected: [{match.group(1)}]")
```

## Task 2: Hypothesis Composition (HC)

The model generates **delta hypotheses** from inspiration papers. Given a research question, background survey, and new inspiration paper, it outputs structured hypothesis components.

### HC Prompt Format (Simplified Overview)

The full prompt template is constructed via `instruction_prompts()` in the code examples below. The general structure is:

```
[Task instruction preamble]

## Information Provided

**Research Question**:
{research_question}

**Background Survey**:
{background_survey}

**Previous Hypothesis**:
{previous_hypothesis_or_none}

**New Inspiration Paper Title**:
{inspiration_title}

**New Inspiration Paper Abstract**:
{inspiration_abstract}

## Your Response

<think>
[reasoning process]
</think>

Inspiration: [Key concept]
- Motivation (WHY): [Why this addresses a gap]
- Mechanism (HOW IT WORKS): [How the concept works]
- Methodology (HOW IT'S INTEGRATED): [Implementation steps]
```

### HC Usage

```python
import sys
sys.path.insert(0, "./utils")
from prompt_store import instruction_prompts
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ZonglinY/MOOSE-Star-R1D-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto")

p = instruction_prompts("prepare_HC_sft_data_to_go_comprehensive_v2_delta")

research_question = "Your research question here"
background_survey = "Your background survey here"
inspiration_title = "Inspiration paper title"
inspiration_abstract = "Inspiration paper abstract"

prompt = (p[0] + research_question
        + p[1] + background_survey
        + p[2] + "No previous hypothesis."
        + p[3] + inspiration_title
        + p[4] + inspiration_abstract
        + p[5])

messages = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
formatted += "<\uff5cAssistant\uff5c>"

inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.6, top_p=0.9, do_sample=True)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
```

## Evaluation Results

### Inspiration Retrieval (Table 1)

| Model | Accuracy |
|-------|----------|
| Random Selection | 6.70% |
| R1-Distilled-Qwen-7B (base) | 28.42% |
| MS-IR-7B (single-task) | 54.37% |
| **MS-7B (this model)** | **54.34%** |

### Hypothesis Composition - Normal (Table 2)

Rubric-based evaluation with ground-truth inspirations (Judge: GPT-4o):

| Model | Total | Mot | Mec | Met | Length |
|-------|-------|-----|-----|-----|--------|
| R1-Distilled-Qwen-7B (base) | 4.05 | 1.96 | 1.30 | 0.80 | 231.02 |
| MS-HC-7B (single-task) | 4.68 | 2.13 | 1.46 | 1.09 | 204.12 |
| MS-HC-7B w/ 1x bounded | 4.74 | 2.16 | 1.48 | 1.10 | 203.84 |
| **MS-7B (this model)** | **5.02** | **2.22** | **1.59** | **1.20** | 208.98 |

### Hypothesis Composition - Bounded (Table 3)

Performance under varying levels of inspiration noise (Judge: GPT-4o):

| Model | Easy Total | Medium Total | Hard Total |
|-------|-----------|-------------|-----------|
| R1-Distilled-Qwen-7B (base) | 2.72 | 2.27 | 2.00 |
| MS-HC-7B w/ 2x bounded | 3.18 | 2.74 | 2.56 |
| **MS-7B (this model)** | **3.37** | **2.86** | **2.78** |

## Key Findings

- **IR performance preserved**: Multi-task training maintains full IR accuracy (54.34% vs 54.37% single-task)
- **HC significantly improved**: Multi-task HC outperforms all single-task variants, including those with bounded composition augmentation
- **Robust under noise**: Largest improvements on Hard bounded composition, suggesting IR reasoning skills transfer to HC

## Citation

```bibtex
@article{yang2025moosestar,
  title={MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier},
  author={Yang, Zonglin and Bing, Lidong},
  journal={arXiv preprint arXiv:2603.03756},
  year={2026}
}
```