File size: 9,446 Bytes
98abdaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a9f47f6
98abdaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d99c90
 
 
82b3b56
0d99c90
 
98abdaf
 
d418c16
37932b3
d418c16
 
 
 
98abdaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57ef194
98abdaf
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
---
language:
- en
license: apache-2.0
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
tags:
- scientific-discovery
- hypothesis-generation
- inspiration-retrieval
- multi-task
datasets:
- ZonglinY/TOMATO-Star-SFT-Data-R1D-32B
library_name: transformers
pipeline_tag: text-generation
---

# MOOSE-Star-R1D-7B Model Card

## Overview

**MOOSE-Star-R1D-7B** (referred to as **MS-7B** in the paper) is a 7B parameter multi-task language model fine-tuned for both **inspiration retrieval** and **hypothesis composition** in scientific discovery workflows. It matches the IR performance of the single-task model ([MOOSE-Star-IR-R1D-7B](https://huggingface.co/ZonglinY/MOOSE-Star-IR-R1D-7B)) while significantly outperforming the single-task HC model ([MOOSE-Star-HC-R1D-7B](https://huggingface.co/ZonglinY/MOOSE-Star-HC-R1D-7B)), all in a single unified model.

- **Paper**: [MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier](https://arxiv.org/abs/2603.03756) (arXiv:2603.03756)
- **Base Model**: [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)
- **License**: Apache 2.0
- **Code**: [ZonglinY/MOOSE-Star](https://github.com/ZonglinY/MOOSE-Star)

## Model Description

| Parameter | Value |
|-----------|-------|
| **Base Model** | DeepSeek-R1-Distill-Qwen-7B |
| **Training Method** | Full-parameter SFT (ZeRO-3) |
| **Training Data** | TOMATO-Star-SFT-Data-R1D-32B: IR split (150,218 samples) + HC split with 1x bounded (114,548 samples) |
| **Chat Template** | deepseekr1 |
| **Cutoff Length** | 16384 |
| **Learning Rate** | 1e-5 |
| **Epochs** | 1 |
| **Batch Size** | 128 |

## Task 1: Inspiration Retrieval (IR)

The model selects the most relevant **cross-paper inspiration** from 15 candidates (A-O) that includes 1 correct inspiration and 14 hard negatives.

### IR Prompt Format (Simplified Overview)

The full prompt template is constructed via `instruction_prompts()` in the code examples below. The general structure is:

```
[Task instruction preamble]

## Context

**Research Question:**
{research_question}

**Background Survey (existing methods for THIS task):**
{background_survey}

**Previous Hypothesis (if any):**
{previous_hypothesis_or_none}

## Candidate Inspiration Papers

### Candidate [A]
**Title:** {title_A}
**Abstract:** {abstract_A}

... (15 candidates total, A through O)

## Output Format

<think>
[reasoning process]
</think>

**Selected ID starts:** [X] **Selected ID ends**

**Selection Reason starts:** [reason] **Selection Reason ends**
```

### IR Usage

**Prerequisites**: Clone the [MOOSE-Star repo](https://github.com/ZonglinY/MOOSE-Star) for prompt templates and inference utilities:
```bash
git clone https://github.com/ZonglinY/MOOSE-Star.git && cd MOOSE-Star
# See requirements.txt for full dependencies; at minimum: pip install transformers torch
```

#### Option A: SGLang Deployment (Recommended)

```bash
# SGLang requires a separate environment; see https://github.com/sgl-project/sglang for installation
# Start the server
python -m sglang.launch_server --model-path ZonglinY/MOOSE-Star-R1D-7B --port 1235
```

```python
import sys
sys.path.insert(0, "./Inference")
from ir_probability_extractor import IRProbabilityExtractor

extractor = IRProbabilityExtractor(base_urls=["http://localhost:1235/v1"])
result = extractor.get_selection_probabilities(
    research_question="Your research question",
    background_survey="Your background survey",
    candidates=[
        {"title": "Candidate A title", "abstract": "Candidate A abstract"},
        {"title": "Candidate B title", "abstract": "Candidate B abstract"},
        # ... up to 15 candidates (labeled A-O)
    ],
)
print(f"Selected: [{result.selected_label}]")
print(f"Probabilities: {result.probabilities}")
```

#### Option B: Direct HuggingFace Inference

```python
import sys
sys.path.insert(0, "./utils")
from prompt_store import instruction_prompts
from transformers import AutoModelForCausalLM, AutoTokenizer
import re

model_name = "ZonglinY/MOOSE-Star-R1D-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto")

p = instruction_prompts("inspiration_retrieval_with_reasoning_with_alphabetical_candidates")

candidates = [{"title": "...", "abstract": "..."}, ...]
candidates_text = "".join(
    f"### Candidate [{chr(ord('A') + i)}]\n**Title:** {c['title']}\n**Abstract:** {c['abstract']}\n\n"
    for i, c in enumerate(candidates)
)

research_question = "Your research question"
background_survey = "Your background survey"
prompt = (p[0] + research_question
        + p[1] + background_survey
        + p[2] + "No previous hypothesis."
        + p[3] + candidates_text
        + p[4])

messages = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
formatted += "<\uff5cAssistant\uff5c>"

inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.6, top_p=0.9, do_sample=True)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

match = re.search(r"\*\*Selected ID starts:\*\*\s*\[(\w)\]\s*\*\*Selected ID ends\*\*", response)
if match:
    print(f"Selected: [{match.group(1)}]")
```

## Task 2: Hypothesis Composition (HC)

The model generates **delta hypotheses** from inspiration papers. Given a research question, background survey, and new inspiration paper, it outputs structured hypothesis components.

### HC Prompt Format (Simplified Overview)

The full prompt template is constructed via `instruction_prompts()` in the code examples below. The general structure is:

```
[Task instruction preamble]

## Information Provided

**Research Question**:
{research_question}

**Background Survey**:
{background_survey}

**Previous Hypothesis**:
{previous_hypothesis_or_none}

**New Inspiration Paper Title**:
{inspiration_title}

**New Inspiration Paper Abstract**:
{inspiration_abstract}

## Your Response

<think>
[reasoning process]
</think>

Inspiration: [Key concept]
- Motivation (WHY): [Why this addresses a gap]
- Mechanism (HOW IT WORKS): [How the concept works]
- Methodology (HOW IT'S INTEGRATED): [Implementation steps]
```

### HC Usage

```python
import sys
sys.path.insert(0, "./utils")
from prompt_store import instruction_prompts
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ZonglinY/MOOSE-Star-R1D-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto")

p = instruction_prompts("prepare_HC_sft_data_to_go_comprehensive_v2_delta")

research_question = "Your research question here"
background_survey = "Your background survey here"
inspiration_title = "Inspiration paper title"
inspiration_abstract = "Inspiration paper abstract"

prompt = (p[0] + research_question
        + p[1] + background_survey
        + p[2] + "No previous hypothesis."
        + p[3] + inspiration_title
        + p[4] + inspiration_abstract
        + p[5])

messages = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
formatted += "<\uff5cAssistant\uff5c>"

inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.6, top_p=0.9, do_sample=True)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
```

## Evaluation Results

### Inspiration Retrieval (Table 1)

| Model | Accuracy |
|-------|----------|
| Random Selection | 6.70% |
| R1-Distilled-Qwen-7B (base) | 28.42% |
| MS-IR-7B (single-task) | 54.37% |
| **MS-7B (this model)** | **54.34%** |

### Hypothesis Composition - Normal (Table 2)

Rubric-based evaluation with ground-truth inspirations (Judge: GPT-4o):

| Model | Total | Mot | Mec | Met | Length |
|-------|-------|-----|-----|-----|--------|
| R1-Distilled-Qwen-7B (base) | 4.05 | 1.96 | 1.30 | 0.80 | 231.02 |
| MS-HC-7B (single-task) | 4.68 | 2.13 | 1.46 | 1.09 | 204.12 |
| MS-HC-7B w/ 1x bounded | 4.74 | 2.16 | 1.48 | 1.10 | 203.84 |
| **MS-7B (this model)** | **5.02** | **2.22** | **1.59** | **1.20** | 208.98 |

### Hypothesis Composition - Bounded (Table 3)

Performance under varying levels of inspiration noise (Judge: GPT-4o):

| Model | Easy Total | Medium Total | Hard Total |
|-------|-----------|-------------|-----------|
| R1-Distilled-Qwen-7B (base) | 2.72 | 2.27 | 2.00 |
| MS-HC-7B w/ 2x bounded | 3.18 | 2.74 | 2.56 |
| **MS-7B (this model)** | **3.37** | **2.86** | **2.78** |

## Key Findings

- **IR performance preserved**: Multi-task training maintains full IR accuracy (54.34% vs 54.37% single-task)
- **HC significantly improved**: Multi-task HC outperforms all single-task variants, including those with bounded composition augmentation
- **Robust under noise**: Largest improvements on Hard bounded composition, suggesting IR reasoning skills transfer to HC

## Citation

```bibtex
@article{yang2025moosestar,
  title={MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier},
  author={Yang, Zonglin and Bing, Lidong},
  journal={arXiv preprint arXiv:2603.03756},
  year={2026}
}
```