--- language: - en license: apache-2.0 base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B tags: - scientific-discovery - hypothesis-generation - inspiration-retrieval - multi-task datasets: - ZonglinY/TOMATO-Star-SFT-Data-R1D-32B library_name: transformers pipeline_tag: text-generation --- # MOOSE-Star-R1D-7B Model Card ## Overview **MOOSE-Star-R1D-7B** (referred to as **MS-7B** in the paper) is a 7B parameter multi-task language model fine-tuned for both **inspiration retrieval** and **hypothesis composition** in scientific discovery workflows. It matches the IR performance of the single-task model ([MOOSE-Star-IR-R1D-7B](https://huggingface.co/ZonglinY/MOOSE-Star-IR-R1D-7B)) while significantly outperforming the single-task HC model ([MOOSE-Star-HC-R1D-7B](https://huggingface.co/ZonglinY/MOOSE-Star-HC-R1D-7B)), all in a single unified model. - **Paper**: [MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier](https://arxiv.org/abs/2603.03756) (arXiv:2603.03756) - **Base Model**: [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) - **License**: Apache 2.0 - **Code**: [ZonglinY/MOOSE-Star](https://github.com/ZonglinY/MOOSE-Star) ## Model Description | Parameter | Value | |-----------|-------| | **Base Model** | DeepSeek-R1-Distill-Qwen-7B | | **Training Method** | Full-parameter SFT (ZeRO-3) | | **Training Data** | TOMATO-Star-SFT-Data-R1D-32B: IR split (150,218 samples) + HC split with 1x bounded (114,548 samples) | | **Chat Template** | deepseekr1 | | **Cutoff Length** | 16384 | | **Learning Rate** | 1e-5 | | **Epochs** | 1 | | **Batch Size** | 128 | ## Task 1: Inspiration Retrieval (IR) The model selects the most relevant **cross-paper inspiration** from 15 candidates (A-O) that includes 1 correct inspiration and 14 hard negatives. ### IR Prompt Format (Simplified Overview) The full prompt template is constructed via `instruction_prompts()` in the code examples below. The general structure is: ``` [Task instruction preamble] ## Context **Research Question:** {research_question} **Background Survey (existing methods for THIS task):** {background_survey} **Previous Hypothesis (if any):** {previous_hypothesis_or_none} ## Candidate Inspiration Papers ### Candidate [A] **Title:** {title_A} **Abstract:** {abstract_A} ... (15 candidates total, A through O) ## Output Format [reasoning process] **Selected ID starts:** [X] **Selected ID ends** **Selection Reason starts:** [reason] **Selection Reason ends** ``` ### IR Usage **Prerequisites**: Clone the [MOOSE-Star repo](https://github.com/ZonglinY/MOOSE-Star) for prompt templates and inference utilities: ```bash git clone https://github.com/ZonglinY/MOOSE-Star.git && cd MOOSE-Star # See requirements.txt for full dependencies; at minimum: pip install transformers torch ``` #### Option A: SGLang Deployment (Recommended) ```bash # SGLang requires a separate environment; see https://github.com/sgl-project/sglang for installation # Start the server python -m sglang.launch_server --model-path ZonglinY/MOOSE-Star-R1D-7B --port 1235 ``` ```python import sys sys.path.insert(0, "./Inference") from ir_probability_extractor import IRProbabilityExtractor extractor = IRProbabilityExtractor(base_urls=["http://localhost:1235/v1"]) result = extractor.get_selection_probabilities( research_question="Your research question", background_survey="Your background survey", candidates=[ {"title": "Candidate A title", "abstract": "Candidate A abstract"}, {"title": "Candidate B title", "abstract": "Candidate B abstract"}, # ... up to 15 candidates (labeled A-O) ], ) print(f"Selected: [{result.selected_label}]") print(f"Probabilities: {result.probabilities}") ``` #### Option B: Direct HuggingFace Inference ```python import sys sys.path.insert(0, "./utils") from prompt_store import instruction_prompts from transformers import AutoModelForCausalLM, AutoTokenizer import re model_name = "ZonglinY/MOOSE-Star-R1D-7B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto") p = instruction_prompts("inspiration_retrieval_with_reasoning_with_alphabetical_candidates") candidates = [{"title": "...", "abstract": "..."}, ...] candidates_text = "".join( f"### Candidate [{chr(ord('A') + i)}]\n**Title:** {c['title']}\n**Abstract:** {c['abstract']}\n\n" for i, c in enumerate(candidates) ) research_question = "Your research question" background_survey = "Your background survey" prompt = (p[0] + research_question + p[1] + background_survey + p[2] + "No previous hypothesis." + p[3] + candidates_text + p[4]) messages = [{"role": "user", "content": prompt}] formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) formatted += "<\uff5cAssistant\uff5c>" inputs = tokenizer(formatted, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.6, top_p=0.9, do_sample=True) response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) match = re.search(r"\*\*Selected ID starts:\*\*\s*\[(\w)\]\s*\*\*Selected ID ends\*\*", response) if match: print(f"Selected: [{match.group(1)}]") ``` ## Task 2: Hypothesis Composition (HC) The model generates **delta hypotheses** from inspiration papers. Given a research question, background survey, and new inspiration paper, it outputs structured hypothesis components. ### HC Prompt Format (Simplified Overview) The full prompt template is constructed via `instruction_prompts()` in the code examples below. The general structure is: ``` [Task instruction preamble] ## Information Provided **Research Question**: {research_question} **Background Survey**: {background_survey} **Previous Hypothesis**: {previous_hypothesis_or_none} **New Inspiration Paper Title**: {inspiration_title} **New Inspiration Paper Abstract**: {inspiration_abstract} ## Your Response [reasoning process] Inspiration: [Key concept] - Motivation (WHY): [Why this addresses a gap] - Mechanism (HOW IT WORKS): [How the concept works] - Methodology (HOW IT'S INTEGRATED): [Implementation steps] ``` ### HC Usage ```python import sys sys.path.insert(0, "./utils") from prompt_store import instruction_prompts from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "ZonglinY/MOOSE-Star-R1D-7B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto") p = instruction_prompts("prepare_HC_sft_data_to_go_comprehensive_v2_delta") research_question = "Your research question here" background_survey = "Your background survey here" inspiration_title = "Inspiration paper title" inspiration_abstract = "Inspiration paper abstract" prompt = (p[0] + research_question + p[1] + background_survey + p[2] + "No previous hypothesis." + p[3] + inspiration_title + p[4] + inspiration_abstract + p[5]) messages = [{"role": "user", "content": prompt}] formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) formatted += "<\uff5cAssistant\uff5c>" inputs = tokenizer(formatted, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0.6, top_p=0.9, do_sample=True) response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print(response) ``` ## Evaluation Results ### Inspiration Retrieval (Table 1) | Model | Accuracy | |-------|----------| | Random Selection | 6.70% | | R1-Distilled-Qwen-7B (base) | 28.42% | | MS-IR-7B (single-task) | 54.37% | | **MS-7B (this model)** | **54.34%** | ### Hypothesis Composition - Normal (Table 2) Rubric-based evaluation with ground-truth inspirations (Judge: GPT-4o): | Model | Total | Mot | Mec | Met | Length | |-------|-------|-----|-----|-----|--------| | R1-Distilled-Qwen-7B (base) | 4.05 | 1.96 | 1.30 | 0.80 | 231.02 | | MS-HC-7B (single-task) | 4.68 | 2.13 | 1.46 | 1.09 | 204.12 | | MS-HC-7B w/ 1x bounded | 4.74 | 2.16 | 1.48 | 1.10 | 203.84 | | **MS-7B (this model)** | **5.02** | **2.22** | **1.59** | **1.20** | 208.98 | ### Hypothesis Composition - Bounded (Table 3) Performance under varying levels of inspiration noise (Judge: GPT-4o): | Model | Easy Total | Medium Total | Hard Total | |-------|-----------|-------------|-----------| | R1-Distilled-Qwen-7B (base) | 2.72 | 2.27 | 2.00 | | MS-HC-7B w/ 2x bounded | 3.18 | 2.74 | 2.56 | | **MS-7B (this model)** | **3.37** | **2.86** | **2.78** | ## Key Findings - **IR performance preserved**: Multi-task training maintains full IR accuracy (54.34% vs 54.37% single-task) - **HC significantly improved**: Multi-task HC outperforms all single-task variants, including those with bounded composition augmentation - **Robust under noise**: Largest improvements on Hard bounded composition, suggesting IR reasoning skills transfer to HC ## Citation ```bibtex @article{yang2025moosestar, title={MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier}, author={Yang, Zonglin and Bing, Lidong}, journal={arXiv preprint arXiv:2603.03756}, year={2026} } ```