wshuai190 commited on
Commit
704f246
·
verified ·
1 Parent(s): 8ae6601

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +152 -0
README.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-4B
4
+ tags:
5
+ - boolean-queries
6
+ - systematic-review
7
+ - information-retrieval
8
+ - pubmed
9
+ - reinforcement-learning
10
+ - grpo
11
+ - chain-of-thought
12
+ library_name: transformers
13
+ ---
14
+
15
+ # AutoBool-Qwen4b-Reasoning
16
+
17
+ This model is part of the **AutoBool** framework, a reinforcement learning approach for training large language models to generate high-quality Boolean queries for systematic literature reviews.
18
+
19
+ ## Model Description
20
+
21
+ This variant uses **explicit chain-of-thought reasoning**. The model is instructed to provide detailed reasoning about the query construction process inside `<think></think>` tags before generating the final Boolean query.
22
+
23
+ - **Base Model:** Qwen/Qwen2.5-4B
24
+ - **Training Method:** GRPO (Group Relative Policy Optimization) with LoRA fine-tuning
25
+ - **Prompt Strategy:** Chain-of-thought reasoning
26
+ - System instruction: "Your reasoning process should be enclosed within `<think></think>`, and the final Boolean query must be enclosed within `<answer></answer>` tags"
27
+ - Output format: `<think>[Detailed step-by-step reasoning explaining the query construction process]</think><answer>[Boolean query]</answer>`
28
+ - Provides full explanation of term selection, MeSH terms, field tags, wildcards, and Boolean logic
29
+ - **Domain:** Biomedical literature search (PubMed)
30
+ - **Task:** Boolean query generation for high-recall retrieval
31
+
32
+ ## Training Details
33
+
34
+ The model was trained using:
35
+ - **Optimization:** GRPO (Group Relative Policy Optimization)
36
+ - **Fine-tuning:** LoRA (Low-Rank Adaptation)
37
+ - **Dataset:** PubMed systematic review queries (version 1.2)
38
+ - **Reward Function:** Combines syntactic validity, format correctness, and retrieval effectiveness
39
+ - **Reasoning Approach:** Explicit thinking process with structured tags
40
+
41
+ ## Intended Use
42
+
43
+ This model is designed for:
44
+ - Generating Boolean queries for systematic literature reviews
45
+ - High-recall biomedical information retrieval
46
+ - Supporting evidence synthesis in healthcare and biomedical research
47
+ - Applications where reasoning transparency is valuable
48
+
49
+ ## How to Use
50
+
51
+ ```python
52
+ from transformers import AutoTokenizer, AutoModelForCausalLM
53
+ import re
54
+
55
+ model_name = "ielabgroup/Autobool-Qwen4b-Reasoning"
56
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
57
+ model = AutoModelForCausalLM.from_pretrained(model_name)
58
+
59
+ # Define your systematic review topic
60
+ topic = "Diagnostic accuracy of endoscopic ultrasonography (EUS) for the preoperative locoregional staging of primary gastric cancer"
61
+
62
+ # Construct the prompt with system and user messages
63
+ messages = [
64
+ {"role": "system", "content": "You are an expert systematic review information specialist.
65
+ You are tasked to formulate a systematic review Boolean query in response to a research topic.
66
+ Your reasoning process should be enclosed within <think></think>, and the final Boolean query must be enclosed within <answer></answer> tags. Do not include anything outside of these tags."},
67
+ {"role": "user", "content": f'You are given a systematic review research topic, with the topic title "{topic}".
68
+ Your task is to generate a highly effective Boolean query in MEDLINE format for PubMed.
69
+ The query should balance **high recall** (capturing all relevant studies) with **reasonable precision** (avoiding irrelevant results):
70
+ - Use both free-text terms and MeSH terms (e.g., chronic pain[tiab], Pain[mh]).
71
+ - **Do not wrap terms or phrases in double quotes**, as this disables automatic term mapping (ATM).
72
+ - Combine synonyms or related terms within a concept using OR.
73
+ - Combine different concepts using AND.
74
+ - Use wildcards (*) to capture word variants (e.g., vaccin* → vaccine, vaccination):
75
+ - Terms must have ≥4 characters before the * (e.g., colo*)
76
+ - Wildcards work with field tags (e.g., breastfeed*[tiab]).
77
+ - Field tags limit the search to specific fields and disable ATM.
78
+ - Do not include date limits.
79
+ - Tag terms using appropriate fields (e.g., covid-19[ti] vaccine[ti] children[ti]) when needed.
80
+ **Only use the following allowed field tags:**
81
+ Title: [ti], Abstract: [ab], Title/Abstract: [tiab]
82
+ MeSH: [mh], Major MeSH: [majr], Supplementary Concept: [nm]
83
+ Text Words: [tw], All Fields: [all]
84
+ Publication Type: [pt], Language: [la]
85
+
86
+ Output your full reasoning inside <think></think>.
87
+ Output the final Boolean query inside <answer></answer>.
88
+ Do not include any content outside these tags.'}
89
+ ]
90
+
91
+ # Generate the query
92
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
93
+ inputs = tokenizer(prompt, return_tensors="pt")
94
+ outputs = model.generate(**inputs, max_length=4096)
95
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
96
+
97
+ # Extract reasoning and query
98
+ reasoning_match = re.search(r'<think>(.*?)</think>', response, re.DOTALL)
99
+ query_match = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)
100
+
101
+ if reasoning_match and query_match:
102
+ reasoning = reasoning_match.group(1).strip()
103
+ query = query_match.group(1).strip()
104
+ print("Reasoning:", reasoning)
105
+ print("
106
+ Query:", query)
107
+ ```
108
+
109
+ The model will generate output with reasoning:
110
+ ```
111
+ <think>
112
+ [Detailed step-by-step reasoning explaining the query construction process,
113
+ including term selection, MeSH terms, field tags, wildcards, and Boolean logic]
114
+ </think>
115
+ <answer>
116
+ [Final Boolean query]
117
+ </answer>
118
+ ```
119
+
120
+ ## Advantages
121
+
122
+ - Provides interpretable reasoning process
123
+ - Can help understand query construction decisions
124
+ - May improve query quality through structured thinking
125
+
126
+ ## Limitations
127
+
128
+ - Optimized specifically for PubMed Boolean query syntax
129
+ - Performance may vary on non-biomedical domains
130
+ - Requires domain knowledge for effective prompt engineering
131
+
132
+ ## Citation
133
+
134
+ If you use this model, please cite:
135
+
136
+ ```bibtex
137
+ @inproceedings{autobool2025,
138
+ title={AutoBool: Reinforcement Learning for Boolean Query Generation in Systematic Reviews},
139
+ author={[]},
140
+ booktitle={Proceedings of the 2025 Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
141
+ year={2025}
142
+ }
143
+ ```
144
+
145
+ ## More Information
146
+
147
+ - **GitHub Repository:** [https://github.com/ielab/AutoBool](https://github.com/ielab/AutoBool)
148
+ - **Paper:** Accepted at EACL 2025
149
+
150
+ ## License
151
+
152
+ Apache 2.0