File size: 7,687 Bytes
8bb3ce0
 
a61a981
8bb3ce0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a61a981
8bb3ce0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a61a981
8bb3ce0
 
 
e5f0834
 
 
 
 
 
 
 
8bb3ce0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a61a981
8bb3ce0
a61a981
 
8bb3ce0
 
 
 
 
 
 
a61a981
8bb3ce0
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
- boolean-queries
- systematic-review
- information-retrieval
- pubmed
- reinforcement-learning
- grpo
- few-shot
library_name: transformers
---

# AutoBool-Qwen4b-Reasoning-objective

This model is part of the **AutoBool** framework, a reinforcement learning approach for training large language models to generate high-quality Boolean queries for systematic literature reviews.

## Model Description

This variant uses the **objective method** grounded in domain expertise and structured logic. The model simulates a relevant article and extracts key terms to construct the Boolean query, following a systematic 6-step process.

- **Base Model:** Qwen/Qwen3-4B
- **Training Method:** GRPO (Group Relative Policy Optimization) with LoRA fine-tuning
- **Prompt Strategy:** Objective method (hypothetical article simulation)
  - **Step 1:** Simulate a concise title and abstract (2-3 sentences) of a relevant and focused article
  - **Step 2:** Identify key informative terms or phrases from the simulated text
  - **Step 3:** Categorize each term into: (A) Health conditions/populations, (B) Treatments/interventions, (C) Study designs, or (N/A)
  - **Step 4-5:** Build Boolean query using categorized terms with appropriate field tags and wildcards
  - **Step 6:** Combine all category blocks using AND
  - Output format: `<think>[Simulated abstract + term extraction + categorization + query construction]</think><answer>[Boolean query]</answer>`
- **Domain:** Biomedical literature search (PubMed)
- **Task:** Boolean query generation for high-recall retrieval

## Training Details

The model was trained using:
- **Optimization:** GRPO (Group Relative Policy Optimization)
- **Fine-tuning:** LoRA (Low-Rank Adaptation)
- **Dataset:** wshuai190/pubmed-pmc-sr-filtered
- **Reward Function:** Combines syntactic validity, format correctness, and retrieval effectiveness
- **Learning Approach:** Example-based pattern recognition

## 🚀 Interactive Demo

Try out our query generation models directly in your browser! The demo allows you to test our different reasoning strategies (Standard, Conceptual, Objective, and No-Reasoning) in real-time.

[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/wshuai190/AutoBool-Demo) 
* **Live Demo:** [AutoBool on Hugging Face Spaces](https://huggingface.co/spaces/wshuai190/AutoBool-Demo)


## Intended Use

This model is designed for:
- Generating Boolean queries for systematic literature reviews
- High-recall biomedical information retrieval
- Supporting evidence synthesis in healthcare and biomedical research
- Applications benefiting from example-guided generation

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import re

model_name = "ielabgroup/Autobool-Qwen4b-Reasoning-objective"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Define your systematic review topic
topic = "Imaging modalities for characterising focal pancreatic lesions"

# Construct the prompt with system and user messages
messages = [
    {"role": "system", "content": "You are an expert systematic review information specialist.
You are tasked to formulate a systematic review Boolean query step by step as a reasoning process within <think> </think>, and provide the Boolean query formulated <answer> </answer>."},
    {"role": "user", "content": f'You are given a systematic review research topic, with the topic title "{topic}".
You need to simulate a Boolean query construction process using the **objective method**, which is grounded in domain expertise and structured logic.

**Step 1**: Simulate a concise title and abstract (2–3 sentences) of a *relevant and focused* article clearly aligned with the topic. This is a hypothetical but plausible example.

**Step 2**: Based on the simulated text, identify *key informative terms or phrases* that best represent the article's core concepts. Prioritise specificity and informativeness. Avoid overly broad or ambiguous terms.

**Step 3**: Categorise each term into one of the following:
- (A) Health conditions or populations (e.g., diabetes, adolescents)
- (B) Treatments, interventions, or exposures (e.g., insulin therapy, air pollution)
- (C) Study designs or methodologies (e.g., randomized controlled trial, cohort study)
- (N/A) Not applicable to any of the above categories

**Step 4**: Using the categorised terms, build a Boolean query in MEDLINE format for PubMed:
- Combine synonyms or related terms within each category using OR
- Use both free-text terms and MeSH terms (e.g., chronic pain[tiab], Pain[mh])
- **Do not wrap terms or phrases in double quotes**, as this disables automatic term mapping (ATM)
- Tag each term individually when needed (e.g., covid-19[ti] vaccine[ti] children[ti])
- Field tags limit the search to specific fields and disable ATM

**Step 5**: Use wildcards (*) to capture word variants (e.g., vaccin* → vaccine, vaccination):
  - Terms must have ≥4 characters before the * (e.g., colo*)
  - Wildcards work with field tags (e.g., breastfeed*[tiab]).

**Step 6**: Combine all category blocks using AND:
((itemA1[tiab] OR itemA2[tiab] OR itemA3[mh]) AND (itemB1[tiab] OR ...) AND (itemC1[tiab] OR ...))

**Only use the following allowed field tags:**
Title: [ti], Abstract: [ab], Title/Abstract: [tiab]
MeSH: [mh], Major MeSH: [majr], Supplementary Concept: [nm]
Text Words: [tw], All Fields: [all]
Publication Type: [pt], Language: [la]

Place your full reasoning (including simulated abstract, term list, classification, and query construction) inside <think></think>.
Output the final Boolean query inside <answer></answer>.
Do not include anything outside the <think> and <answer> tags.
Do not include date restrictions.'}
]

# Generate the query
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=4096)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract reasoning and query
reasoning_match = re.search(r'<think>(.*?)</think>', response, re.DOTALL)
query_match = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)

if reasoning_match and query_match:
    reasoning = reasoning_match.group(1).strip()
    query = query_match.group(1).strip()
    print("Objective method reasoning (simulated article + term extraction):", reasoning)
    print("
Query:", query)
```

## Advantages

- Simulates real-world article abstracts to ground query construction
- Systematic categorization of terms (Health conditions, Interventions, Study designs)
- Grounded in domain expertise and structured logic
- May identify more relevant and specific search terms through hypothetical article simulation

## Limitations

- Optimized specifically for PubMed Boolean query syntax
- Performance may vary on non-biomedical domains
- Requires domain knowledge for effective prompt engineering

## Citation

If you use this model, please cite:

```bibtex
@inproceedings{autobool2026,
  title={AutoBool: Reinforcement Learning for Boolean Query Generation in Systematic Reviews},
  author={[Shuai Wang, Harrisen Scells, Bevan Koopman, Guido Zuccon]},
  booktitle={Proceedings of the 2026 Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
  year={2025}
}
```

## More Information

- **GitHub Repository:** [https://github.com/ielab/AutoBool](https://github.com/ielab/AutoBool)
- **Paper:** Accepted at EACL 2026

## License

Apache 2.0