File size: 7,687 Bytes
8bb3ce0 a61a981 8bb3ce0 a61a981 8bb3ce0 a61a981 8bb3ce0 e5f0834 8bb3ce0 a61a981 8bb3ce0 a61a981 8bb3ce0 a61a981 8bb3ce0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | ---
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
- boolean-queries
- systematic-review
- information-retrieval
- pubmed
- reinforcement-learning
- grpo
- few-shot
library_name: transformers
---
# AutoBool-Qwen4b-Reasoning-objective
This model is part of the **AutoBool** framework, a reinforcement learning approach for training large language models to generate high-quality Boolean queries for systematic literature reviews.
## Model Description
This variant uses the **objective method** grounded in domain expertise and structured logic. The model simulates a relevant article and extracts key terms to construct the Boolean query, following a systematic 6-step process.
- **Base Model:** Qwen/Qwen3-4B
- **Training Method:** GRPO (Group Relative Policy Optimization) with LoRA fine-tuning
- **Prompt Strategy:** Objective method (hypothetical article simulation)
- **Step 1:** Simulate a concise title and abstract (2-3 sentences) of a relevant and focused article
- **Step 2:** Identify key informative terms or phrases from the simulated text
- **Step 3:** Categorize each term into: (A) Health conditions/populations, (B) Treatments/interventions, (C) Study designs, or (N/A)
- **Step 4-5:** Build Boolean query using categorized terms with appropriate field tags and wildcards
- **Step 6:** Combine all category blocks using AND
- Output format: `<think>[Simulated abstract + term extraction + categorization + query construction]</think><answer>[Boolean query]</answer>`
- **Domain:** Biomedical literature search (PubMed)
- **Task:** Boolean query generation for high-recall retrieval
## Training Details
The model was trained using:
- **Optimization:** GRPO (Group Relative Policy Optimization)
- **Fine-tuning:** LoRA (Low-Rank Adaptation)
- **Dataset:** wshuai190/pubmed-pmc-sr-filtered
- **Reward Function:** Combines syntactic validity, format correctness, and retrieval effectiveness
- **Learning Approach:** Example-based pattern recognition
## 🚀 Interactive Demo
Try out our query generation models directly in your browser! The demo allows you to test our different reasoning strategies (Standard, Conceptual, Objective, and No-Reasoning) in real-time.
[](https://huggingface.co/spaces/wshuai190/AutoBool-Demo)
* **Live Demo:** [AutoBool on Hugging Face Spaces](https://huggingface.co/spaces/wshuai190/AutoBool-Demo)
## Intended Use
This model is designed for:
- Generating Boolean queries for systematic literature reviews
- High-recall biomedical information retrieval
- Supporting evidence synthesis in healthcare and biomedical research
- Applications benefiting from example-guided generation
## How to Use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import re
model_name = "ielabgroup/Autobool-Qwen4b-Reasoning-objective"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Define your systematic review topic
topic = "Imaging modalities for characterising focal pancreatic lesions"
# Construct the prompt with system and user messages
messages = [
{"role": "system", "content": "You are an expert systematic review information specialist.
You are tasked to formulate a systematic review Boolean query step by step as a reasoning process within <think> </think>, and provide the Boolean query formulated <answer> </answer>."},
{"role": "user", "content": f'You are given a systematic review research topic, with the topic title "{topic}".
You need to simulate a Boolean query construction process using the **objective method**, which is grounded in domain expertise and structured logic.
**Step 1**: Simulate a concise title and abstract (2–3 sentences) of a *relevant and focused* article clearly aligned with the topic. This is a hypothetical but plausible example.
**Step 2**: Based on the simulated text, identify *key informative terms or phrases* that best represent the article's core concepts. Prioritise specificity and informativeness. Avoid overly broad or ambiguous terms.
**Step 3**: Categorise each term into one of the following:
- (A) Health conditions or populations (e.g., diabetes, adolescents)
- (B) Treatments, interventions, or exposures (e.g., insulin therapy, air pollution)
- (C) Study designs or methodologies (e.g., randomized controlled trial, cohort study)
- (N/A) Not applicable to any of the above categories
**Step 4**: Using the categorised terms, build a Boolean query in MEDLINE format for PubMed:
- Combine synonyms or related terms within each category using OR
- Use both free-text terms and MeSH terms (e.g., chronic pain[tiab], Pain[mh])
- **Do not wrap terms or phrases in double quotes**, as this disables automatic term mapping (ATM)
- Tag each term individually when needed (e.g., covid-19[ti] vaccine[ti] children[ti])
- Field tags limit the search to specific fields and disable ATM
**Step 5**: Use wildcards (*) to capture word variants (e.g., vaccin* → vaccine, vaccination):
- Terms must have ≥4 characters before the * (e.g., colo*)
- Wildcards work with field tags (e.g., breastfeed*[tiab]).
**Step 6**: Combine all category blocks using AND:
((itemA1[tiab] OR itemA2[tiab] OR itemA3[mh]) AND (itemB1[tiab] OR ...) AND (itemC1[tiab] OR ...))
**Only use the following allowed field tags:**
Title: [ti], Abstract: [ab], Title/Abstract: [tiab]
MeSH: [mh], Major MeSH: [majr], Supplementary Concept: [nm]
Text Words: [tw], All Fields: [all]
Publication Type: [pt], Language: [la]
Place your full reasoning (including simulated abstract, term list, classification, and query construction) inside <think></think>.
Output the final Boolean query inside <answer></answer>.
Do not include anything outside the <think> and <answer> tags.
Do not include date restrictions.'}
]
# Generate the query
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=4096)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract reasoning and query
reasoning_match = re.search(r'<think>(.*?)</think>', response, re.DOTALL)
query_match = re.search(r'<answer>(.*?)</answer>', response, re.DOTALL)
if reasoning_match and query_match:
reasoning = reasoning_match.group(1).strip()
query = query_match.group(1).strip()
print("Objective method reasoning (simulated article + term extraction):", reasoning)
print("
Query:", query)
```
## Advantages
- Simulates real-world article abstracts to ground query construction
- Systematic categorization of terms (Health conditions, Interventions, Study designs)
- Grounded in domain expertise and structured logic
- May identify more relevant and specific search terms through hypothetical article simulation
## Limitations
- Optimized specifically for PubMed Boolean query syntax
- Performance may vary on non-biomedical domains
- Requires domain knowledge for effective prompt engineering
## Citation
If you use this model, please cite:
```bibtex
@inproceedings{autobool2026,
title={AutoBool: Reinforcement Learning for Boolean Query Generation in Systematic Reviews},
author={[Shuai Wang, Harrisen Scells, Bevan Koopman, Guido Zuccon]},
booktitle={Proceedings of the 2026 Conference of the European Chapter of the Association for Computational Linguistics (EACL)},
year={2025}
}
```
## More Information
- **GitHub Repository:** [https://github.com/ielab/AutoBool](https://github.com/ielab/AutoBool)
- **Paper:** Accepted at EACL 2026
## License
Apache 2.0
|