| # Boolean Search Query Model | |
| This model is fine-tuned to convert natural language queries into boolean search expressions, optimized for academic and research database searching. | |
| ## Model Details | |
| - **Base Model**: Meta-Llama-3.1-8B | |
| - **Training Type**: LoRA fine-tuning | |
| - **Task**: Converting natural language to boolean search queries | |
| - **Languages**: English | |
| - **License**: Same as base model | |
| ## Intended Use | |
| - Converting natural language search requests into proper boolean expressions | |
| - Academic and research database searching | |
| - Information retrieval query formulation | |
| ## Performance | |
| ### Test Results | |
| Base Model vs Fine-tuned Model comparison: | |
| ``` | |
| Natural Query: "Studies examining the relationship between exercise and mental health" | |
| Base: exercise AND mental health | |
| Fine-tuned: exercise AND "mental health" # Properly handles multi-word terms | |
| Natural Query: "Articles about artificial intelligence ethics and regulation or policy" | |
| Base: "artificial intelligence ethics" AND ("regulation" OR "policy") # Treats AI ethics as one concept | |
| Fine-tuned: "artificial intelligence" AND (ethics OR regulation OR policy) # Properly splits concepts | |
| ``` | |
| ### Key Improvements | |
| 1. Meta-term Removal | |
| - Automatically removes terms like "articles", "papers", "research", "studies" | |
| - Focuses on actual search concepts | |
| 2. Proper Term Quoting | |
| - Only quotes multi-word phrases | |
| - Single words remain unquoted | |
| 3. Logical Grouping | |
| - Appropriate use of parentheses for OR groups | |
| - Clear operator precedence | |
| 4. Minimal Formatting | |
| - No unnecessary parentheses | |
| - No duplicate terms | |
| ## Limitations | |
| - English language only | |
| - May not handle specialized domain terminology optimally | |
| - Limited to boolean operators (AND, OR, NOT) | |
| - Designed for academic/research context | |
| ## Training Data | |
| The model was trained on a curated dataset of natural language queries paired with their correct boolean translations. Dataset characteristics: | |
| - Size: 135 examples | |
| - Format: Natural query β Boolean expression pairs | |
| - Source: Manually curated academic search examples | |
| - Validation: Expert-reviewed for accuracy | |
| ## Training Process | |
| - **Method**: LoRA fine-tuning | |
| - **Hardware**: NVIDIA GeForce RTX 4070 Ti SUPER | |
| ## How to Use | |
| ```python | |
| from unsloth import FastLanguageModel | |
| # Load model | |
| model, tokenizer = FastLanguageModel.from_pretrained( | |
| "Zwounds/boolean-search-model", | |
| max_seq_length=2048, | |
| dtype=None, # Auto-detect | |
| load_in_4bit=True | |
| ) | |
| FastLanguageModel.for_inference(model) | |
| # Format query | |
| query = "Find papers about climate change and renewable energy" | |
| formatted = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. | |
| ### Instruction: | |
| Convert this natural language query into a boolean search query by following these rules: | |
| 1. FIRST: Remove all meta-terms from this list (they should NEVER appear in output): | |
| - articles, papers, research, studies | |
| - examining, investigating, analyzing | |
| - findings, documents, literature | |
| - publications, journals, reviews | |
| Example: "Research examining X" β just "X" | |
| 2. SECOND: Remove generic implied terms that don't add search value: | |
| - Remove words like "practices," "techniques," "methods," "approaches," "strategies" | |
| - Remove words like "impacts," "effects," "influences," "role," "applications" | |
| - For example: "sustainable agriculture practices" β "sustainable agriculture" | |
| - For example: "teaching methodologies" β "teaching" | |
| - For example: "leadership styles" β "leadership" | |
| 3. THEN: Format the remaining terms: | |
| CRITICAL QUOTING RULES: | |
| - Multi-word phrases MUST ALWAYS be in quotes - NO EXCEPTIONS | |
| - Examples of correct quoting: | |
| - Wrong: machine learning AND deep learning | |
| - Right: "machine learning" AND "deep learning" | |
| - Wrong: natural language processing | |
| - Right: "natural language processing" | |
| - Single words must NEVER have quotes (e.g., science, research, learning) | |
| - Use AND to connect required concepts | |
| - Use OR with parentheses for alternatives (e.g., ("soil health" OR biodiversity)) | |
| Example conversions showing proper quoting: | |
| "Research on machine learning for natural language processing" | |
| β "machine learning" AND "natural language processing" | |
| "Studies examining anxiety depression stress in workplace" | |
| β (anxiety OR depression OR stress) AND workplace | |
| "Articles about deep learning impact on computer vision" | |
| β "deep learning" AND "computer vision" | |
| "Research on sustainable agriculture practices and their impact on soil health or biodiversity" | |
| β "sustainable agriculture" AND ("soil health" OR biodiversity) | |
| "Articles about effective teaching methods for second language acquisition" | |
| β teaching AND "second language acquisition" | |
| ### Input: | |
| {query} | |
| ### Response: | |
| """ | |
| # Generate boolean query | |
| inputs = tokenizer(formatted, return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=100) | |
| result = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(result) # "climate change" AND "renewable energy" | |
| ``` | |
| ## Citation | |
| If you use this model in your research, please cite: | |
| ```bibtex | |
| @misc{boolean-search-llm, | |
| title={Boolean Search Query LLM}, | |
| author={Stephen Zweibel}, | |
| year={2025}, | |
| publisher={Hugging Face}, | |
| url={https://huggingface.co/Zwounds/boolean-search-model} | |
| } | |