boolean-search-model / MODEL_CARD.md

Upload MODEL_CARD.md with huggingface_hub

5043f4a verified 11 months ago

5.39 kB

	# Boolean Search Query Model

	This model is fine-tuned to convert natural language queries into boolean search expressions, optimized for academic and research database searching.

	## Model Details

	- Base Model: Meta-Llama-3.1-8B
	- Training Type: LoRA fine-tuning
	- Task: Converting natural language to boolean search queries
	- Languages: English
	- License: Same as base model

	## Intended Use

	- Converting natural language search requests into proper boolean expressions
	- Academic and research database searching
	- Information retrieval query formulation

	## Performance

	### Test Results

	Base Model vs Fine-tuned Model comparison:

	```
	Natural Query: "Studies examining the relationship between exercise and mental health"
	Base: exercise AND mental health
	Fine-tuned: exercise AND "mental health" # Properly handles multi-word terms

	Natural Query: "Articles about artificial intelligence ethics and regulation or policy"
	Base: "artificial intelligence ethics" AND ("regulation" OR "policy") # Treats AI ethics as one concept
	Fine-tuned: "artificial intelligence" AND (ethics OR regulation OR policy) # Properly splits concepts
	```

	### Key Improvements

	1. Meta-term Removal
	- Automatically removes terms like "articles", "papers", "research", "studies"
	- Focuses on actual search concepts

	2. Proper Term Quoting
	- Only quotes multi-word phrases
	- Single words remain unquoted

	3. Logical Grouping
	- Appropriate use of parentheses for OR groups
	- Clear operator precedence

	4. Minimal Formatting
	- No unnecessary parentheses
	- No duplicate terms

	## Limitations

	- English language only
	- May not handle specialized domain terminology optimally
	- Limited to boolean operators (AND, OR, NOT)
	- Designed for academic/research context

	## Training Data

	The model was trained on a curated dataset of natural language queries paired with their correct boolean translations. Dataset characteristics:

	- Size: 135 examples
	- Format: Natural query → Boolean expression pairs
	- Source: Manually curated academic search examples
	- Validation: Expert-reviewed for accuracy

	## Training Process

	- Method: LoRA fine-tuning
	- Hardware: NVIDIA GeForce RTX 4070 Ti SUPER

	## How to Use

	```python
	from unsloth import FastLanguageModel

	# Load model
	model, tokenizer = FastLanguageModel.from_pretrained(
	"Zwounds/boolean-search-model",
	max_seq_length=2048,
	dtype=None, # Auto-detect
	load_in_4bit=True
	)
	FastLanguageModel.for_inference(model)

	# Format query
	query = "Find papers about climate change and renewable energy"
	formatted = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

	### Instruction:
	Convert this natural language query into a boolean search query by following these rules:

	1. FIRST: Remove all meta-terms from this list (they should NEVER appear in output):
	- articles, papers, research, studies
	- examining, investigating, analyzing
	- findings, documents, literature
	- publications, journals, reviews
	Example: "Research examining X" → just "X"

	2. SECOND: Remove generic implied terms that don't add search value:
	- Remove words like "practices," "techniques," "methods," "approaches," "strategies"
	- Remove words like "impacts," "effects," "influences," "role," "applications"
	- For example: "sustainable agriculture practices" → "sustainable agriculture"
	- For example: "teaching methodologies" → "teaching"
	- For example: "leadership styles" → "leadership"

	3. THEN: Format the remaining terms:
	CRITICAL QUOTING RULES:
	- Multi-word phrases MUST ALWAYS be in quotes - NO EXCEPTIONS
	- Examples of correct quoting:
	- Wrong: machine learning AND deep learning
	- Right: "machine learning" AND "deep learning"
	- Wrong: natural language processing
	- Right: "natural language processing"
	- Single words must NEVER have quotes (e.g., science, research, learning)
	- Use AND to connect required concepts
	- Use OR with parentheses for alternatives (e.g., ("soil health" OR biodiversity))

	Example conversions showing proper quoting:
	"Research on machine learning for natural language processing"
	→ "machine learning" AND "natural language processing"

	"Studies examining anxiety depression stress in workplace"
	→ (anxiety OR depression OR stress) AND workplace

	"Articles about deep learning impact on computer vision"
	→ "deep learning" AND "computer vision"

	"Research on sustainable agriculture practices and their impact on soil health or biodiversity"
	→ "sustainable agriculture" AND ("soil health" OR biodiversity)

	"Articles about effective teaching methods for second language acquisition"
	→ teaching AND "second language acquisition"

	### Input:
	{query}

	### Response:
	"""

	# Generate boolean query
	inputs = tokenizer(formatted, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=100)
	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(result) # "climate change" AND "renewable energy"
	```

	## Citation

	If you use this model in your research, please cite:
	```bibtex
	@misc{boolean-search-llm,
	title={Boolean Search Query LLM},
	author={Stephen Zweibel},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/Zwounds/boolean-search-model}
	}