Update README.md

b28f3e1 verified 12 months ago

7.34 kB

	---
	license: mit
	datasets:
	- alexdong/query-reformulation
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- google-t5/t5-small
	pipeline_tag: text2text-generation
	---


	# QRKB: A Synthetic Query Reformulation Model for Knowledge Graphs

	[License: MIT](https://opensource.org/licenses/MIT)
	[Hugging Face Model](https://huggingface.co/alexdong/query-reformulation-knowledge-base-t5-small/)
	[Dataset](https://huggingface.co/datasets/alexdong/query-reformulation)
	[Source Code](https://github.com/alexdong/query-reformulation)

	This repository contains the model and code for training and evaluating a query reformulation model on the QRKB-16k dataset. The model is designed to transform natural language queries into a set of structured subqueries suitable for retrieval from knowledge graphs like DBpedia and Wikidata. This is particularly useful for Retrieval-Augmented Generation (RAG) applications.

	## Overview

	The model takes a natural language query as input and outputs a sequence of subqueries, each representing a semantic triple (subject-predicate-object). These subqueries can be used to directly query a knowledge graph using SPARQL. The model also predicts paraphrased variations of the input query.

	Key Features:

	* Query Decomposition: Breaks down complex queries into smaller, manageable subqueries.
	* Knowledge Graph Compatibility: Outputs subqueries that can be easily executed against knowledge graphs like DBpedia and Wikidata.
	* Sentence Similarity: Trained on variations of the input query, making it robust to different phrasings.
	* Three Reformulation Categories: Handles comparison, chaining, and expansion query types.
	* Fine-tuned on SynQRe-KG: Trained on a high-quality, synthetically generated dataset specifically designed for this task.

	## Model Architecture

	The model is based on the [Sequence-to-Sequence architecture](https://arxiv.org/abs/1409.3215) using a pre-trained [Transformer model](https://arxiv.org/abs/1706.03762) as its backbone. Specifically, we use [T5-small](https://huggingface.co/google-t5/t5-small) as the encoder and decoder.

	## Installation

	```bash
	pip install -r requirements.txt
	```

	`requirements.txt`

	```
	transformers
	torch
	datasets
	sentencepiece
	rouge_score # For ROUGE evaluation
	nltk
	```

	## Usage

	### 1. Loading the Model and Tokenizer

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	model_name = "your_model_name_or_path" # e.g., "your_username/your_model_name" or "./your_local_model_directory"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

	# Move the model to the GPU if available
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)
	```

	### 2. Inference

	```python
	def reformulate_query(query):
	inputs = tokenizer(query, return_tensors="pt").to(device)
	outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True) # Adjust generation parameters as needed
	decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
	subqueries = decoded_output.split("\n") # Assuming <sep> is your separator token
	return subqueries


	query = "What is the capital of the country that contains the administrative division where the national park housing Mount Aspiring is located?"
	subqueries = reformulate_query(query)
	print(f"Original Query: {query}")
	print(f"Subqueries: {subqueries}")
	```

	### 3. Training (Example)

	```python
	from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
	from datasets import load_dataset

	# Load the dataset
	dataset = load_dataset("your_dataset_name", split="train") # Or load from local files

	# Define training arguments
	training_args = Seq2SeqTrainingArguments(
	output_dir="./results",
	evaluation_strategy="epoch",
	learning_rate=5e-5,
	per_device_train_batch_size=32,
	per_device_eval_batch_size=32,
	num_train_epochs=10,
	weight_decay=0.01,
	save_total_limit=3,
	predict_with_generate=True, # Important for generation tasks
	fp16=True, # Use mixed precision if your GPU supports it
	# Add other arguments as needed
	)


	# Define a data collator (if needed, for padding, etc.)
	from transformers import DataCollatorForSeq2Seq
	data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

	def compute_metrics(eval_pred: EvalPrediction, tokenizer: T5Tokenizer) -> Dict[str, float]:
	predictions, labels = eval_pred

	# The predictions are likely coming as logits
	# We need to handle this properly
	if isinstance(predictions, tuple):
	# If predictions is a tuple, take the first element (logits)
	predictions = predictions[0]

	# Get the most likely token IDs
	predictions = np.argmax(predictions, axis=-1)

	decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

	# Handle labels
	labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
	decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

	# --- ROUGE-L ---
	rouge_l_scores = [score(pred, label) for pred, label in zip(decoded_preds, decoded_labels)]
	avg_rouge_l = sum(rouge_l_scores) / len(rouge_l_scores)

	return {
	"rouge_l": avg_rouge_l,
	}


	# Create the Trainer
	trainer = Seq2SeqTrainer(
	model=model,
	args=training_args,
	train_dataset=dataset["train"], # Assuming you have train/validation/test splits
	eval_dataset=dataset["validation"],
	tokenizer=tokenizer,
	data_collator=data_collator,
	compute_metrics=compute_metrics, # Add this line
	)

	# Train the model
	trainer.train()
	```

	Important Notes on Training:

	* Replace Placeholders: Replace `"your_dataset_name"` and other placeholders with your actual dataset name and paths.
	* Data Preprocessing: You'll likely need to preprocess your data before training. This usually involves:
	* Tokenizing the input queries and subqueries.
	* Padding sequences to a maximum length.
	* Creating attention masks.
	* Converting the subqueries string (with `\n` separators) into a format suitable for your model (e.g., using a special separator token). The example above uses `<sep>`. You must handle this correctly.
	* Evaluation Metrics: The example shows how to use ROUGE. You should also consider other metrics like BLEU, METEOR, and potentially custom metrics specific to knowledge graph query evaluation (e.g., precision and recall of retrieved entities and relationships).
	* Hyperparameter Tuning: The provided training arguments are just an example. You'll need to tune these hyperparameters to achieve optimal performance.


	## Citation
	```
	@misc{dong2025queryreformulation,
	title = {Synthetic Query Reformulation Dataset for Knowledge Graph Retrieval},
	author = {Alex Dong},
	year = {2025},
	howpublished = {Online Dataset},
	note = {Available at https://huggingface.co/datasets/alexdong/query-reformulation/. Contact: me@alexdong.com},
	keywords = {dbpedia, wikidata, kb, query-understanding, query-expansion, query-decomposition, query-rewriting, text2text-generation, question-answering, sentence-similarity},
	}
	```

	## License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.