--- license: mit datasets: - alexdong/query-reformulation language: - en metrics: - accuracy base_model: - google-t5/t5-small pipeline_tag: text2text-generation --- # QRKB: A Synthetic Query Reformulation Model for Knowledge Graphs [License: MIT](https://opensource.org/licenses/MIT) [Hugging Face Model](https://huggingface.co/alexdong/query-reformulation-knowledge-base-t5-small/) [Dataset](https://huggingface.co/datasets/alexdong/query-reformulation) [Source Code](https://github.com/alexdong/query-reformulation) This repository contains the model and code for training and evaluating a query reformulation model on the QRKB-16k dataset. The model is designed to transform natural language queries into a set of structured subqueries suitable for retrieval from knowledge graphs like DBpedia and Wikidata. This is particularly useful for Retrieval-Augmented Generation (RAG) applications. ## Overview The model takes a natural language query as input and outputs a sequence of subqueries, each representing a semantic triple (subject-predicate-object). These subqueries can be used to directly query a knowledge graph using SPARQL. The model also predicts paraphrased variations of the input query. **Key Features:** * **Query Decomposition:** Breaks down complex queries into smaller, manageable subqueries. * **Knowledge Graph Compatibility:** Outputs subqueries that can be easily executed against knowledge graphs like DBpedia and Wikidata. * **Sentence Similarity:** Trained on variations of the input query, making it robust to different phrasings. * **Three Reformulation Categories:** Handles comparison, chaining, and expansion query types. * **Fine-tuned on SynQRe-KG:** Trained on a high-quality, synthetically generated dataset specifically designed for this task. ## Model Architecture The model is based on the [Sequence-to-Sequence architecture](https://arxiv.org/abs/1409.3215) using a pre-trained [Transformer model](https://arxiv.org/abs/1706.03762) as its backbone. Specifically, we use [T5-small](https://huggingface.co/google-t5/t5-small) as the encoder and decoder. ## Installation ```bash pip install -r requirements.txt ``` **`requirements.txt`** ``` transformers torch datasets sentencepiece rouge_score # For ROUGE evaluation nltk ``` ## Usage ### 1. Loading the Model and Tokenizer ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer model_name = "your_model_name_or_path" # e.g., "your_username/your_model_name" or "./your_local_model_directory" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Move the model to the GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) ``` ### 2. Inference ```python def reformulate_query(query): inputs = tokenizer(query, return_tensors="pt").to(device) outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True) # Adjust generation parameters as needed decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True) subqueries = decoded_output.split("\n") # Assuming is your separator token return subqueries query = "What is the capital of the country that contains the administrative division where the national park housing Mount Aspiring is located?" subqueries = reformulate_query(query) print(f"Original Query: {query}") print(f"Subqueries: {subqueries}") ``` ### 3. Training (Example) ```python from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer from datasets import load_dataset # Load the dataset dataset = load_dataset("your_dataset_name", split="train") # Or load from local files # Define training arguments training_args = Seq2SeqTrainingArguments( output_dir="./results", evaluation_strategy="epoch", learning_rate=5e-5, per_device_train_batch_size=32, per_device_eval_batch_size=32, num_train_epochs=10, weight_decay=0.01, save_total_limit=3, predict_with_generate=True, # Important for generation tasks fp16=True, # Use mixed precision if your GPU supports it # Add other arguments as needed ) # Define a data collator (if needed, for padding, etc.) from transformers import DataCollatorForSeq2Seq data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) def compute_metrics(eval_pred: EvalPrediction, tokenizer: T5Tokenizer) -> Dict[str, float]: predictions, labels = eval_pred # The predictions are likely coming as logits # We need to handle this properly if isinstance(predictions, tuple): # If predictions is a tuple, take the first element (logits) predictions = predictions[0] # Get the most likely token IDs predictions = np.argmax(predictions, axis=-1) decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True) # Handle labels labels = np.where(labels != -100, labels, tokenizer.pad_token_id) decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) # --- ROUGE-L --- rouge_l_scores = [score(pred, label) for pred, label in zip(decoded_preds, decoded_labels)] avg_rouge_l = sum(rouge_l_scores) / len(rouge_l_scores) return { "rouge_l": avg_rouge_l, } # Create the Trainer trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=dataset["train"], # Assuming you have train/validation/test splits eval_dataset=dataset["validation"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics, # Add this line ) # Train the model trainer.train() ``` **Important Notes on Training:** * **Replace Placeholders:** Replace `"your_dataset_name"` and other placeholders with your actual dataset name and paths. * **Data Preprocessing:** You'll likely need to preprocess your data before training. This usually involves: * Tokenizing the input queries and subqueries. * Padding sequences to a maximum length. * Creating attention masks. * Converting the subqueries string (with `\n` separators) into a format suitable for your model (e.g., using a special separator token). The example above uses ``. You *must* handle this correctly. * **Evaluation Metrics:** The example shows how to use ROUGE. You should also consider other metrics like BLEU, METEOR, and potentially custom metrics specific to knowledge graph query evaluation (e.g., precision and recall of retrieved entities and relationships). * **Hyperparameter Tuning:** The provided training arguments are just an example. You'll need to tune these hyperparameters to achieve optimal performance. ## Citation ``` @misc{dong2025queryreformulation, title = {Synthetic Query Reformulation Dataset for Knowledge Graph Retrieval}, author = {Alex Dong}, year = {2025}, howpublished = {Online Dataset}, note = {Available at https://huggingface.co/datasets/alexdong/query-reformulation/. Contact: me@alexdong.com}, keywords = {dbpedia, wikidata, kb, query-understanding, query-expansion, query-decomposition, query-rewriting, text2text-generation, question-answering, sentence-similarity}, } ``` ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.