| | --- |
| | license: mit |
| | datasets: |
| | - alexdong/query-reformulation |
| | language: |
| | - en |
| | metrics: |
| | - accuracy |
| | base_model: |
| | - google-t5/t5-small |
| | pipeline_tag: text2text-generation |
| | --- |
| | |
| |
|
| | # QRKB: A Synthetic Query Reformulation Model for Knowledge Graphs |
| |
|
| | [License: MIT](https://opensource.org/licenses/MIT) |
| | [Hugging Face Model](https://huggingface.co/alexdong/query-reformulation-knowledge-base-t5-small/) |
| | [Dataset](https://huggingface.co/datasets/alexdong/query-reformulation) |
| | [Source Code](https://github.com/alexdong/query-reformulation) |
| |
|
| | This repository contains the model and code for training and evaluating a query reformulation model on the QRKB-16k dataset. The model is designed to transform natural language queries into a set of structured subqueries suitable for retrieval from knowledge graphs like DBpedia and Wikidata. This is particularly useful for Retrieval-Augmented Generation (RAG) applications. |
| |
|
| | ## Overview |
| |
|
| | The model takes a natural language query as input and outputs a sequence of subqueries, each representing a semantic triple (subject-predicate-object). These subqueries can be used to directly query a knowledge graph using SPARQL. The model also predicts paraphrased variations of the input query. |
| |
|
| | **Key Features:** |
| |
|
| | * **Query Decomposition:** Breaks down complex queries into smaller, manageable subqueries. |
| | * **Knowledge Graph Compatibility:** Outputs subqueries that can be easily executed against knowledge graphs like DBpedia and Wikidata. |
| | * **Sentence Similarity:** Trained on variations of the input query, making it robust to different phrasings. |
| | * **Three Reformulation Categories:** Handles comparison, chaining, and expansion query types. |
| | * **Fine-tuned on SynQRe-KG:** Trained on a high-quality, synthetically generated dataset specifically designed for this task. |
| |
|
| | ## Model Architecture |
| |
|
| | The model is based on the [Sequence-to-Sequence architecture](https://arxiv.org/abs/1409.3215) using a pre-trained [Transformer model](https://arxiv.org/abs/1706.03762) as its backbone. Specifically, we use [T5-small](https://huggingface.co/google-t5/t5-small) as the encoder and decoder. |
| |
|
| | ## Installation |
| |
|
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | **`requirements.txt`** |
| |
|
| | ``` |
| | transformers |
| | torch |
| | datasets |
| | sentencepiece |
| | rouge_score # For ROUGE evaluation |
| | nltk |
| | ``` |
| |
|
| | ## Usage |
| |
|
| | ### 1. Loading the Model and Tokenizer |
| |
|
| | ```python |
| | from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
| | |
| | model_name = "your_model_name_or_path" # e.g., "your_username/your_model_name" or "./your_local_model_directory" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
| | |
| | # Move the model to the GPU if available |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | model.to(device) |
| | ``` |
| |
|
| | ### 2. Inference |
| |
|
| | ```python |
| | def reformulate_query(query): |
| | inputs = tokenizer(query, return_tensors="pt").to(device) |
| | outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True) # Adjust generation parameters as needed |
| | decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | subqueries = decoded_output.split("\n") # Assuming <sep> is your separator token |
| | return subqueries |
| | |
| | |
| | query = "What is the capital of the country that contains the administrative division where the national park housing Mount Aspiring is located?" |
| | subqueries = reformulate_query(query) |
| | print(f"Original Query: {query}") |
| | print(f"Subqueries: {subqueries}") |
| | ``` |
| |
|
| | ### 3. Training (Example) |
| |
|
| | ```python |
| | from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer |
| | from datasets import load_dataset |
| | |
| | # Load the dataset |
| | dataset = load_dataset("your_dataset_name", split="train") # Or load from local files |
| | |
| | # Define training arguments |
| | training_args = Seq2SeqTrainingArguments( |
| | output_dir="./results", |
| | evaluation_strategy="epoch", |
| | learning_rate=5e-5, |
| | per_device_train_batch_size=32, |
| | per_device_eval_batch_size=32, |
| | num_train_epochs=10, |
| | weight_decay=0.01, |
| | save_total_limit=3, |
| | predict_with_generate=True, # Important for generation tasks |
| | fp16=True, # Use mixed precision if your GPU supports it |
| | # Add other arguments as needed |
| | ) |
| | |
| | |
| | # Define a data collator (if needed, for padding, etc.) |
| | from transformers import DataCollatorForSeq2Seq |
| | data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) |
| | |
| | def compute_metrics(eval_pred: EvalPrediction, tokenizer: T5Tokenizer) -> Dict[str, float]: |
| | predictions, labels = eval_pred |
| | |
| | # The predictions are likely coming as logits |
| | # We need to handle this properly |
| | if isinstance(predictions, tuple): |
| | # If predictions is a tuple, take the first element (logits) |
| | predictions = predictions[0] |
| | |
| | # Get the most likely token IDs |
| | predictions = np.argmax(predictions, axis=-1) |
| | |
| | decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True) |
| | |
| | # Handle labels |
| | labels = np.where(labels != -100, labels, tokenizer.pad_token_id) |
| | decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) |
| | |
| | # --- ROUGE-L --- |
| | rouge_l_scores = [score(pred, label) for pred, label in zip(decoded_preds, decoded_labels)] |
| | avg_rouge_l = sum(rouge_l_scores) / len(rouge_l_scores) |
| | |
| | return { |
| | "rouge_l": avg_rouge_l, |
| | } |
| | |
| | |
| | # Create the Trainer |
| | trainer = Seq2SeqTrainer( |
| | model=model, |
| | args=training_args, |
| | train_dataset=dataset["train"], # Assuming you have train/validation/test splits |
| | eval_dataset=dataset["validation"], |
| | tokenizer=tokenizer, |
| | data_collator=data_collator, |
| | compute_metrics=compute_metrics, # Add this line |
| | ) |
| | |
| | # Train the model |
| | trainer.train() |
| | ``` |
| |
|
| | **Important Notes on Training:** |
| |
|
| | * **Replace Placeholders:** Replace `"your_dataset_name"` and other placeholders with your actual dataset name and paths. |
| | * **Data Preprocessing:** You'll likely need to preprocess your data before training. This usually involves: |
| | * Tokenizing the input queries and subqueries. |
| | * Padding sequences to a maximum length. |
| | * Creating attention masks. |
| | * Converting the subqueries string (with `\n` separators) into a format suitable for your model (e.g., using a special separator token). The example above uses `<sep>`. You *must* handle this correctly. |
| | * **Evaluation Metrics:** The example shows how to use ROUGE. You should also consider other metrics like BLEU, METEOR, and potentially custom metrics specific to knowledge graph query evaluation (e.g., precision and recall of retrieved entities and relationships). |
| | * **Hyperparameter Tuning:** The provided training arguments are just an example. You'll need to tune these hyperparameters to achieve optimal performance. |
| |
|
| |
|
| | ## Citation |
| | ``` |
| | @misc{dong2025queryreformulation, |
| | title = {Synthetic Query Reformulation Dataset for Knowledge Graph Retrieval}, |
| | author = {Alex Dong}, |
| | year = {2025}, |
| | howpublished = {Online Dataset}, |
| | note = {Available at https://huggingface.co/datasets/alexdong/query-reformulation/. Contact: me@alexdong.com}, |
| | keywords = {dbpedia, wikidata, kb, query-understanding, query-expansion, query-decomposition, query-rewriting, text2text-generation, question-answering, sentence-similarity}, |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |