---
license: mit
datasets:
- alexdong/query-reformulation
language:
- en
metrics:
- accuracy
base_model:
- google-t5/t5-small
pipeline_tag: text2text-generation
---


# QRKB: A Synthetic Query Reformulation Model for Knowledge Graphs

[License: MIT](https://opensource.org/licenses/MIT)
[Hugging Face Model](https://huggingface.co/alexdong/query-reformulation-knowledge-base-t5-small/) 
[Dataset](https://huggingface.co/datasets/alexdong/query-reformulation)
[Source Code](https://github.com/alexdong/query-reformulation)

This repository contains the model and code for training and evaluating a query reformulation model on the QRKB-16k dataset.  The model is designed to transform natural language queries into a set of structured subqueries suitable for retrieval from knowledge graphs like DBpedia and Wikidata. This is particularly useful for Retrieval-Augmented Generation (RAG) applications.

## Overview

The model takes a natural language query as input and outputs a sequence of subqueries, each representing a semantic triple (subject-predicate-object). These subqueries can be used to directly query a knowledge graph using SPARQL.  The model also predicts paraphrased variations of the input query.

**Key Features:**

*   **Query Decomposition:** Breaks down complex queries into smaller, manageable subqueries.
*   **Knowledge Graph Compatibility:**  Outputs subqueries that can be easily executed against knowledge graphs like DBpedia and Wikidata.
*   **Sentence Similarity:** Trained on variations of the input query, making it robust to different phrasings.
*   **Three Reformulation Categories:** Handles comparison, chaining, and expansion query types.
*   **Fine-tuned on SynQRe-KG:** Trained on a high-quality, synthetically generated dataset specifically designed for this task.

## Model Architecture

The model is based on the [Sequence-to-Sequence architecture](https://arxiv.org/abs/1409.3215) using a pre-trained [Transformer model](https://arxiv.org/abs/1706.03762) as its backbone.  Specifically, we use [T5-small](https://huggingface.co/google-t5/t5-small) as the encoder and decoder.

## Installation

```bash
pip install -r requirements.txt
```

**`requirements.txt`**

```
transformers
torch
datasets
sentencepiece
rouge_score # For ROUGE evaluation
nltk
```

## Usage

### 1. Loading the Model and Tokenizer

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "your_model_name_or_path"  # e.g., "your_username/your_model_name" or "./your_local_model_directory"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Move the model to the GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
```

### 2. Inference

```python
def reformulate_query(query):
    inputs = tokenizer(query, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_length=128, num_beams=5, early_stopping=True) # Adjust generation parameters as needed
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    subqueries = decoded_output.split("\n") # Assuming <sep> is your separator token
    return subqueries


query = "What is the capital of the country that contains the administrative division where the national park housing Mount Aspiring is located?"
subqueries = reformulate_query(query)
print(f"Original Query: {query}")
print(f"Subqueries: {subqueries}")
```

### 3. Training (Example)

```python
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset

# Load the dataset
dataset = load_dataset("your_dataset_name", split="train") # Or load from local files

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
    save_total_limit=3,
    predict_with_generate=True,  # Important for generation tasks
    fp16=True,  # Use mixed precision if your GPU supports it
    # Add other arguments as needed
)


# Define a data collator (if needed, for padding, etc.)
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

def compute_metrics(eval_pred: EvalPrediction, tokenizer: T5Tokenizer) -> Dict[str, float]:
    predictions, labels = eval_pred

    # The predictions are likely coming as logits
    # We need to handle this properly
    if isinstance(predictions, tuple):
        # If predictions is a tuple, take the first element (logits)
        predictions = predictions[0]

    # Get the most likely token IDs
    predictions = np.argmax(predictions, axis=-1)

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    # Handle labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # --- ROUGE-L ---
    rouge_l_scores = [score(pred, label) for pred, label in zip(decoded_preds, decoded_labels)]
    avg_rouge_l = sum(rouge_l_scores) / len(rouge_l_scores)

    return {
        "rouge_l": avg_rouge_l,
    }


# Create the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],  # Assuming you have train/validation/test splits
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics, # Add this line
)

# Train the model
trainer.train()
```

**Important Notes on Training:**

*   **Replace Placeholders:** Replace `"your_dataset_name"` and other placeholders with your actual dataset name and paths.
*   **Data Preprocessing:**  You'll likely need to preprocess your data before training.  This usually involves:
    *   Tokenizing the input queries and subqueries.
    *   Padding sequences to a maximum length.
    *   Creating attention masks.
    *   Converting the subqueries string (with `\n` separators) into a format suitable for your model (e.g., using a special separator token). The example above uses `<sep>`.  You *must* handle this correctly.
*   **Evaluation Metrics:**  The example shows how to use ROUGE. You should also consider other metrics like BLEU, METEOR, and potentially custom metrics specific to knowledge graph query evaluation (e.g., precision and recall of retrieved entities and relationships).
* **Hyperparameter Tuning:** The provided training arguments are just an example.  You'll need to tune these hyperparameters to achieve optimal performance.


## Citation
```
@misc{dong2025queryreformulation,
    title         = {Synthetic Query Reformulation Dataset for Knowledge Graph Retrieval},
    author        = {Alex Dong},
    year          = {2025},
    howpublished  = {Online Dataset},
    note          = {Available at https://huggingface.co/datasets/alexdong/query-reformulation/. Contact: me@alexdong.com},
    keywords      = {dbpedia, wikidata, kb, query-understanding, query-expansion, query-decomposition, query-rewriting, text2text-generation, question-answering, sentence-similarity},
}
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.