--- license: mit language: - en base_model: - google-bert/bert-base-uncased pipeline_tag: question-answering --- # WikiHint: A Human-Annotated Dataset for Hint Ranking and Generation [![License](https://img.shields.io/badge/License-CC%20BY%204.0-blue)](https://creativecommons.org/licenses/by/4.0/) WikiHint is a **human-annotated dataset** designed for **automatic hint generation and ranking** for factoid questions. This dataset, based on Wikipedia, contains **5,000 hints for 1,000 questions** and supports research in **hint evaluation, ranking, and generation**. ## πŸ—‚ Overview - **1,000 questions** with **5,000 manually created hints**. - Hints ranked by **human annotators** based on helpfulness. - Evaluated using **LLMs (LLaMA, GPT-4)** and **human performance studies**. - Supports **hint ranking** and **automatic hint evaluation**. ## πŸ”¬ Research Contributions βœ… **First human-annotated dataset** for hint generation and ranking. βœ… **HintRank:** A lightweight method for automatic hint ranking. βœ… **Human study** evaluating hint effectiveness in helping users answer questions. βœ… **Fine-tuning open-source LLMs** (LLaMA-3.1, GPT-4) for hint generation. ## πŸ“ˆ Key Insights - **Answer-aware hints** improve hint effectiveness. - **Finetuned LLaMA models** generate better hints than vanilla models. - **Shorter hints** tend to be **more effective** than longer ones. - **Human-generated hints** outperform LLM-generated hints in clarity and ranking. ## πŸš€ Getting Started ### 1️⃣ Clone the Repository ```sh git clone https://github.com/DataScienceUIBK/WikiHint.git cd WikiHint ``` ### 2️⃣ Load the Dataset ```python import json with open("./WikiHint/training.json", "r") as f: training_data = json.load(f) with open("./WikiHint/test.json", "r") as f: test_data = json.load(f) print(f"Training set: {len(training_data)} questions") print(f"Test set: {len(test_data)} questions") ``` ## πŸ† HintRank: A Lightweight Hint Ranking Method HintRank is an **automatic ranking method** for hints using **BERT-based models**. It operates on **pairwise comparisons**, determining the **relative helpfulness of hints**.

### ✨ Features: βœ” **Lightweight**: Runs locally without requiring massive computational resources. βœ” **LLM-free evaluation**: Works without relying on **large-scale generative models**. βœ” **Human-aligned ranking**: Strong correlation with **human-assigned hint rankings**. ### πŸ” How It Works: 1. **Concatenates question & two hints** β†’ Converts them into BERT-compatible format. 2. **Computes hint quality** β†’ Determines which hint is **more useful**. 3. **Generates hint rankings** β†’ Assigns ranks based on pairwise comparisons. ## πŸ“Œ Using `HintRank` for Hint Ranking The `HintRank` module is designed to **automatically rank hints** based on their helpfulness using **BERT-based models**. ### πŸš€ Run the HintRank Demo in Google Colab You can easily try **HintRank** in your browser via **Google Colab**, with no local installation required. Simply **[launch the Colab notebook](https://colab.research.google.com/github/DataScienceUIBK/WikiHint/blob/main/HintRank/Demo.ipynb)** to explore **HintRank** interactively. ### 1️⃣ Install Dependencies If running locally, ensure you have the required dependencies installed: ```sh pip install transformers torch numpy scipy ``` ### 2️⃣ Import and Initialize HintRank Navigate to the `HintRank` directory and import the `hint_rank` module: ```python from HintRank.hint_rank import HintRank # Initialize the HintRank model ranker = HintRank() ``` ### 3️⃣ Rank Hints for a Given Question ```python question = "What is the capital of Austria?" answer = "Vienna" hints = [ "Mozart and Beethoven once lived here.", "It is a big city in Europe.", "Austria’s largest city." ] # Pairwise Comparison Example better_hint_answer_aware = ranker.pairwise_compare(question, hints[1], hints[2], answer) better_hint_answer_agnostic = ranker.pairwise_compare(question, hints[0], hints[1]) print(f"Answer-Aware: Hint {2 if better_hint_answer_aware == 1 else 3} is better than Hint {3 if better_hint_answer_aware == 0 else 2}.") print(f"Answer-Agnostic: Hint {1 if better_hint_answer_agnostic == 1 else 2} is better than Hint {2 if better_hint_answer_agnostic == 0 else 1}.") ``` ### 4️⃣ Listwise Hint Ranking You can also rank multiple hints at once: ```python print("\nAnswer-Aware Ranked Hints:") ranked_hints_answer_aware = ranker.listwise_compare(question, hints, answer) for i, (hint, _) in enumerate(ranked_hints_answer_aware): print(f"Rank {i + 1}: {hint}") print("\nAnswer-Agnostic Ranked Hints:") ranked_hints_answer_agnostic = ranker.listwise_compare(question, hints) for i, (hint, _) in enumerate(ranked_hints_answer_agnostic): print(f"Rank {i + 1}: {hint}") ``` ### πŸ“Œ Expected Output ``` Pairwise Hint Comparison Answer-Aware: Hint 3 is better than Hint 2. Answer-Agnostic: Hint 2 is better than Hint 1. Listwise Hint Ranking Answer-Aware Ranked Hints: Rank 1: Austria’s largest city. Rank 2: Mozart and Beethoven once lived here. Rank 3: It is a big city in Europe. Answer-Agnostic Ranked Hints: Rank 1: It is a big city in Europe. Rank 2: Austria’s largest city. Rank 3: Mozart and Beethoven once lived here. ``` --- ## πŸ“Š πŸ†š WikiHint vs. TriviaHG Dataset Comparison The table below compares **WikiHint** with **TriviaHG**, the largest previous dataset for hint generation. WikiHint has **better convergence**, **shorter hints**, and **higher-quality** hints based on multiple evaluation metrics. | **Dataset** | **Subset** | **Relevance** | **Readability** | **Convergence** | **Familiarity** | **Length** | **Answer Leakage (Avg.)** | **Answer Leakage (Max.)** | |------------|-----------|--------------|----------------|--------------|--------------|---------|----------------|----------------| | TriviaHG | Entire | 0.95 | 0.71 | 0.57 | 0.77 | 20.82 | 0.23 | 0.44 | | WikiHint | Entire | 0.98 | 0.72 | 0.73 | 0.75 | 17.82 | 0.24 | 0.49 | | TriviaHG | Train | 0.95 | 0.73 | 0.57 | 0.75 | 21.19 | 0.22 | 0.44 | | WikiHint | Train | 0.98 | 0.71 | 0.74 | 0.76 | 17.77 | 0.24 | 0.49 | | TriviaHG | Test | 0.95 | 0.73 | 0.60 | 0.77 | 20.97 | 0.23 | 0.44 | | WikiHint | Test | 0.98 | 0.83 | 0.72 | 0.73 | 18.32 | 0.24 | 0.47 | πŸ“Œ **Key Findings**: - **WikiHint outperforms TriviaHG** in **convergence**, meaning its hints help users **arrive at answers more effectively**. - **WikiHint’s hints are shorter**, leading to **more concise and effective guidance**. ## πŸ“ŠπŸ€– Evaluation of Generated Hints This table presents an **evaluation of generated hints** across different **LLMs (LLaMA-3.1, GPT-4)** based on **Relevance, Readability, Convergence, Familiarity, Hint Length, and Answer Leakage**. It provides insights into how **finetuning (FT)** and **answer-awareness (wA)** affect hint quality. | **Model** | **Config** | **Use Answer?** | **Rel** | **Read** | **Conv (LLaMA-8B)** | **Conv (LLaMA-70B)** | **Fam** | **Len** | **AnsLkg (Avg.)** | **AnsLkg (Max.)** | |-----------|----------|---------------|--------------|----------------|------------------|------------------|--------------|---------|----------------|----------------| | **GPT-4** | Vanilla | βœ… | 0.91 | 1.00 | 0.14 | 0.48 | 0.84 | 26.36 | 0.23 | 0.51 | | **GPT-4** | Vanilla | ❌ | 0.92 | 1.10 | 0.12 | 0.47 | 0.81 | 26.93 | 0.24 | 0.52 | | **LLaMA-3.1-405b** | Vanilla | βœ… | 0.94 | 1.49 | 0.11 | 0.47 | 0.76 | 41.81 | 0.23 | 0.50 | | **LLaMA-3.1-405b** | Vanilla | ❌| 0.92 | 1.53 | 0.10 | 0.45 | 0.78 | 50.91 | 0.23 | 0.50 | | **LLaMA-3.1-70b** | FTwA | βœ… | 0.88 | 1.50 | 0.09 | 0.42 | 0.84 | 43.69 | 0.22 | 0.48 | | **LLaMA-3.1-70b** | Vanilla | βœ… | 0.86 | 1.53 | 0.05 | 0.42 | 0.80 | 45.51 | 0.23 | 0.50 | | **LLaMA-3.1-70b** | FTwoA | ❌ | 0.86 | 1.50 | 0.08 | 0.38 | 0.80 | 51.07 | 0.22 | 0.51 | | **LLaMA-3.1-70b** | Vanilla | ❌ | 0.87 | 1.56 | 0.06 | 0.38 | 0.76 | 53.24 | 0.22 | 0.50 | | **LLaMA-3.1-8b** | FTwA | βœ… | 0.78 | 1.63 | 0.05 | 0.37 | 0.79 | 50.33 | 0.22 | 0.52 | | **LLaMA-3.1-8b** | Vanilla | βœ… | 0.81 | 1.72 | 0.05 | 0.32 | 0.80 | 54.38 | 0.22 | 0.50 | | **LLaMA-3.1-8b** | FTwoA | ❌ | 0.76 | 1.70 | 0.03 | 0.32 | 0.80 | 55.02 | 0.22 | 0.51 | | **LLaMA-3.1-8b** | Vanilla | ❌ | 0.78 | 1.76 | 0.04 | 0.30 | 0.83 | 52.99 | 0.22 | 0.50 | πŸ“Œ **Key Takeaways**: - **Relevance**: **Larger models (405b, 70b) provide better hints** compared to smaller (8b) models. - **Readability**: **GPT-4 produces the most readable hints**. - **Convergence**: **Answer-aware hints (wA) help LLMs generate better hints**. - **Familiarity**: Larger models generate **more familiar hints** based on common knowledge. - **Hint Length**: **Finetuned models (FTwA, FTwoA) generate shorter and better hints**. ## πŸ“œ License This project is licensed under the **Creative Commons Attribution 4.0 International License (CC BY 4.0)**. You are free to use, share, and adapt the dataset with proper attribution. ## πŸ“‘ Citation If you find this work useful, please cite [πŸ“œour paper](https://doi.org/10.48550/arXiv.2412.01626): Mozafari, J., Gerhold, F., & Jatowt, A. (2024). WikiHint: A Human-Annotated Dataset for Hint Ranking and Generation. arXiv preprint arXiv:2412.01626. ### πŸ“„ BibTeX: ```bibtex @article{mozafari2025wikihinthumanannotateddatasethint, title={WikiHint: A Human-Annotated Dataset for Hint Ranking and Generation}, author={Jamshid Mozafari and Florian Gerhold and Adam Jatowt}, year={2025}, eprint={2412.01626}, archivePrefix={arXiv}, primaryClass={cs.CL}, doi={10.48550/arXiv.2412.01626}, } ``` ## πŸ™Acknowledgments Thanks to our contributors and the University of Innsbruck for supporting this project.