Beginner Golfer’s RAG Assistant

Introduction

This system uses a RAG pipeline to help beginner golfers improve their knowledge of swing mechanics, practice techniques, rules, and general etiquette. Golf is a sport shrouded in etiquette and unspoken rules, not to mention it is difficult to learn and often counterintuitive, yet it has exploded in popularity in recent years. There is lots of scattered golf knowledge spread across the internet (blog posts, Twitter, YouTube, etc.) and 20th century books, but no unified source for beginners to access all this information when getting started. Current LLMs struggle with this task because golf instruction requires precise rules and context-driven practice guidance, and the use of casual sources to train general LLMs leads to hallucinations that make them unreliable for beginners. The RAG pipeline ensures responses are grounded in authoritative resources like the USGA Rules of Golf and instructional guides by retrieving relevant content from a curated knowledge base before generation. The system was implemented using Llama 3.1 8B combined with MiniLM semantic embeddings for retrieval, achieving substantial improvements in accuracy over the base model across rules, etiquette, and technique questions.

Data

The RAG pipeline uses a The RAG pipeline uses a custom dataset containing 95 documents of reliable, authoritative information tailored to beginners. The data is organized into three distinct categories:

  • Rules (30 documents from the official USGA Rules of Golf providing definitive guidance)

  • Etiquette (30 documents from golf handbooks and instructional websites describing common course behavior)

  • Technique (35 documents from handbooks and coaching materials covering context-dependent golf instruction)

Each document in the dataset includes the full instructional text along with metadata specifying the category (Rules, Etiquette, or Technique), source, and topic. The system was evaluated on a test set of 40 custom questions balanced across the three categories (10 rules, 10 etiquette, 20 technique) to assess the model's ability to provide accurate, beginner-friendly golf instruction grounded in authoritative sources.

Methodology

RAG was chosen for this task because golf instructions (especially rules, etiquette, and technique) requires precise, authoritative information rather than general reasoning. Early experiments showed that keyword-based retrieval methods (such as TF-IDF) failed when synonyms or informal language were used (e.g., “golf sticks” instead of “clubs”), showing the need to use semantic embeddings. Three retrieval configurations were tested:

Both semantic models outperformed TF-IDF, with MiniLM achieving Precision@3 of 73.3% (measuring how many of the top 3 retrieved documents were relevant) and offering the best balance between accuracy and efficiency. The final pipeline retrieves the top 3 most relevant documents, then feeds them to Llama-3.1-8B to generate grounded responses.

Evaluation

The RAG Beginner Golfer’s Assistant was evaluated on three benchmark tasks:

  1. Retrieval precision: MiniLM, MPNet, and TF-IDF were first compared using Precision@3 to measure how often the top 3 retrieved documents contained relevant information. MiniLM achieved Precision@3 of 73.3, outperforming TF-IDF, and offering a better compute/accuracy tradeoff than MPNet.

  2. Answer accuracy: The final RAG pipeline (Llama-3.1-8B + MiniLM) was then evaluated on 40 custom test questions across Rules, Etiquette, and Technique. The observed metrics were overall accuracy (>30% word overlap) and ROUGE-L. Accuracy assesses output quality overall, while ROUGE-L measures similarity between generated responses and the authoritative sources text.

  3. Category-level performance: Next, the performance for each category was measured separately to highlight strengths and weaknesses specific to Rules, Etiquette, or Technique.

The Rules, Etiquette, and Technique categories were identified and evaluated to cover the full scope of beginner golf instruction needs. Accuracy and ROUGE-L were used to assess response quality and similarity to "gold standard" answers. The final RAG pipeline was compared to three baselines: Gemma-2B, Falcon3-10B, and base Llama-3.1-8B. Gemma-2B is optimized for instruction-following tasks while being small enough to run on on limited hardware, while Falcon3-10B is post-trained on 1.2m samples of conversational and instructional data. The results show that the RAG pipeline outperforms all comparison models, particularly in Rules and Etiquette. This shows that using RAG to ground answers in authoritative sources improves response correctness.

Model Overall Accuracy ROUGE-L Rules Acc Etiquette Acc Technique Acc Technique ROUGE-L Benchmark Task
Gemma-2B 42.5% (17/40) 0.135 40% (4/10) 60% (6/10) 35% (7/20) 0.133 Test Set Evaluation
Falcon3-10B 47.5% (19/40) 0.194 60% (6/10) 30% (3/10) 50% (10/20) 0.183 Test Set Evaluation
Llama-3.1-8B Base 77.5% (31/40) 0.154 90% (9/10) 90% (9/10) 65% (13/20) 0.145 Test Set Evaluation
RAG Assistant (Final) 95.0% (38/40) 0.292 100% (10/10) 100% (10/10) 90% (18/20) 0.281 Retrieval + Test Set + Category Performance
Improvement +17.5 pts +0.138 +10 pts +10 pts +25 pts +0.136 Relative to Base Llama

Usage and Intended Uses

This model uses a RAG pipeline combining Llama 3.1 8B with semantic retrieval using sentence-transformers.

from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
import numpy as np

#Load the golf knowledge base
knowledge_data = load_dataset("chriswikoff/Beginner-Golf-Knowledge", split="train")
knowledge_texts = knowledge_data['text']

#Load embedding model for retrieval
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
knowledge_embeddings = embed_model.encode(knowledge_texts)

#Load base LLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", device_map="auto")

#Retrieve relevant context
def retrieve_context(question, k=3):
    query_embedding = embed_model.encode([question])
    similarities = np.dot(knowledge_embeddings, query_embedding.T).squeeze()
    top_k_indices = np.argsort(similarities)[-k:][::-1]
    return "\n\n".join([knowledge_texts[i] for i in top_k_indices])

#Generate response
question = "What is the maximum number of clubs I can carry?"
context = retrieve_context(question)
prompt = f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=150, pad_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Intended Uses:

This RAG system is designed for beginner golfers seeking accurate, authoritative guidance on rules, etiquette, and technique. It provides educational support for newcomers learning USGA rules, proper course behavior, and fundamental swing mechanics. The system grounds responses in authoritative sources (USGA rules, instructional handbooks, coaching materials) to reduce hallucinations and provide trustworthy advice. This model is best suited for answering specific instructional questions rather than open-ended conversation.

Prompt Format

The system uses a prompt format where retrieved documents are provided before the question.

Context:
Under USGA Rule 4.1, the maximum number of clubs you may carry during a round is 14.

Question: What is the maximum number of clubs I can carry?

Answer:

Expected Output Format

The model generates responses that directly answer the question while also citing rules or sources from the retrieved context. Responses are concise, beginner-friendly, and grounded in reliable golf instruction documentation.

You may carry a maximum of 14 clubs during a round according to USGA Rule 4.1. If you start with fewer than 14, you may add clubs during the round up to this limit. Carrying more than 14 clubs results in a penalty.

Limitations

This system has several limitations to consider. The custom knowledge base contains only 95 documents, which represents a limited subset of ALL golf instruction content available and may not cover edge cases or advanced topics. The 100% accuracy achieved on the test set is likely reflective of the limited scope of the 40-question evaluation rather than perfect performance; a more difficult test set with edge cases would likely show lower performance. The system's performance is also heavily dependent on retrieval quality, so MiniLM achieving only 73.3% Precision@3 means one in four retrieved documents might not be relevant to the question. Finally, the knowledge base requires manual updates when USGA updates its rules, or new instructional content becomes available.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support