A RAG-based LLM pipeline for itinerary generation, built for Phi-3.
Introduction
Generating grounded, geographically consistent itineraries for travel / an outing is a task that current large language models struggle with, as they sometimes hallucinate attractions or lack up to date data on new places. In addition, the places generated likely will be geared toward the masses (understandably) given that LLM are trained on vast amounts of data, most of which not pertaining to the user. This project aims to tackle both of these issues with a single-hop Retrieval Augmented Generation (RAG) pipeline that utilizes MiniLM-L6-v2, a lightweight embedding model, combined with a FAISS index over a dataset of city to attraction pairs (one to many). The pipeline tries to identify the location from the prompt (the city, for now) then injects the retrieved attractions for that city selected with the FAISS index into a strict itinerary prompt with a list of the users saved locations and instructions for generation. This ends up with the model producing relatively grounded itineraries or real locations both relevant to the user and in a given city. My results show a large improvement in groundedness and hallucination reduction relative to the base model (discussed in further detail below, but on a custom evaluation metric the score went from ~3 to a 10), while only sacrificing a little performance on unrelated reasoning benchmarks like GSM8K and RACE (no catasrophic forgetting).
Data
The goal would be to one day attach this pipeline to something like googles place data, but for now, I have generated a custom dataset in the form of the file "five_city_attraction_corpus.json". This includes city attraction pairings for five U.S. cities: NYC, San Francisco, Chicago, Boston, and LA. Because this is a RAG pipeline as opposed to model finetuning, no training/validation split was needed. Instead I created two forms of validation data, a 100-example synthetic itinerary set, and a set of manually created test cases that I used in order to compare different retrieval configurations. The GSM8K and RACE datasets were also something I used as external benchmarks in order to assess catastrophic forgetting of the wrapped model before and after actually implementing the retrieval part of the RAG pipeline.
Here are two examples from the manual test cases:
"id": 0,
"city": "San Francisco",
"user_prompt": "I'm planning a relaxing day in San Francisco. Suggestions?",
"saved_places": [
"War Memorial Opera House",
"Golden Gate Park",
"Pier 39"
],
"id": 1,
"city": "New York",
"user_prompt": "Plan me a fun day in New York.",
"saved_places": [
"Central Park",
"Chelsea Market",
"Times Square"
]
and here is a snippet from two examples of the synthetic test cases:
'id': 0,
'city': 'Los Angeles',
'user_prompt': 'Give me a short itinerary for a day out in Los Angeles.',
'saved_places': ['Chinatown Chicago', 'Ellis Island', 'Museum Wharf'],
'id': 1,
'city': 'Chicago',
'user_prompt': "What's a good way to spend a day exploring Chicago?",
'saved_places': ['USS Constitution Museum',
'Shoreline Amphitheatre',
'Skywalk Observatory']
Methodology
The heart of my project is grounding as opposed to training the model to think in a vastly different way, so I ended up going with a single-hop RAG pipeline as instead if finetuning the model or otherwise adjusting the weights/parameters. The idea was to extract a city from the users prompt (at this point, the pipeline HEAVILY relies on the user mentioning one of the 5 cities, but with a larger corpus, eg. google place data, I plan to mitigate that issue with a more dynamic area extraction but the general idea would be the same) and then retrieve the top-k relevant to the prompt attractions for that city. It then drop those top-k places into the itinerary prompt along with the original users prompt and the users "saved places" list. I tested three retrieval variations (MiniLM cosine, MiniLM dot-product, and MPNet cosine) because I wanted to see if a heavier embedding model or different similarity metric would perform better. Everything is built on top of Phi-3-Mini-128k-Instruct, which I picked because it’s small, fast, and very good at instruction following without me needing to train anything. No hyperparameters in the traditional sense since nothing is being optimized, but for reproducibility: I used k=10 retrieval, cosine or dot similarity depending on the config, and deterministic generation (no sampling) with a max of 256 new tokens.
I also want to give some background on my methodology decisions relevant to my DS 5002 course:
My empirical results from Homework 6 and 7 helped guide my decision to move forward with single-hop RAG, for example in Homework 6 which covered prompt tuning, I fine-tuned TinyLlama-1.1B-Chat with PEFT and observed only rather small performance gains on GSM8K (from about 4–6% to 6–7%, with RACE remaining around 32%). In Homework 7 where I used LoRA, I saw a similar pattern (I am honestly not convinced it wasn't on me in some way but I still figure I should take the results for what they are!) even after training Llama-3.2B on 6,000 examples from GSM8K, it barely improved and RACE also just got a little worse. In contrast on Check-In 3, adding a cleaned attraction list into the prompts (simulating RAG) I ended up having much more grounded itineraries and for the most part saved places were included only when they made sense geographically. Adding retrieval context was a much bigger improver of performance that PEFT was in any case for me/my task.
Evaluation
To get a sense of whether the RAG piece was actually helping, I evaluated the model in a few different ways. The main one was the 100-example synthetic itinerary test set, in which I used a custom scoring method that gives points for pulling correct attractions and heavily penalizes hallucinations. I check the output for location names from my place data that sort of is filling in for googles place data (my corpus variable), if the place exists it checks which city its in, if it gives a place from the wrong city I take away 10 points from its score, if it does belong to the city it gets +1, but if it belongs to the city AND is part of the users list of places, it gets +2. To be clear this was NOT used to train the model, just to evaluate the performance before and after RAG is fed into the prompt template. I also ran the wrapped model on GSM8K and RACE before and after RAG to make sure I wasn’t losing major reasoning abilities by adding retrieval noise. To give a bit more context for how my RAG pipeline helped out Phi-3, compares to other models that I believe would have some solid baseline performance at the task, I chose:
Llama-3.2-1B-Instruct, which is one of the strongest modern instruction models,
and
Phi-2, which might be a little confusing since it is an earlier version of Phi-3, but I was curious if an ealier model may actually outperform my RAG on tasks like GSM8K and RACE, which it may have an advantage in compared to Phi-3 with my RAG built for specifically itinerary generation.
These additional evaluations helped me out in that I was able to see if my performance with with RAG (especially with my custom evaluator for itinerary generation) was unique or if other models could achieve comparable performance without retrieval.
Here is how that went!
| Model | GSM8K (Strict EM) | RACE (Accuracy) | Custom Itinerary Score |
|---|---|---|---|
| Llama-3.2-1B-Instruct (Comparison, No RAG) | 0.20 | 0.40 | 5.5 |
| Phi-2 (Comparison, No RAG) | 0.30 | 0.30 | 0.0 |
| Phi-3-Mini-128k-Instruct (Base Model, No RAG) | 0.10 | 0.30 | 3.0 |
| Phi-3-Mini-128k-Instruct + RAG (Final Model) | 0.10 | 0.30 | 10.0 |
Phi-3 with the RAG pipeline vastly outperformed all of the comparison models on the metric most relevant to itinerary generation, the Custom Itinerary score (phew!). I am quite suprised that LLama-3.2-1B-Instruct did so well on the custom itinerary score with no RAG however, and in hindsight it may have been a better starting point for this projet than Phi-3-Mini! Phi 2 also did better than Phi-3-Mini on GSM8K, which I think may be in part due to the Phi-3-Mini without RAG being tested still with the bones of the RAG pipeline (just not the retrieval) for the most fair possible comparison to the RAG pipieline with retrieval. As a whole I am happy with the performance boost that Phi-3-Mini ended up with, especially compared to the performance of Phi-2 and Llama-3.2.
Usage and Intended Uses
At this point, I think the model serves more as an educational tool/starting point than something that should be used in pratice to plan your next outing! It certainly can work for planning you a trip in the 5 cities within my corpus, but I really think it will require a few next steps in order to be truly useful in day to day life that (for now) were outside the scope of my project:
- access to a real place database (Google/Maps/etc.)
- better geographic filtering (actual distances instead of city buckets),
- maybe a lightweight post-processing step to ensure formatting is always clean (at the moment, Phi-3 really has trouble sticking to a consistent output style/knowing when to stop).
Still, I am happy with how much this RAG pipeline I built is able to ground the output, incorporate places the user is interested in, and reliably (at some point within the output) generate a readable itinerary without too much fluff (even if the format isnt EXACTLY the same each time). I also built the pipeline with incorporating better place data in mind for someone (hopefully myself in the future) to adapt be it for a travel app prototype, or just as a reference for building their own retrieval system with a limited scope.
To use it, I would write something like: "plan me a day in New York City, I want to eat and see something interesting" and then give it a list of locations for your saved places that are relatively close to eachother, as it will absolutely use two places should they be within the same city even if they are quite far from eachother!
Here is an example of how to load the model and use the pipeline:
Step 1: download five_city_attraction_corpus.json and rag_utils.py and drop them in your project root
Step 2: run the following code in a python script also at the root:
from rag_utils import RagPipeline
pipeline = RagPipeline(
model_name="microsoft/Phi-3-mini-128k-instruct",
embedder_name="sentence-transformers/all-MiniLM-L6-v2"
)
response = pipeline.run(
user_prompt="Plan me a day in Boston!",
saved_places=["Fenway Park"]
)
print(response)
This will give you a response like:
User Input:
Plan me a day in Boston!
User Saved Places:
['Fenway Park']
Retrieved Places:
['East Boston Greenway', 'ICA Boston', 'CambridgeSide Galleria', 'Boston Tea Party Ships', 'Government Center', 'Boston Common', 'Rose Kennedy Greenway', 'Boston Harborwalk', 'JP Centre', 'Quincy Market']
Task:
Produce ONLY a concise itinerary.
- No timestamps.
- No explanations before or after.
- No extra commentary.
- No section headers.
- No bullet points or numbering.
- Each step should be a single sentence.
- Use saved places only if they logically fit the location.
- If nothing fits, rely solely on retrieved places.
Output Format (MANDATORY):
A newline-separated list of activities, one per line.
Example:
Go to Location 1
Walk at Location 2
Enjoy dinner at Location 3
Watch a movie at Location 4
Rules:
- Do NOT say “Here’s your itinerary”.
- Do NOT output anything except the itinerary steps.
- Do NOT include blank lines.
Now generate the itinerary:
Go to Fenway Park
Walk along the Boston Harborwalk
Visit the Rose Kennedy Greenway
Enjoy a meal at Quincy Market
User Input:
Plan a day in Boston with a focus on history and culture.
User Saved Places:
['Boston Tea Party Ships & Faneuil Hall Preparedness Museum', 'Old State House', 'Museum of Fine Arts', 'Boston Public Library', 'Freedom Trail']
Retrieved Places:
['East Boston Greenway', 'ICA Boston', 'C
Here you can see that unfortunately the pipeline and model combination still does repeat the pattern over and over after giving a solid itinerary, so you can use the following parameters to help with this:
you can set cut_off to true in the run function, which will stop the model off after it finds three new lines in a row which I have had some success with, heres an example of how to call this:
from rag_utils import RagPipeline
pipeline = RagPipeline(
model_name="microsoft/Phi-3-mini-128k-instruct",
embedder_name="sentence-transformers/all-MiniLM-L6-v2"
)
response = pipeline.run(
user_prompt="Plan me a day in Boston!",
saved_places=["Fenway Park"],
cut_off=True
)
print(response)
Heres the output:
User Input:
Plan me a day in Boston!
User Saved Places:
['Fenway Park']
Retrieved Places:
['East Boston Greenway', 'ICA Boston', 'CambridgeSide Galleria', 'Boston Tea Party Ships', 'Government Center', 'Boston Common', 'Rose Kennedy Greenway', 'Boston Harborwalk', 'JP Centre', 'Quincy Market']
Task:
Produce ONLY a concise itinerary.
- No timestamps.
- No explanations before or after.
- No extra commentary.
- No section headers.
- No bullet points or numbering.
- Each step should be a single sentence.
- Use saved places only if they logically fit the location.
- If nothing fits, rely solely on retrieved places.
Output Format (MANDATORY):
A newline-separated list of activities, one per line.
Example:
Go to Location 1
Walk at Location 2
Enjoy dinner at Location 3
Watch a movie at Location 4
Rules:
- Do NOT say “Here’s your itinerary”.
- Do NOT output anything except the itinerary steps.
- Do NOT include blank lines.
Now generate the itinerary:
Go to Fenway Park
Walk along the Boston Harborwalk
Visit the Rose Kennedy Greenway
Enjoy a meal at Quincy Market
Additionally, if you want to also/instead exclude the prompt, you can use the parameter no_prompt and set that to True:
from rag_utils import RagPipeline
pipeline = RagPipeline(
model_name="microsoft/Phi-3-mini-128k-instruct",
embedder_name="sentence-transformers/all-MiniLM-L6-v2"
)
response = pipeline.run(
user_prompt="Plan me a day in Boston!",
saved_places=["Fenway Park"],
cut_off=True,
no_prompt=True
)
print(response)
Which for my run output:
Go to Fenway Park
Walk along the Boston Harborwalk
Visit the Rose Kennedy Greenway
Enjoy a meal at Quincy Market
Prompt Format
The prompt structure is intentionally strict because the model behaves much better when it knows exactly what to do (and what not to do). Everything gets wrapped in a fixed template that includes the user’s input, filtered saved places, and retrieved attractions.
""" User Input: {user_prompt}
User Saved Places: {saved_places}
Retrieved Places: {retrieved_places}
Task: Produce ONLY a concise itinerary.
- No timestamps.
- No explanations before or after.
- No extra commentary.
- No section headers.
- No bullet points or numbering.
- Each step should be a single sentence.
- Use saved places only if they logically fit the location.
- If nothing fits, rely solely on retrieved places.
Output Format (MANDATORY): A newline-separated list of activities, one per line.
Example: Go to Location 1 Walk at Location 2 Enjoy dinner at Location 3 Watch a movie at Location 4
Rules:
- Do NOT say “Here’s your itinerary”.
- Do NOT output anything except the itinerary steps.
- Do NOT include blank lines.
Now generate the itinerary: """
Expected Output Format
The model outputs a newline-separated list of single-sentence itinerary steps. There should be no headers, blank lines, bullet points, timestamps, or explanation text. Here is an example!
Walk through Golden Gate Park
Explore the Palace of Fine Arts
Visit the Exploratorium
Relax along Crissy Field
Enjoy dinner at Fisherman’s Wharf
Limitations
I touch on this a bit in the "Usage and Intended Uses" section, but while this RAG pipeline is fairly modular, reduces hallucinations, and helps keep suggestions within a city, its heavily limited to my fixed five-city attraction corpus I created for this project at the moment. Without a bit of modification, the current pipeline has no ability to adapt to locations outside of this dataset. Also, the pipeline does not consider distance between locations within the city, traffic, the fact that users would likely not want to go to lunch back to back (though I have only seen this once myself!), or even locations outside the corpus, even if in reality they do exist within one of the five cities. I believe this project serves better as an academic demo for how retrieval-augmented itinerary generation can work at a basic level, and should not be used in real-life situations (as it currently stands) where someone may end up spending a lot of time in transit between locations!
Model tree for Jefto/itinerary-rag-pipeline
Base model
microsoft/Phi-3-mini-128k-instruct