CalebMaresca commited on
Commit
e1ee8e8
·
1 Parent(s): f16141d

add notebooks for creating testset, evaluation, and embedding fine-tuning and write README

Browse files
.gitignore CHANGED
@@ -3,6 +3,9 @@ docs/
3
  .cursor/
4
  .chainlit/
5
  .files/
 
 
 
6
 
7
  # Byte-compiled / optimized / DLL files
8
  __pycache__/
 
3
  .cursor/
4
  .chainlit/
5
  .files/
6
+ wandb/
7
+ finetuned_arctic_ft/
8
+ checkpoints/
9
 
10
  # Byte-compiled / optimized / DLL files
11
  __pycache__/
README.md CHANGED
@@ -1 +1,111 @@
1
- # matrix-game-rag
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Matrix Wargame RAG Agent - Certification Challenge
2
+
3
+ This repository contains the work for the Certification Challenge, focusing on an application to assist with understanding and designing matrix wargames.
4
+
5
+ ## Introduction
6
+
7
+ This project aims to develop an AI-powered application that helps users answer questions about matrix wargames and supports them in the design process of new matrix wargames. It leverages Retrieval Augmented Generation (RAG) and agentic capabilities to provide accurate and contextually relevant information.
8
+
9
+ ## Task 1: Defining your Problem and Audience
10
+
11
+ **Problem Statement:**
12
+ * Users, such as game designers, researchers, and hobbyists, often find it challenging to quickly access specific information about matrix wargame mechanics, historical examples, and design principles, or need assistance in brainstorming and structuring new game designs.
13
+
14
+ **Why this is a problem for your specific user:**
15
+ * While matrix games are designed to be accessible, newcomers can find it challenging to navigate the rules and gameplay without a knowledgeable facilitator or by frequently pausing to consult rulebooks. This application provides a centralized, interactive knowledge base to answer their specific questions in an accessible way.
16
+ * Game designers may find it time consuming to build new games and struggle to incorporate all the necessary details and nuances. This application's game designer tool can help them create detailed and well-structured games more efficiently.
17
+ * Researchers might need a tool to quickly compare and contrast different wargame designs or identify trends in the field.
18
+
19
+ **Potential questions that your user is likely to ask:**
20
+ * "What are the core mechanics of a matrix wargame?"
21
+ * "Can you give me examples of matrix wargames used for training military personnel?"
22
+ * "How can I design a matrix wargame to simulate [specific scenario, e.g., a cybersecurity incident]?"
23
+ * "What are common pitfalls to avoid when designing a matrix wargame?"
24
+ * "Suggest some adjudication mechanisms for a diplomatic conflict in a matrix game."
25
+ * "What data sources can I use to inform the design of a historical matrix wargame?"
26
+ * "How do I balance a matrix wargame for multiple players with different objectives?"
27
+
28
+ ## Task 2: Propose a Solution
29
+
30
+ **Proposed Solution:**
31
+ * I will build an agentic RAG application that allows users to ask natural language questions about matrix wargames. The system will retrieve relevant information from a curated knowledge base of documents, articles, and potentially game rulebooks. It will also leverage agentic reasoning to help users brainstorm design elements, structure game phases, and consider different mechanics.
32
+ * The user will interact with a simple interface (e.g., a web chat) where they can input their queries. The application will provide concise answers, cite sources, and offer follow-up suggestions or design considerations. This will save users time, provide targeted information, and act as a creative partner in the design process.
33
+
34
+ **Tooling Choices:**
35
+ * **LLM:** GPT4.1-mini - Why: Newer model with a good balance between intelligence and cost
36
+ * **Embedding Model:** snowflake-arctic-embed-l - Why: Open source and easy to fine tune
37
+ * **Orchestration:** LangChain/LangGraph - Why: Easy to connect agent with many tools. Enables easy integration with more complex multi-agent graphs that I want to build in the future.
38
+ * **Vector Database:** Qdrant - Why: Efficient similarity search and ease of integration with LangChain/LangGraph
39
+ * **Monitoring:** LangSmith - Why: To track performance, identify issues, and understand application usage
40
+ * **Evaluation:** RAGAS- Why: To build custom personas of users and evaluate RAG metrics
41
+ * **User Interface:** Chainlit - Why: For rapid prototyping. Currently building a Next.js frontend
42
+ * **(Optional) Serving & Inference:** HF Spaces - Why: ease of implementation and access
43
+
44
+ **Agent Usage:**
45
+ * An agent will be used to understand complex user queries that may require multi-step reasoning or access to multiple tools (e.g., a RAG retriever for the knowledge base and a wikipedia search tool for information on recent/historical events or broader context).
46
+ * The Game designer tool enables the agent to design high-quality matrix games within a standardized Pydantic format (future features will allow users to save these games and submit them to be played by AI agents)
47
+
48
+ ## Task 3: Dealing with the Data
49
+
50
+ **Data Sources and External APIs:**
51
+ * **Primary Data Source:** A collection of PDFs, articles, and web content related to matrix wargaming theory, design principles, and existing game examples.
52
+ * **External API:** Wikipedia API to allow the agent to look up information on events or other information it might need to design a game about a specific topic.
53
+
54
+ **Default Chunking Strategy:**
55
+ * RecursiveCharacterTextSplitter with a chunk size of 300 characters
56
+ * **Why:** Smaller chucks reduce token usage while maintaining high functionality (according to tests below)
57
+
58
+ ## Task 4: Building a Quick End-to-End Prototype
59
+
60
+ **Deliverables:**
61
+ * Link to deployed prototype: [Link to Hugging Face Space or other endpoint]
62
+
63
+ ## Task 5: Creating a Golden Test Data Set
64
+
65
+ See `ragas_create_testset.ipynb`
66
+
67
+ **Deliverables:**
68
+ * **RAGAS Evaluation Results:**
69
+ | Metric | Score |
70
+ |--------------------|-------|
71
+ | Context Recall | 0.9361 |
72
+ | Faithfulness | 0.9439 |
73
+ | Factual Correctness | 0.7825 |
74
+ | Answer Relevance | 0.8933 |
75
+ | Context Entity Recall | 0.1368 |
76
+ | Noise Sensitivity | 0.1838 |
77
+
78
+ * **Conclusions on Performance:**
79
+ * The first four scores look good, not sure why the last two are so low. Though, as discussed in class, these numbers don't have much meaning in isolation. Below, in Task 7, I will use them to compare performance before/after fine-tuning the embeddings.
80
+
81
+ ## Task 6: Fine-Tuning Open-Source Embeddings
82
+
83
+ See `embeddings_fine_tune.ipynb`
84
+
85
+ **Deliverables:**
86
+ * Link to fine-tuned embedding model on Hugging Face Hub: [Link to model]
87
+
88
+ ## Task 7: Assessing Performance
89
+
90
+ **Deliverables:**
91
+ * **Performance Comparison (RAGAS):**
92
+ | Metric | Original RAG Score | Fine-Tuned RAG Score |
93
+ |--------------------|--------------------|------------------------|
94
+ | Context Recall | 0.9361 | 0.9681 |
95
+ | Faithfulness | 0.9439 | 0.9744 |
96
+ | Factual Correctness | 0.7825 | 0.8032 |
97
+ | Answer Relevance | 0.8933 | 0.9221 |
98
+ | Context Entity Recall | 0.1368 | 0.1456 |
99
+ | Noise Sensitivity | 0.1838 | 0.1580 |
100
+
101
+ * Performance increased on all metrics except for noise sensitivity!
102
+
103
+ * **Changes for Second Half of Course:**
104
+ * I hope to develop this into a platform that people can use to play matrix games against AI agent players, or even simulate games entirly played by AI. This will allow for simulating many wargames in a short amount of time to explore possible futures and prepare planned actions to take in the most likely or most important contingencies.
105
+
106
+ ## Links
107
+
108
+ * **Loom Video (Demo & Use Case - 5 mins MAX):** [Link to Loom video]
109
+ * **Written Document:** This README.md file serves as the written document.
110
+ * **Public Application Link:** [Link to final Hugging Face Space or other deployment]
111
+ * **Public Fine-Tuned Embedding Model Link:** [Link to Hugging Face Hub model]
embeddings_fine_tune.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
rag_test.ipynb DELETED
The diff for this file is too large to render. See raw diff
 
ragas_create_testset.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
ragas_eval.ipynb ADDED
The diff for this file is too large to render. See raw diff