Spaces:
Runtime error
Runtime error
Commit ·
e1ee8e8
1
Parent(s): f16141d
add notebooks for creating testset, evaluation, and embedding fine-tuning and write README
Browse files- .gitignore +3 -0
- README.md +111 -1
- embeddings_fine_tune.ipynb +0 -0
- rag_test.ipynb +0 -0
- ragas_create_testset.ipynb +0 -0
- ragas_eval.ipynb +0 -0
.gitignore
CHANGED
|
@@ -3,6 +3,9 @@ docs/
|
|
| 3 |
.cursor/
|
| 4 |
.chainlit/
|
| 5 |
.files/
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
# Byte-compiled / optimized / DLL files
|
| 8 |
__pycache__/
|
|
|
|
| 3 |
.cursor/
|
| 4 |
.chainlit/
|
| 5 |
.files/
|
| 6 |
+
wandb/
|
| 7 |
+
finetuned_arctic_ft/
|
| 8 |
+
checkpoints/
|
| 9 |
|
| 10 |
# Byte-compiled / optimized / DLL files
|
| 11 |
__pycache__/
|
README.md
CHANGED
|
@@ -1 +1,111 @@
|
|
| 1 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Matrix Wargame RAG Agent - Certification Challenge
|
| 2 |
+
|
| 3 |
+
This repository contains the work for the Certification Challenge, focusing on an application to assist with understanding and designing matrix wargames.
|
| 4 |
+
|
| 5 |
+
## Introduction
|
| 6 |
+
|
| 7 |
+
This project aims to develop an AI-powered application that helps users answer questions about matrix wargames and supports them in the design process of new matrix wargames. It leverages Retrieval Augmented Generation (RAG) and agentic capabilities to provide accurate and contextually relevant information.
|
| 8 |
+
|
| 9 |
+
## Task 1: Defining your Problem and Audience
|
| 10 |
+
|
| 11 |
+
**Problem Statement:**
|
| 12 |
+
* Users, such as game designers, researchers, and hobbyists, often find it challenging to quickly access specific information about matrix wargame mechanics, historical examples, and design principles, or need assistance in brainstorming and structuring new game designs.
|
| 13 |
+
|
| 14 |
+
**Why this is a problem for your specific user:**
|
| 15 |
+
* While matrix games are designed to be accessible, newcomers can find it challenging to navigate the rules and gameplay without a knowledgeable facilitator or by frequently pausing to consult rulebooks. This application provides a centralized, interactive knowledge base to answer their specific questions in an accessible way.
|
| 16 |
+
* Game designers may find it time consuming to build new games and struggle to incorporate all the necessary details and nuances. This application's game designer tool can help them create detailed and well-structured games more efficiently.
|
| 17 |
+
* Researchers might need a tool to quickly compare and contrast different wargame designs or identify trends in the field.
|
| 18 |
+
|
| 19 |
+
**Potential questions that your user is likely to ask:**
|
| 20 |
+
* "What are the core mechanics of a matrix wargame?"
|
| 21 |
+
* "Can you give me examples of matrix wargames used for training military personnel?"
|
| 22 |
+
* "How can I design a matrix wargame to simulate [specific scenario, e.g., a cybersecurity incident]?"
|
| 23 |
+
* "What are common pitfalls to avoid when designing a matrix wargame?"
|
| 24 |
+
* "Suggest some adjudication mechanisms for a diplomatic conflict in a matrix game."
|
| 25 |
+
* "What data sources can I use to inform the design of a historical matrix wargame?"
|
| 26 |
+
* "How do I balance a matrix wargame for multiple players with different objectives?"
|
| 27 |
+
|
| 28 |
+
## Task 2: Propose a Solution
|
| 29 |
+
|
| 30 |
+
**Proposed Solution:**
|
| 31 |
+
* I will build an agentic RAG application that allows users to ask natural language questions about matrix wargames. The system will retrieve relevant information from a curated knowledge base of documents, articles, and potentially game rulebooks. It will also leverage agentic reasoning to help users brainstorm design elements, structure game phases, and consider different mechanics.
|
| 32 |
+
* The user will interact with a simple interface (e.g., a web chat) where they can input their queries. The application will provide concise answers, cite sources, and offer follow-up suggestions or design considerations. This will save users time, provide targeted information, and act as a creative partner in the design process.
|
| 33 |
+
|
| 34 |
+
**Tooling Choices:**
|
| 35 |
+
* **LLM:** GPT4.1-mini - Why: Newer model with a good balance between intelligence and cost
|
| 36 |
+
* **Embedding Model:** snowflake-arctic-embed-l - Why: Open source and easy to fine tune
|
| 37 |
+
* **Orchestration:** LangChain/LangGraph - Why: Easy to connect agent with many tools. Enables easy integration with more complex multi-agent graphs that I want to build in the future.
|
| 38 |
+
* **Vector Database:** Qdrant - Why: Efficient similarity search and ease of integration with LangChain/LangGraph
|
| 39 |
+
* **Monitoring:** LangSmith - Why: To track performance, identify issues, and understand application usage
|
| 40 |
+
* **Evaluation:** RAGAS- Why: To build custom personas of users and evaluate RAG metrics
|
| 41 |
+
* **User Interface:** Chainlit - Why: For rapid prototyping. Currently building a Next.js frontend
|
| 42 |
+
* **(Optional) Serving & Inference:** HF Spaces - Why: ease of implementation and access
|
| 43 |
+
|
| 44 |
+
**Agent Usage:**
|
| 45 |
+
* An agent will be used to understand complex user queries that may require multi-step reasoning or access to multiple tools (e.g., a RAG retriever for the knowledge base and a wikipedia search tool for information on recent/historical events or broader context).
|
| 46 |
+
* The Game designer tool enables the agent to design high-quality matrix games within a standardized Pydantic format (future features will allow users to save these games and submit them to be played by AI agents)
|
| 47 |
+
|
| 48 |
+
## Task 3: Dealing with the Data
|
| 49 |
+
|
| 50 |
+
**Data Sources and External APIs:**
|
| 51 |
+
* **Primary Data Source:** A collection of PDFs, articles, and web content related to matrix wargaming theory, design principles, and existing game examples.
|
| 52 |
+
* **External API:** Wikipedia API to allow the agent to look up information on events or other information it might need to design a game about a specific topic.
|
| 53 |
+
|
| 54 |
+
**Default Chunking Strategy:**
|
| 55 |
+
* RecursiveCharacterTextSplitter with a chunk size of 300 characters
|
| 56 |
+
* **Why:** Smaller chucks reduce token usage while maintaining high functionality (according to tests below)
|
| 57 |
+
|
| 58 |
+
## Task 4: Building a Quick End-to-End Prototype
|
| 59 |
+
|
| 60 |
+
**Deliverables:**
|
| 61 |
+
* Link to deployed prototype: [Link to Hugging Face Space or other endpoint]
|
| 62 |
+
|
| 63 |
+
## Task 5: Creating a Golden Test Data Set
|
| 64 |
+
|
| 65 |
+
See `ragas_create_testset.ipynb`
|
| 66 |
+
|
| 67 |
+
**Deliverables:**
|
| 68 |
+
* **RAGAS Evaluation Results:**
|
| 69 |
+
| Metric | Score |
|
| 70 |
+
|--------------------|-------|
|
| 71 |
+
| Context Recall | 0.9361 |
|
| 72 |
+
| Faithfulness | 0.9439 |
|
| 73 |
+
| Factual Correctness | 0.7825 |
|
| 74 |
+
| Answer Relevance | 0.8933 |
|
| 75 |
+
| Context Entity Recall | 0.1368 |
|
| 76 |
+
| Noise Sensitivity | 0.1838 |
|
| 77 |
+
|
| 78 |
+
* **Conclusions on Performance:**
|
| 79 |
+
* The first four scores look good, not sure why the last two are so low. Though, as discussed in class, these numbers don't have much meaning in isolation. Below, in Task 7, I will use them to compare performance before/after fine-tuning the embeddings.
|
| 80 |
+
|
| 81 |
+
## Task 6: Fine-Tuning Open-Source Embeddings
|
| 82 |
+
|
| 83 |
+
See `embeddings_fine_tune.ipynb`
|
| 84 |
+
|
| 85 |
+
**Deliverables:**
|
| 86 |
+
* Link to fine-tuned embedding model on Hugging Face Hub: [Link to model]
|
| 87 |
+
|
| 88 |
+
## Task 7: Assessing Performance
|
| 89 |
+
|
| 90 |
+
**Deliverables:**
|
| 91 |
+
* **Performance Comparison (RAGAS):**
|
| 92 |
+
| Metric | Original RAG Score | Fine-Tuned RAG Score |
|
| 93 |
+
|--------------------|--------------------|------------------------|
|
| 94 |
+
| Context Recall | 0.9361 | 0.9681 |
|
| 95 |
+
| Faithfulness | 0.9439 | 0.9744 |
|
| 96 |
+
| Factual Correctness | 0.7825 | 0.8032 |
|
| 97 |
+
| Answer Relevance | 0.8933 | 0.9221 |
|
| 98 |
+
| Context Entity Recall | 0.1368 | 0.1456 |
|
| 99 |
+
| Noise Sensitivity | 0.1838 | 0.1580 |
|
| 100 |
+
|
| 101 |
+
* Performance increased on all metrics except for noise sensitivity!
|
| 102 |
+
|
| 103 |
+
* **Changes for Second Half of Course:**
|
| 104 |
+
* I hope to develop this into a platform that people can use to play matrix games against AI agent players, or even simulate games entirly played by AI. This will allow for simulating many wargames in a short amount of time to explore possible futures and prepare planned actions to take in the most likely or most important contingencies.
|
| 105 |
+
|
| 106 |
+
## Links
|
| 107 |
+
|
| 108 |
+
* **Loom Video (Demo & Use Case - 5 mins MAX):** [Link to Loom video]
|
| 109 |
+
* **Written Document:** This README.md file serves as the written document.
|
| 110 |
+
* **Public Application Link:** [Link to final Hugging Face Space or other deployment]
|
| 111 |
+
* **Public Fine-Tuned Embedding Model Link:** [Link to Hugging Face Hub model]
|
embeddings_fine_tune.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
rag_test.ipynb
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
ragas_create_testset.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
ragas_eval.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|