Spaces:
Build error
Build error
Upload 11 files
Browse files- README.md +76 -60
- aimakerspace/__init__.py +15 -0
- aimakerspace/openai_utils/__init__.py +10 -0
- aimakerspace/openai_utils/chatmodel.py +25 -0
- aimakerspace/openai_utils/embedding.py +28 -0
- aimakerspace/openai_utils/prompts.py +34 -0
- aimakerspace/text_utils.py +34 -0
- aimakerspace/vectordatabase.py +29 -0
- app.py +0 -31
- requirements.txt +1 -1
README.md
CHANGED
|
@@ -1,60 +1,76 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: RAG Implementation Notebook
|
| 3 |
-
emoji: 🔍
|
| 4 |
-
colorFrom: blue
|
| 5 |
-
colorTo: purple
|
| 6 |
-
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
-
app_file: app.py
|
| 9 |
-
pinned: false
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
# RAG Implementation Notebook
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: RAG Implementation Notebook
|
| 3 |
+
emoji: 🔍
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 3.50.2
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# RAG Implementation Notebook
|
| 13 |
+
|
| 14 |
+
This space contains a Jupyter notebook demonstrating a Retrieval Augmented Generation (RAG) implementation using OpenAI's API and Hugging Face models.
|
| 15 |
+
|
| 16 |
+
## Features
|
| 17 |
+
- PDF document processing
|
| 18 |
+
- Text chunking and embedding
|
| 19 |
+
- Vector database implementation
|
| 20 |
+
- RAG pipeline with context-aware responses
|
| 21 |
+
|
| 22 |
+
## How to Use
|
| 23 |
+
1. Clone this repository
|
| 24 |
+
2. Install the requirements: `pip install -r requirements.txt`
|
| 25 |
+
3. Open the notebook: `jupyter notebook Pythonic_RAG_Assignment.ipynb`
|
| 26 |
+
|
| 27 |
+
## Requirements
|
| 28 |
+
See `requirements.txt` for the complete list of dependencies.
|
| 29 |
+
|
| 30 |
+
# 🧑💻 What is [AI Engineering](https://maven.com/aimakerspace/ai-eng-bootcamp)?
|
| 31 |
+
|
| 32 |
+
AI Engineering refers to the industry-relevant skills that data science and engineering teams need to successfully **build, deploy, operate, and improve Large Language Model (LLM) applications in production environments**.
|
| 33 |
+
|
| 34 |
+
In practice, this requires understanding both prototyping and production deployments.
|
| 35 |
+
|
| 36 |
+
During the *prototyping* phase, Prompt Engineering, Retrieval Augmented Generation (RAG), Agents, and Fine-Tuning are all necessary tools to be able to understand and leverage. Prototyping includes:
|
| 37 |
+
1. Building RAG Applications
|
| 38 |
+
2. Building with Agent and Multi-Agent Frameworks
|
| 39 |
+
3. Fine-Tuning LLMs & Embedding Models
|
| 40 |
+
4. Deploying LLM Prototype Applications to Users
|
| 41 |
+
|
| 42 |
+
When *productionizing* LLM application prototypes, there are many important aspects ensuring helpful, harmless, honest, reliable, and scalable solutions for your customers or stakeholders. Productionizing includes:
|
| 43 |
+
1. Evaluating RAG and Agent Applications
|
| 44 |
+
2. Improving Search and Retrieval Pipelines for Production
|
| 45 |
+
3. Monitoring Production KPIs for LLM Applications
|
| 46 |
+
4. Setting up Inference Servers for LLMs and Embedding Models
|
| 47 |
+
5. Building LLM Applications with Scalable, Production-Grade Components
|
| 48 |
+
|
| 49 |
+
This bootcamp builds on our two previous courses, [LLM Engineering](https://maven.com/aimakerspace/llm-engineering) and [LLM Operations](https://maven.com/aimakerspace/llmops) 👇
|
| 50 |
+
|
| 51 |
+
- Large Language Model Engineering (LLM Engineering) refers to the emerging best-practices and tools for pretraining, post-training, and optimizing LLMs prior to production deployment. Pre- and post-training techniques include unsupervised pretraining, supervised fine-tuning, alignment, model merging, distillation, quantization. and others.
|
| 52 |
+
|
| 53 |
+
- Large Language Model Ops (LLM Ops, or LLMOps (as from [WandB](https://docs.wandb.ai/guides/prompts) and [a16z](https://a16z.com/emerging-architectures-for-llm-applications/))) refers to the emerging best-practices, tooling, and improvement processes used to manage production LLM applications throughout the AI product lifecycle. LLM Ops is a subset of Machine Learning Operations (MLOps) that focuses on LLM-specific infrastructure and ops capabilities required to build, deploy, monitor, and scale complex LLM applications in production environments. _This term is being used much less in industry these days._
|
| 54 |
+
|
| 55 |
+
# 🏆 **Grading and Certification**
|
| 56 |
+
|
| 57 |
+
To become **AI-Makerspace Certified**, which will open you up to additional opportunities for full and part-time work within our community and network, you must:
|
| 58 |
+
|
| 59 |
+
1. Complete all project assignments.
|
| 60 |
+
2. Complete a project and present during Demo Day.
|
| 61 |
+
3. Receive at least an 85% total grade in the course.
|
| 62 |
+
|
| 63 |
+
If you do not complete all assignments, participate in Demo Day, or maintain a high-quality standard of work, you may still be eligible for a *certificate of completion* if you miss no more than 2 live sessions.
|
| 64 |
+
|
| 65 |
+
# 📚 About
|
| 66 |
+
|
| 67 |
+
This GitHub repository is your gateway to mastering the art of AI Engineering. ***All assignments for the course will be released here for your building, shipping, and sharing adventures!***
|
| 68 |
+
|
| 69 |
+
# 🙏 Contributions
|
| 70 |
+
|
| 71 |
+
We believe in the power of collaboration. Contributions, ideas, and feedback are highly encouraged! Let's build the ultimate resource for AI Engineering together.
|
| 72 |
+
|
| 73 |
+
Please to reach out with any questions or suggestions.
|
| 74 |
+
|
| 75 |
+
Happy coding! 🚀🚀🚀
|
| 76 |
+
|
aimakerspace/__init__.py
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from .text_utils import PDFLoader, CharacterTextSplitter
|
| 2 |
+
from .vectordatabase import VectorDatabase
|
| 3 |
+
from .openai_utils.prompts import SystemRolePrompt, UserRolePrompt
|
| 4 |
+
from .openai_utils.chatmodel import ChatOpenAI
|
| 5 |
+
from .openai_utils.embedding import EmbeddingModel
|
| 6 |
+
|
| 7 |
+
__all__ = [
|
| 8 |
+
'PDFLoader',
|
| 9 |
+
'CharacterTextSplitter',
|
| 10 |
+
'VectorDatabase',
|
| 11 |
+
'SystemRolePrompt',
|
| 12 |
+
'UserRolePrompt',
|
| 13 |
+
'ChatOpenAI',
|
| 14 |
+
'EmbeddingModel'
|
| 15 |
+
]
|
aimakerspace/openai_utils/__init__.py
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from .prompts import SystemRolePrompt, UserRolePrompt
|
| 2 |
+
from .chatmodel import ChatOpenAI
|
| 3 |
+
from .embedding import EmbeddingModel
|
| 4 |
+
|
| 5 |
+
__all__ = [
|
| 6 |
+
'SystemRolePrompt',
|
| 7 |
+
'UserRolePrompt',
|
| 8 |
+
'ChatOpenAI',
|
| 9 |
+
'EmbeddingModel'
|
| 10 |
+
]
|
aimakerspace/openai_utils/chatmodel.py
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import openai
|
| 3 |
+
from typing import List, Dict, Union
|
| 4 |
+
|
| 5 |
+
class ChatOpenAI:
|
| 6 |
+
def __init__(self, model_name: str = "gpt-4"):
|
| 7 |
+
self.model_name = model_name
|
| 8 |
+
self.openai_api_key = os.getenv("OPENAI_API_KEY")
|
| 9 |
+
if self.openai_api_key is None:
|
| 10 |
+
raise ValueError("OPENAI_API_KEY is not set")
|
| 11 |
+
|
| 12 |
+
def run(self, messages: List[Dict[str, str]], text_only: bool = True) -> Union[str, Dict]:
|
| 13 |
+
if not isinstance(messages, list):
|
| 14 |
+
raise ValueError("messages must be a list")
|
| 15 |
+
|
| 16 |
+
openai.api_key = self.openai_api_key
|
| 17 |
+
response = openai.ChatCompletion.create(
|
| 18 |
+
model=self.model_name,
|
| 19 |
+
messages=messages
|
| 20 |
+
)
|
| 21 |
+
|
| 22 |
+
if text_only:
|
| 23 |
+
return response.choices[0].message.content
|
| 24 |
+
|
| 25 |
+
return response
|
aimakerspace/openai_utils/embedding.py
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import openai
|
| 3 |
+
import numpy as np
|
| 4 |
+
from typing import List, Union
|
| 5 |
+
import asyncio
|
| 6 |
+
|
| 7 |
+
class EmbeddingModel:
|
| 8 |
+
def __init__(self, model_name: str = "text-embedding-3-small"):
|
| 9 |
+
self.model_name = model_name
|
| 10 |
+
self.openai_api_key = os.getenv("OPENAI_API_KEY")
|
| 11 |
+
if self.openai_api_key is None:
|
| 12 |
+
raise ValueError("OPENAI_API_KEY is not set")
|
| 13 |
+
|
| 14 |
+
def get_embedding(self, text: str) -> np.ndarray:
|
| 15 |
+
openai.api_key = self.openai_api_key
|
| 16 |
+
response = openai.Embedding.create(
|
| 17 |
+
model=self.model_name,
|
| 18 |
+
input=text
|
| 19 |
+
)
|
| 20 |
+
return np.array(response.data[0].embedding)
|
| 21 |
+
|
| 22 |
+
async def async_get_embeddings(self, list_of_text: List[str]) -> List[np.ndarray]:
|
| 23 |
+
openai.api_key = self.openai_api_key
|
| 24 |
+
response = await openai.Embedding.acreate(
|
| 25 |
+
model=self.model_name,
|
| 26 |
+
input=list_of_text
|
| 27 |
+
)
|
| 28 |
+
return [np.array(item.embedding) for item in response.data]
|
aimakerspace/openai_utils/prompts.py
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
from typing import Dict
|
| 3 |
+
|
| 4 |
+
class BasePrompt:
|
| 5 |
+
def __init__(self, prompt: str):
|
| 6 |
+
self.prompt = prompt
|
| 7 |
+
self._pattern = re.compile(r"\{([^}]+)\}")
|
| 8 |
+
|
| 9 |
+
def format_prompt(self, **kwargs) -> str:
|
| 10 |
+
matches = self._pattern.findall(self.prompt)
|
| 11 |
+
return self.prompt.format(**{match: kwargs.get(match, "") for match in matches})
|
| 12 |
+
|
| 13 |
+
def get_input_variables(self) -> list:
|
| 14 |
+
return self._pattern.findall(self.prompt)
|
| 15 |
+
|
| 16 |
+
class RolePrompt(BasePrompt):
|
| 17 |
+
def __init__(self, prompt: str, role: str):
|
| 18 |
+
super().__init__(prompt)
|
| 19 |
+
self.role = role
|
| 20 |
+
|
| 21 |
+
def create_message(self, **kwargs) -> Dict[str, str]:
|
| 22 |
+
return {"role": self.role, "content": self.format_prompt(**kwargs)}
|
| 23 |
+
|
| 24 |
+
class SystemRolePrompt(RolePrompt):
|
| 25 |
+
def __init__(self, prompt: str):
|
| 26 |
+
super().__init__(prompt, "system")
|
| 27 |
+
|
| 28 |
+
class UserRolePrompt(RolePrompt):
|
| 29 |
+
def __init__(self, prompt: str):
|
| 30 |
+
super().__init__(prompt, "user")
|
| 31 |
+
|
| 32 |
+
class AssistantRolePrompt(RolePrompt):
|
| 33 |
+
def __init__(self, prompt: str):
|
| 34 |
+
super().__init__(prompt, "assistant")
|
aimakerspace/text_utils.py
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import PyPDF2
|
| 2 |
+
from typing import List
|
| 3 |
+
|
| 4 |
+
class PDFLoader:
|
| 5 |
+
def __init__(self, path: str):
|
| 6 |
+
self.path = path
|
| 7 |
+
|
| 8 |
+
def load_documents(self) -> List[str]:
|
| 9 |
+
documents = []
|
| 10 |
+
with open(self.path, 'rb') as file:
|
| 11 |
+
pdf_reader = PyPDF2.PdfReader(file)
|
| 12 |
+
for page in pdf_reader.pages:
|
| 13 |
+
documents.append(page.extract_text())
|
| 14 |
+
return documents
|
| 15 |
+
|
| 16 |
+
class CharacterTextSplitter:
|
| 17 |
+
def __init__(self, chunk_size: int = 1500, chunk_overlap: int = 300):
|
| 18 |
+
self.chunk_size = chunk_size
|
| 19 |
+
self.chunk_overlap = chunk_overlap
|
| 20 |
+
|
| 21 |
+
def split_texts(self, texts: List[str]) -> List[str]:
|
| 22 |
+
split_texts = []
|
| 23 |
+
for text in texts:
|
| 24 |
+
split_texts.extend(self._split_text(text))
|
| 25 |
+
return split_texts
|
| 26 |
+
|
| 27 |
+
def _split_text(self, text: str) -> List[str]:
|
| 28 |
+
chunks = []
|
| 29 |
+
start = 0
|
| 30 |
+
while start < len(text):
|
| 31 |
+
end = min(start + self.chunk_size, len(text))
|
| 32 |
+
chunks.append(text[start:end])
|
| 33 |
+
start = end - self.chunk_overlap
|
| 34 |
+
return chunks
|
aimakerspace/vectordatabase.py
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import numpy as np
|
| 2 |
+
from typing import List, Tuple, Dict
|
| 3 |
+
from .openai_utils.embedding import EmbeddingModel
|
| 4 |
+
|
| 5 |
+
class VectorDatabase:
|
| 6 |
+
def __init__(self, embedding_model: EmbeddingModel = None):
|
| 7 |
+
self.vectors: Dict[str, np.ndarray] = {}
|
| 8 |
+
self.texts: List[str] = []
|
| 9 |
+
self.embedding_model = embedding_model or EmbeddingModel()
|
| 10 |
+
|
| 11 |
+
async def abuild_from_list(self, list_of_text: List[str]) -> 'VectorDatabase':
|
| 12 |
+
embeddings = await self.embedding_model.async_get_embeddings(list_of_text)
|
| 13 |
+
for text, embedding in zip(list_of_text, embeddings):
|
| 14 |
+
self.insert(text, np.array(embedding))
|
| 15 |
+
return self
|
| 16 |
+
|
| 17 |
+
def insert(self, text: str, vector: np.ndarray):
|
| 18 |
+
self.texts.append(text)
|
| 19 |
+
self.vectors[text] = vector
|
| 20 |
+
|
| 21 |
+
def search_by_text(self, query: str, k: int = 4) -> List[Tuple[str, float]]:
|
| 22 |
+
query_embedding = self.embedding_model.get_embedding(query)
|
| 23 |
+
similarities = []
|
| 24 |
+
|
| 25 |
+
for text, vector in self.vectors.items():
|
| 26 |
+
similarity = np.dot(query_embedding, vector) / (np.linalg.norm(query_embedding) * np.linalg.norm(vector))
|
| 27 |
+
similarities.append((text, similarity))
|
| 28 |
+
|
| 29 |
+
return sorted(similarities, key=lambda x: x[1], reverse=True)[:k]
|
app.py
CHANGED
|
@@ -7,37 +7,6 @@ from aimakerspace.openai_utils.chatmodel import ChatOpenAI
|
|
| 7 |
from aimakerspace.openai_utils.embedding import EmbeddingModel
|
| 8 |
import asyncio
|
| 9 |
|
| 10 |
-
def load_notebook():
|
| 11 |
-
notebook_path = "Pythonic_RAG_Assignment.ipynb"
|
| 12 |
-
if os.path.exists(notebook_path):
|
| 13 |
-
with open(notebook_path, "r", encoding="utf-8") as f:
|
| 14 |
-
return f.read()
|
| 15 |
-
return "Notebook not found"
|
| 16 |
-
|
| 17 |
-
with gr.Blocks() as demo:
|
| 18 |
-
gr.Markdown("# RAG Implementation Notebook")
|
| 19 |
-
gr.Markdown("This space contains a Jupyter notebook demonstrating a Retrieval Augmented Generation (RAG) implementation.")
|
| 20 |
-
|
| 21 |
-
with gr.Tabs():
|
| 22 |
-
with gr.TabItem("Notebook Preview"):
|
| 23 |
-
notebook_content = gr.Markdown(load_notebook())
|
| 24 |
-
|
| 25 |
-
with gr.TabItem("About"):
|
| 26 |
-
gr.Markdown("""
|
| 27 |
-
## About This Space
|
| 28 |
-
|
| 29 |
-
This space contains a Jupyter notebook that demonstrates:
|
| 30 |
-
- PDF document processing
|
| 31 |
-
- Text chunking and embedding
|
| 32 |
-
- Vector database implementation
|
| 33 |
-
- RAG pipeline with context-aware responses
|
| 34 |
-
|
| 35 |
-
To run the notebook locally:
|
| 36 |
-
1. Clone this repository
|
| 37 |
-
2. Install requirements: `pip install -r requirements.txt`
|
| 38 |
-
3. Run: `jupyter notebook Pythonic_RAG_Assignment.ipynb`
|
| 39 |
-
""")
|
| 40 |
-
|
| 41 |
# Initialize the RAG pipeline
|
| 42 |
def initialize_rag():
|
| 43 |
# Load the PDF
|
|
|
|
| 7 |
from aimakerspace.openai_utils.embedding import EmbeddingModel
|
| 8 |
import asyncio
|
| 9 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
# Initialize the RAG pipeline
|
| 11 |
def initialize_rag():
|
| 12 |
# Load the PDF
|
requirements.txt
CHANGED
|
@@ -10,5 +10,5 @@ datasets
|
|
| 10 |
huggingface_hub
|
| 11 |
openai>=1.0.0
|
| 12 |
python-dotenv>=1.0.0
|
| 13 |
-
|
| 14 |
asyncio>=3.4.3
|
|
|
|
| 10 |
huggingface_hub
|
| 11 |
openai>=1.0.0
|
| 12 |
python-dotenv>=1.0.0
|
| 13 |
+
PyPDF2>=3.0.0
|
| 14 |
asyncio>=3.4.3
|