Spaces:

angelesteban00
/

angelesteban00_hg

Sleeping

App Files Files Community

angelesteban00 commited on Jan 11, 2024

Commit

8b091a4

1 Parent(s): 168d589

.

Browse files

Files changed (3) hide show

app.py +2 -2
load_data_from_PDF.py +34 -0
requirements.txt +2 -0

app.py CHANGED Viewed

@@ -21,11 +21,11 @@ Demo based on https://www.mongodb.com/developer/products/atlas/rag-atlas-vector-
 ## Prerequisites:
  create a free DB called "langchain_demo" and a collection called "collection_of_text_blobs" in MongoDB Atlas (https://cloud.mongodb.com). After that, you have two options:
- **option1**) execute locally "load_data.py" to create new documents and their embeddings in MongoDB<br>
  **option2**) import the JSON file "langchain_demo.collection_of_text_blobs.json"
 ## Dataset
-The JSON documents in MongoDB looks like:
 ```
 {
   "_id": {

 ## Prerequisites:
  create a free DB called "langchain_demo" and a collection called "collection_of_text_blobs" in MongoDB Atlas (https://cloud.mongodb.com). After that, you have two options:
+ **option1**) execute locally "load_data.py"/"load_data_from_PDF.py" to create new documents and their embeddings in MongoDB<br>
  **option2**) import the JSON file "langchain_demo.collection_of_text_blobs.json"
 ## Dataset
+The JSON documents in MongoDB looks like (also was splitted and embebed this PDF https://arxiv.org/pdf/2303.08774.pdf):
 ```
 {
   "_id": {

load_data_from_PDF.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from pymongo import MongoClient
+# error since Jan 2024, from langchain.embeddings.openai import OpenAIEmbeddings
+from langchain_openai import OpenAIEmbeddings
+# error since Jan 2024, from langchain.vectorstores import MongoDBAtlasVectorSearch
+from langchain_community.vectorstores import MongoDBAtlasVectorSearch
+# error since Jan 2024, from langchain.document_loaders import PyPDFLoader
+from langchain_community.document_loaders import PyPDFLoader
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+import os
+mongo_uri = os.getenv("MONGO_URI")
+openai_api_key = os.getenv("OPENAI_API_KEY")
+client = MongoClient(mongo_uri)
+dbName = "langchain_demo"
+collectionName = "collection_of_text_blobs"
+collection = client[dbName][collectionName]
+#loader = DirectoryLoader( './sample_files', glob="./*.txt", show_progress=True)
+loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
+data = loader.load()
+text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
+docs = text_splitter.split_documents(data)
+#embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
+#vectorStore = MongoDBAtlasVectorSearch.from_documents( data, embeddings, collection=collection, index_name="default" )
+# insert the documents in MongoDB Atlas Vector Search
+x = MongoDBAtlasVectorSearch.from_documents(
+    documents=docs,
+    embedding=OpenAIEmbeddings(openai_api_key=openai_api_key, disallowed_special=()),
+    collection=collection,
+    index_name="default"
+    )

requirements.txt CHANGED Viewed

@@ -1,4 +1,6 @@
 langchain
 langchain-openai
 pymongo[srv]==4.1.1
 bs4

 langchain
+pypdf
+python-dotenv
 langchain-openai
 pymongo[srv]==4.1.1
 bs4