Spaces:

yessir
/

AraRAG

Runtime error

App Files Files Community

Yaser Abdelaziz commited on May 2, 2024

Commit

b5acc22

1 Parent(s): c3adec4

Initial commit

Browse files

Files changed (16) hide show

.gitignore +163 -0
README.md +32 -13
app.py +58 -0
create_vector_store.py +21 -0
data/Actual Budget Report 2022.pdf +0 -0
data/Press Release - 2022 Results (Stock Market).pdf +0 -0
eval/eval_results.xlsx +0 -0
eval/eval_set.csv +8 -0
eval/eval_set_results.csv +18 -0
evaluate.py +48 -0
ingest.py +20 -0
prompts.py +45 -0
rag.py +53 -0
requirements.txt +3 -0
text_splitter.py +36 -0
utils.py +52 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,163 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+# Custom
+montaai/

README.md CHANGED Viewed

@@ -1,13 +1,32 @@
----
-title: AraRAG
-emoji: 📈
-colorFrom: green
-colorTo: green
-sdk: gradio
-sdk_version: 4.28.3
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# AraRAG
+This project is a chatbot application that can process Arabic PDF files. It uses the RAG model for generating responses in Arabic and Gradio for the user interface. The chatbot can also evaluate its own performance using the Cohere API.
+## Getting Started
+These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
+### Prerequisites
+The project requires the following Python packages:
+- gradio
+- PyMuPDF
+- cohere
+You can install these packages using pip:
+```bash
+pip install -r requirements.txt
+```
+### Environment Variables
+The project uses environment variables to manage sensitive information. Make sure to set these variables in your environment or in a .env file:
+- `SPELLBOOK_BASE_URL`: The base URL for the Spellbook API.
+- `SPELLBOOK_API_KEY`: Your Spellbook API key.
+- `VECTOR_STORE_NAME`: The name of the vector store to be created.
+- `OPENAI_API_KEY`: Your OpenAI API key for the RAG model.
+- `COHERE_API_KEY`: Your Cohere API key for evaluating the chatbot's performance.
+## Creating a Vector Store
+Before running the chatbot, you need to create a vector store using the `create_vector_store.py` script. This script uses the Spellbook API to create a vector store with the name specified in the `VECTOR_STORE_NAME` environment variable.
+## Running the Chatbot
+You can run the chatbot using the `app.py` script. This script uses Gradio to create a chatbot interface. The chatbot can accept text input or a PDF file. If a PDF file is uploaded, the script will extract the text from the PDF and use it as input to the RAG model.
+## Evaluating the Chatbot
+You can evaluate the chatbot's performance using the evaluate.py script. This script uses the Cohere API to evaluate the correctness of the chatbot's answers. The script reads a CSV file with a set of questions and their correct answers, uses the chatbot to answer the questions, and then evaluates the correctness of the chatbot's answers.

app.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import fitz
+import gradio as gr
+from dotenv import load_dotenv
+load_dotenv(override=True)
+from rag import RAG
+from utils import format_page
+def add_message(history, message):
+    if message["files"] is not None and len(message["files"]) > 0:
+        file_path = message["files"][0]
+        history.append([file_path, ""])
+    else:
+        if message["text"] is not None:
+            history.append((message["text"], None))
+    return history
+def bot(history, message, pdf_file_content):
+    rag = RAG()
+    if message["files"] is not None and len(message["files"]) > 0:
+        file_path = message["files"][0]
+        if file_path.endswith(".pdf"):
+            with fitz.open(file_path) as doc:
+                history[-1][1] = ""
+                for page in doc:
+                    cont, page_content = format_page(page)
+                    if cont:
+                        continue
+                    pdf_file_content += "\n\n" + page_content
+                    history[-1][1] += "\n\n" + page_content
+                    yield history, gr.MultimodalTextbox(value=None, interactive=False), pdf_file_content
+                rag.ingest_text(pdf_file_content)
+    else:
+        for answer in rag(message, stream=True):
+            history[-1][1] = answer
+            yield history, gr.MultimodalTextbox(value=None, interactive=False), pdf_file_content
+with gr.Blocks() as demo:
+    pdf_file_content = gr.State("")
+    chatbot = gr.Chatbot(
+        [],
+        elem_id="chatbot",
+        bubble_full_width=False
+    )
+    chat_input = gr.MultimodalTextbox(interactive=True, file_types=[".pdf"], placeholder="Enter message or upload file...", show_label=False)
+    chat_msg = chat_input.submit(add_message, [chatbot, chat_input], chatbot)
+    bot_msg = chat_msg.then(bot, [chatbot, chat_input, pdf_file_content], [chatbot, chat_input, pdf_file_content], api_name="bot_response")
+    bot_msg.then(lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input])
+demo.queue()
+demo.launch()

create_vector_store.py ADDED Viewed

	@@ -0,0 +1,21 @@

+import json
+import os
+from dotenv import load_dotenv
+load_dotenv(override=True)
+from utils import call_spellbook_api
+def main():
+    vector_store_name = os.environ.get("VECTOR_STORE_NAME")
+    payload = {"name": vector_store_name}
+    response = call_spellbook_api(endpoint="api/v1/vector-stores", payload=payload)
+    print(json.dumps(response, indent=2))
+if __name__ == "__main__":
+    main()

data/Actual Budget Report 2022.pdf ADDED Viewed

Binary file (827 kB). View file

data/Press Release - 2022 Results (Stock Market).pdf ADDED Viewed

Binary file (518 kB). View file

eval/eval_results.xlsx ADDED Viewed

Binary file (8.61 kB). View file

eval/eval_set.csv ADDED Viewed

	@@ -0,0 +1,8 @@

+question,answer
+ما هو تقرير وزارة المالية للأداء؟, تقوم وزارة المالية بإعداد تقارير دورية عن الأداء المالي للميزانية العامة للدولة، وتعزيز الشفافية والإفصاح عن السياسات والمبادرات التي تنفذها. وتطوير مستوى الإفصاح عن المبادرات والسياسات التي تقوم بتنفيذها. وتعد تقارير أداء ربع سنوية ونصف سنوية، بالإضافة إلى بيان الميزانية السنوي والأولي.
+اعطيني نبذة عن مجموعة تداول ولخص لي اهم مؤشرات الاداء المالي في 2021,هو تقرير يصدر عن وزارة المالية السعودية، ويكون ذلك في نهاية العام المالي، ويضم التقرير بيانات ومؤشرات الأداء الفعلي للمالية العامة والاقتصاد خلال العام المالي المنصرم، كما يُوضح أسباب الاختلاف بين التقديرات المعتمدة في الميزانية وأدائها الفعلي. ويهدف هذا التقرير إلى دعم مبادرات الشفافية والإفصاح المالي التي تنتهجها الحكومة السعودية، بالإضافة إلى تعزيز الشفافية والإفصاح في المالية العامة، مع تقديم شرح للسياسات والمبادرات والبرامج التي تنفذها الوزارة.
+ما هو صافي الربح الذي حققته مجموعة تداول السعودية في العام المالي 2022 بعد الزكاة؟, حققت مجموعة تداول السعودية صافي ربح قدره 424.6 مليون ريال سعودي في العام المالي 2022 بعد الزكاة.
+ما الشركة الجديدة التي تم تأسيسها بالتعاون مع صندوق الاستثمارات العامة؟, تم تأسيس شركة سوق الكربون الطوعي الإقليمية بالتعاون مع صندوق الاستثمارات العامة.
+ما هي الأهداف الرئيسية لإطلاق شركة سوق الكربون الطوعي الإقليمية؟, الأهداف الرئيسية لإطلاق شركة سوق الكربون الطوعي الإقليمية تشمل دعم الشركات والقطاعات في المنطقة لتمكينها من الوصول إلى الحياد الكربوني وضمان شراء أرصدة الكربون لتخفيض الانبعاثات الكربونية في سلاسل القيمة.
+كيف تؤثر المبادرات التي أطلقتها مجموعة تداول السعودية على تحقيق أهداف رؤية المملكة 2030؟, المبادرات التي أطلقتها مجموعة تداول السعودية تساهم في تعزيز كفاءة وسيولة سوق المال السعودي، وتعزيز الأداء التشغيلي والمالي، وتوسيع قاعدة العملاء المحلية والإقليمية والعالمية، مما يدعم تحقيق أهداف رؤية المملكة 2030 لتطوير قطاع مالي متقدم يسهم في تنمية الاقتصاد الوطني.
+ما هي النتائج الرئيسية للتعاون بين مجموعة تداول السعودية وسوق أبوظبي للأوراق المالية؟, نتج عن التعاون بين مجموعة تداول السعودية وسوق أبوظبي للأوراق المالية أول عملية إدراج مزدوج ومتزامن، مما يمثل مرحلة جديدة من التعاون بين السوق المالية السعودية والأسواق المالية الخليجية والدولية.

eval/eval_set_results.csv ADDED Viewed

	@@ -0,0 +1,18 @@

+question,answer,agent_answer,correctness,correctness_reason
+ما هو تقرير وزارة المالية للأداء؟, تقوم وزارة المالية بإعداد تقارير دورية عن الأداء المالي للميزانية العامة للدولة، وتعزيز الشفافية والإفصاح عن السياسات والمبادرات التي تنفذها. وتطوير مستوى الإفصاح عن المبادرات والسياسات التي تقوم بتنفيذها. وتعد تقارير أداء ربع سنوية ونصف سنوية، بالإضافة إلى بيان الميزانية السنوي والأولي.,تقرير وزارة المالية للأداء هو وثيقة تصدرها وزارة المالية وتوفر نظرة عامة على أداء الوزارة على مدار عام معين. يغطي التقرير جوانب مختلفة مثل الإيرادات والإنفاق والرؤية والاستراتيجية والتحديات المستقبلية. يهدف التقرير إلى توفير تحديث شفاف للجمهور حول عمليات الوزارة وإنجازاتها.,True,"The agent's answer provides a valid explanation of what the Ministry of Finance Performance Report is, and matches the essence of the ground truth answer."
+اعطيني نبذة عن مجموعة تداول ولخص لي اهم مؤشرات الاداء المالي في 2021,هو تقرير يصدر عن وزارة المالية السعودية، ويكون ذلك في نهاية العام المالي، ويضم التقرير بيانات ومؤشرات الأداء الفعلي للمالية العامة والاقتصاد خلال العام المالي المنصرم، كما يُوضح أسباب الاختلاف بين التقديرات المعتمدة في الميزانية وأدائها الفعلي. ويهدف هذا التقرير إلى دعم مبادرات الشفافية والإفصاح المالي التي تنتهجها الحكومة السعودية، بالإضافة إلى تعزيز الشفافية والإفصاح في المالية العامة، مع تقديم شرح للسياسات والمبادرات والبرامج التي تنفذها الوزارة.,"شركة تداول القابضة هي شركة مساهمة عامة مدرجة في السوق المالية السعودية (تداول). تعمل الشركة في مجال تقديم الخدمات المالية، وتمتلك عدة شركات تابعة لها تعمل في قطاعات مختلفة مرتبطة بالقطاع المالي. من أهم مؤشرات الأداء المالي للشركة لعام 2021:
+1. حققت إيرادات بلغت 1,166.1 مليون ريال سعودي، بزيادة قدرها 8.1% عن العام السابق 2020.
+2. بلغ صافي الدخل 587.7 مليون ريال سعودي، بزيادة قدرها 27.8% عن عام 2020.
+3. حققت الشركة أرباحاً قبل الفوائد والضرائب والاستهلاك والإطفاء (EBITDA) بلغت 1,071.4 مليون ريال سعودي، بزيادة قدرها 117% عن العام الماضي.
+4. بلغت أصول الشركة 7,690 مليون ريال سعودي، بينما بلغت حقوق المساهمين 4,510 مليون ريال سعودي.
+هذه هي أهم مؤشرات الأداء المالي لشركة تداول القابضة خلال عام 2021 كما وردت في تقاريرهم المالية.",False,"The agent's answer doesn't match the ground truth. While the provided information is about a company named 'تداول القابضة', which is a Saudi financial services company, the ground truth mentions a report issued by the Ministry of Finance, which is different from the requested information."
+ما هو صافي الربح الذي حققته مجموعة تداول السعودية في العام المالي 2022 بعد الزكاة؟, حققت مجموعة تداول السعودية صافي ربح قدره 424.6 مليون ريال سعودي في العام المالي 2022 بعد الزكاة.,424.6 مليون ريال سعودي,True,The agent's answer is identical to the ground truth.
+ما الشركة الجديدة التي تم تأسيسها بالتعاون مع صندوق الاستثمارات العامة؟, تم تأسيس شركة سوق الكربون الطوعي الإقليمية بالتعاون مع صندوق الاستثمارات العامة.,تداول جروب.,False,The agent's answer 'تداول جروب' does not correspond to the ground truth answer 'شركة سوق الكربون الطوعي الإقليمية'. They don't share any key information.
+ما هي الأهداف الرئيسية لإطلاق شركة سوق الكربون الطوعي الإقليمية؟, الأهداف الرئيسية لإطلاق شركة سوق الكربون الطوعي الإقليمية تشمل دعم الشركات والقطاعات في المنطقة لتمكينها م�� الوصول إلى الحياد الكربوني وضمان شراء أرصدة الكربون لتخفيض الانبعاثات الكربونية في سلاسل القيمة.,سوق الكربون الطوعي الإقليمية أطلقت بهدف تعزيز التنمية المستدامة ودعم الانتقال إلى اقتصاد أخضر في المنطقة. ستعمل السوق على إنشاء إطار موحد لتجارة الكربون، مما يسهل تخفيض الانبعاثات وتوفير فرص جديدة للاستثمار والشراكات في مجال الطاقة النظيفة. الهدف الرئيسي هو تشجيع وتسهيل مشاركة القطاع الخاص في مبادرات الحد من الكربون، والمساهمة في الجهود المبذولة لمواجهة تغير المناخ.,False,"The agent answer doesn't mention anything about supporting companies in the region or enabling them to access carbon neutrality. Also, the purchase of carbon credits to reduce emissions within the value chains is not included in the agent's response."
+كيف تؤثر المبادرات التي أطلقتها مجموعة تداول السعودية على تحقيق أهداف رؤية المملكة 2030؟, المبادرات التي أطلقتها مجموعة تداول السعودية تساهم في تعزيز كفاءة وسيولة سوق المال السعودي، وتعزيز الأداء التشغيلي والمالي، وتوسيع قاعدة العملاء المحلية والإقليمية والعالمية، مما يدعم تحقيق أهداف رؤية المملكة 2030 لتطوير قطاع مالي متقدم يسهم في تنمية الاقتصاد الوطني.,تدعم مبادرات مجموعة تداول السعودية رؤية المملكة 2030 في تحفيز وتنمية سوق المال السعودي، حيث تساهم في زيادة كفاءة السوق المالية وتعزيز مكانة المملكة كمركز مالي رائد في المنطقة. كما تساعد في تطوير السوق وتوفير فرص استثمارية جديدة، مما يؤدي إلى جذب المزيد من الاستثمارات المحلية والأجنبية. بالإضافة إلى ذلك، فإن هذه المبادرات ستعمل على تحسين الأداء التشغيلي للسوق المالية السعودية وتوفير بيئة تجارية أكثر كفاءة وفعالية.,True,"The agent's answer mentions how the initiatives launched by Tadawul contribute to the vision of enhancing the efficiency of the Saudi stock market and its role as a leading financial center in the region. It also talks about the impact on operational and financial performance, which is implied in the ground truth answer."
+ما هي النتائج الرئيسية للتعاون بين مجموعة تداول السعودية وسوق أبوظبي للأوراق المالية؟, نتج عن التعاون بين مجموعة تداول السعودية وسوق أبوظبي للأوراق المالية أول عملية إدراج مزدوج ومتزامن، مما يمثل مرحلة جديدة من التعاون بين السوق المالية السعودية والأسواق المالية الخليجية والدولية.,نجحت مجموعة تداول السعودية وسوق أبوظبي للأوراق المالية في تحقيق شراكة ناجحة تم على إثرها أول إدراج مزدوج لشركة سعودية في سوق أبوظبي للوراق المالية، وهي خطوة هامة نحو الأمام لتعزيز التعاون المالي بين البلدين. كما تم إدراج شركة أخرى في سوق أبوظبي أيضاً، مما سيساهم في إظهار فرص جديدة للاستثمار والتبادل التجاري.,False,"The agent's answer doesn't mention the simultaneous dual listing, which is the main outcome of the collaboration between the two companies, as stated in the ground truth answer."

evaluate.py ADDED Viewed

	@@ -0,0 +1,48 @@

+import os
+import json
+import cohere
+import pandas as pd
+from dotenv import load_dotenv
+load_dotenv(override=True)
+from rag import RAG
+from prompts import preamble, message
+def evaluate(co, question, agent_answer, ground_truth):
+    response = co.chat(
+        model='command-r',
+        message=message.format(question=question, agent_answer=agent_answer, ground_truth=ground_truth),
+        temperature=0.0,
+        chat_history=[{"role": "system", "message": preamble}],
+        prompt_truncation='AUTO',
+        connectors=[]
+    )
+    json_response = json.loads(response.text[8:-4])
+    correctness = json_response['correctness']
+    correctness_reason = json_response['correctness_reason']
+    return correctness, correctness_reason
+if __name__ == '__main__':
+    co = cohere.Client(os.getenv('COHERE_API_KEY'))
+    rag = RAG()
+    def df_evaluate(row):
+        correctness, correctness_reason = evaluate(
+            co,
+            row["question"],
+            row["agent_answer"],
+            row["answer"]
+        )
+        row["correctness"] = correctness
+        row["correctness_reason"] = correctness_reason
+        return row
+    eval_df = pd.read_csv("eval/eval_set.csv")
+    eval_df["agent_answer"] = eval_df["question"].apply(lambda x: list(rag(x))[0])
+    eval_df = eval_df.apply(df_evaluate, axis=1)
+    eval_df.to_csv("eval/eval_set_results.csv", index=False)
+    accuracy = eval_df["correctness"].mean()
+    print(f"Accuracy: {accuracy}")

ingest.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import json
+from utils import generate_embeddings, call_spellbook_api
+def ingest_data(vector_store_name, splits):
+    embeddings = generate_embeddings(splits)
+    document_items = [
+        {
+            "embedding": embedding,
+            "text": split
+        }
+        for split, embedding in zip(splits, embeddings)
+    ]
+    payload = {"items": document_items}
+    response = call_spellbook_api(endpoint="api/v1/vector-stores/" + vector_store_name + "/documents", payload=payload)
+    print(json.dumps(response, indent=2))

prompts.py ADDED Viewed

	@@ -0,0 +1,45 @@

+prompt = """Use the following context as your learned knowledge, inside <context></context> XML tags. Notice that I extracted the following context from a pdf file.
+<context>
+{pdf_file_content}
+</context>
+When answer to user:
+- If you don't know, just say that you don't know.
+- If you don't know when you are not sure, ask for clarification.
+- Avoid mentioning that you obtained the information from the context.
+- Be concise and to the point.
+- Avoid providing information that is not in the context.
+- And answer according to the language of the user's question.
+{message}"""
+prompt_v2 = """I extracted the following Arabic text from a pdf file:
+\"\"\"
+{pdf_file_content}
+\"\"\"
+Please answer the following question in Arabic using the Arabic text above only:
+{message}"""
+eval_preamble = """Your role is to test AI agents. Your task consists in assessing whether a agent output correctly answers a question.
+You are provided with the ground truth answer to the question. Your task is then to evaluate if the agent answer is close to the ground thruth answer.
+Think step by step and consider the agent output in its entirety. Remember: you need to have a strong and sound reason to support your evaluation.
+If the agent answer is correct, return True. If the agent answer is incorrect, return False along with the reason.
+You must output a single JSON object with keys 'correctness' and 'correctness_reason'. Make sure you return a valid JSON object.
+The question that was asked to the agent, its output, and the expected ground truth answer will be delimited with XML tags."""
+eval_message = """<question>
+{question}
+</question>
+<agent_answer>
+{agent_answer}
+</agent_answer>
+<ground_truth>
+{ground_truth}
+</ground_truth>"""

rag.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import os
+import cohere
+from prompts import prompt
+from ingest import ingest_data
+from text_splitter import split_text_on_tokens, tokenizer
+from utils import generate_embeddings, call_spellbook_api
+class RAG:
+    vector_store_name = os.environ.get("VECTOR_STORE_NAME")
+    def ingest_text(self, text: str):
+        ingest_data(self.vector_store_name, split_text_on_tokens(text, tokenizer))
+    def __call__(self, message: str, stream=False):
+        payload = {"queryEmbedding": generate_embeddings([str(message)])[0], "k": 10}
+        response = call_spellbook_api(
+            endpoint="api/v1/vector-stores/" + self.vector_store_name + "/similarity-search", payload=payload
+        )
+        print(response)
+        docs = [item["text"] for item in response["data"]["items"]]
+        co = cohere.Client(os.getenv('COHERE_API_KEY'))
+        # print("Message: ", str(message))
+        # response = co.rerank(model="rerank-multilingual-v3.0", query=str(message), documents=docs, top_n=10, return_documents=True)
+        # print(response)
+        # docs = [doc.document.text for doc in response.results if doc.relevance_score > 0.1]
+        information = "\n\n".join(docs)
+        print(information)
+        answer = ""
+        print(prompt.format(pdf_file_content=information, message=message))
+        for event in co.chat_stream(
+            model='command-r',
+            message=prompt.format(pdf_file_content=information, message=message),
+            temperature=0.0,
+            chat_history=[],
+            prompt_truncation='AUTO',
+            connectors=[]
+        ):
+            if event.event_type == "text-generation":
+                answer += event.text
+                if stream:
+                    yield answer
+            elif event.event_type == "stream-end":
+                break
+        if not stream:
+            yield answer

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+gradio
+PyMuPDF
+cohere

text_splitter.py ADDED Viewed

	@@ -0,0 +1,36 @@

+from dataclasses import dataclass
+from typing import Callable, List
+from transformers import AutoTokenizer
+@dataclass
+class Tokenizer:
+    chunk_overlap: int
+    tokens_per_chunk: int
+    decode: Callable[[List[int]], str]
+    encode: Callable[[str], List[int]]
+def split_text_on_tokens(text: str, tokenizer: Tokenizer) -> List[str]:
+    """Split incoming text and return chunks using tokenizer."""
+    splits: list[str] = []
+    input_ids = tokenizer.encode(text)[1:]
+    start_idx = 0
+    cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
+    chunk_ids = input_ids[start_idx:cur_idx]
+    while start_idx < len(input_ids):
+        splits.append(tokenizer.decode(chunk_ids))
+        start_idx += tokenizer.tokens_per_chunk - tokenizer.chunk_overlap
+        cur_idx = min(start_idx + tokenizer.tokens_per_chunk, len(input_ids))
+        chunk_ids = input_ids[start_idx:cur_idx]
+    return splits
+tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
+tokenizer = Tokenizer(
+    chunk_overlap=50,
+    tokens_per_chunk=500,
+    decode=tokenizer.decode,
+    encode=tokenizer.encode,
+)

utils.py ADDED Viewed

	@@ -0,0 +1,52 @@

+import os
+import requests
+import cohere
+# from openai import OpenAI
+def generate_embeddings(texts: str):
+    # embedding_res = OpenAI().embeddings.create(input=text, model="text-embedding-ada-002")
+    # embedding = embedding_res.data[0].embedding
+    co = cohere.Client(os.getenv('COHERE_API_KEY'))
+    response = co.embed(texts=texts, input_type='classification', embedding_types=['float'], model='embed-multilingual-v3.0')
+    embeddings = response.embeddings.float
+    return embeddings
+def call_spellbook_api(endpoint: str, payload: dict):
+    spellbook_base_url = os.environ.get("SPELLBOOK_BASE_URL")
+    spellbook_api_key = os.environ.get("SPELLBOOK_API_KEY")
+    headers = {
+        "accept": "application/json",
+        "content-type": "application/json",
+        "authorization": f"Bearer {spellbook_api_key}",
+    }
+    url = spellbook_base_url + endpoint if spellbook_base_url else endpoint
+    response = requests.request("POST", url, json=payload, headers=headers)
+    return response.json()
+def format_page(page):
+    # reference (modified version for Arabic): https://stackoverflow.com/questions/78200728/how-to-avoid-pymupdf-fitz-interpreting-large-gaps-between-words-as-a-newline-c
+    page_content = ""
+    words = page.get_text("words", sort=True)  # words sorted vertical, then horizontal
+    if len(words) == 0:
+        return True, page_content
+    line = [words[0]]  # list of words in same line
+    for w in words[1:]:
+        w0 = line[-1]  # get previous word
+        if abs(w0[3] - w[3]) <= 3:  # same line (approx. same bottom coord)
+            line.append(w)
+        else:  # new line starts
+            line.sort(key=lambda w: w[0], reverse=True)  # sort words in line right-to-left
+            # print text of line
+            text = " ".join([w[4] for w in line])
+            page_content += text + "\n"
+            line = [w]  # init line list again
+    # print last line
+    text = " ".join([w[4] for w in line[::-1]])
+    page_content += text + "\n"
+    page_content += chr(12) + "\n"
+    return False, page_content