Spaces:

SilenWang
/

ReviewGPT

Runtime error

App Files Files Community

Silen Wang commited on Apr 1, 2023

Commit

c4ad2cf

1 Parent(s): 9b8e6f6

v0.2.1 demo update

Browse files

Files changed (19) hide show

.gitignore +4 -0
README.md +23 -7
README.zh-cn.md +25 -6
img/study_other.png +0 -0
img/study_paper.png +0 -0
img/summarise.png +0 -0
lang/Review.md +11 -0
lang/Review.zh_cn.md +11 -0
lang/Setting.md +7 -0
lang/Setting.zh_cn.md +6 -0
lang/Study.md +7 -0
lang/Study.zh_cn.md +6 -0
lang/Title.md +2 -0
requirements.txt +5 -1
reviewGPT.py +105 -15
utils/config_sample.py +33 -5
utils/pdf_parser.py +58 -0
utils/review.py +49 -11
utils/task.py +31 -1

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+conda/
+utils/config.py
+*test*
+*__pycache__*

README.md CHANGED Viewed

@@ -16,6 +16,8 @@ Researchers need to read a large amount of literature every day to keep up with
 ## Demo
 - Screen:
 ![demo](img/screen.png)
@@ -24,6 +26,10 @@ Researchers need to read a large amount of literature every day to keep up with
 ![demo](img/summarise.png)
 ## ToDo
@@ -33,8 +39,8 @@ Researchers need to read a large amount of literature every day to keep up with
   + [ ] Add a download button for raw parsing data(json)
   + [x] Implementation of content summarise function
   + [ ] The About page
-  + [ ] Add usage instructions
-  + [ ] Add a page for reading single paper
 - Backend:
   + [x] Call the chatGPT API for content summarization
   + [x] Call the chatGPT API for literature content access judgment (for meta-analysis)
@@ -47,20 +53,30 @@ Researchers need to read a large amount of literature every day to keep up with
     + [ ] chatGLM
     + [ ] moss
     + [ ] LLaMA
-  + [ ] Add the function of reading single paper
   + [ ] Add APIs for existing feature
 - Reference learning:
-  + [ ] Learn the content of[ResearchGPT](https://github.com/mukulpatnaik/researchgpt) and add similar function
   + [ ] Learn the content of[chatPaper](https://github.com/kaixindelele/ChatPaper) and add similar function
-  + [ ] Try build something like [chatPDF](https://www.chatpdf.com/)
 - Others:
   - [x] Enhlish README
   - [ ] Dockfile for container building
-  - [ ] A HuggingFace demo
 ## Problems
 - This project was initially developed using Pynecone, but encountered several problems that affected its use/appearance, so it was finally switched to Gradio.
   + pynecone continues to occupy the CPU after startup.
   + Currently, the file upload function is not very user-friendly, and you must use buttons or other content to trigger the upload (I have not found how to implement drag and drop upload).
-  + After uploading the file, performing other operations will cause the displayed file name to be lost.

 ## Demo
+[Demo](https://huggingface.co/spaces/SilenWang/ReviewGPT) available on Huggingface, an OpenAI API Key is required
 - Screen:
 ![demo](img/screen.png)
 ![demo](img/summarise.png)
+- Study:
+![demo](img/study_paper.png)
+![demo](img/study_other.png)
 ## ToDo
   + [ ] Add a download button for raw parsing data(json)
   + [x] Implementation of content summarise function
   + [ ] The About page
+  + [x] Add usage instructions
+  + [x] Paper Reading Page
 - Backend:
   + [x] Call the chatGPT API for content summarization
   + [x] Call the chatGPT API for literature content access judgment (for meta-analysis)
     + [ ] chatGLM
     + [ ] moss
     + [ ] LLaMA
+  + [x] Add the function of reading single paper
   + [ ] Add APIs for existing feature
+  + [ ] Improve the PDF parsing module in the Study feature, with the goal of changing the unit from page to paragraph.
 - Reference learning:
+  + [x] Learn the content of[ResearchGPT](https://github.com/mukulpatnaik/researchgpt) and add similar function
   + [ ] Learn the content of[chatPaper](https://github.com/kaixindelele/ChatPaper) and add similar function
+  + [ ] ~~Try build something like [chatPDF](https://www.chatpdf.com/)~~
 - Others:
   - [x] Enhlish README
   - [ ] Dockfile for container building
+  - [x] A HuggingFace demo
+  - [ ] Add error handling for network tasks, following the example of [chatPaper](https://github.com/kaixindelele/ChatPaper)
+## Code Interpretation
+- According to chatGPT, the implementation of ResearchGPT is as follows:
+  + Convert file contents **by page** into text
+  + Call `text-embedding-ada-002` for text embedding matrix calculation
+  + Convert the question into a matrix and calculate the similarity with the matrix of each page
+  + Send the top 3 pages with the highest similarity to the proposed question to the chatGPT interface for literature interpretation
 ## Problems
 - This project was initially developed using Pynecone, but encountered several problems that affected its use/appearance, so it was finally switched to Gradio.
   + pynecone continues to occupy the CPU after startup.
   + Currently, the file upload function is not very user-friendly, and you must use buttons or other content to trigger the upload (I have not found how to implement drag and drop upload).
+  + After uploading the file, performing other operations will cause the displayed file name to be lost.

README.zh-cn.md CHANGED Viewed

@@ -4,6 +4,8 @@
 ## 运行展示
 - 文献准入判断(Screen):
 ![demo](img/screen.png)
@@ -12,6 +14,11 @@
 ![demo](img/summarise.png)
 ## 规划内容
@@ -21,8 +28,8 @@
   + [ ] 增加原始解析数据下载按钮(json)
   + [x] 内容综述功能实装
   + [ ] 增加About页
-  + [ ] 增加使用说明(具体怎么加没想好)
-  + [ ] 增加单文献阅读的页面
 - 后端:
   + [x] 调用chatGPT的API进行内容综述
   + [x] 调用chatGPT的API进行文献内容准入判断(Meta分析用)
@@ -35,16 +42,28 @@
     + [ ] chatGLM
     + [ ] moss
     + [ ] LLaMA
-  + [ ] 增加单文献阅读的功能
   + [ ] 增加现有功能的API
 - 参考学习:
-  + [ ] 学习[ResearchGPT](https://github.com/mukulpatnaik/researchgpt)的内容, 增加类似的功能
   + [ ] 学习[chatPaper](https://github.com/kaixindelele/ChatPaper)的内容, 增加类似的功能
-  + [ ] 尝试构建个[chatPDF](https://www.chatpdf.com/)类似的功能
 - 杂项:
   - [x] 英文README
   - [ ] 准备Dockfile, 构建容器
-  - [ ] 准备HuggingFace Demo
 ## 问题记录

 ## 运行展示
+现在可以在Huggingface上试用本程序的[Demo](https://huggingface.co/spaces/SilenWang/ReviewGPT)(需要自备OpenAI API Key)
 - 文献准入判断(Screen):
 ![demo](img/screen.png)
 ![demo](img/summarise.png)
+- 文献辅助阅读(Study):
+![demo](img/study_paper.png)
+![demo](img/study_other.png)
 ## 规划内容
   + [ ] 增加原始解析数据下载按钮(json)
   + [x] 内容综述功能实装
   + [ ] 增加About页
+  + [x] 增加使用说明
+  + [x] 增加单文献阅读的页面
 - 后端:
   + [x] 调用chatGPT的API进行内容综述
   + [x] 调用chatGPT的API进行文献内容准入判断(Meta分析用)
     + [ ] chatGLM
     + [ ] moss
     + [ ] LLaMA
+  + [x] 增加单文献阅读的功能
   + [ ] 增加现有功能的API
+  + [ ] 改进Study功能中的PDF解析模块, 目标是将单位由页变为段落
 - 参考学习:
+  + [x] 学习[ResearchGPT](https://github.com/mukulpatnaik/researchgpt)的内容, 增加类似的功能
   + [ ] 学习[chatPaper](https://github.com/kaixindelele/ChatPaper)的内容, 增加类似的功能
+  + [ ] ~~尝试构建个[chatPDF](https://www.chatpdf.com/)类似的功能~~
 - 杂项:
   - [x] 英文README
   - [ ] 准备Dockfile, 构建容器
+  - [x] 准备HuggingFace Demo
+  - [ ] 效仿[chatPaper](https://github.com/kaixindelele/ChatPaper)增加网络任务的错误处理(tenacity)
+## 学习内容
+### ResearchGPT功能实现原理
+- 借助chatGPT解读可知, ResearchGPT的实现方式为:
+  + 将文件内容**按页**转换为文本
+  + 调用`text-embedding-ada-002`进行文本embedding矩阵计算
+  + 将问题也转换为矩阵, 与每个页面的矩阵计算相似性
+  + 将相似性最高的3个页面与提出的问题一起给chatGPT接口, 实现文献解读
 ## 问题记录

img/study_other.png ADDED Viewed

img/study_paper.png ADDED Viewed

img/summarise.png CHANGED Viewed

lang/Review.md ADDED Viewed

	@@ -0,0 +1,11 @@

+### Usage
+This page provides two functions: Summarize and Screen.
+- Summarize: Read the summary of multiple articles, summarize their content, and compare the similarities and differences among multiple studies. Further questions can be asked using prompts.
+- Screen: Read the summary of articles one by one, determine whether the article meets the conditions given in the prompts, and can be used for batch screening of literature.
+Currently, two input methods are provided:
+- PMID: Enter the PMID directly, and call the Pumbed API provided by Biopython to obtain the summary of the literature and parse it.
+- RIS format file: Can be exported in RIS format from various literature management software or online databases and uploaded for parsing.

lang/Review.zh_cn.md ADDED Viewed

	@@ -0,0 +1,11 @@

+### 使用说明
+这个页面提供Summarise和Screen两种功能:
+- Summarise: 阅读多篇文章的摘要, 总结摘要的内容并比较多个研究的异同, 可以通过Prompts进行进一步提问
+- Screen: 逐一阅读文章的摘要, 判断文章是否符合Prompts中给出的条件, 可用于批量筛选文献
+目前提供两种输入方式:
+- PMID: 直接输入PMID号, 将调用biopython提供的Pumbed API获取文献摘要并解析
+- RIS格式文件: 可从各种文献管理软件, 或者在线数据库以RIS格式导出题录后上传解析

lang/Setting.md ADDED Viewed

	@@ -0,0 +1,7 @@

+### Setting Instructions
+The currently available setup options include:
+- `OpenAI API Key`: All functionalities currently rely on the implementation of the OpenAI API, so this key is a required field. The filled-in key is not stored directly to the local machine, but instead in the `gradio.States` object. If self-deploying, you can write the content to the `utils/config.py` file (refer to the format of `utils/config_sample.py`), which can avoid having to fill in the key each time the page is refreshed.
+- `Email`: To use the API provided by NCBI, an email address must be provided, so this setting is also required. The process for self-deployment is the same as that for the `OpenAI API Key`.

lang/Setting.zh_cn.md ADDED Viewed

	@@ -0,0 +1,6 @@

+### 设置说明
+目前开放的设置项包括:
+- `OpenAI API Key`: 目前所有的功能都基于OpenAI 提供的API实现, 所以该Key是必填项. 填写的Key不直接存储到本地, 而是在`gradio.States`对象中. 如果进行自部署, 则可以将内容写到`utils/config.py`文件中(参考`utils/config_sample.py`的格式)
+- `Email`: 调用NCBI提供的API必须要提供邮箱, 因此也需要进行设置, 自部署的处理同`OpenAI API Key`

lang/Study.md ADDED Viewed

	@@ -0,0 +1,7 @@

+### Usage
+This page utilizes AI to assist in reading paper, and currently has two modes:
+- Paper: Upload a PDF file for parsing, and then answer questions based on the content of the document. Please make sure to upload the PDF before asking questions in Paper mode.
+- Other: Ask questions directly without relying on documents, which is equivalent to using ChatGPT directly, but without contextual conversation ability (saving tokens).

lang/Study.zh_cn.md ADDED Viewed

	@@ -0,0 +1,6 @@

+### 使用说明
+这个页面中可利用AI辅助阅读文献, 目前有两种模式:
+- Paper: 上传PDF文件进行解析, 然后根据文档的内容回答问题, 请确保先上传PDF后再用Paper模式提问
+- Other: 不基于文档直接提问, 等同于直接使用chatGPT, 但是没有上下文对话能力(节省token)

lang/Title.md ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # ReviewGPT
2	+ ReviewGPT is an app that use the ChatGPT API to perform paper summarization and aggregation. My goal is to use AI to accelerate the reading and retrieval of papers

requirements.txt CHANGED Viewed

@@ -2,4 +2,8 @@ gradio>=3.21.0
 openai==0.27.0
 pandas>=1.2.4
 biopython
-rispy

 openai==0.27.0
 pandas>=1.2.4
 biopython
+rispy
+pdfplumber
+plotly
+scipy
+scikit-learn

reviewGPT.py CHANGED Viewed

@@ -1,6 +1,7 @@
 import gradio as gr
 from datetime import datetime
 from utils import task
 try:
     import utils.config as conf
@@ -8,10 +9,13 @@ except ImportError:
     import utils.config_sample as conf
-# 定义
-TITLE = "# ReviewGPT"
-# 标题下的描述，支持md格式
-DESCRIPTION = "ReviewGPT is an app that use the ChatGPT API to perform paper summarization and aggregation. My goal is to use AI to accelerate the reading and retrieval of papers."
 def load_setting():
@@ -24,19 +28,76 @@ def load_setting():
     return setting
 # 构建Blocks上下文
 with gr.Blocks() as reviewGPT:
     share_settings = gr.State(load_setting()) # 已经转换成字典了
-    gr.Markdown(TITLE)
-    gr.Markdown(DESCRIPTION)
     with gr.Tab("Review"):
         with gr.Row():       # 行排列
             with gr.Column():    # 列排列
                 gr.Markdown('### Data Input')
                 input_form = gr.Radio(
-                    ["PMID", "RIS File"], label="Select Input Form"
                 )
                 pmids = gr.Textbox(label="Input PMIDs", lines=9, visible=True, interactive=True)
                 ris_file = gr.File(label="Please Select Ris File", visible=False, file_types=['.ris'], type='binary')
@@ -46,24 +107,53 @@ with gr.Blocks() as reviewGPT:
             with gr.Column():    # 列排列
                 gr.Markdown('### Task Setting')
                 task_choice = gr.Radio(
-                    ["Screen", "Summarise"], label="Select a Task"
                 )
                 prompts = gr.Textbox(label="Input Prompts", lines=9, interactive=True)
         start = gr.Button(value='REVIEW START')
         out = gr.Markdown()
-        def run_review(input_form, task_choice, prompts, pmids, ris_file, share_settings):
-            '''
-            click触发进行文献阅读
-            '''
-            paper_info = task.get_paper_info(input_form, share_settings['EMAIL'], pmids, ris_file)
-            return task.review(task_choice, paper_info, prompts, share_settings['OPENAI_KEY'], share_settings['REVIEW_MODEL'])
         start.click(fn=run_review, inputs=[
             input_form, task_choice, prompts, pmids, ris_file, share_settings
         ], outputs=out)
     with gr.Tab("Setting"):
         # gr.Text('Model Selection(not effective now)')
         # model = gr.Textbox(label="Select a model", interactive=True)

 import gradio as gr
 from datetime import datetime
 from utils import task
+import pandas as pd
 try:
     import utils.config as conf
     import utils.config_sample as conf
+def load_markdown(base):
+    '''
+    载入页面上的描述内容
+    '''
+    with open(f'lang/{base}.md') as f:
+        content = f.read()
+    return content
 def load_setting():
     return setting
+def run_review(input_form, task_choice, prompts, pmids, ris_file, share_settings):
+    '''
+    click触发进行文献阅读
+    '''
+    paper_info = task.get_paper_info(input_form, share_settings['EMAIL'], pmids, ris_file)
+    return task.review(task_choice, paper_info, prompts, share_settings['OPENAI_KEY'], share_settings['REVIEW_MODEL'])
+def pdf_preprocess(history):
+    '''
+    解析pdf需要一定时间, 因此在对话框中显示
+    提示用户先不要点击页面造成问题(强制措施过后写)
+    '''
+    return history + [[None, 'File uploaded, now pre-preocessing, please do not ask question in paper mode until process is done']] # 第一个字段是输入框
+def pdf_process(pdf_data, share_settings):
+    '''
+    处理pdf文件为文章数据
+    '''
+    paper_data = task.parse_pdf_info(pdf_data, share_settings['OPENAI_KEY'])
+    return paper_data
+def pdf_postprocess(history):
+    '''
+    解析pdf需要一定时间, 因此在对话框中显示
+    提示用户先不要点击页面造成问题(强制措施过后写)
+    '''
+    return history + [[None, 'PDF pre-preocess done, now you can ask question about this paper']] # 第一个字段是输入框
+def user(user_message, history):
+    '''
+    处理输入框信息
+    '''
+    return "", history + [[user_message, None]] # 第一个字段是输入框
+def bot(history, paper_data, mode, share_settings):
+    '''
+    根据输入框信息解析后返回处理结果
+    '''
+    question = history[-1][0]
+    if mode == 'Paper':
+        bot_message = task.study(paper_data, question, share_settings['OPENAI_KEY'], share_settings['REVIEW_MODEL'])
+        history[-1][1] = bot_message # 将回复添加到最后
+    elif mode == 'Other':
+        bot_message = task.query(question, share_settings['OPENAI_KEY'], share_settings['REVIEW_MODEL'])
+        history[-1][1] = bot_message # 将回复添加到最后
+    else:
+        raise Exception('Invalid Query Mode')
+    return history
 # 构建Blocks上下文
 with gr.Blocks() as reviewGPT:
     share_settings = gr.State(load_setting()) # 已经转换成字典了
+    pdf_data = gr.State(pd.DataFrame()) # PDF解析数据
+    gr.Markdown(load_markdown('Title'))
     with gr.Tab("Review"):
+        with gr.Box():
+            gr.Markdown(load_markdown('Review'))
         with gr.Row():       # 行排列
             with gr.Column():    # 列排列
                 gr.Markdown('### Data Input')
                 input_form = gr.Radio(
+                    ["PMID", "RIS File"], value = "PMID", label="Select Input Form",
                 )
                 pmids = gr.Textbox(label="Input PMIDs", lines=9, visible=True, interactive=True)
                 ris_file = gr.File(label="Please Select Ris File", visible=False, file_types=['.ris'], type='binary')
             with gr.Column():    # 列排列
                 gr.Markdown('### Task Setting')
                 task_choice = gr.Radio(
+                    ["Screen", "Summarise"], value = "Screen", label="Select a Task"
                 )
                 prompts = gr.Textbox(label="Input Prompts", lines=9, interactive=True)
         start = gr.Button(value='REVIEW START')
         out = gr.Markdown()
         start.click(fn=run_review, inputs=[
             input_form, task_choice, prompts, pmids, ris_file, share_settings
         ], outputs=out)
+    with gr.Tab("Study"): # 文献阅读页面用聊天机器人形式
+        ###### 布局部分 ######
+        with gr.Box():
+            gr.Markdown(load_markdown('Study'))
+        with gr.Row():
+            with gr.Column(scale=0.2):
+                pdf_file = gr.File(label="Please Select PDF File", visible=True, file_types=['.pdf'], type='binary')
+                mode = gr.Radio(
+                    ["Paper", 'Other'], value = "Paper", label="Query Mode", interactive=True
+                )
+                clear = gr.Button("Clear Dialog")
+            with gr.Column(scale=0.8):
+                chatbot = gr.Chatbot(
+                    value=[[None, "Hi, I'm a bot to help you read papers, please first upload a pdf file in the left box"]]
+                ).style(height=350)
+                with gr.Row():
+                    msg = gr.Textbox(label='Query')
+            ###### 响应部分 ######
+            pdf_file.change(
+                pdf_preprocess, chatbot, chatbot
+            ).then(
+                pdf_process, [pdf_file, share_settings], pdf_data
+            ).then(
+                pdf_postprocess, chatbot, chatbot
+            )
+            clear.click(lambda: None, None, chatbot, queue=False)
+            msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(# 这里的用于执行下一个动作
+                bot, [chatbot, pdf_data, mode, share_settings], chatbot
+            )
     with gr.Tab("Setting"):
+        with gr.Box():
+            gr.Markdown(load_markdown('Setting'))
         # gr.Text('Model Selection(not effective now)')
         # model = gr.Textbox(label="Select a model", interactive=True)

utils/config_sample.py CHANGED Viewed

@@ -9,19 +9,28 @@ EMAIL = "YOUR@MAIL.COM"
 # 预设的Promots, 目前用来实现的功能有两个
 Prompts = {
-    'Summarize': '小结 #1-#{idx} 的内容, 并小结不同参考文献的异同, 请用中文给出回答',
     'Summarize_Unit': '''
-        将编号为{idx}的文献标记为参考文献 #1, 将后续的文段标记为 #1 的摘要.
         {abstract}
     ''',
     # 进行meta时的标准肯定是变化的, 因此这个功能的Promot主要是用于功能上的限定
     # 1. 限制回答方式为json格式的字符串, 方便解析
     # 2. 限制chatGPT逐一检查要检查的问题, 并在认为不符合标准时给出不符合哪一条
     # 3. 限制chatGPT尝试拼接语义正确但是内容不正确的回答, 让它在无法判定时直接说明难以判断, 交给人工判断
     'Screen': '''
-        请你扮演一位研究者, 你接下来将要进行一向Meta分析, 因此需要逐一篇文献摘要, 并对摘要内容进行理解,
-        以判断文献是否满足Meta分析的准入标准, 可以用于后续分析. 在阅读时, 你需要对后面给出的每一条准入标准逐
         一进行判断, 如果文献的摘要不满足任意一条标准, 那么需要给出你认为摘要不满足标准的原因.
         返回的结果请按照json字串的方式给出, 请不要附带任何其他的多余内容.
@@ -46,4 +55,23 @@ Prompts = {
         下面是文献摘要的内容:
         {abstract}
     ''',
 }

 # 预设的Promots, 目前用来实现的功能有两个
 Prompts = {
+    'Summarize': '''
+        {papers}
+        小结 #1-#{idx} 的内容, 并小结不同参考文献的异同
+        如果后面还有其他问题, 请基于#1-#{idx} 的内容逐一进行回答
+        所有问题请中文作答, 但对一些专有名词及其缩写可不翻译.
+        {questions}
+    ''',
     'Summarize_Unit': '''
+        将编号为{paper_id}的文献标记为参考文献 #{idx}, 将后续的文段标记为 #{idx} 的摘要.
         {abstract}
     ''',
     # 进行meta时的标准肯定是变化的, 因此这个功能的Promot主要是用于功能上的限定
     # 1. 限制回答方式为json格式的字符串, 方便解析
     # 2. 限制chatGPT逐一检查要检查的问题, 并在认为不符合标准时给出不符合哪一条
     # 3. 限制chatGPT尝试拼接语义正确但是内容不正确的回答, 让它在无法判定时直接说明难以判断, 交给人工判断
     'Screen': '''
+        请你扮演一位研究者, 你接下来将要进行一项Meta分析, 因此需要逐一阅读并理解文献摘要,
+        以判断文献是否满足Meta分析的准入标准. 在阅读时, 你需要对后面给出的每一条准入标准逐
         一进行判断, 如果文献的摘要不满足任意一条标准, 那么需要给出你认为摘要不满足标准的原因.
         返回的结果请按照json字串的方式给出, 请不要附带任何其他的多余内容.
         下面是文献摘要的内容:
         {abstract}
     ''',
+    'Review': '''
+        请你扮演一位研究者, 你接下来将要阅读一篇论文的部分页, 并基于这部分的内容回答一个或多个问题.
+        回答问题时, 请完全基于提供的论文内容回答, 这具体指: 按照论文的思路, 用论文的表述方式进行回答.
+        不要在回答中添加论文不存在的内容. 所有回答请使用中文, 但是对于一些专有名词及其缩写可以不翻译.
+        下面是这篇论文的部分页:
+        {pages}
+        下面是待回答的问题, 如果有多个问题, 请逐一回答他们, 在回答的时候请说明基于前面哪一页的内容进行了回答,
+        如: 1. 基于#1和#3的内容, 我的回答是...
+        {question}
+    ''',
+    'Review_Unit': '''
+        将后续的文段标记为这篇文献的 #{page}页, 这一页中的内容是:
+        {content}
+    '''
 }

utils/pdf_parser.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import pdfplumber
+import pandas as pd
+from openai.embeddings_utils import get_embedding
+import openai
+try:
+    from utils.config import OPENAI_KEY
+except ImportError:
+    from utils.config_sample import OPENAI_KEY
+class PdfFile:
+    '''
+    1. pdf文件解析器, 目前使用pdfplumber模块简单实现
+    2. 之后需要找更好的ML模块以能更好的提取文章内容(精确到节段), 位置定位更精确
+    3. 在有新的提取方式以前, 这部分不添加设置项
+    '''
+    def __init__(self, file, api_key=None):
+        self.pdf_file = pdfplumber.open(file)
+        self.api_key = api_key if api_key else OPENAI_KEY
+    def __del__(self):
+        if self.pdf_file:
+            self.pdf_file.close()
+    def parse_info(self):
+        '''
+        一页内容存一次, 所有内容通过模型进行预测
+        '''
+        all_text = []
+        for page in self.pdf_file.pages:
+            page_text = page.dedupe_chars().extract_text(layout=True)
+            if page_text:
+                all_text.append({
+                    'page': page.page_number,
+                    'text': page_text
+                })
+        text_data = pd.DataFrame(all_text)
+        openai.api_key = self.api_key
+        text_data['embedding'] = text_data['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
+        # text_data.to_csv('debug.data.csv', index=0)
+        return text_data
+    @property
+    def metadata(self):
+        '''
+        调用rispy给出可解析的所有字段
+        '''
+        return self.pdf_file.metadata
+if __name__ == '__main__':
+    pdf = PdfFile(file='/home/silen/git_proj/ReviewGPT/test/demo_paper.pdf')
+    print(pdf.parse_info())

utils/review.py CHANGED Viewed

@@ -2,7 +2,9 @@
 # import tiktoken
 import openai
 from json import dump, loads
 try:
     from utils.config import Prompts, OPENAI_KEY, REVIEW_MODEL
@@ -44,16 +46,22 @@ class Reviewer:
         self.messages = None
-    def query(self):
         '''
         发送请求并获取chatGPT给的结果
         '''
         # 设置email和搜索关键词
         openai.api_key = self.api_key
-        response = openai.ChatCompletion.create(
-            model = self.model,
-            messages = self.messages
-        )
         return response
@@ -71,17 +79,45 @@ class Reviewer:
         return self.query()
-    def summarise(self, papers):
         '''
         meta分析时用的方法, 读取标准内容和文献摘要后,
         判断文章是否符合准入标准
         '''
         units = []
-        for idx, abstract in papers:
-            units.append(Prompts['Summarize_Unit'].format(idx=idx, abstract=abstract))
-        units.append(Prompts['Summarize'].format(idx=len(papers)))
-        message = '\n'.join(units)
         self.messages = [
             {"role": "user", "content": message},
         ]
@@ -117,10 +153,12 @@ def screen_demo():
         dump(answer, aJson)
-if __name__ == '__main__':
     reviewer = Reviewer()
     papers = [
         ('12345', 'Background: Little is known about reproductive health in severely obese women. In this study, we present associations between different levels of severe obesity and a wide range of health outcomes in the mother and child. Method(s): From the Danish National Birth Cohort, we obtained self-reported information about prepregnant body mass index (BMI) for 2451 severely obese women and 2450 randomly selected women from the remaining cohort who served as a comparison group. Information about maternal and infant outcomes was also self-reported or came from registers. Logistic regression was used to estimate the association between different levels of severe obesity and reproductive outcomes. Principal Findings: Subfecundity was more frequent in severely obese women, and during pregnancy, they had an excess risk of urinary tract infections, gestational diabetes, preeclampsia and other hypertensive disorders which increased with severity of obesity. They tended to have a higher risk of both pre- and post-term birth, and risk of cesarean and instrumental deliveries increased across obesity categories. After birth, severely obese women more often failed to initiate or sustain breastfeeding. Risk of weight retention 1.5 years after birth was similar to that of other women, but after adjustment for gestational weight gain, the risk was increased, especially in women in the lowest obesity category. In infants, increasing maternal obesity was associated with decreased risk of a low birth weight and increased risk of a high birth weight. Estimates for ponderal index showed the same pattern indicating an increasing risk of neonatal fatness with severity of obesity. Infant obesity measured one year after birth was also increased in children of severely obese mothers. Conclusion(s): Severe obesity is correlated with a substantial disease burden in reproductive health. Although the causal mechanisms remain elusive, these findings are useful for making predictions and planning health care at the individual level. © 2009 Nohr et al.'),
         ('45456', 'Background: Preeclampsia is one of the leading causes of maternal and perinatal morbidity and mortality world-wide. The risk for developing preeclampsia varies depending on the underlying mechanism. Because the disorder is heterogeneous, the pathogenesis can differ in women with various risk factors. Understanding these mechanisms of disease responsible for preeclampsia as well as risk assessment is still a major challenge. The aim of this study was to determine the risk factors associated with preeclampsia, in healthy women in maternity hospitals of Karachi and Rawalpindi. Method(s): We conducted a hospital based matched case-control study to assess the factors associated with preeclampsia in Karachi and Rawalpindi, from January 2006 to December 2007. 131 hospital-reported cases of PE and 262 controls without history of preeclampsia were enrolled within 3 days of delivery. Cases and controls were matched on the hospital, day of delivery and parity. Potential risk factors for preeclampsia were ascertained during in-person postpartum interviews using a structured questionnaire and by medical record abstraction. Conditional logistic regression was used to estimate matched odds ratios (ORs) and 95% confidence intervals (95% CIs). Result(s): In multivariate analysis, women having a family history of hypertension (adjusted OR 2.06, 95% CI; 1.27-3.35), gestational diabetes (adjusted OR 6.57, 95% CI; 1.94 -22.25), pre-gestational diabetes (adjusted OR 7.36, 95% CI; 1.37-33.66) and mental stress during pregnancy (adjusted OR 1.32; 95% CI; 1.19-1.46, for each 5 unit increase in Perceived stress scale score) were at increased risk of preeclampsia. However, high body mass index, maternal age, urinary tract infection, use of condoms prior to index pregnancy and sociodemographic factors were not associated with higher risk of having preeclampsia. Conclusion(s): Development of preeclampsia was associated with gestational diabetes, pregestational diabetes, family history of hypertension and mental stress during pregnancy. These factors can be used as a screening tool for preeclampsia prediction. Identification of the above mentioned predictors would enhance the ability to diagnose and monitor women likely to develop preeclampsia before the onset of disease for timely interventions and better maternal and fetal outcomes. © 2010 Shamsi et al; licensee BioMed Central Ltd.'),
     ]
     reviewer.summarise(papers)

 # import tiktoken
 import openai
+from openai.embeddings_utils import get_embedding, cosine_similarity
 from json import dump, loads
+import pandas as pd
 try:
     from utils.config import Prompts, OPENAI_KEY, REVIEW_MODEL
         self.messages = None
+    def query(self, msg=None):
         '''
         发送请求并获取chatGPT给的结果
         '''
         # 设置email和搜索关键词
         openai.api_key = self.api_key
+        if msg:
+            response = openai.ChatCompletion.create(
+                model = self.model,
+                messages =  [{"role": "user", "content": msg}]
+            )
+        else:
+            response = openai.ChatCompletion.create(
+                model = self.model,
+                messages = self.messages
+            )
         return response
         return self.query()
+    def summarise(self, papers, prompts):
         '''
         meta分析时用的方法, 读取标准内容和文献摘要后,
         判断文章是否符合准入标准
         '''
         units = []
+        for idx, (paper_id, abstract) in enumerate(papers):
+            units.append(Prompts['Summarize_Unit'].format(
+                idx=idx,
+                paper_id=paper_id,
+                abstract=abstract
+            ))
+        paper_text = '\n'.join(units)
+        message = Prompts['Summarize'].format(idx=len(papers), papers=paper_text, questions=prompts)
+        self.messages = [
+            {"role": "user", "content": message},
+        ]
+        return self.query()
+    def study(self, question: str, paper_data: pd.DataFrame, top: int = 2):
+        '''
+        阅读文献的部分内容, 给出问题的回答
+        step1, 根据问题, 计算embedding
+        step2, 找到top N 最接近的内容
+        step3, 形成prompts, 请求回答
+        '''
+        question_embedding = get_embedding(question, engine='text-embedding-ada-002')
+        paper_data['similarity'] = paper_data['embedding'].apply(lambda x: cosine_similarity(x, question_embedding))
+        # 找出最符合的n段文本, 默认2, 主要是为了节省token...
+        top_texts = paper_data.sort_values(by='similarity', ascending=False).head(2)[['page', 'text']].values.tolist()
+        units = []
+        for page, text in top_texts:
+            units.append(Prompts['Review_Unit'].format(page=page, content=text))
+        message = Prompts['Review'].format(pages='\n'.join(units), question=question)
         self.messages = [
             {"role": "user", "content": message},
         ]
         dump(answer, aJson)
+def summarise_demo():
     reviewer = Reviewer()
     papers = [
         ('12345', 'Background: Little is known about reproductive health in severely obese women. In this study, we present associations between different levels of severe obesity and a wide range of health outcomes in the mother and child. Method(s): From the Danish National Birth Cohort, we obtained self-reported information about prepregnant body mass index (BMI) for 2451 severely obese women and 2450 randomly selected women from the remaining cohort who served as a comparison group. Information about maternal and infant outcomes was also self-reported or came from registers. Logistic regression was used to estimate the association between different levels of severe obesity and reproductive outcomes. Principal Findings: Subfecundity was more frequent in severely obese women, and during pregnancy, they had an excess risk of urinary tract infections, gestational diabetes, preeclampsia and other hypertensive disorders which increased with severity of obesity. They tended to have a higher risk of both pre- and post-term birth, and risk of cesarean and instrumental deliveries increased across obesity categories. After birth, severely obese women more often failed to initiate or sustain breastfeeding. Risk of weight retention 1.5 years after birth was similar to that of other women, but after adjustment for gestational weight gain, the risk was increased, especially in women in the lowest obesity category. In infants, increasing maternal obesity was associated with decreased risk of a low birth weight and increased risk of a high birth weight. Estimates for ponderal index showed the same pattern indicating an increasing risk of neonatal fatness with severity of obesity. Infant obesity measured one year after birth was also increased in children of severely obese mothers. Conclusion(s): Severe obesity is correlated with a substantial disease burden in reproductive health. Although the causal mechanisms remain elusive, these findings are useful for making predictions and planning health care at the individual level. © 2009 Nohr et al.'),
         ('45456', 'Background: Preeclampsia is one of the leading causes of maternal and perinatal morbidity and mortality world-wide. The risk for developing preeclampsia varies depending on the underlying mechanism. Because the disorder is heterogeneous, the pathogenesis can differ in women with various risk factors. Understanding these mechanisms of disease responsible for preeclampsia as well as risk assessment is still a major challenge. The aim of this study was to determine the risk factors associated with preeclampsia, in healthy women in maternity hospitals of Karachi and Rawalpindi. Method(s): We conducted a hospital based matched case-control study to assess the factors associated with preeclampsia in Karachi and Rawalpindi, from January 2006 to December 2007. 131 hospital-reported cases of PE and 262 controls without history of preeclampsia were enrolled within 3 days of delivery. Cases and controls were matched on the hospital, day of delivery and parity. Potential risk factors for preeclampsia were ascertained during in-person postpartum interviews using a structured questionnaire and by medical record abstraction. Conditional logistic regression was used to estimate matched odds ratios (ORs) and 95% confidence intervals (95% CIs). Result(s): In multivariate analysis, women having a family history of hypertension (adjusted OR 2.06, 95% CI; 1.27-3.35), gestational diabetes (adjusted OR 6.57, 95% CI; 1.94 -22.25), pre-gestational diabetes (adjusted OR 7.36, 95% CI; 1.37-33.66) and mental stress during pregnancy (adjusted OR 1.32; 95% CI; 1.19-1.46, for each 5 unit increase in Perceived stress scale score) were at increased risk of preeclampsia. However, high body mass index, maternal age, urinary tract infection, use of condoms prior to index pregnancy and sociodemographic factors were not associated with higher risk of having preeclampsia. Conclusion(s): Development of preeclampsia was associated with gestational diabetes, pregestational diabetes, family history of hypertension and mental stress during pregnancy. These factors can be used as a screening tool for preeclampsia prediction. Identification of the above mentioned predictors would enhance the ability to diagnose and monitor women likely to develop preeclampsia before the onset of disease for timely interventions and better maternal and fetal outcomes. © 2010 Shamsi et al; licensee BioMed Central Ltd.'),
     ]
     reviewer.summarise(papers)

utils/task.py CHANGED Viewed

@@ -1,3 +1,7 @@
 from json import loads
 import pandas as pd
 import io
@@ -5,6 +9,7 @@ import io
 from utils.review  import Reviewer
 from utils.pubmed  import PubMedFetcher
 from utils.ris_parser  import RisFile
 def get_paper_info(inputMethod, email=None, pmids=None, ris_data=None):
@@ -70,5 +75,30 @@ def review(task, paper_info, prompts, openai_key, review_model):
                 papers.append((rec['DOI'], rec['Abstract']))
             else:
                 raise Exception('No PMID nor DOI in record.')
-        response = reviewer.summarise(papers)
         return response['choices'][0]['message']['content']

+'''
+这个文件里的所有内容其实应该精简以后全部放在ReviewerBot下
+'''
 from json import loads
 import pandas as pd
 import io
 from utils.review  import Reviewer
 from utils.pubmed  import PubMedFetcher
 from utils.ris_parser  import RisFile
+from utils.pdf_parser  import PdfFile
 def get_paper_info(inputMethod, email=None, pmids=None, ris_data=None):
                 papers.append((rec['DOI'], rec['Abstract']))
             else:
                 raise Exception('No PMID nor DOI in record.')
+        response = reviewer.summarise(papers, prompts)
         return response['choices'][0]['message']['content']
+def parse_pdf_info(pdf_data, openai_key):
+    '''
+    调用模块解析PDF, 单独放出来, 解析只需要一次
+    '''
+    fileHandle = io.BytesIO(pdf_data)
+    pdf = PdfFile(file=fileHandle, api_key=openai_key)
+    paper_data = pdf.parse_info()
+    return paper_data
+def study(paper_data, prompts, openai_key, review_model):
+    '''
+    根据解析好的PDF数据进行阅读
+    '''
+    reviewer = Reviewer(api_key=openai_key, model=review_model)
+    response = reviewer.study(prompts, paper_data)
+    return response['choices'][0]['message']['content']
+def query(prompts, openai_key, review_model):
+    reviewer = Reviewer(api_key=openai_key, model=review_model)
+    response = reviewer.query(prompts)
+    return response['choices'][0]['message']['content']