Silen Wang commited on
Commit
c4ad2cf
·
1 Parent(s): 9b8e6f6

v0.2.1 demo update

Browse files
.gitignore ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ conda/
2
+ utils/config.py
3
+ *test*
4
+ *__pycache__*
README.md CHANGED
@@ -16,6 +16,8 @@ Researchers need to read a large amount of literature every day to keep up with
16
 
17
  ## Demo
18
 
 
 
19
  - Screen:
20
 
21
  ![demo](img/screen.png)
@@ -24,6 +26,10 @@ Researchers need to read a large amount of literature every day to keep up with
24
 
25
  ![demo](img/summarise.png)
26
 
 
 
 
 
27
 
28
  ## ToDo
29
 
@@ -33,8 +39,8 @@ Researchers need to read a large amount of literature every day to keep up with
33
  + [ ] Add a download button for raw parsing data(json)
34
  + [x] Implementation of content summarise function
35
  + [ ] The About page
36
- + [ ] Add usage instructions
37
- + [ ] Add a page for reading single paper
38
  - Backend:
39
  + [x] Call the chatGPT API for content summarization
40
  + [x] Call the chatGPT API for literature content access judgment (for meta-analysis)
@@ -47,20 +53,30 @@ Researchers need to read a large amount of literature every day to keep up with
47
  + [ ] chatGLM
48
  + [ ] moss
49
  + [ ] LLaMA
50
- + [ ] Add the function of reading single paper
51
  + [ ] Add APIs for existing feature
 
52
  - Reference learning:
53
- + [ ] Learn the content of[ResearchGPT](https://github.com/mukulpatnaik/researchgpt) and add similar function
54
  + [ ] Learn the content of[chatPaper](https://github.com/kaixindelele/ChatPaper) and add similar function
55
- + [ ] Try build something like [chatPDF](https://www.chatpdf.com/)
56
  - Others:
57
  - [x] Enhlish README
58
  - [ ] Dockfile for container building
59
- - [ ] A HuggingFace demo
 
 
 
 
 
 
 
 
 
60
 
61
  ## Problems
62
 
63
  - This project was initially developed using Pynecone, but encountered several problems that affected its use/appearance, so it was finally switched to Gradio.
64
  + pynecone continues to occupy the CPU after startup.
65
  + Currently, the file upload function is not very user-friendly, and you must use buttons or other content to trigger the upload (I have not found how to implement drag and drop upload).
66
- + After uploading the file, performing other operations will cause the displayed file name to be lost.
 
16
 
17
  ## Demo
18
 
19
+ [Demo](https://huggingface.co/spaces/SilenWang/ReviewGPT) available on Huggingface, an OpenAI API Key is required
20
+
21
  - Screen:
22
 
23
  ![demo](img/screen.png)
 
26
 
27
  ![demo](img/summarise.png)
28
 
29
+ - Study:
30
+
31
+ ![demo](img/study_paper.png)
32
+ ![demo](img/study_other.png)
33
 
34
  ## ToDo
35
 
 
39
  + [ ] Add a download button for raw parsing data(json)
40
  + [x] Implementation of content summarise function
41
  + [ ] The About page
42
+ + [x] Add usage instructions
43
+ + [x] Paper Reading Page
44
  - Backend:
45
  + [x] Call the chatGPT API for content summarization
46
  + [x] Call the chatGPT API for literature content access judgment (for meta-analysis)
 
53
  + [ ] chatGLM
54
  + [ ] moss
55
  + [ ] LLaMA
56
+ + [x] Add the function of reading single paper
57
  + [ ] Add APIs for existing feature
58
+ + [ ] Improve the PDF parsing module in the Study feature, with the goal of changing the unit from page to paragraph.
59
  - Reference learning:
60
+ + [x] Learn the content of[ResearchGPT](https://github.com/mukulpatnaik/researchgpt) and add similar function
61
  + [ ] Learn the content of[chatPaper](https://github.com/kaixindelele/ChatPaper) and add similar function
62
+ + [ ] ~~Try build something like [chatPDF](https://www.chatpdf.com/)~~
63
  - Others:
64
  - [x] Enhlish README
65
  - [ ] Dockfile for container building
66
+ - [x] A HuggingFace demo
67
+ - [ ] Add error handling for network tasks, following the example of [chatPaper](https://github.com/kaixindelele/ChatPaper)
68
+
69
+ ## Code Interpretation
70
+
71
+ - According to chatGPT, the implementation of ResearchGPT is as follows:
72
+ + Convert file contents **by page** into text
73
+ + Call `text-embedding-ada-002` for text embedding matrix calculation
74
+ + Convert the question into a matrix and calculate the similarity with the matrix of each page
75
+ + Send the top 3 pages with the highest similarity to the proposed question to the chatGPT interface for literature interpretation
76
 
77
  ## Problems
78
 
79
  - This project was initially developed using Pynecone, but encountered several problems that affected its use/appearance, so it was finally switched to Gradio.
80
  + pynecone continues to occupy the CPU after startup.
81
  + Currently, the file upload function is not very user-friendly, and you must use buttons or other content to trigger the upload (I have not found how to implement drag and drop upload).
82
+ + After uploading the file, performing other operations will cause the displayed file name to be lost.
README.zh-cn.md CHANGED
@@ -4,6 +4,8 @@
4
 
5
  ## 运行展示
6
 
 
 
7
  - 文献准入判断(Screen):
8
 
9
  ![demo](img/screen.png)
@@ -12,6 +14,11 @@
12
 
13
  ![demo](img/summarise.png)
14
 
 
 
 
 
 
15
  ## 规划内容
16
 
17
 
@@ -21,8 +28,8 @@
21
  + [ ] 增加原始解析数据下载按钮(json)
22
  + [x] 内容综述功能实装
23
  + [ ] 增加About页
24
- + [ ] 增加使用说明(具体怎么加没想好)
25
- + [ ] 增加单文献阅读的页面
26
  - 后端:
27
  + [x] 调用chatGPT的API进行内容综述
28
  + [x] 调用chatGPT的API进行文献内容准入判断(Meta分析用)
@@ -35,16 +42,28 @@
35
  + [ ] chatGLM
36
  + [ ] moss
37
  + [ ] LLaMA
38
- + [ ] 增加单文献阅读的功能
39
  + [ ] 增加现有功能的API
 
40
  - 参考学习:
41
- + [ ] 学习[ResearchGPT](https://github.com/mukulpatnaik/researchgpt)的内容, 增加类似的功能
42
  + [ ] 学习[chatPaper](https://github.com/kaixindelele/ChatPaper)的内容, 增加类似的功能
43
- + [ ] 尝试构建个[chatPDF](https://www.chatpdf.com/)类似的功能
44
  - 杂项:
45
  - [x] 英文README
46
  - [ ] 准备Dockfile, 构建容器
47
- - [ ] 准备HuggingFace Demo
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ## 问题记录
50
 
 
4
 
5
  ## 运行展示
6
 
7
+ 现在可以在Huggingface上试用本程序的[Demo](https://huggingface.co/spaces/SilenWang/ReviewGPT)(需要自备OpenAI API Key)
8
+
9
  - 文献准入判断(Screen):
10
 
11
  ![demo](img/screen.png)
 
14
 
15
  ![demo](img/summarise.png)
16
 
17
+ - 文献辅助阅读(Study):
18
+
19
+ ![demo](img/study_paper.png)
20
+ ![demo](img/study_other.png)
21
+
22
  ## 规划内容
23
 
24
 
 
28
  + [ ] 增加原始解析数据下载按钮(json)
29
  + [x] 内容综述功能实装
30
  + [ ] 增加About页
31
+ + [x] 增加使用说明
32
+ + [x] 增加单文献阅读的页面
33
  - 后端:
34
  + [x] 调用chatGPT的API进行内容综述
35
  + [x] 调用chatGPT的API进行文献内容准入判断(Meta分析用)
 
42
  + [ ] chatGLM
43
  + [ ] moss
44
  + [ ] LLaMA
45
+ + [x] 增加单文献阅读的功能
46
  + [ ] 增加现有功能的API
47
+ + [ ] 改进Study功能中的PDF解析模块, 目标是将单位由页变为段落
48
  - 参考学习:
49
+ + [x] 学习[ResearchGPT](https://github.com/mukulpatnaik/researchgpt)的内容, 增加类似的功能
50
  + [ ] 学习[chatPaper](https://github.com/kaixindelele/ChatPaper)的内容, 增加类似的功能
51
+ + [ ] ~~尝试构建个[chatPDF](https://www.chatpdf.com/)类似的功能~~
52
  - 杂项:
53
  - [x] 英文README
54
  - [ ] 准备Dockfile, 构建容器
55
+ - [x] 准备HuggingFace Demo
56
+ - [ ] 效仿[chatPaper](https://github.com/kaixindelele/ChatPaper)增加网络任务的错误处理(tenacity)
57
+
58
+ ## 学习内容
59
+
60
+ ### ResearchGPT功能实现原理
61
+
62
+ - 借助chatGPT解读可知, ResearchGPT的实现方式为:
63
+ + 将文件内容**按页**转换为文本
64
+ + 调用`text-embedding-ada-002`进行文本embedding矩阵计算
65
+ + 将问题也转换为矩阵, 与每个页面的矩阵计算相似性
66
+ + 将相似性最高的3个页面与提出的问题一起给chatGPT接口, 实现文献解读
67
 
68
  ## 问题记录
69
 
img/study_other.png ADDED
img/study_paper.png ADDED
img/summarise.png CHANGED
lang/Review.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### Usage
2
+
3
+ This page provides two functions: Summarize and Screen.
4
+
5
+ - Summarize: Read the summary of multiple articles, summarize their content, and compare the similarities and differences among multiple studies. Further questions can be asked using prompts.
6
+ - Screen: Read the summary of articles one by one, determine whether the article meets the conditions given in the prompts, and can be used for batch screening of literature.
7
+
8
+ Currently, two input methods are provided:
9
+
10
+ - PMID: Enter the PMID directly, and call the Pumbed API provided by Biopython to obtain the summary of the literature and parse it.
11
+ - RIS format file: Can be exported in RIS format from various literature management software or online databases and uploaded for parsing.
lang/Review.zh_cn.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### 使用说明
2
+
3
+ 这个页面提供Summarise和Screen两种功能:
4
+
5
+ - Summarise: 阅读多篇文章的摘要, 总结摘要的内容并比较多个研究的异同, 可以通过Prompts进行进一步提问
6
+ - Screen: 逐一阅读文章的摘要, 判断文章是否符合Prompts中给出的条件, 可用于批量筛选文献
7
+
8
+ 目前提供两种输入方式:
9
+
10
+ - PMID: 直接输入PMID号, 将调用biopython提供的Pumbed API获取文献摘要并解析
11
+ - RIS格式文件: 可从各种文献管理软件, 或者在线数据库以RIS格式导出题录后上传解析
lang/Setting.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ ### Setting Instructions
2
+
3
+ The currently available setup options include:
4
+
5
+ - `OpenAI API Key`: All functionalities currently rely on the implementation of the OpenAI API, so this key is a required field. The filled-in key is not stored directly to the local machine, but instead in the `gradio.States` object. If self-deploying, you can write the content to the `utils/config.py` file (refer to the format of `utils/config_sample.py`), which can avoid having to fill in the key each time the page is refreshed.
6
+
7
+ - `Email`: To use the API provided by NCBI, an email address must be provided, so this setting is also required. The process for self-deployment is the same as that for the `OpenAI API Key`.
lang/Setting.zh_cn.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ ### 设置说明
2
+
3
+ 目前开放的设置项包括:
4
+
5
+ - `OpenAI API Key`: 目前所有的功能都基于OpenAI 提供的API实现, 所以该Key是必填项. 填写的Key不直接存储到本地, 而是在`gradio.States`对象中. 如果进行自部署, 则可以将内容写到`utils/config.py`文件中(参考`utils/config_sample.py`的格式)
6
+ - `Email`: 调用NCBI提供的API必须要提供邮箱, 因此也需要进行设置, 自部署的处理同`OpenAI API Key`
lang/Study.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ ### Usage
2
+
3
+ This page utilizes AI to assist in reading paper, and currently has two modes:
4
+
5
+ - Paper: Upload a PDF file for parsing, and then answer questions based on the content of the document. Please make sure to upload the PDF before asking questions in Paper mode.
6
+
7
+ - Other: Ask questions directly without relying on documents, which is equivalent to using ChatGPT directly, but without contextual conversation ability (saving tokens).
lang/Study.zh_cn.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ ### 使用说明
2
+
3
+ 这个页面中可利用AI辅助阅读文献, 目前有两种模式:
4
+
5
+ - Paper: 上传PDF文件进行解析, 然后根据文档的内容回答问题, 请确保先上传PDF后再用Paper模式提问
6
+ - Other: 不基于文档直接提问, 等同于直接使用chatGPT, 但是没有上下文对话能力(节省token)
lang/Title.md ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # ReviewGPT
2
+ ReviewGPT is an app that use the ChatGPT API to perform paper summarization and aggregation. My goal is to use AI to accelerate the reading and retrieval of papers
requirements.txt CHANGED
@@ -2,4 +2,8 @@ gradio>=3.21.0
2
  openai==0.27.0
3
  pandas>=1.2.4
4
  biopython
5
- rispy
 
 
 
 
 
2
  openai==0.27.0
3
  pandas>=1.2.4
4
  biopython
5
+ rispy
6
+ pdfplumber
7
+ plotly
8
+ scipy
9
+ scikit-learn
reviewGPT.py CHANGED
@@ -1,6 +1,7 @@
1
  import gradio as gr
2
  from datetime import datetime
3
  from utils import task
 
4
 
5
  try:
6
  import utils.config as conf
@@ -8,10 +9,13 @@ except ImportError:
8
  import utils.config_sample as conf
9
 
10
 
11
- # 定义
12
- TITLE = "# ReviewGPT"
13
- # 标题下的描述,支持md格式
14
- DESCRIPTION = "ReviewGPT is an app that use the ChatGPT API to perform paper summarization and aggregation. My goal is to use AI to accelerate the reading and retrieval of papers."
 
 
 
15
 
16
 
17
  def load_setting():
@@ -24,19 +28,76 @@ def load_setting():
24
  return setting
25
 
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  # 构建Blocks上下文
28
  with gr.Blocks() as reviewGPT:
29
  share_settings = gr.State(load_setting()) # 已经转换成字典了
 
30
 
31
- gr.Markdown(TITLE)
32
- gr.Markdown(DESCRIPTION)
33
 
34
  with gr.Tab("Review"):
 
 
35
  with gr.Row(): # 行排列
36
  with gr.Column(): # 列排列
37
  gr.Markdown('### Data Input')
38
  input_form = gr.Radio(
39
- ["PMID", "RIS File"], label="Select Input Form"
40
  )
41
  pmids = gr.Textbox(label="Input PMIDs", lines=9, visible=True, interactive=True)
42
  ris_file = gr.File(label="Please Select Ris File", visible=False, file_types=['.ris'], type='binary')
@@ -46,24 +107,53 @@ with gr.Blocks() as reviewGPT:
46
  with gr.Column(): # 列排列
47
  gr.Markdown('### Task Setting')
48
  task_choice = gr.Radio(
49
- ["Screen", "Summarise"], label="Select a Task"
50
  )
51
  prompts = gr.Textbox(label="Input Prompts", lines=9, interactive=True)
52
  start = gr.Button(value='REVIEW START')
53
  out = gr.Markdown()
54
 
55
- def run_review(input_form, task_choice, prompts, pmids, ris_file, share_settings):
56
- '''
57
- click触发进行文献阅读
58
- '''
59
- paper_info = task.get_paper_info(input_form, share_settings['EMAIL'], pmids, ris_file)
60
- return task.review(task_choice, paper_info, prompts, share_settings['OPENAI_KEY'], share_settings['REVIEW_MODEL'])
61
-
62
  start.click(fn=run_review, inputs=[
63
  input_form, task_choice, prompts, pmids, ris_file, share_settings
64
  ], outputs=out)
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  with gr.Tab("Setting"):
 
 
67
 
68
  # gr.Text('Model Selection(not effective now)')
69
  # model = gr.Textbox(label="Select a model", interactive=True)
 
1
  import gradio as gr
2
  from datetime import datetime
3
  from utils import task
4
+ import pandas as pd
5
 
6
  try:
7
  import utils.config as conf
 
9
  import utils.config_sample as conf
10
 
11
 
12
+ def load_markdown(base):
13
+ '''
14
+ 载入页面上的描述内容
15
+ '''
16
+ with open(f'lang/{base}.md') as f:
17
+ content = f.read()
18
+ return content
19
 
20
 
21
  def load_setting():
 
28
  return setting
29
 
30
 
31
+ def run_review(input_form, task_choice, prompts, pmids, ris_file, share_settings):
32
+ '''
33
+ click触发进行文献阅读
34
+ '''
35
+ paper_info = task.get_paper_info(input_form, share_settings['EMAIL'], pmids, ris_file)
36
+ return task.review(task_choice, paper_info, prompts, share_settings['OPENAI_KEY'], share_settings['REVIEW_MODEL'])
37
+
38
+
39
+ def pdf_preprocess(history):
40
+ '''
41
+ 解析pdf需要一定时间, 因此在对话框中显示
42
+ 提示用户先不要点击页面造成问题(强制措施过后写)
43
+ '''
44
+ return history + [[None, 'File uploaded, now pre-preocessing, please do not ask question in paper mode until process is done']] # 第一个字段是输入框
45
+
46
+
47
+ def pdf_process(pdf_data, share_settings):
48
+ '''
49
+ 处理pdf文件为文章数据
50
+ '''
51
+ paper_data = task.parse_pdf_info(pdf_data, share_settings['OPENAI_KEY'])
52
+ return paper_data
53
+
54
+
55
+ def pdf_postprocess(history):
56
+ '''
57
+ 解析pdf需要一定时间, 因此在对话框中显示
58
+ 提示用户先不要点击页面造成问题(强制措施过后写)
59
+ '''
60
+ return history + [[None, 'PDF pre-preocess done, now you can ask question about this paper']] # 第一个字段是输入框
61
+
62
+
63
+ def user(user_message, history):
64
+ '''
65
+ 处理输入框信息
66
+ '''
67
+ return "", history + [[user_message, None]] # 第一个字段是输入框
68
+
69
+
70
+ def bot(history, paper_data, mode, share_settings):
71
+ '''
72
+ 根据输入框信息解析后返回处理结果
73
+ '''
74
+ question = history[-1][0]
75
+ if mode == 'Paper':
76
+ bot_message = task.study(paper_data, question, share_settings['OPENAI_KEY'], share_settings['REVIEW_MODEL'])
77
+ history[-1][1] = bot_message # 将回复添加到最后
78
+ elif mode == 'Other':
79
+ bot_message = task.query(question, share_settings['OPENAI_KEY'], share_settings['REVIEW_MODEL'])
80
+ history[-1][1] = bot_message # 将回复添加到最后
81
+ else:
82
+ raise Exception('Invalid Query Mode')
83
+ return history
84
+
85
+
86
  # 构建Blocks上下文
87
  with gr.Blocks() as reviewGPT:
88
  share_settings = gr.State(load_setting()) # 已经转换成字典了
89
+ pdf_data = gr.State(pd.DataFrame()) # PDF解析数据
90
 
91
+ gr.Markdown(load_markdown('Title'))
 
92
 
93
  with gr.Tab("Review"):
94
+ with gr.Box():
95
+ gr.Markdown(load_markdown('Review'))
96
  with gr.Row(): # 行排列
97
  with gr.Column(): # 列排列
98
  gr.Markdown('### Data Input')
99
  input_form = gr.Radio(
100
+ ["PMID", "RIS File"], value = "PMID", label="Select Input Form",
101
  )
102
  pmids = gr.Textbox(label="Input PMIDs", lines=9, visible=True, interactive=True)
103
  ris_file = gr.File(label="Please Select Ris File", visible=False, file_types=['.ris'], type='binary')
 
107
  with gr.Column(): # 列排列
108
  gr.Markdown('### Task Setting')
109
  task_choice = gr.Radio(
110
+ ["Screen", "Summarise"], value = "Screen", label="Select a Task"
111
  )
112
  prompts = gr.Textbox(label="Input Prompts", lines=9, interactive=True)
113
  start = gr.Button(value='REVIEW START')
114
  out = gr.Markdown()
115
 
 
 
 
 
 
 
 
116
  start.click(fn=run_review, inputs=[
117
  input_form, task_choice, prompts, pmids, ris_file, share_settings
118
  ], outputs=out)
119
 
120
+
121
+ with gr.Tab("Study"): # 文献阅读页面用聊天机器人形式
122
+ ###### 布局部分 ######
123
+ with gr.Box():
124
+ gr.Markdown(load_markdown('Study'))
125
+ with gr.Row():
126
+ with gr.Column(scale=0.2):
127
+ pdf_file = gr.File(label="Please Select PDF File", visible=True, file_types=['.pdf'], type='binary')
128
+ mode = gr.Radio(
129
+ ["Paper", 'Other'], value = "Paper", label="Query Mode", interactive=True
130
+ )
131
+ clear = gr.Button("Clear Dialog")
132
+
133
+ with gr.Column(scale=0.8):
134
+ chatbot = gr.Chatbot(
135
+ value=[[None, "Hi, I'm a bot to help you read papers, please first upload a pdf file in the left box"]]
136
+ ).style(height=350)
137
+ with gr.Row():
138
+ msg = gr.Textbox(label='Query')
139
+
140
+ ###### 响应部分 ######
141
+ pdf_file.change(
142
+ pdf_preprocess, chatbot, chatbot
143
+ ).then(
144
+ pdf_process, [pdf_file, share_settings], pdf_data
145
+ ).then(
146
+ pdf_postprocess, chatbot, chatbot
147
+ )
148
+ clear.click(lambda: None, None, chatbot, queue=False)
149
+ msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(# 这里的用于执行下一个动作
150
+ bot, [chatbot, pdf_data, mode, share_settings], chatbot
151
+ )
152
+
153
+
154
  with gr.Tab("Setting"):
155
+ with gr.Box():
156
+ gr.Markdown(load_markdown('Setting'))
157
 
158
  # gr.Text('Model Selection(not effective now)')
159
  # model = gr.Textbox(label="Select a model", interactive=True)
utils/config_sample.py CHANGED
@@ -9,19 +9,28 @@ EMAIL = "YOUR@MAIL.COM"
9
 
10
  # 预设的Promots, 目前用来实现的功能有两个
11
  Prompts = {
12
- 'Summarize': '小结 #1-#{idx} 的内容, 并小结不同参考文献的异同, 请用中文给出回答',
 
 
 
 
 
 
 
 
 
 
13
  'Summarize_Unit': '''
14
- 将编号为{idx}的文献标记为参考文献 #1, 将后续的文段标记为 #1 的摘要.
15
  {abstract}
16
  ''',
17
-
18
  # 进行meta时的标准肯定是变化的, 因此这个功能的Promot主要是用于功能上的限定
19
  # 1. 限制回答方式为json格式的字符串, 方便解析
20
  # 2. 限制chatGPT逐一检查要检查的问题, 并在认为不符合标准时给出不符合哪一条
21
  # 3. 限制chatGPT尝试拼接语义正确但是内容不正确的回答, 让它在无法判定时直接说明难以判断, 交给人工判断
22
  'Screen': '''
23
- 请你扮演一位研究者, 你接下来将要进行一Meta分析, 因此需要逐一文献摘要, 并对摘要内容进行理解,
24
- 以判断文献是否满足Meta分析的准入标准, 可以用于后续分析. 在阅读时, 你需要对后面给出的每一条准入标准逐
25
  一进行判断, 如果文献的摘要不满足任意一条标准, 那么需要给出你认为摘要不满足标准的原因.
26
  返回的结果请按照json字串的方式给出, 请不要附带任何其他的多余内容.
27
 
@@ -46,4 +55,23 @@ Prompts = {
46
  下面是文献摘要的内容:
47
  {abstract}
48
  ''',
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  }
 
9
 
10
  # 预设的Promots, 目前用来实现的功能有两个
11
  Prompts = {
12
+ 'Summarize': '''
13
+ {papers}
14
+
15
+ 小结 #1-#{idx} 的内容, 并小结不同参考文献的异同
16
+
17
+ 如果后面还有其他问题, 请基于#1-#{idx} 的内容逐一进行回答
18
+
19
+ 所有问题请中文作答, 但对一些专有名词及其缩写可不翻译.
20
+
21
+ {questions}
22
+ ''',
23
  'Summarize_Unit': '''
24
+ 将编号为{paper_id}的文献标记为参考文献 #{idx}, 将后续的文段标记为 #{idx} 的摘要.
25
  {abstract}
26
  ''',
 
27
  # 进行meta时的标准肯定是变化的, 因此这个功能的Promot主要是用于功能上的限定
28
  # 1. 限制回答方式为json格式的字符串, 方便解析
29
  # 2. 限制chatGPT逐一检查要检查的问题, 并在认为不符合标准时给出不符合哪一条
30
  # 3. 限制chatGPT尝试拼接语义正确但是内容不正确的回答, 让它在无法判定时直接说明难以判断, 交给人工判断
31
  'Screen': '''
32
+ 请你扮演一位研究者, 你接下来将要进行一Meta分析, 因此需要逐一阅读并理解文献摘要,
33
+ 以判断文献是否满足Meta分析的准入标准. 在阅读时, 你需要对后面给出的每一条准入标准逐
34
  一进行判断, 如果文献的摘要不满足任意一条标准, 那么需要给出你认为摘要不满足标准的原因.
35
  返回的结果请按照json字串的方式给出, 请不要附带任何其他的多余内容.
36
 
 
55
  下面是文献摘要的内容:
56
  {abstract}
57
  ''',
58
+ 'Review': '''
59
+ 请你扮演一位研究者, 你接下来将要阅读一篇论文的部分页, 并基于这部分的内容回答一个或多个问题.
60
+ 回答问题时, 请完全基于提供的论文内容回答, 这具体指: 按照论文的思路, 用论文的表述方式进行回答.
61
+ 不要在回答中添加论文不存在的内容. 所有回答请使用中文, 但是对于一些专有名词及其缩写可以不翻译.
62
+
63
+ 下面是这篇论文的部分页:
64
+ {pages}
65
+
66
+ 下面是待回答的问题, 如果有多个问题, 请逐一回答他们, 在回答的时候请说明基于前面哪一页的内容进行了回答,
67
+
68
+ 如: 1. 基于#1和#3的内容, 我的回答是...
69
+
70
+
71
+ {question}
72
+ ''',
73
+ 'Review_Unit': '''
74
+ 将后续的文段标记为这篇文献的 #{page}页, 这一页中的内容是:
75
+ {content}
76
+ '''
77
  }
utils/pdf_parser.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pdfplumber
2
+ import pandas as pd
3
+ from openai.embeddings_utils import get_embedding
4
+ import openai
5
+
6
+ try:
7
+ from utils.config import OPENAI_KEY
8
+ except ImportError:
9
+ from utils.config_sample import OPENAI_KEY
10
+
11
+ class PdfFile:
12
+ '''
13
+ 1. pdf文件解析器, 目前使用pdfplumber模块简单实现
14
+ 2. 之后需要找更好的ML模块以能更好的提取文章内容(精确到节段), 位置定位更精确
15
+ 3. 在有新的提取方式以前, 这部分不添加设置项
16
+ '''
17
+
18
+ def __init__(self, file, api_key=None):
19
+ self.pdf_file = pdfplumber.open(file)
20
+ self.api_key = api_key if api_key else OPENAI_KEY
21
+
22
+
23
+ def __del__(self):
24
+ if self.pdf_file:
25
+ self.pdf_file.close()
26
+
27
+
28
+ def parse_info(self):
29
+ '''
30
+ 一页内容存一次, 所有内容通过模型进行预测
31
+ '''
32
+ all_text = []
33
+ for page in self.pdf_file.pages:
34
+ page_text = page.dedupe_chars().extract_text(layout=True)
35
+ if page_text:
36
+ all_text.append({
37
+ 'page': page.page_number,
38
+ 'text': page_text
39
+ })
40
+ text_data = pd.DataFrame(all_text)
41
+ openai.api_key = self.api_key
42
+ text_data['embedding'] = text_data['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
43
+ # text_data.to_csv('debug.data.csv', index=0)
44
+ return text_data
45
+
46
+
47
+ @property
48
+ def metadata(self):
49
+ '''
50
+ 调用rispy给出可解析的所有字段
51
+ '''
52
+ return self.pdf_file.metadata
53
+
54
+
55
+ if __name__ == '__main__':
56
+ pdf = PdfFile(file='/home/silen/git_proj/ReviewGPT/test/demo_paper.pdf')
57
+ print(pdf.parse_info())
58
+
utils/review.py CHANGED
@@ -2,7 +2,9 @@
2
 
3
  # import tiktoken
4
  import openai
 
5
  from json import dump, loads
 
6
 
7
  try:
8
  from utils.config import Prompts, OPENAI_KEY, REVIEW_MODEL
@@ -44,16 +46,22 @@ class Reviewer:
44
  self.messages = None
45
 
46
 
47
- def query(self):
48
  '''
49
  发送请求并获取chatGPT给的结果
50
  '''
51
  # 设置email和搜索关键词
52
  openai.api_key = self.api_key
53
- response = openai.ChatCompletion.create(
54
- model = self.model,
55
- messages = self.messages
56
- )
 
 
 
 
 
 
57
 
58
  return response
59
 
@@ -71,17 +79,45 @@ class Reviewer:
71
  return self.query()
72
 
73
 
74
- def summarise(self, papers):
75
  '''
76
  meta分析时用的方法, 读取标准内容和文献摘要后,
77
  判断文章是否符合准入标准
78
  '''
79
  units = []
80
- for idx, abstract in papers:
81
- units.append(Prompts['Summarize_Unit'].format(idx=idx, abstract=abstract))
82
- units.append(Prompts['Summarize'].format(idx=len(papers)))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
- message = '\n'.join(units)
 
85
  self.messages = [
86
  {"role": "user", "content": message},
87
  ]
@@ -117,10 +153,12 @@ def screen_demo():
117
  dump(answer, aJson)
118
 
119
 
120
- if __name__ == '__main__':
121
  reviewer = Reviewer()
122
  papers = [
123
  ('12345', 'Background: Little is known about reproductive health in severely obese women. In this study, we present associations between different levels of severe obesity and a wide range of health outcomes in the mother and child. Method(s): From the Danish National Birth Cohort, we obtained self-reported information about prepregnant body mass index (BMI) for 2451 severely obese women and 2450 randomly selected women from the remaining cohort who served as a comparison group. Information about maternal and infant outcomes was also self-reported or came from registers. Logistic regression was used to estimate the association between different levels of severe obesity and reproductive outcomes. Principal Findings: Subfecundity was more frequent in severely obese women, and during pregnancy, they had an excess risk of urinary tract infections, gestational diabetes, preeclampsia and other hypertensive disorders which increased with severity of obesity. They tended to have a higher risk of both pre- and post-term birth, and risk of cesarean and instrumental deliveries increased across obesity categories. After birth, severely obese women more often failed to initiate or sustain breastfeeding. Risk of weight retention 1.5 years after birth was similar to that of other women, but after adjustment for gestational weight gain, the risk was increased, especially in women in the lowest obesity category. In infants, increasing maternal obesity was associated with decreased risk of a low birth weight and increased risk of a high birth weight. Estimates for ponderal index showed the same pattern indicating an increasing risk of neonatal fatness with severity of obesity. Infant obesity measured one year after birth was also increased in children of severely obese mothers. Conclusion(s): Severe obesity is correlated with a substantial disease burden in reproductive health. Although the causal mechanisms remain elusive, these findings are useful for making predictions and planning health care at the individual level. © 2009 Nohr et al.'),
124
  ('45456', 'Background: Preeclampsia is one of the leading causes of maternal and perinatal morbidity and mortality world-wide. The risk for developing preeclampsia varies depending on the underlying mechanism. Because the disorder is heterogeneous, the pathogenesis can differ in women with various risk factors. Understanding these mechanisms of disease responsible for preeclampsia as well as risk assessment is still a major challenge. The aim of this study was to determine the risk factors associated with preeclampsia, in healthy women in maternity hospitals of Karachi and Rawalpindi. Method(s): We conducted a hospital based matched case-control study to assess the factors associated with preeclampsia in Karachi and Rawalpindi, from January 2006 to December 2007. 131 hospital-reported cases of PE and 262 controls without history of preeclampsia were enrolled within 3 days of delivery. Cases and controls were matched on the hospital, day of delivery and parity. Potential risk factors for preeclampsia were ascertained during in-person postpartum interviews using a structured questionnaire and by medical record abstraction. Conditional logistic regression was used to estimate matched odds ratios (ORs) and 95% confidence intervals (95% CIs). Result(s): In multivariate analysis, women having a family history of hypertension (adjusted OR 2.06, 95% CI; 1.27-3.35), gestational diabetes (adjusted OR 6.57, 95% CI; 1.94 -22.25), pre-gestational diabetes (adjusted OR 7.36, 95% CI; 1.37-33.66) and mental stress during pregnancy (adjusted OR 1.32; 95% CI; 1.19-1.46, for each 5 unit increase in Perceived stress scale score) were at increased risk of preeclampsia. However, high body mass index, maternal age, urinary tract infection, use of condoms prior to index pregnancy and sociodemographic factors were not associated with higher risk of having preeclampsia. Conclusion(s): Development of preeclampsia was associated with gestational diabetes, pregestational diabetes, family history of hypertension and mental stress during pregnancy. These factors can be used as a screening tool for preeclampsia prediction. Identification of the above mentioned predictors would enhance the ability to diagnose and monitor women likely to develop preeclampsia before the onset of disease for timely interventions and better maternal and fetal outcomes. © 2010 Shamsi et al; licensee BioMed Central Ltd.'),
125
  ]
126
  reviewer.summarise(papers)
 
 
 
2
 
3
  # import tiktoken
4
  import openai
5
+ from openai.embeddings_utils import get_embedding, cosine_similarity
6
  from json import dump, loads
7
+ import pandas as pd
8
 
9
  try:
10
  from utils.config import Prompts, OPENAI_KEY, REVIEW_MODEL
 
46
  self.messages = None
47
 
48
 
49
+ def query(self, msg=None):
50
  '''
51
  发送请求并获取chatGPT给的结果
52
  '''
53
  # 设置email和搜索关键词
54
  openai.api_key = self.api_key
55
+ if msg:
56
+ response = openai.ChatCompletion.create(
57
+ model = self.model,
58
+ messages = [{"role": "user", "content": msg}]
59
+ )
60
+ else:
61
+ response = openai.ChatCompletion.create(
62
+ model = self.model,
63
+ messages = self.messages
64
+ )
65
 
66
  return response
67
 
 
79
  return self.query()
80
 
81
 
82
+ def summarise(self, papers, prompts):
83
  '''
84
  meta分析时用的方法, 读取标准内容和文献摘要后,
85
  判断文章是否符合准入标准
86
  '''
87
  units = []
88
+ for idx, (paper_id, abstract) in enumerate(papers):
89
+ units.append(Prompts['Summarize_Unit'].format(
90
+ idx=idx,
91
+ paper_id=paper_id,
92
+ abstract=abstract
93
+ ))
94
+ paper_text = '\n'.join(units)
95
+ message = Prompts['Summarize'].format(idx=len(papers), papers=paper_text, questions=prompts)
96
+ self.messages = [
97
+ {"role": "user", "content": message},
98
+ ]
99
+
100
+ return self.query()
101
+
102
+
103
+ def study(self, question: str, paper_data: pd.DataFrame, top: int = 2):
104
+ '''
105
+ 阅读文献的部分内容, 给出问题的回答
106
+ step1, 根据问题, 计算embedding
107
+ step2, 找到top N 最接近的内容
108
+ step3, 形成prompts, 请求回答
109
+ '''
110
+ question_embedding = get_embedding(question, engine='text-embedding-ada-002')
111
+ paper_data['similarity'] = paper_data['embedding'].apply(lambda x: cosine_similarity(x, question_embedding))
112
+ # 找出最符合的n段文本, 默认2, 主要是为了节省token...
113
+ top_texts = paper_data.sort_values(by='similarity', ascending=False).head(2)[['page', 'text']].values.tolist()
114
+
115
+ units = []
116
+ for page, text in top_texts:
117
+ units.append(Prompts['Review_Unit'].format(page=page, content=text))
118
 
119
+ message = Prompts['Review'].format(pages='\n'.join(units), question=question)
120
+
121
  self.messages = [
122
  {"role": "user", "content": message},
123
  ]
 
153
  dump(answer, aJson)
154
 
155
 
156
+ def summarise_demo():
157
  reviewer = Reviewer()
158
  papers = [
159
  ('12345', 'Background: Little is known about reproductive health in severely obese women. In this study, we present associations between different levels of severe obesity and a wide range of health outcomes in the mother and child. Method(s): From the Danish National Birth Cohort, we obtained self-reported information about prepregnant body mass index (BMI) for 2451 severely obese women and 2450 randomly selected women from the remaining cohort who served as a comparison group. Information about maternal and infant outcomes was also self-reported or came from registers. Logistic regression was used to estimate the association between different levels of severe obesity and reproductive outcomes. Principal Findings: Subfecundity was more frequent in severely obese women, and during pregnancy, they had an excess risk of urinary tract infections, gestational diabetes, preeclampsia and other hypertensive disorders which increased with severity of obesity. They tended to have a higher risk of both pre- and post-term birth, and risk of cesarean and instrumental deliveries increased across obesity categories. After birth, severely obese women more often failed to initiate or sustain breastfeeding. Risk of weight retention 1.5 years after birth was similar to that of other women, but after adjustment for gestational weight gain, the risk was increased, especially in women in the lowest obesity category. In infants, increasing maternal obesity was associated with decreased risk of a low birth weight and increased risk of a high birth weight. Estimates for ponderal index showed the same pattern indicating an increasing risk of neonatal fatness with severity of obesity. Infant obesity measured one year after birth was also increased in children of severely obese mothers. Conclusion(s): Severe obesity is correlated with a substantial disease burden in reproductive health. Although the causal mechanisms remain elusive, these findings are useful for making predictions and planning health care at the individual level. © 2009 Nohr et al.'),
160
  ('45456', 'Background: Preeclampsia is one of the leading causes of maternal and perinatal morbidity and mortality world-wide. The risk for developing preeclampsia varies depending on the underlying mechanism. Because the disorder is heterogeneous, the pathogenesis can differ in women with various risk factors. Understanding these mechanisms of disease responsible for preeclampsia as well as risk assessment is still a major challenge. The aim of this study was to determine the risk factors associated with preeclampsia, in healthy women in maternity hospitals of Karachi and Rawalpindi. Method(s): We conducted a hospital based matched case-control study to assess the factors associated with preeclampsia in Karachi and Rawalpindi, from January 2006 to December 2007. 131 hospital-reported cases of PE and 262 controls without history of preeclampsia were enrolled within 3 days of delivery. Cases and controls were matched on the hospital, day of delivery and parity. Potential risk factors for preeclampsia were ascertained during in-person postpartum interviews using a structured questionnaire and by medical record abstraction. Conditional logistic regression was used to estimate matched odds ratios (ORs) and 95% confidence intervals (95% CIs). Result(s): In multivariate analysis, women having a family history of hypertension (adjusted OR 2.06, 95% CI; 1.27-3.35), gestational diabetes (adjusted OR 6.57, 95% CI; 1.94 -22.25), pre-gestational diabetes (adjusted OR 7.36, 95% CI; 1.37-33.66) and mental stress during pregnancy (adjusted OR 1.32; 95% CI; 1.19-1.46, for each 5 unit increase in Perceived stress scale score) were at increased risk of preeclampsia. However, high body mass index, maternal age, urinary tract infection, use of condoms prior to index pregnancy and sociodemographic factors were not associated with higher risk of having preeclampsia. Conclusion(s): Development of preeclampsia was associated with gestational diabetes, pregestational diabetes, family history of hypertension and mental stress during pregnancy. These factors can be used as a screening tool for preeclampsia prediction. Identification of the above mentioned predictors would enhance the ability to diagnose and monitor women likely to develop preeclampsia before the onset of disease for timely interventions and better maternal and fetal outcomes. © 2010 Shamsi et al; licensee BioMed Central Ltd.'),
161
  ]
162
  reviewer.summarise(papers)
163
+
164
+
utils/task.py CHANGED
@@ -1,3 +1,7 @@
 
 
 
 
1
  from json import loads
2
  import pandas as pd
3
  import io
@@ -5,6 +9,7 @@ import io
5
  from utils.review import Reviewer
6
  from utils.pubmed import PubMedFetcher
7
  from utils.ris_parser import RisFile
 
8
 
9
 
10
  def get_paper_info(inputMethod, email=None, pmids=None, ris_data=None):
@@ -70,5 +75,30 @@ def review(task, paper_info, prompts, openai_key, review_model):
70
  papers.append((rec['DOI'], rec['Abstract']))
71
  else:
72
  raise Exception('No PMID nor DOI in record.')
73
- response = reviewer.summarise(papers)
74
  return response['choices'][0]['message']['content']
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ '''
2
+ 这个文件里的所有内容其实应该精简以后全部放在ReviewerBot下
3
+ '''
4
+
5
  from json import loads
6
  import pandas as pd
7
  import io
 
9
  from utils.review import Reviewer
10
  from utils.pubmed import PubMedFetcher
11
  from utils.ris_parser import RisFile
12
+ from utils.pdf_parser import PdfFile
13
 
14
 
15
  def get_paper_info(inputMethod, email=None, pmids=None, ris_data=None):
 
75
  papers.append((rec['DOI'], rec['Abstract']))
76
  else:
77
  raise Exception('No PMID nor DOI in record.')
78
+ response = reviewer.summarise(papers, prompts)
79
  return response['choices'][0]['message']['content']
80
+
81
+
82
+ def parse_pdf_info(pdf_data, openai_key):
83
+ '''
84
+ 调用模块解析PDF, 单独放出来, 解析只需要一次
85
+ '''
86
+ fileHandle = io.BytesIO(pdf_data)
87
+ pdf = PdfFile(file=fileHandle, api_key=openai_key)
88
+ paper_data = pdf.parse_info()
89
+ return paper_data
90
+
91
+
92
+ def study(paper_data, prompts, openai_key, review_model):
93
+ '''
94
+ 根据解析好的PDF数据进行阅读
95
+ '''
96
+ reviewer = Reviewer(api_key=openai_key, model=review_model)
97
+ response = reviewer.study(prompts, paper_data)
98
+ return response['choices'][0]['message']['content']
99
+
100
+
101
+ def query(prompts, openai_key, review_model):
102
+ reviewer = Reviewer(api_key=openai_key, model=review_model)
103
+ response = reviewer.query(prompts)
104
+ return response['choices'][0]['message']['content']