Spaces:
Sleeping
Sleeping
Add subjects and add more tokens for the model to digest
Browse files- README.md +8 -13
- config.yaml +2 -2
- readme_images/example_custom_1.png +0 -0
- src/action.py +1 -1
- src/download_new_papers.py +1 -1
- src/relevancy.py +4 -3
- src/relevancy_prompt.txt +1 -1
- src/utils.py +1 -1
README.md
CHANGED
|
@@ -45,22 +45,16 @@ You can also send yourself an email of the digest by creating a SendGrid account
|
|
| 45 |
|
| 46 |
#### Digest Configuration:
|
| 47 |
- Subject/Topic: Computer Science
|
| 48 |
-
- Categories: Artificial Intelligence, Computation and Language
|
| 49 |
- Interest:
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
|
|
|
| 54 |
|
| 55 |
#### Result:
|
| 56 |
-
<p align="left"><img src="./readme_images/
|
| 57 |
-
|
| 58 |
-
#### Digest Configuration:
|
| 59 |
-
- Subject/Topic: Quantitative Finance
|
| 60 |
-
- Interest: "making lots of money"
|
| 61 |
-
|
| 62 |
-
#### Result:
|
| 63 |
-
<p align="left"><img src="./readme_images/example_2.png" width=580 /></p>
|
| 64 |
|
| 65 |
## 💡 Usage
|
| 66 |
|
|
@@ -96,6 +90,7 @@ To locally run the same UI as the Huggign Face space:
|
|
| 96 |
|
| 97 |
- [x] Support personalized paper recommendation using LLM.
|
| 98 |
- [x] Send emails for daily digest.
|
|
|
|
| 99 |
- [ ] Implement a ranking factor to prioritize content from specific authors.
|
| 100 |
- [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc.
|
| 101 |
- [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts..
|
|
|
|
| 45 |
|
| 46 |
#### Digest Configuration:
|
| 47 |
- Subject/Topic: Computer Science
|
| 48 |
+
- Categories: Artificial Intelligence, Computation and Language, Machine Learning
|
| 49 |
- Interest:
|
| 50 |
+
1. Large language model pretraining and finetunings
|
| 51 |
+
2. Multimodal machine learning
|
| 52 |
+
3. RAGs, Information retrieval
|
| 53 |
+
4. Optimization of LLM and GenAI
|
| 54 |
+
5. Do not care about specific application, for example, information extraction, summarization, etc.
|
| 55 |
|
| 56 |
#### Result:
|
| 57 |
+
<p align="left"><img src="./readme_images/example_custom_1.png" width=580 /></p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
## 💡 Usage
|
| 60 |
|
|
|
|
| 90 |
|
| 91 |
- [x] Support personalized paper recommendation using LLM.
|
| 92 |
- [x] Send emails for daily digest.
|
| 93 |
+
- [x] Further read from the paper itself via its HTML format (.pdf version will be implemented in the next phase)
|
| 94 |
- [ ] Implement a ranking factor to prioritize content from specific authors.
|
| 95 |
- [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc.
|
| 96 |
- [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts..
|
config.yaml
CHANGED
|
@@ -3,7 +3,7 @@ topic: "Computer Science"
|
|
| 3 |
# An empty list here will include all categories in a topic
|
| 4 |
# Use the natural language names of the topics, found here: https://arxiv.org
|
| 5 |
# Including more categories will result in more calls to the large language model
|
| 6 |
-
categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning"]
|
| 7 |
|
| 8 |
# Relevance score threshold. abstracts that receive a score less than this from the large language model
|
| 9 |
# will have their papers filtered out.
|
|
@@ -23,6 +23,6 @@ threshold: 6
|
|
| 23 |
interest: |
|
| 24 |
1. Large language model pretraining and finetunings
|
| 25 |
2. Multimodal machine learning
|
| 26 |
-
3. RAGs
|
| 27 |
4. Optimization of LLM and GenAI
|
| 28 |
5. Do not care about specific application, for example, information extraction, summarization, etc.
|
|
|
|
| 3 |
# An empty list here will include all categories in a topic
|
| 4 |
# Use the natural language names of the topics, found here: https://arxiv.org
|
| 5 |
# Including more categories will result in more calls to the large language model
|
| 6 |
+
categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning", "Information Retrieval"]
|
| 7 |
|
| 8 |
# Relevance score threshold. abstracts that receive a score less than this from the large language model
|
| 9 |
# will have their papers filtered out.
|
|
|
|
| 23 |
interest: |
|
| 24 |
1. Large language model pretraining and finetunings
|
| 25 |
2. Multimodal machine learning
|
| 26 |
+
3. RAGs, Information retrieval
|
| 27 |
4. Optimization of LLM and GenAI
|
| 28 |
5. Do not care about specific application, for example, information extraction, summarization, etc.
|
readme_images/example_custom_1.png
ADDED
|
src/action.py
CHANGED
|
@@ -251,7 +251,7 @@ def generate_body(topic, categories, interest, threshold):
|
|
| 251 |
)
|
| 252 |
body = "<br><br>".join(
|
| 253 |
[
|
| 254 |
-
f'<b>Title:</b> <a href="{paper["main_page"]}">{paper["title"]}</a><br><b>Authors:</b> {paper["authors"]}<br>'
|
| 255 |
f'<b>Score:</b> {paper["Relevancy score"]}<br><b>Reason:</b> {paper["Reasons for match"]}<br>'
|
| 256 |
f'<b>Goal:</b> {paper["Goal"]}<br><b>Data</b>: {paper["Data"]}<br><b>Methodology:</b> {paper["Methodology"]}<br>'
|
| 257 |
f'<b>Experiments & Results</b>: {paper["Experiments & Results"]}<br><b>Git</b>: {paper["Git"]}<br>'
|
|
|
|
| 251 |
)
|
| 252 |
body = "<br><br>".join(
|
| 253 |
[
|
| 254 |
+
f'<b>Subject: </b>{paper["subjects"]}<br><b>Title:</b> <a href="{paper["main_page"]}">{paper["title"]}</a><br><b>Authors:</b> {paper["authors"]}<br>'
|
| 255 |
f'<b>Score:</b> {paper["Relevancy score"]}<br><b>Reason:</b> {paper["Reasons for match"]}<br>'
|
| 256 |
f'<b>Goal:</b> {paper["Goal"]}<br><b>Data</b>: {paper["Data"]}<br><b>Methodology:</b> {paper["Methodology"]}<br>'
|
| 257 |
f'<b>Experiments & Results</b>: {paper["Experiments & Results"]}<br><b>Git</b>: {paper["Git"]}<br>'
|
src/download_new_papers.py
CHANGED
|
@@ -22,7 +22,7 @@ def crawl_html_version(html_link):
|
|
| 22 |
|
| 23 |
for each in para_list:
|
| 24 |
main_content.append(each.text.strip())
|
| 25 |
-
return ' '.join(main_content)[:
|
| 26 |
#if len(main_content >)
|
| 27 |
#return ''.join(main_content) if len(main_content) < 20000 else ''.join(main_content[:20000])
|
| 28 |
def _download_new_papers(field_abbr):
|
|
|
|
| 22 |
|
| 23 |
for each in para_list:
|
| 24 |
main_content.append(each.text.strip())
|
| 25 |
+
return ' '.join(main_content)[:10000]
|
| 26 |
#if len(main_content >)
|
| 27 |
#return ''.join(main_content) if len(main_content) < 20000 else ''.join(main_content[:20000])
|
| 28 |
def _download_new_papers(field_abbr):
|
src/relevancy.py
CHANGED
|
@@ -39,7 +39,7 @@ def encode_prompt(query, prompt_papers):
|
|
| 39 |
def is_json(myjson):
|
| 40 |
try:
|
| 41 |
json.loads(myjson)
|
| 42 |
-
except
|
| 43 |
return False
|
| 44 |
return True
|
| 45 |
|
|
@@ -97,7 +97,8 @@ def post_process_chat_gpt_response(paper_data, response, threshold_score=7):
|
|
| 97 |
# if the decoding stops due to length, the last example is likely truncated so we discard it
|
| 98 |
if scores[idx] < threshold_score:
|
| 99 |
continue
|
| 100 |
-
output_str = "
|
|
|
|
| 101 |
output_str += "Authors: " + paper_data[idx]["authors"] + "\n"
|
| 102 |
output_str += "Link: " + paper_data[idx]["main_page"] + "\n"
|
| 103 |
for key, value in inst.items():
|
|
@@ -166,7 +167,7 @@ def generate_relevance_score(
|
|
| 166 |
return ans_data, hallucination
|
| 167 |
|
| 168 |
def run_all_day_paper(
|
| 169 |
-
query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence"]},
|
| 170 |
date=None,
|
| 171 |
data_dir="../data",
|
| 172 |
model_name="gpt-3.5-turbo-16k",
|
|
|
|
| 39 |
def is_json(myjson):
|
| 40 |
try:
|
| 41 |
json.loads(myjson)
|
| 42 |
+
except Exception as e:
|
| 43 |
return False
|
| 44 |
return True
|
| 45 |
|
|
|
|
| 97 |
# if the decoding stops due to length, the last example is likely truncated so we discard it
|
| 98 |
if scores[idx] < threshold_score:
|
| 99 |
continue
|
| 100 |
+
output_str = "Subject: " + paper_data[idx]["subjects"] + "\n"
|
| 101 |
+
output_str += "Title: " + paper_data[idx]["title"] + "\n"
|
| 102 |
output_str += "Authors: " + paper_data[idx]["authors"] + "\n"
|
| 103 |
output_str += "Link: " + paper_data[idx]["main_page"] + "\n"
|
| 104 |
for key, value in inst.items():
|
|
|
|
| 167 |
return ans_data, hallucination
|
| 168 |
|
| 169 |
def run_all_day_paper(
|
| 170 |
+
query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence", "Information Retrieval"]},
|
| 171 |
date=None,
|
| 172 |
data_dir="../data",
|
| 173 |
model_name="gpt-3.5-turbo-16k",
|
src/relevancy_prompt.txt
CHANGED
|
@@ -5,4 +5,4 @@ Please keep the paper order the same as in the input list, with one json format
|
|
| 5 |
|
| 6 |
1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings", "Goal": "What kind of pain points the paper is trying to solve?", "Data": "Summary of the data source used in the paper", "Methodology": "Summary of methodologies used in the paper", "Git": "Link to the code repo (if available)", "Experiments & Results": "Summary of any experiments & its results", "Discussion & Next steps": "Further discussion and next steps of the research"}
|
| 7 |
|
| 8 |
-
My research interests are: NLP, RAGs, LLM, Optmization in Machine learning, Data science, Generative AI, Optimization in LLM, Finance modelling ...
|
|
|
|
| 5 |
|
| 6 |
1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings", "Goal": "What kind of pain points the paper is trying to solve?", "Data": "Summary of the data source used in the paper", "Methodology": "Summary of methodologies used in the paper", "Git": "Link to the code repo (if available)", "Experiments & Results": "Summary of any experiments & its results", "Discussion & Next steps": "Further discussion and next steps of the research"}
|
| 7 |
|
| 8 |
+
My research interests are: NLP, RAGs, LLM, Information Retrieval, Optmization in Machine learning, Data science, Generative AI, Optimization in LLM, Finance modelling ...
|
src/utils.py
CHANGED
|
@@ -25,7 +25,7 @@ if openai_org is not None:
|
|
| 25 |
@dataclasses.dataclass
|
| 26 |
class OpenAIDecodingArguments(object):
|
| 27 |
#max_tokens: int = 1800
|
| 28 |
-
max_tokens: int =
|
| 29 |
temperature: float = 0.2
|
| 30 |
top_p: float = 1.0
|
| 31 |
n: int = 1
|
|
|
|
| 25 |
@dataclasses.dataclass
|
| 26 |
class OpenAIDecodingArguments(object):
|
| 27 |
#max_tokens: int = 1800
|
| 28 |
+
max_tokens: int = 5400
|
| 29 |
temperature: float = 0.2
|
| 30 |
top_p: float = 1.0
|
| 31 |
n: int = 1
|