Spaces:

linhkid91
/

ArxivDigest-extra

Sleeping

App Files Files Community

linhkid91 commited on Apr 27, 2024

Commit

fc807c3

1 Parent(s): fc0e67e

Add subjects and add more tokens for the model to digest

Browse files

Files changed (8) hide show

README.md +8 -13
config.yaml +2 -2
readme_images/example_custom_1.png +0 -0
src/action.py +1 -1
src/download_new_papers.py +1 -1
src/relevancy.py +4 -3
src/relevancy_prompt.txt +1 -1
src/utils.py +1 -1

README.md CHANGED Viewed

@@ -45,22 +45,16 @@ You can also send yourself an email of the digest by creating a SendGrid account
 #### Digest Configuration:
 - Subject/Topic: Computer Science
-- Categories: Artificial Intelligence, Computation and Language
 - Interest:
-  - Large language model pretraining and finetunings
-  - Multimodal machine learning
-  - Do not care about specific application, for example, information extraction, summarization, etc.
-  - Not interested in paper focus on specific languages, e.g., Arabic, Chinese, etc.
 #### Result:
-<p align="left"><img src="./readme_images/example_1.png" width=580 /></p>
-#### Digest Configuration:
-- Subject/Topic: Quantitative Finance
-- Interest: "making lots of money"
-#### Result:
-<p align="left"><img src="./readme_images/example_2.png" width=580 /></p>
 ## 💡 Usage
@@ -96,6 +90,7 @@ To locally run the same UI as the Huggign Face space:
 - [x] Support personalized paper recommendation using LLM.
 - [x] Send emails for daily digest.
 - [ ] Implement a ranking factor to prioritize content from specific authors.
 - [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc.
 - [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts..

 #### Digest Configuration:
 - Subject/Topic: Computer Science
+- Categories: Artificial Intelligence, Computation and Language, Machine Learning
 - Interest:
+  1. Large language model pretraining and finetunings
+  2. Multimodal machine learning
+  3. RAGs, Information retrieval
+  4. Optimization of LLM and GenAI
+  5. Do not care about specific application, for example, information extraction, summarization, etc.
 #### Result:
+<p align="left"><img src="./readme_images/example_custom_1.png" width=580 /></p>
 ## 💡 Usage
 - [x] Support personalized paper recommendation using LLM.
 - [x] Send emails for daily digest.
+- [x] Further read from the paper itself via its HTML format (.pdf version will be implemented in the next phase)
 - [ ] Implement a ranking factor to prioritize content from specific authors.
 - [ ] Support open-source models, e.g., LLaMA, Vicuna, MPT etc.
 - [ ] Fine-tune an open-source model to better support paper ranking and stay updated with the latest research concepts..

config.yaml CHANGED Viewed

@@ -3,7 +3,7 @@ topic: "Computer Science"
 # An empty list here will include all categories in a topic
 # Use the natural language names of the topics, found here: https://arxiv.org
 # Including more categories will result in more calls to the large language model
-categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning"]
 # Relevance score threshold. abstracts that receive a score less than this from the large language model
 # will have their papers filtered out.
@@ -23,6 +23,6 @@ threshold: 6
 interest: |
   1. Large language model pretraining and finetunings
   2. Multimodal machine learning
-  3. RAGs
   4. Optimization of LLM and GenAI
   5. Do not care about specific application, for example, information extraction, summarization, etc.

 # An empty list here will include all categories in a topic
 # Use the natural language names of the topics, found here: https://arxiv.org
 # Including more categories will result in more calls to the large language model
+categories: ["Artificial Intelligence", "Computation and Language", "Machine Learning", "Information Retrieval"]
 # Relevance score threshold. abstracts that receive a score less than this from the large language model
 # will have their papers filtered out.
 interest: |
   1. Large language model pretraining and finetunings
   2. Multimodal machine learning
+  3. RAGs, Information retrieval
   4. Optimization of LLM and GenAI
   5. Do not care about specific application, for example, information extraction, summarization, etc.

readme_images/example_custom_1.png ADDED Viewed

src/action.py CHANGED Viewed

@@ -251,7 +251,7 @@ def generate_body(topic, categories, interest, threshold):
         )
         body = "<br><br>".join(
             [
-                f'<b>Title:</b> <a href="{paper["main_page"]}">{paper["title"]}</a><br><b>Authors:</b> {paper["authors"]}<br>'
                 f'<b>Score:</b> {paper["Relevancy score"]}<br><b>Reason:</b> {paper["Reasons for match"]}<br>'
                 f'<b>Goal:</b> {paper["Goal"]}<br><b>Data</b>: {paper["Data"]}<br><b>Methodology:</b> {paper["Methodology"]}<br>'
                 f'<b>Experiments & Results</b>: {paper["Experiments & Results"]}<br><b>Git</b>: {paper["Git"]}<br>'

         )
         body = "<br><br>".join(
             [
+                f'<b>Subject: </b>{paper["subjects"]}<br><b>Title:</b> <a href="{paper["main_page"]}">{paper["title"]}</a><br><b>Authors:</b> {paper["authors"]}<br>'
                 f'<b>Score:</b> {paper["Relevancy score"]}<br><b>Reason:</b> {paper["Reasons for match"]}<br>'
                 f'<b>Goal:</b> {paper["Goal"]}<br><b>Data</b>: {paper["Data"]}<br><b>Methodology:</b> {paper["Methodology"]}<br>'
                 f'<b>Experiments & Results</b>: {paper["Experiments & Results"]}<br><b>Git</b>: {paper["Git"]}<br>'

src/download_new_papers.py CHANGED Viewed

@@ -22,7 +22,7 @@ def crawl_html_version(html_link):
     for each in para_list:
         main_content.append(each.text.strip())
-    return ' '.join(main_content)[:8000]
     #if len(main_content >)
     #return ''.join(main_content) if len(main_content) < 20000 else ''.join(main_content[:20000])
 def _download_new_papers(field_abbr):

     for each in para_list:
         main_content.append(each.text.strip())
+    return ' '.join(main_content)[:10000]
     #if len(main_content >)
     #return ''.join(main_content) if len(main_content) < 20000 else ''.join(main_content[:20000])
 def _download_new_papers(field_abbr):

src/relevancy.py CHANGED Viewed

@@ -39,7 +39,7 @@ def encode_prompt(query, prompt_papers):
 def is_json(myjson):
     try:
         json.loads(myjson)
-    except ValueError as e:
         return False
     return True
@@ -97,7 +97,8 @@ def post_process_chat_gpt_response(paper_data, response, threshold_score=7):
         # if the decoding stops due to length, the last example is likely truncated so we discard it
         if scores[idx] < threshold_score:
             continue
-        output_str = "Title: " + paper_data[idx]["title"] + "\n"
         output_str += "Authors: " + paper_data[idx]["authors"] + "\n"
         output_str += "Link: " + paper_data[idx]["main_page"] + "\n"
         for key, value in inst.items():
@@ -166,7 +167,7 @@ def generate_relevance_score(
     return ans_data, hallucination
 def run_all_day_paper(
-    query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence"]},
     date=None,
     data_dir="../data",
     model_name="gpt-3.5-turbo-16k",

 def is_json(myjson):
     try:
         json.loads(myjson)
+    except Exception as e:
         return False
     return True
         # if the decoding stops due to length, the last example is likely truncated so we discard it
         if scores[idx] < threshold_score:
             continue
+        output_str = "Subject: " + paper_data[idx]["subjects"] + "\n"
+        output_str += "Title: " + paper_data[idx]["title"] + "\n"
         output_str += "Authors: " + paper_data[idx]["authors"] + "\n"
         output_str += "Link: " + paper_data[idx]["main_page"] + "\n"
         for key, value in inst.items():
     return ans_data, hallucination
 def run_all_day_paper(
+    query={"interest":"Computer Science", "subjects":["Machine Learning", "Computation and Language", "Artificial Intelligence", "Information Retrieval"]},
     date=None,
     data_dir="../data",
     model_name="gpt-3.5-turbo-16k",

src/relevancy_prompt.txt CHANGED Viewed

@@ -5,4 +5,4 @@ Please keep the paper order the same as in the input list, with one json format
 1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings", "Goal": "What kind of pain points the paper is trying to solve?", "Data": "Summary of the data source used in the paper", "Methodology": "Summary of methodologies used in the paper", "Git": "Link to the code repo (if available)", "Experiments & Results": "Summary of any experiments & its results", "Discussion & Next steps": "Further discussion and next steps of the research"}
-My research interests are: NLP, RAGs, LLM, Optmization in Machine learning, Data science, Generative AI, Optimization in LLM, Finance modelling ...


5
6	1. {"Relevancy score": "an integer score out of 10", "Reasons for match": "1-2 sentence short reasonings", "Goal": "What kind of pain points the paper is trying to solve?", "Data": "Summary of the data source used in the paper", "Methodology": "Summary of methodologies used in the paper", "Git": "Link to the code repo (if available)", "Experiments & Results": "Summary of any experiments & its results", "Discussion & Next steps": "Further discussion and next steps of the research"}
7
8	+ My research interests are: NLP, RAGs, LLM, Information Retrieval, Optmization in Machine learning, Data science, Generative AI, Optimization in LLM, Finance modelling ...

src/utils.py CHANGED Viewed

@@ -25,7 +25,7 @@ if openai_org is not None:
 @dataclasses.dataclass
 class OpenAIDecodingArguments(object):
     #max_tokens: int = 1800
-    max_tokens: int = 4800
     temperature: float = 0.2
     top_p: float = 1.0
     n: int = 1

 @dataclasses.dataclass
 class OpenAIDecodingArguments(object):
     #max_tokens: int = 1800
+    max_tokens: int = 5400
     temperature: float = 0.2
     top_p: float = 1.0
     n: int = 1