livctr commited on
Commit
d03087e
·
1 Parent(s): cf59f7d

update README

Browse files

Former-commit-id: 5720fe395f13496c4239779dac4d300b47126489

Files changed (1) hide show
  1. README.md +55 -26
README.md CHANGED
@@ -1,18 +1,37 @@
1
  # U.S. ML PhD Recomendation System
2
 
3
- Disclaimer: results are not 100% accurate and there is likely some bias to how papers / professors are filtered.
4
 
5
- ### Data Pipeline
6
 
7
- First, a list of authors are gathered from recent conference proceedings. A batched RAG pipeline is used to determine which persons are U.S. professors (unsure how accurate the LLM here is). This can be reproduced as follows:
8
 
9
- #### Repeat scrape until satisfactory
10
 
11
- ```python
12
- # Scrape top conferences for potential U.S.-based professors, ~45 mins
13
- python -m data_pipeline.conference_scraper
14
- ```
15
- **Selected conferences**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  - NeurIPS: 2022, 2023
17
  - ICML: 2023, 2024
18
  - AISTATS: 2023, 2024
@@ -21,18 +40,30 @@ python -m data_pipeline.conference_scraper
21
  - EMNLP: 2023, 2024
22
  - CVPR: 2023, 2024
23
 
24
- ```python
25
- # Search authors and locally store search results. Uses Bing web search API.
26
- python -m data_pipeline.us_professor_verifier --batch_search
 
 
 
 
 
 
27
  ```
28
 
29
- NOTE 1: you may encounter caught exceptions due to HTTPError or invalid JSON outputs from the LLM. Would suggest to run the above multiple times until results are satisfactory.
 
 
 
30
 
31
- NOTE 2: This pipeline does not handle name collisions, name changes, initials.
 
 
 
32
 
33
- #### Create file containing U.S. professor data
34
 
35
- ```python
36
  # Use locally stored search results as input to an LLM.
37
  # Sends as batches, each one waiting for the previous to finish.
38
  python -m data_pipeline.us_professor_verifier --batch_analyze
@@ -41,16 +72,14 @@ python -m data_pipeline.us_professor_verifier --batch_analyze
41
  python -m data_pipeline.us_professor_verifier --batch_retrieve
42
  ```
43
 
44
- #### Extract embeddings for the relevant papers
45
- ```python
46
- # Fetch arxiv data and extract embeddings
 
47
  python -m data_pipeline.paper_embeddings_extractor
48
  ```
49
 
50
- ### Run streamlit
51
-
52
- ```python
53
-
54
- streamlit run streamlit.py
55
-
56
- ```
 
1
  # U.S. ML PhD Recomendation System
2
 
3
+ ## Usage
4
 
5
+ **Disclaimer**: This system should only be used for informational and exploratory purposes. The data pipeline is based on data from selected ML conferences, OpenAI GPT-4o-mini, and heuristics. It is guaranteed that *IT MISSES MANY VERY WELL-QUALIFIED PROFESSORS*. Recommendations are not definitive and should not replace personal research, discussions with your professors at your current institutions, and direct communication with universities. Further, they are only a proxy for research alignment. Many other factors (e.g. work style, your career goals, location, etc.) need to be considered when making a decision about PhD programs. The system is limited in that it may have incomplete or biased data (the data pipeline is described below). Also, users should independently verify that the information provided is accurate, that their potential advisors are looking for PhD students, and that they are applying to the correct advisor at the correct institution (there are a few name collisions, which the pipeline doesn’t handle). Again, the recommendations should serve as an exploratory tool, and it is the responsibility of the user to do their due diligence and research when making decisions on where and how to apply, as well as their final decision.
6
 
7
+ To get started, click on the streamlit demo!
8
 
9
+ To run locally, `git clone` the repo. Then, `pip install -r requirements.txt`.
10
 
11
+ ## Overview of Data Pipeline
12
+
13
+ ### Overview:
14
+ 1) Authors and papers are gathered from recent conference proceedings, listed [below](#selected-conferences). While these are intended to cover some of the major ML conferences, the list is not exhaustive, and there are a number of top conferences that are missing, which can lead to bias in the recommendations.
15
+ 2) The authors are searched using the Bing web search API, and the returned search results are used as input into OpenAI GPT-4o-mini to determine whether the author is a U.S.-based professor. Sometimes, a search result may contain irrelevant information. There are also cases where one person may hold multiple titles or multiple people have the same name.
16
+ 3) Recent papers of the U.S.-based professors are found via arXiv.
17
+ 4) Finally, when a query is searched, the system uses semantic similarity to find similar papers and their corrsponding U.S.-based professors.
18
+
19
+ ### "Mistakes" (Among things that I am aware of)
20
+ - This pipeline does not handle name collisions, name changes, or different spellings for the same person. Please click on the arXiv link and PDF to verify the institution.
21
+ - An LLM judges whether an author is a U.S.-based professor. No rigorous analysis has been done on the accuracy of this classification, but I can verify that there are mistakes.
22
+
23
+ ### Current Heuristics
24
+ - Authors are filtered to those with at least 3 papers in the selected conferences (past 2 years) for which they are not first authors.
25
+ - Papers with only one author or >20 authors are ignored.
26
+ - The first author of each paper is ignored. Normally, these are students.
27
+
28
+ ### Possibilities for improvement
29
+ - Better embeddings: The current methodology to embed papers simply packages the title and abstract as input to the LLM. An extra step can be done in which an LLM extracts the topic, insights, methodologies, etc. in the papers so that there is more focus on content.
30
+ - Model: At the time of writing, `gte-Qwen2-7B-instruct` appears to be the best at clustering in the arXiv section of the MTEB benchmark ([leaderboard](https://huggingface.co/spaces/mteb/leaderboard)). More powerful models can be used, if they can somehow also be deployed.
31
+ - Narrow scope: due to limited financial resources, only a select number of conferences and professors are explored.
32
+ - Handling name collisions.
33
+
34
+ ### Selected conferences
35
  - NeurIPS: 2022, 2023
36
  - ICML: 2023, 2024
37
  - AISTATS: 2023, 2024
 
40
  - EMNLP: 2023, 2024
41
  - CVPR: 2023, 2024
42
 
43
+
44
+ ## Reproducing the Project
45
+
46
+ The `data_pipeline` requires more packages. Please pip install them:
47
+
48
+ ```bash
49
+ cd data_pipeline
50
+ pip install -r requirements-data-pipeline.txt
51
+ cd ..
52
  ```
53
 
54
+ ```bash
55
+ # Scrape top conferences, ~45 mins (most time from AAAI)
56
+ python -m data_pipeline.conference_scraper
57
+ ```
58
 
59
+ ```bash
60
+ # Search authors and locally store search results. Uses Bing web search API.
61
+ python -m data_pipeline.us_professor_verifier --batch_search
62
+ ```
63
 
64
+ NOTE: you may encounter caught exceptions, e.g., due to HTTPError. Would suggest to run the above multiple times until the conferences are scraped / all persons have been verified.
65
 
66
+ ```bash
67
  # Use locally stored search results as input to an LLM.
68
  # Sends as batches, each one waiting for the previous to finish.
69
  python -m data_pipeline.us_professor_verifier --batch_analyze
 
72
  python -m data_pipeline.us_professor_verifier --batch_retrieve
73
  ```
74
 
75
+ NOTE: I am not confident that the LLM can always produce valid JSON (it does so very frequently).
76
+
77
+ ```bash
78
+ # Fetch arxiv data and extract embeddings (may need GPU)
79
  python -m data_pipeline.paper_embeddings_extractor
80
  ```
81
 
82
+ ```bash
83
+ # Run the streamlit application
84
+ streamlit run USMLPhDRecommender.py
85
+ ```