Spaces:

livctr
/

USMLPhDRecommender

Sleeping

App Files Files Community

livctr commited on Oct 24, 2024

Commit

d03087e

1 Parent(s): cf59f7d

update README

Browse files

Former-commit-id: 5720fe395f13496c4239779dac4d300b47126489

Files changed (1) hide show

README.md +55 -26

README.md CHANGED Viewed

@@ -1,18 +1,37 @@
 # U.S. ML PhD Recomendation System
-Disclaimer: results are not 100% accurate and there is likely some bias to how papers / professors are filtered.
-### Data Pipeline
-First, a list of authors are gathered from recent conference proceedings. A batched RAG pipeline is used to determine which persons are U.S. professors (unsure how accurate the LLM here is). This can be reproduced as follows:
-#### Repeat scrape until satisfactory
-```python
-# Scrape top conferences for potential U.S.-based professors, ~45 mins
-python -m data_pipeline.conference_scraper
-```
-**Selected conferences**
 - NeurIPS: 2022, 2023
 - ICML: 2023, 2024
 - AISTATS: 2023, 2024
@@ -21,18 +40,30 @@ python -m data_pipeline.conference_scraper
 - EMNLP: 2023, 2024
 - CVPR: 2023, 2024
-```python
-# Search authors and locally store search results. Uses Bing web search API.
-python -m data_pipeline.us_professor_verifier --batch_search
 ```
-NOTE 1: you may encounter caught exceptions due to HTTPError or invalid JSON outputs from the LLM. Would suggest to run the above multiple times until results are satisfactory.
-NOTE 2: This pipeline does not handle name collisions, name changes, initials.
-#### Create file containing U.S. professor data
-```python
 # Use locally stored search results as input to an LLM.
 # Sends as batches, each one waiting for the previous to finish.
 python -m data_pipeline.us_professor_verifier --batch_analyze
@@ -41,16 +72,14 @@ python -m data_pipeline.us_professor_verifier --batch_analyze
 python -m data_pipeline.us_professor_verifier --batch_retrieve
 ```
-#### Extract embeddings for the relevant papers
-```python
-# Fetch arxiv data and extract embeddings
 python -m data_pipeline.paper_embeddings_extractor
 ```
-### Run streamlit
-```python
-streamlit run streamlit.py
-```

 # U.S. ML PhD Recomendation System
+## Usage
+**Disclaimer**: This system should only be used for informational and exploratory purposes. The data pipeline is based on data from selected ML conferences, OpenAI GPT-4o-mini, and heuristics. It is guaranteed that *IT MISSES MANY VERY WELL-QUALIFIED PROFESSORS*. Recommendations are not definitive and should not replace personal research, discussions with your professors at your current institutions, and direct communication with universities. Further, they are only a proxy for research alignment. Many other factors (e.g. work style, your career goals, location, etc.) need to be considered when making a decision about PhD programs. The system is limited in that it may have incomplete or biased data (the data pipeline is described below). Also, users should independently verify that the information provided is accurate, that their potential advisors are looking for PhD students, and that they are applying to the correct advisor at the correct institution (there are a few name collisions, which the pipeline doesn’t handle). Again, the recommendations should serve as an exploratory tool, and it is the responsibility of the user to do their due diligence and research when making decisions on where and how to apply, as well as their final decision.
+To get started, click on the streamlit demo!
+To run locally, `git clone` the repo. Then, `pip install -r requirements.txt`.
+## Overview of Data Pipeline
+### Overview:
+1) Authors and papers are gathered from recent conference proceedings, listed [below](#selected-conferences). While these are intended to cover some of the major ML conferences, the list is not exhaustive, and there are a number of top conferences that are missing, which can lead to bias in the recommendations.
+2) The authors are searched using the Bing web search API, and the returned search results are used as input into OpenAI GPT-4o-mini to determine whether the author is a U.S.-based professor. Sometimes, a search result may contain irrelevant information. There are also cases where one person may hold multiple titles or multiple people have the same name.
+3) Recent papers of the U.S.-based professors are found via arXiv.
+4) Finally, when a query is searched, the system uses semantic similarity to find similar papers and their corrsponding U.S.-based professors.
+### "Mistakes" (Among things that I am aware of)
+- This pipeline does not handle name collisions, name changes, or different spellings for the same person. Please click on the arXiv link and PDF to verify the institution.
+- An LLM judges whether an author is a U.S.-based professor. No rigorous analysis has been done on the accuracy of this classification, but I can verify that there are mistakes.
+### Current Heuristics
+- Authors are filtered to those with at least 3 papers in the selected conferences (past 2 years) for which they are not first authors.
+- Papers with only one author or >20 authors are ignored.
+- The first author of each paper is ignored. Normally, these are students.
+### Possibilities for improvement
+- Better embeddings: The current methodology to embed papers simply packages the title and abstract as input to the LLM. An extra step can be done in which an LLM extracts the topic, insights, methodologies, etc. in the papers so that there is more focus on content.
+- Model: At the time of writing, `gte-Qwen2-7B-instruct` appears to be the best at clustering in the arXiv section of the MTEB benchmark ([leaderboard](https://huggingface.co/spaces/mteb/leaderboard)). More powerful models can be used, if they can somehow also be deployed.
+- Narrow scope: due to limited financial resources, only a select number of conferences and professors are explored.
+- Handling name collisions.
+### Selected conferences
 - NeurIPS: 2022, 2023
 - ICML: 2023, 2024
 - AISTATS: 2023, 2024
 - EMNLP: 2023, 2024
 - CVPR: 2023, 2024
+## Reproducing the Project
+The `data_pipeline` requires more packages. Please pip install them:
+```bash
+cd data_pipeline
+pip install -r requirements-data-pipeline.txt
+cd ..
 ```
+```bash
+# Scrape top conferences, ~45 mins (most time from AAAI)
+python -m data_pipeline.conference_scraper
+```
+```bash
+# Search authors and locally store search results. Uses Bing web search API.
+python -m data_pipeline.us_professor_verifier --batch_search
+```
+NOTE: you may encounter caught exceptions, e.g., due to HTTPError. Would suggest to run the above multiple times until the conferences are scraped / all persons have been verified.
+```bash
 # Use locally stored search results as input to an LLM.
 # Sends as batches, each one waiting for the previous to finish.
 python -m data_pipeline.us_professor_verifier --batch_analyze
 python -m data_pipeline.us_professor_verifier --batch_retrieve
 ```
+NOTE: I am not confident that the LLM can always produce valid JSON (it does so very frequently).
+```bash
+# Fetch arxiv data and extract embeddings (may need GPU)
 python -m data_pipeline.paper_embeddings_extractor
 ```
+```bash
+# Run the streamlit application
+streamlit run USMLPhDRecommender.py
+```