Spaces:
Build error
Build error
Update README.md
Browse files
README.md
CHANGED
|
@@ -9,3 +9,276 @@ app_file: app.py
|
|
| 9 |
pinned: false
|
| 10 |
license: apache-2.0
|
| 11 |
----
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
pinned: false
|
| 10 |
license: apache-2.0
|
| 11 |
----
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
# Auto-Research
|
| 15 |
+
![Auto-Research][logo]
|
| 16 |
+
|
| 17 |
+
[logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
|
| 18 |
+
A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
|
| 19 |
+
|
| 20 |
+
Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI
|
| 21 |
+
|
| 22 |
+
Requirements:
|
| 23 |
+
- python 3.7 or above
|
| 24 |
+
- poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
|
| 25 |
+
- list of requirements in requirements.txt - `cat requirements.txt | xargs pip install`
|
| 26 |
+
- 8GB disk space
|
| 27 |
+
- 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
|
| 28 |
+
|
| 29 |
+
#### Demo :
|
| 30 |
+
|
| 31 |
+
Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
|
| 32 |
+
|
| 33 |
+
Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
|
| 34 |
+
|
| 35 |
+
(`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
#### Installation:
|
| 39 |
+
```
|
| 40 |
+
sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-dev
|
| 41 |
+
pip install git+https://github.com/sidphbot/Auto-Research.git
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
#### Run Survey (cli):
|
| 45 |
+
```
|
| 46 |
+
python survey.py [options] <your_research_query>
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
#### Run Survey (Streamlit web-interface - new):
|
| 50 |
+
```
|
| 51 |
+
streamlit run app.py
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
#### Run Survey (Python API):
|
| 55 |
+
```
|
| 56 |
+
from survey import Surveyor
|
| 57 |
+
mysurveyor = Surveyor()
|
| 58 |
+
mysurveyor.survey('quantum entanglement')
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### Research tools:
|
| 62 |
+
|
| 63 |
+
These are independent tools for your research or document text handling needs.
|
| 64 |
+
|
| 65 |
+
```
|
| 66 |
+
*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
- `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)
|
| 70 |
+
|
| 71 |
+
Input:
|
| 72 |
+
|
| 73 |
+
`longtext` : string
|
| 74 |
+
|
| 75 |
+
Returns:
|
| 76 |
+
|
| 77 |
+
`summary` : string
|
| 78 |
+
|
| 79 |
+
- `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)
|
| 80 |
+
|
| 81 |
+
Input:
|
| 82 |
+
|
| 83 |
+
`longtext` : string
|
| 84 |
+
|
| 85 |
+
Returns:
|
| 86 |
+
|
| 87 |
+
`summary` : string
|
| 88 |
+
|
| 89 |
+
- `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)
|
| 90 |
+
|
| 91 |
+
Input:
|
| 92 |
+
|
| 93 |
+
`longtext` : string
|
| 94 |
+
|
| 95 |
+
Returns:
|
| 96 |
+
|
| 97 |
+
`title` : string
|
| 98 |
+
|
| 99 |
+
- `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)
|
| 100 |
+
|
| 101 |
+
Input:
|
| 102 |
+
|
| 103 |
+
`longtext` : string
|
| 104 |
+
|
| 105 |
+
Returns:
|
| 106 |
+
|
| 107 |
+
`highlights` : [string]
|
| 108 |
+
`keywords` : [string]
|
| 109 |
+
`keyphrases` : [string]
|
| 110 |
+
|
| 111 |
+
- `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).
|
| 112 |
+
|
| 113 |
+
Input:
|
| 114 |
+
|
| 115 |
+
`pdf_file` : string
|
| 116 |
+
|
| 117 |
+
Returns:
|
| 118 |
+
|
| 119 |
+
`images_files` : [string]
|
| 120 |
+
|
| 121 |
+
- `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).
|
| 122 |
+
|
| 123 |
+
Input:
|
| 124 |
+
|
| 125 |
+
`pdf_file` : string
|
| 126 |
+
|
| 127 |
+
Returns:
|
| 128 |
+
|
| 129 |
+
`images_files` : [string]
|
| 130 |
+
|
| 131 |
+
- `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)
|
| 132 |
+
|
| 133 |
+
Input:
|
| 134 |
+
|
| 135 |
+
`lines` : [string]
|
| 136 |
+
|
| 137 |
+
Returns:
|
| 138 |
+
|
| 139 |
+
`sections` : dict(generated_title: [cluster_abstract])
|
| 140 |
+
`clusters` : dict(cluster_id: [cluster_lines])
|
| 141 |
+
|
| 142 |
+
- `extract_headings` - *[for scientific texts - Assumes an ‘abstract’ heading present]* takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`).
|
| 143 |
+
|
| 144 |
+
`[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)
|
| 145 |
+
|
| 146 |
+
`[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!
|
| 147 |
+
|
| 148 |
+
Input:
|
| 149 |
+
|
| 150 |
+
`text_file` : string
|
| 151 |
+
|
| 152 |
+
Returns:
|
| 153 |
+
|
| 154 |
+
`refined` : [string],
|
| 155 |
+
`headings` : [string]
|
| 156 |
+
`sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
## Access/Modify defaults:
|
| 160 |
+
|
| 161 |
+
- inside code
|
| 162 |
+
```
|
| 163 |
+
from survey.Surveyor import DEFAULTS
|
| 164 |
+
from pprint import pprint
|
| 165 |
+
|
| 166 |
+
pprint(DEFAULTS)
|
| 167 |
+
```
|
| 168 |
+
or,
|
| 169 |
+
|
| 170 |
+
- Modify static config file - `defaults.py`
|
| 171 |
+
|
| 172 |
+
or,
|
| 173 |
+
|
| 174 |
+
- At runtime (utility)
|
| 175 |
+
|
| 176 |
+
```
|
| 177 |
+
python survey.py --help
|
| 178 |
+
```
|
| 179 |
+
```
|
| 180 |
+
usage: survey.py [-h] [--max_search max_metadata_papers]
|
| 181 |
+
[--num_papers max_num_papers] [--pdf_dir pdf_dir]
|
| 182 |
+
[--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
|
| 183 |
+
[--dump_dir dump_dir] [--models_dir save_models_dir]
|
| 184 |
+
[--title_model_name title_model_name]
|
| 185 |
+
[--ex_summ_model_name extractive_summ_model_name]
|
| 186 |
+
[--ledmodel_name ledmodel_name]
|
| 187 |
+
[--embedder_name sentence_embedder_name]
|
| 188 |
+
[--nlp_name spacy_model_name]
|
| 189 |
+
[--similarity_nlp_name similarity_nlp_name]
|
| 190 |
+
[--kw_model_name kw_model_name]
|
| 191 |
+
[--refresh_models refresh_models] [--high_gpu high_gpu]
|
| 192 |
+
query_string
|
| 193 |
+
|
| 194 |
+
Generate a survey just from a query !!
|
| 195 |
+
|
| 196 |
+
positional arguments:
|
| 197 |
+
query_string your research query/keywords
|
| 198 |
+
|
| 199 |
+
optional arguments:
|
| 200 |
+
-h, --help show this help message and exit
|
| 201 |
+
--max_search max_metadata_papers
|
| 202 |
+
maximium number of papers to gaze at - defaults to 100
|
| 203 |
+
--num_papers max_num_papers
|
| 204 |
+
maximium number of papers to download and analyse -
|
| 205 |
+
defaults to 25
|
| 206 |
+
--pdf_dir pdf_dir pdf paper storage directory - defaults to
|
| 207 |
+
arxiv_data/tarpdfs/
|
| 208 |
+
--txt_dir txt_dir text-converted paper storage directory - defaults to
|
| 209 |
+
arxiv_data/fulltext/
|
| 210 |
+
--img_dir img_dir image storage directory - defaults to
|
| 211 |
+
arxiv_data/images/
|
| 212 |
+
--tab_dir tab_dir tables storage directory - defaults to
|
| 213 |
+
arxiv_data/tables/
|
| 214 |
+
--dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/
|
| 215 |
+
--models_dir save_models_dir
|
| 216 |
+
directory to save models (> 5GB) - defaults to
|
| 217 |
+
saved_models/
|
| 218 |
+
--title_model_name title_model_name
|
| 219 |
+
title model name/tag in hugging-face, defaults to
|
| 220 |
+
'Callidior/bert2bert-base-arxiv-titlegen'
|
| 221 |
+
--ex_summ_model_name extractive_summ_model_name
|
| 222 |
+
extractive summary model name/tag in hugging-face,
|
| 223 |
+
defaults to 'allenai/scibert_scivocab_uncased'
|
| 224 |
+
--ledmodel_name ledmodel_name
|
| 225 |
+
led model(for abstractive summary) name/tag in
|
| 226 |
+
hugging-face, defaults to 'allenai/led-
|
| 227 |
+
large-16384-arxiv'
|
| 228 |
+
--embedder_name sentence_embedder_name
|
| 229 |
+
sentence embedder name/tag in hugging-face, defaults
|
| 230 |
+
to 'paraphrase-MiniLM-L6-v2'
|
| 231 |
+
--nlp_name spacy_model_name
|
| 232 |
+
spacy model name/tag in hugging-face (if changed -
|
| 233 |
+
needs to be spacy-installed prior), defaults to
|
| 234 |
+
'en_core_sci_scibert'
|
| 235 |
+
--similarity_nlp_name similarity_nlp_name
|
| 236 |
+
spacy downstream model(for similarity) name/tag in
|
| 237 |
+
hugging-face (if changed - needs to be spacy-installed
|
| 238 |
+
prior), defaults to 'en_core_sci_lg'
|
| 239 |
+
--kw_model_name kw_model_name
|
| 240 |
+
keyword extraction model name/tag in hugging-face,
|
| 241 |
+
defaults to 'distilbert-base-nli-mean-tokens'
|
| 242 |
+
--refresh_models refresh_models
|
| 243 |
+
Refresh model downloads with given names (needs
|
| 244 |
+
atleast one model name param above), defaults to False
|
| 245 |
+
--high_gpu high_gpu High GPU usage permitted, defaults to False
|
| 246 |
+
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
- At runtime (code)
|
| 250 |
+
|
| 251 |
+
> during surveyor object initialization with `surveyor_obj = Surveyor()`
|
| 252 |
+
- `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
|
| 253 |
+
- `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
|
| 254 |
+
- `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
|
| 255 |
+
- `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
|
| 256 |
+
- `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
|
| 257 |
+
- `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
|
| 258 |
+
- `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
|
| 259 |
+
- `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
|
| 260 |
+
- `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
|
| 261 |
+
- `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
|
| 262 |
+
- `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
|
| 263 |
+
- `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
|
| 264 |
+
- `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
|
| 265 |
+
- `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
|
| 266 |
+
- `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
|
| 267 |
+
|
| 268 |
+
> during survey generation with `surveyor_obj.survey(query="my_research_query")`
|
| 269 |
+
- `max_search`: int maximium number of papers to gaze at - defaults to `100`
|
| 270 |
+
- `num_papers`: int maximium number of papers to download and analyse - defaults to `25`
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
#### Artifacts generated (zipped):
|
| 275 |
+
- Detailed survey draft paper as txt file
|
| 276 |
+
- A curated list of top 25+ papers as pdfs and txts
|
| 277 |
+
- Images extracted from above papers as jpegs, bmps etc
|
| 278 |
+
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
|
| 279 |
+
- Tables extracted from papers(optional)
|
| 280 |
+
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
|
| 281 |
+
|
| 282 |
+
|
| 283 |
+
Please cite this repo if it helped you :)
|
| 284 |
+
|