File size: 1,107 Bytes
7072a6e d4ffb8f 7072a6e d4ffb8f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
[Home](../README.md)
# Data Processing
## Data Extraction
Activate the virtual environment for your OS. For example on Mac run the following
command:
`source .venv/bin/activate`
Change to the data directory:
`cd /data`
Ensure that you have a Firecrawl API key stored in a file named `.env' in the
following format:
`
FIRECRAWL_API_KEY=<YOUR_FIRECRAWL_API_KEY>
`
Run the following command to execute the data scraper script:
`python3 data_scraper.py`
## Post Processing - optional
The data that is downloaded is approximately 200 mb, to review the data it needs to be formatted, otherwise it will all be on a single line.
Run the following command from your terminal:
`cat scout_information.json | python -m json.tool > pretty_scout_information.json`
After reviewing the data for completeness delete the pretty_scout_information.json file as it is not needed for processing.
The second step in the data scraper is creating a vector store. Prior to this repo being created this vector store was uploaded to
Huggingface Datasets and can be found here: `marty331/scouts_dataset_vector_store`.
|