[Home](../README.md) # Data Processing ## Data Extraction Activate the virtual environment for your OS. For example on Mac run the following command: `source .venv/bin/activate` Change to the data directory: `cd /data` Ensure that you have a Firecrawl API key stored in a file named `.env' in the following format: ` FIRECRAWL_API_KEY= ` Run the following command to execute the data scraper script: `python3 data_scraper.py` ## Post Processing - optional The data that is downloaded is approximately 200 mb, to review the data it needs to be formatted, otherwise it will all be on a single line. Run the following command from your terminal: `cat scout_information.json | python -m json.tool > pretty_scout_information.json` After reviewing the data for completeness delete the pretty_scout_information.json file as it is not needed for processing. The second step in the data scraper is creating a vector store. Prior to this repo being created this vector store was uploaded to Huggingface Datasets and can be found here: `marty331/scouts_dataset_vector_store`.