ai_scout / data /data.md
marty331's picture
Update readme information
d4ffb8f

A newer version of the Gradio SDK is available: 6.2.0

Upgrade

Home

Data Processing

Data Extraction

Activate the virtual environment for your OS. For example on Mac run the following command: source .venv/bin/activate

Change to the data directory: cd /data

Ensure that you have a Firecrawl API key stored in a file named `.env' in the following format:

FIRECRAWL_API_KEY=<YOUR_FIRECRAWL_API_KEY>

Run the following command to execute the data scraper script:

python3 data_scraper.py

Post Processing - optional

The data that is downloaded is approximately 200 mb, to review the data it needs to be formatted, otherwise it will all be on a single line.

Run the following command from your terminal:

cat scout_information.json | python -m json.tool > pretty_scout_information.json

After reviewing the data for completeness delete the pretty_scout_information.json file as it is not needed for processing.

The second step in the data scraper is creating a vector store. Prior to this repo being created this vector store was uploaded to Huggingface Datasets and can be found here: marty331/scouts_dataset_vector_store.