Spaces:

marty331
/

ai_scout

Sleeping

ai_scout / data /data.md

Update readme information

d4ffb8f 12 months ago

1.11 kB

	[Home](../README.md)
	# Data Processing

	## Data Extraction

	Activate the virtual environment for your OS. For example on Mac run the following
	command:
	`source .venv/bin/activate`

	Change to the data directory:
	`cd /data`

	Ensure that you have a Firecrawl API key stored in a file named `.env' in the
	following format:


	`
	FIRECRAWL_API_KEY=<YOUR_FIRECRAWL_API_KEY>
	`

	Run the following command to execute the data scraper script:

	`python3 data_scraper.py`



	## Post Processing - optional

	The data that is downloaded is approximately 200 mb, to review the data it needs to be formatted, otherwise it will all be on a single line.

	Run the following command from your terminal:

	`cat scout_information.json \| python -m json.tool > pretty_scout_information.json`

	After reviewing the data for completeness delete the pretty_scout_information.json file as it is not needed for processing.

	The second step in the data scraper is creating a vector store. Prior to this repo being created this vector store was uploaded to
	Huggingface Datasets and can be found here: `marty331/scouts_dataset_vector_store`.