File size: 1,107 Bytes
7072a6e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4ffb8f
 
7072a6e
 
 
 
 
 
 
 
 
d4ffb8f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
[Home](../README.md)
# Data Processing

## Data Extraction

Activate the virtual environment for your OS.  For example on Mac run the following
command:
`source .venv/bin/activate`

Change to the data directory:
`cd /data`

Ensure that you have a Firecrawl API key stored in a file named `.env' in the
following format:


`
FIRECRAWL_API_KEY=<YOUR_FIRECRAWL_API_KEY>
`

Run the following command to execute the data scraper script:

`python3 data_scraper.py`



## Post Processing - optional

The data that is downloaded is approximately 200 mb, to review the data it needs to be formatted, otherwise it will all be on a single line.

Run the following command from your terminal:

`cat scout_information.json | python -m json.tool > pretty_scout_information.json`

After reviewing the data for completeness delete the pretty_scout_information.json file as it is not needed for processing.

The second step in the data scraper is creating a vector store.  Prior to this repo being created this vector store was uploaded to
Huggingface Datasets and can be found here: `marty331/scouts_dataset_vector_store`.