open-navigator / docs /EBOARD_COOKIE_GUIDE.md
jcbowyer's picture
Deploy: Consolidated gold tables, fixed nginx docs routing
896453f verified
# eBoard Cookie Extraction Guide
## Quick Start (10 Minutes)
This guide shows you how to bypass Incapsula bot protection using **manual session cookies**. This is the fastest no-cost workaround to scrape Tuscaloosa school district data.
---
## Step 1: Export Cookies from Your Browser
### Option A: Using EditThisCookie Extension (Recommended)
1. **Install Extension:**
- Chrome: https://chrome.google.com/webstore/detail/editthiscookie/fngmhnnpilhplaeedifhccceomclgfbg
- Edge: https://microsoftedge.microsoft.com/addons/detail/editthiscookie/ajfboaconbpkglpfanbmlfgojgndmhmc
2. **Visit eBoard Site:**
```
https://simbli.eboardsolutions.com/SB_Meetings/SB_MeetingListing.aspx?S=2088
```
3. **Solve Any CAPTCHA:**
- Wait for "Verifying you are human" screen to complete
- Click around the page (view a few meetings) to ensure cookies are fully populated
4. **Export Cookies:**
- Click the EditThisCookie icon in your browser
- Click the "Export" button (looks like a download icon)
- Cookies are copied to clipboard
5. **Save to File:**
```bash
cd /home/developer/projects/open-navigator
nano eboard_cookies.json
```
- Paste the copied cookies
- Save and exit (Ctrl+X, then Y, then Enter)
### Option B: Using Browser DevTools (Manual)
1. **Visit eBoard Site:**
```
https://simbli.eboardsolutions.com/SB_Meetings/SB_MeetingListing.aspx?S=2088
```
2. **Open DevTools:**
- Press F12
- Go to **Application** tab (Chrome) or **Storage** tab (Firefox)
- Click **Cookies** β†’ `https://simbli.eboardsolutions.com`
3. **Find Key Cookies:**
Look for these cookie names (the numbers will vary):
- `incap_ses_XXXXX_2088`
- `visid_incap_XXXXX_2088`
- `nlbi_XXXXX`
4. **Create JSON File:**
```bash
cd /home/developer/projects/open-navigator
nano eboard_cookies.json
```
5. **Format as JSON:**
```json
[
{
"name": "incap_ses_7050_2088",
"value": "YOUR_ACTUAL_VALUE_FROM_BROWSER",
"domain": ".eboardsolutions.com",
"path": "/"
},
{
"name": "visid_incap_2227783",
"value": "YOUR_ACTUAL_VALUE_FROM_BROWSER",
"domain": ".eboardsolutions.com",
"path": "/"
},
{
"name": "nlbi_2227783",
"value": "YOUR_ACTUAL_VALUE_FROM_BROWSER",
"domain": ".eboardsolutions.com",
"path": "/"
}
]
```
---
## Step 2: Verify Cookie File
```bash
cd /home/developer/projects/open-navigator
# Check file exists
ls -la eboard_cookies.json
# Verify JSON format
python -c "import json; print(f'Loaded {len(json.load(open(\"eboard_cookies.json\")))} cookies')"
```
Should output: `Loaded 3 cookies` (or however many you exported)
---
## Step 3: Run the Scraper
The scraper will automatically detect and use `eboard_cookies.json`:
### Tuscaloosa City Schools
```bash
source .venv/bin/activate
python main.py scrape \
--state AL \
--municipality "Tuscaloosa City Schools" \
--url http://simbli.eboardsolutions.com/index.aspx?s=2088 \
--platform eboard \
--max-events 0 \
--start-year 0 \
--no-include-social
```
### Tuscaloosa County Schools
```bash
python main.py scrape \
--state AL \
--municipality "Tuscaloosa County Schools" \
--url http://simbli.eboardsolutions.com/index.aspx?s=2092 \
--platform eboard \
--max-events 0 \
--start-year 0 \
--no-include-social
```
---
## Expected Output
### Without Cookies (Blocked):
```
INFO | agents.scraper:_scrape_eboard - No cookie file found
INFO | agents.scraper:_scrape_eboard - Loading Meeting Listing page...
ERROR | agents.scraper:_scrape_eboard - Still blocked by Incapsula (964 bytes)
```
### With Cookies (Success):
```
SUCCESS | agents.scraper:_scrape_eboard - βœ“ Loaded 3 cookies from eboard_cookies.json
SUCCESS | agents.scraper:_scrape_eboard - βœ“ Cookies injected into browser session
SUCCESS | agents.scraper:_scrape_eboard - βœ“ Bypassed Incapsula! Got 246327 bytes
INFO | agents.scraper:_scrape_eboard - Found 47 meeting/document links
```
---
## Troubleshooting
### Problem: "Still blocked by Incapsula"
**Cause:** Cookies expired or User-Agent mismatch
**Solution:**
1. Re-export cookies (they expire every few hours)
2. Ensure you're using the same browser as cookie export:
- If you exported from **Chrome 123**, the script uses Chrome 123 UA βœ“
- If you exported from **Firefox**, you need to update the User-Agent in the code
### Problem: "Found 0 meeting links"
**Cause:** Page structure changed or still being challenged
**Solution:**
1. Check if cookies are still valid (re-export)
2. Try visiting the site manually first, then immediately run scraper
3. Increase wait time in script (already randomized 5-7 seconds)
### Problem: "Cookies expired after 10 meetings"
**Cause:** Incapsula's "Advanced Mode" detected automated pattern
**Solution:**
- Scraper already implements:
- βœ… Randomized delays (3-7 seconds between requests)
- βœ… Mouse movements to simulate human behavior
- βœ… Varied User-Agent fingerprinting
- If still detected, try:
1. Reduce number of meetings (`--max-events 25`)
2. Run multiple smaller batches instead of one large batch
3. Wait 10-15 minutes between batches
---
## Cookie Lifespan
- **Typical Duration:** 2-4 hours
- **Activity Extension:** Each page view extends expiration
- **Re-export Needed:** When scraper gets blocked again
**Pro Tip:** For daily scraping, just re-export cookies each morning before running the scraper.
---
## Security Notes
- **Keep cookies private:** They grant access to the site as "you"
- **Single machine:** Don't share cookies between different IP addresses
- **Browser match:** Use same browser for export and scraping
- **.gitignore:** The file `eboard_cookies.json` is already in `.gitignore` (won't be committed)
---
## Advanced: Multiple School Districts
To scrape both Tuscaloosa City and County schools:
```bash
# 1. Export cookies while visiting EITHER school's site
# (cookies work for all eboardsolutions.com sites)
# 2. Scrape City Schools
python main.py scrape --platform eboard \
--url http://simbli.eboardsolutions.com/index.aspx?s=2088 \
--municipality "Tuscaloosa City Schools" --state AL
# Wait 30 seconds (let cookies settle)
sleep 30
# 3. Scrape County Schools (same cookies)
python main.py scrape --platform eboard \
--url http://simbli.eboardsolutions.com/index.aspx?s=2092 \
--municipality "Tuscaloosa County Schools" --state AL
```
---
## Success Metrics
You'll know it's working when you see:
- βœ… `Bypassed Incapsula! Got 200000+ bytes`
- βœ… `Found XX meeting/document links` (where XX > 0)
- βœ… `βœ“ Scraped PDF: ...` (individual documents being downloaded)
Typical results for Tuscaloosa:
- **City Schools (S=2088):** 30-50 meetings
- **County Schools (S=2092):** 40-60 meetings