Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
| # eBoard Cookie Extraction Guide | |
| ## Quick Start (10 Minutes) | |
| This guide shows you how to bypass Incapsula bot protection using **manual session cookies**. This is the fastest no-cost workaround to scrape Tuscaloosa school district data. | |
| --- | |
| ## Step 1: Export Cookies from Your Browser | |
| ### Option A: Using EditThisCookie Extension (Recommended) | |
| 1. **Install Extension:** | |
| - Chrome: https://chrome.google.com/webstore/detail/editthiscookie/fngmhnnpilhplaeedifhccceomclgfbg | |
| - Edge: https://microsoftedge.microsoft.com/addons/detail/editthiscookie/ajfboaconbpkglpfanbmlfgojgndmhmc | |
| 2. **Visit eBoard Site:** | |
| ``` | |
| https://simbli.eboardsolutions.com/SB_Meetings/SB_MeetingListing.aspx?S=2088 | |
| ``` | |
| 3. **Solve Any CAPTCHA:** | |
| - Wait for "Verifying you are human" screen to complete | |
| - Click around the page (view a few meetings) to ensure cookies are fully populated | |
| 4. **Export Cookies:** | |
| - Click the EditThisCookie icon in your browser | |
| - Click the "Export" button (looks like a download icon) | |
| - Cookies are copied to clipboard | |
| 5. **Save to File:** | |
| ```bash | |
| cd /home/developer/projects/open-navigator | |
| nano eboard_cookies.json | |
| ``` | |
| - Paste the copied cookies | |
| - Save and exit (Ctrl+X, then Y, then Enter) | |
| ### Option B: Using Browser DevTools (Manual) | |
| 1. **Visit eBoard Site:** | |
| ``` | |
| https://simbli.eboardsolutions.com/SB_Meetings/SB_MeetingListing.aspx?S=2088 | |
| ``` | |
| 2. **Open DevTools:** | |
| - Press F12 | |
| - Go to **Application** tab (Chrome) or **Storage** tab (Firefox) | |
| - Click **Cookies** β `https://simbli.eboardsolutions.com` | |
| 3. **Find Key Cookies:** | |
| Look for these cookie names (the numbers will vary): | |
| - `incap_ses_XXXXX_2088` | |
| - `visid_incap_XXXXX_2088` | |
| - `nlbi_XXXXX` | |
| 4. **Create JSON File:** | |
| ```bash | |
| cd /home/developer/projects/open-navigator | |
| nano eboard_cookies.json | |
| ``` | |
| 5. **Format as JSON:** | |
| ```json | |
| [ | |
| { | |
| "name": "incap_ses_7050_2088", | |
| "value": "YOUR_ACTUAL_VALUE_FROM_BROWSER", | |
| "domain": ".eboardsolutions.com", | |
| "path": "/" | |
| }, | |
| { | |
| "name": "visid_incap_2227783", | |
| "value": "YOUR_ACTUAL_VALUE_FROM_BROWSER", | |
| "domain": ".eboardsolutions.com", | |
| "path": "/" | |
| }, | |
| { | |
| "name": "nlbi_2227783", | |
| "value": "YOUR_ACTUAL_VALUE_FROM_BROWSER", | |
| "domain": ".eboardsolutions.com", | |
| "path": "/" | |
| } | |
| ] | |
| ``` | |
| --- | |
| ## Step 2: Verify Cookie File | |
| ```bash | |
| cd /home/developer/projects/open-navigator | |
| # Check file exists | |
| ls -la eboard_cookies.json | |
| # Verify JSON format | |
| python -c "import json; print(f'Loaded {len(json.load(open(\"eboard_cookies.json\")))} cookies')" | |
| ``` | |
| Should output: `Loaded 3 cookies` (or however many you exported) | |
| --- | |
| ## Step 3: Run the Scraper | |
| The scraper will automatically detect and use `eboard_cookies.json`: | |
| ### Tuscaloosa City Schools | |
| ```bash | |
| source .venv/bin/activate | |
| python main.py scrape \ | |
| --state AL \ | |
| --municipality "Tuscaloosa City Schools" \ | |
| --url http://simbli.eboardsolutions.com/index.aspx?s=2088 \ | |
| --platform eboard \ | |
| --max-events 0 \ | |
| --start-year 0 \ | |
| --no-include-social | |
| ``` | |
| ### Tuscaloosa County Schools | |
| ```bash | |
| python main.py scrape \ | |
| --state AL \ | |
| --municipality "Tuscaloosa County Schools" \ | |
| --url http://simbli.eboardsolutions.com/index.aspx?s=2092 \ | |
| --platform eboard \ | |
| --max-events 0 \ | |
| --start-year 0 \ | |
| --no-include-social | |
| ``` | |
| --- | |
| ## Expected Output | |
| ### Without Cookies (Blocked): | |
| ``` | |
| INFO | agents.scraper:_scrape_eboard - No cookie file found | |
| INFO | agents.scraper:_scrape_eboard - Loading Meeting Listing page... | |
| ERROR | agents.scraper:_scrape_eboard - Still blocked by Incapsula (964 bytes) | |
| ``` | |
| ### With Cookies (Success): | |
| ``` | |
| SUCCESS | agents.scraper:_scrape_eboard - β Loaded 3 cookies from eboard_cookies.json | |
| SUCCESS | agents.scraper:_scrape_eboard - β Cookies injected into browser session | |
| SUCCESS | agents.scraper:_scrape_eboard - β Bypassed Incapsula! Got 246327 bytes | |
| INFO | agents.scraper:_scrape_eboard - Found 47 meeting/document links | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| ### Problem: "Still blocked by Incapsula" | |
| **Cause:** Cookies expired or User-Agent mismatch | |
| **Solution:** | |
| 1. Re-export cookies (they expire every few hours) | |
| 2. Ensure you're using the same browser as cookie export: | |
| - If you exported from **Chrome 123**, the script uses Chrome 123 UA β | |
| - If you exported from **Firefox**, you need to update the User-Agent in the code | |
| ### Problem: "Found 0 meeting links" | |
| **Cause:** Page structure changed or still being challenged | |
| **Solution:** | |
| 1. Check if cookies are still valid (re-export) | |
| 2. Try visiting the site manually first, then immediately run scraper | |
| 3. Increase wait time in script (already randomized 5-7 seconds) | |
| ### Problem: "Cookies expired after 10 meetings" | |
| **Cause:** Incapsula's "Advanced Mode" detected automated pattern | |
| **Solution:** | |
| - Scraper already implements: | |
| - β Randomized delays (3-7 seconds between requests) | |
| - β Mouse movements to simulate human behavior | |
| - β Varied User-Agent fingerprinting | |
| - If still detected, try: | |
| 1. Reduce number of meetings (`--max-events 25`) | |
| 2. Run multiple smaller batches instead of one large batch | |
| 3. Wait 10-15 minutes between batches | |
| --- | |
| ## Cookie Lifespan | |
| - **Typical Duration:** 2-4 hours | |
| - **Activity Extension:** Each page view extends expiration | |
| - **Re-export Needed:** When scraper gets blocked again | |
| **Pro Tip:** For daily scraping, just re-export cookies each morning before running the scraper. | |
| --- | |
| ## Security Notes | |
| - **Keep cookies private:** They grant access to the site as "you" | |
| - **Single machine:** Don't share cookies between different IP addresses | |
| - **Browser match:** Use same browser for export and scraping | |
| - **.gitignore:** The file `eboard_cookies.json` is already in `.gitignore` (won't be committed) | |
| --- | |
| ## Advanced: Multiple School Districts | |
| To scrape both Tuscaloosa City and County schools: | |
| ```bash | |
| # 1. Export cookies while visiting EITHER school's site | |
| # (cookies work for all eboardsolutions.com sites) | |
| # 2. Scrape City Schools | |
| python main.py scrape --platform eboard \ | |
| --url http://simbli.eboardsolutions.com/index.aspx?s=2088 \ | |
| --municipality "Tuscaloosa City Schools" --state AL | |
| # Wait 30 seconds (let cookies settle) | |
| sleep 30 | |
| # 3. Scrape County Schools (same cookies) | |
| python main.py scrape --platform eboard \ | |
| --url http://simbli.eboardsolutions.com/index.aspx?s=2092 \ | |
| --municipality "Tuscaloosa County Schools" --state AL | |
| ``` | |
| --- | |
| ## Success Metrics | |
| You'll know it's working when you see: | |
| - β `Bypassed Incapsula! Got 200000+ bytes` | |
| - β `Found XX meeting/document links` (where XX > 0) | |
| - β `β Scraped PDF: ...` (individual documents being downloaded) | |
| Typical results for Tuscaloosa: | |
| - **City Schools (S=2088):** 30-50 meetings | |
| - **County Schools (S=2092):** 40-60 meetings | |