Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
File size: 4,598 Bytes
896453f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | # eBoard Platform Manual Download Guide
## Issue: Incapsula Bot Protection
eBoard Solutions (https://simbli.eboardsolutions.com) uses **Incapsula** anti-bot protection that blocks automated scraping, even with advanced tools like Playwright. The platform requires manual interaction to access meeting documents.
## Affected School Districts
### Tuscaloosa City Schools
- **URL**: http://simbli.eboardsolutions.com/index.aspx?s=2088
- **Meetings**: http://simbli.eboardsolutions.com/SB_Meetings/SB_MeetingListing.aspx?S=2088
### Tuscaloosa County Schools
- **URL**: https://simbli.eboardsolutions.com/SB_Meetings/SB_MeetingListing.aspx?S=2092
- **Website**: https://www.tcss.net/board-of-education (links to eBoard)
## Manual Download Steps
### 1. Access Meeting Listings
1. Visit the meetings URL above in your browser
2. You'll see a calendar or list of board meetings
3. Each meeting shows the date and has document links
### 2. Download Documents
For each meeting:
- Click on the meeting date to view details
- Look for:
- **Agenda** (usually PDF)
- **Minutes** (usually PDF)
- **Packets** (supporting materials)
- Right-click each document → "Save As"
### 3. Organize Downloads
Save files with naming pattern:
```
tuscaloosa_city_schools_YYYY-MM-DD_agenda.pdf
tuscaloosa_city_schools_YYYY-MM-DD_minutes.pdf
```
### 4. Import into System
Once downloaded, you can import them manually:
```python
from pipeline.delta_lake import DeltaLakePipeline
from agents.scraper import ScraperAgent
import asyncio
async def import_manual_pdfs(pdf_directory: str):
"""Import manually downloaded PDFs into the system."""
scraper = ScraperAgent()
async with scraper:
documents = []
for pdf_path in Path(pdf_directory).glob("*.pdf"):
# Extract content from PDF
content = await scraper._scrape_pdf_document(str(pdf_path))
if content:
# Parse filename for metadata
parts = pdf_path.stem.split('_')
date_str = parts[2] if len(parts) > 2 else ""
doc_type = parts[3] if len(parts) > 3 else "document"
doc = {
'document_id': hashlib.md5(str(pdf_path).encode()).hexdigest(),
'source_url': f'file://{pdf_path}',
'municipality': 'Tuscaloosa City Schools',
'state': 'AL',
'meeting_date': date_str,
'meeting_type': 'Board Meeting',
'title': pdf_path.stem,
'content': content,
'metadata': {'source': 'manual_download', 'platform': 'eboard'}
}
documents.append(doc)
# Write to Delta Lake
pipeline = DeltaLakePipeline()
pipeline.write_raw_documents(documents)
return documents
# Usage:
# asyncio.run(import_manual_pdfs('/path/to/downloaded/pdfs'))
```
## Alternative: RSS Feeds
Some eBoard installations offer RSS feeds or calendar exports:
1. Look for RSS icon on meetings page
2. Look for "Subscribe" or "Export to Calendar" options
3. These may bypass the web interface restrictions
## Future Enhancement Ideas
1. **Browser Extension**: Create a Chrome extension that scrapes while you browse
2. **API Discovery**: Research if eBoard has any undocumented APIs
3. **Selenium Grid**: Use residential proxy services for more sophisticated bot evasion
4. **Contact District**: Request bulk export of meeting documents directly
## Why Automation Fails
eBoard's Incapsula protection includes:
- Browser fingerprinting (detects headless browsers)
- IP reputation checking
- JavaScript challenges (requires full browser execution)
- Session tracking (blocks rapid sequential requests)
- Rate limiting per IP address
Even with Playwright running in visible mode, subsequent page navigations get blocked once the system detects automated patterns.
## Recommended Approach
For comprehensive school district data:
1. **Prioritize**: Focus on city government data (working well)
2. **Manual collection**: Download key school board meetings manually
3. **Selective import**: Import only the most relevant documents
4. **Direct contact**: Reach out to school district IT for data sharing agreement
## Status
- ✅ **Tuscaloosa City Government**: Automated scraping works (SuiteOne Media platform)
- ❌ **Tuscaloosa City Schools**: Manual download required (eBoard + Incapsula)
- ❌ **Tuscaloosa County Schools**: Manual download required (eBoard + Incapsula)
|