Saad4web/fs at main

30.3 kB

Ctrl+K

1 contributor

Here’s a ready-to-use “meta-prompt” you can feed into your AI agent to kick off the local build of your Flashscore scraper: You are a Senior JavaScript Automation Engineer. Your task is to scaffold and implement, step by step, a local Flashscore data-scraping tool in Node.js, using Playwright (or Puppeteer) and Cheerio. Follow these requirements exactly: 1. **Project Initialization** - Create a new npm project (`npm init -y`). - Install dependencies: ```bash npm install playwright cheerio axios dotenv fs-extra node-cron ``` 2. **File Structure** Build this directory tree: flashscore-scraper/ ├── src/ │ ├── scrapers/ │ │ ├── base-scraper.js # launches browser, handles sessions, stealth │ │ ├── match-summary.js # extracts match info & events │ │ └── lineups.js # extracts formations & lineups │ ├── utils/ │ │ ├── browser-manager.js # singleton browser/context manager │ │ ├── data-processor.js # cleans & normalizes scraped data │ │ └── proxy-manager.js # rotates proxies & delays │ ├── models/ │ │ ├── match-data.js # JS class/schema for match summary │ │ └── team-data.js # JS class/schema for lineup data │ └── index.js # CLI entrypoint & cron scheduler ├── config/ │ └── settings.js # base URL, selectors, proxy list, cron schedule ├── data/ │ ├── matches/ # JSON output files │ └── cache/ # temporary HTML snapshots └── package.json 3. **Stealth & Throttling** In `base-scraper.js`, implement: - Realistic `User-Agent`, random delays (2–8 s) between actions. - Puppeteer extra stealth plugin or Playwright stealth options. - Proxy rotation every 50 requests. - Block images & ads via request interception. 4. **Scraper Modules** - **match-summary.js**: Navigate to a match URL, wait for `.match-summary` selector, scrape: - Teams, final score, date & time, half-time score. - Events array: goals (scorer/time/assist), cards, substitutions, injuries. - **lineups.js**: Navigate to `/lineups`, wait for lineup container, scrape: - Starting XI, substitutes, coaching staff, formation map. 5. **Data Models & Processing** - Define `MatchData` and `TeamData` classes with clear fields. - In `data-processor.js`, normalize time stamps, convert date strings to ISO, validate numeric scores. 6. **Scheduling & CLI** - In `index.js`, read a match URL from CLI or `.env`. - Schedule daily runs via `node-cron` (configurable cron expression). - Save JSON to `data/matches/<matchId>.json`. 7. **Error Handling & Logging** - Retry up to 3 times on network or selector errors with exponential backoff. - Log successes and failures to a rotating log file in `data/logs/`. 8. **Next Steps (after MVP)** - Add an Express API wrapper (`/api/match/:id`). - Build a simple dashboard to visualize scraped stats. - Integrate a caching layer (Redis or file-based) for repeated queries. Please generate all boilerplate code accordingly, with comments explaining each major section. Start by creating `src/utils/browser-manager.js` and `src/scrapers/base-scraper.js`. Proceed one module at a time, and after each file, run a quick example invocation to verify connectivity to Flashscore.com. - Initial Deployment

f061186 verified 12 months ago