fs / index.html
Saad4web's picture
Here’s a ready-to-use β€œmeta-prompt” you can feed into your AI agent to kick off the local build of your Flashscore scraper: You are a Senior JavaScript Automation Engineer. Your task is to scaffold and implement, step by step, a local Flashscore data-scraping tool in Node.js, using Playwright (or Puppeteer) and Cheerio. Follow these requirements exactly: 1. **Project Initialization** - Create a new npm project (`npm init -y`). - Install dependencies: ```bash npm install playwright cheerio axios dotenv fs-extra node-cron ``` 2. **File Structure** Build this directory tree: flashscore-scraper/ β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ scrapers/ β”‚ β”‚ β”œβ”€β”€ base-scraper.js # launches browser, handles sessions, stealth β”‚ β”‚ β”œβ”€β”€ match-summary.js # extracts match info & events β”‚ β”‚ └── lineups.js # extracts formations & lineups β”‚ β”œβ”€β”€ utils/ β”‚ β”‚ β”œβ”€β”€ browser-manager.js # singleton browser/context manager β”‚ β”‚ β”œβ”€β”€ data-processor.js # cleans & normalizes scraped data β”‚ β”‚ └── proxy-manager.js # rotates proxies & delays β”‚ β”œβ”€β”€ models/ β”‚ β”‚ β”œβ”€β”€ match-data.js # JS class/schema for match summary β”‚ β”‚ └── team-data.js # JS class/schema for lineup data β”‚ └── index.js # CLI entrypoint & cron scheduler β”œβ”€β”€ config/ β”‚ └── settings.js # base URL, selectors, proxy list, cron schedule β”œβ”€β”€ data/ β”‚ β”œβ”€β”€ matches/ # JSON output files β”‚ └── cache/ # temporary HTML snapshots └── package.json 3. **Stealth & Throttling** In `base-scraper.js`, implement: - Realistic `User-Agent`, random delays (2–8 s) between actions. - Puppeteer extra stealth plugin or Playwright stealth options. - Proxy rotation every 50 requests. - Block images & ads via request interception. 4. **Scraper Modules** - **match-summary.js**: Navigate to a match URL, wait for `.match-summary` selector, scrape: - Teams, final score, date & time, half-time score. - Events array: goals (scorer/time/assist), cards, substitutions, injuries. - **lineups.js**: Navigate to `/lineups`, wait for lineup container, scrape: - Starting XI, substitutes, coaching staff, formation map. 5. **Data Models & Processing** - Define `MatchData` and `TeamData` classes with clear fields. - In `data-processor.js`, normalize time stamps, convert date strings to ISO, validate numeric scores. 6. **Scheduling & CLI** - In `index.js`, read a match URL from CLI or `.env`. - Schedule daily runs via `node-cron` (configurable cron expression). - Save JSON to `data/matches/<matchId>.json`. 7. **Error Handling & Logging** - Retry up to 3 times on network or selector errors with exponential backoff. - Log successes and failures to a rotating log file in `data/logs/`. 8. **Next Steps (after MVP)** - Add an Express API wrapper (`/api/match/:id`). - Build a simple dashboard to visualize scraped stats. - Integrate a caching layer (Redis or file-based) for repeated queries. Please generate all boilerplate code accordingly, with comments explaining each major section. Start by creating `src/utils/browser-manager.js` and `src/scrapers/base-scraper.js`. Proceed one module at a time, and after each file, run a quick example invocation to verify connectivity to Flashscore.com. - Initial Deployment
f061186 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Flashscore Scraper Project | Documentation</title>
<script src="https://cdn.tailwindcss.com"></script>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.0/css/all.min.css">
<script>
tailwind.config = {
theme: {
extend: {
colors: {
primary: '#0d9488',
secondary: '#134e4a',
accent: '#14b8a6'
}
}
}
}
</script>
<style>
.file-tree li {
position: relative;
padding-left: 1.5rem;
}
.file-tree li:before {
content: '';
position: absolute;
left: 0;
top: 0;
bottom: 0;
width: 1px;
background-color: #cbd5e1;
}
.file-tree li:after {
content: '';
position: absolute;
left: 0;
top: 12px;
height: 1px;
width: 10px;
background-color: #cbd5e1;
}
.file-tree .folder:before {
font-family: "Font Awesome 6 Free";
content: "\f07b";
position: absolute;
left: -1.5rem;
top: 0;
font-weight: 900;
color: #0d9488;
}
.file-tree .file:before {
font-family: "Font Awesome 6 Free";
content: "\f15b";
position: absolute;
left: -1.5rem;
top: 0;
font-weight: 400;
color: #94a3b8;
}
.code-header {
font-family: ui-monospace, SFMono-Regular, Menlo, Monaco, Consolas, "Liberation Mono", "Courier New", monospace;
}
pre {
font-size: 0.875rem;
line-height: 1.5rem;
overflow-x: auto;
padding: 0;
background-color: #1e293b;
border-radius: 0.5rem;
margin: 0;
}
.code-container {
display: none;
}
.active-file {
background-color: #f8fafc;
}
@media (max-width: 768px) {
.sidebar-container {
max-height: 300px;
overflow-y: auto;
}
}
</style>
</head>
<body class="bg-slate-50 text-slate-800">
<!-- Navigation -->
<nav class="bg-gradient-to-r from-secondary to-primary text-white p-4 shadow-lg">
<div class="container mx-auto flex justify-between items-center">
<div class="flex items-center">
<i class="fas fa-database text-2xl mr-3"></i>
<h1 class="text-2xl font-bold">Flashscore Scraper</h1>
</div>
<div>
<span class="bg-white text-primary px-3 py-1 rounded-full text-sm font-bold">Node.js v18</span>
</div>
</div>
</nav>
<!-- Header -->
<header class="bg-gradient-to-r from-secondary/5 to-primary/5 py-12">
<div class="container mx-auto px-4">
<div class="max-w-4xl mx-auto text-center">
<h2 class="text-4xl font-bold text-slate-800 mb-4">Sports Data Extraction Tool</h2>
<p class="text-lg text-slate-600 mb-6">
Comprehensive Flashscore.com scraper built with Playwright and Cheerio.
Collects match data, lineups, statistics and schedules.
</p>
<div class="flex flex-wrap justify-center gap-3">
<span class="bg-white px-3 py-1 rounded-full text-sm font-medium border border-primary/30">
<i class="fas fa-play mr-1"></i> Playwright
</span>
<span class="bg-white px-3 py-1 rounded-full text-sm font-medium border border-primary/30">
<i class="fas fa-filter mr-1"></i> Cheerio
</span>
<span class="bg-white px-3 py-1 rounded-full text-sm font-medium border border-primary/30">
<i class="fas fa-clock mr-1"></i> CRON Scheduling
</span>
<span class="bg-white px-3 py-1 rounded-full text-sm font-medium border border-primary/30">
<i class="fas fa-robot mr-1"></i> Stealth Mode
</span>
</div>
</div>
</div>
</header>
<div class="container mx-auto px-4 py-8">
<!-- Project Structure -->
<section class="bg-white rounded-xl shadow-lg overflow-hidden mb-8">
<div class="border-b border-slate-200 py-4 px-6 flex items-center">
<i class="fas fa-sitemap text-primary mr-3"></i>
<h3 class="text-xl font-bold text-slate-800">Project Structure</h3>
</div>
<div class="p-6">
<div class="grid grid-cols-1 md:grid-cols-12 gap-6">
<div class="md:col-span-4 sidebar-container">
<div class="bg-slate-50 p-4 rounded-lg">
<h4 class="font-bold text-primary mb-3">Project Files</h4>
<ul class="file-tree text-sm space-y-1">
<li class="folder">flashscore-scraper
<ul class="pl-4 space-y-1">
<li class="folder">config
<ul class="pl-4 space-y-1">
<li class="file" data-file="settings.js">settings.js</li>
</ul>
</li>
<li class="folder">data
<ul class="pl-4 space-y-1">
<li class="folder">matches</li>
<li class="folder">cache</li>
</ul>
</li>
<li class="folder">src
<ul class="pl-4 space-y-1">
<li class="folder">models
<ul class="pl-4 space-y-1">
<li class="file" data-file="match-data.js">match-data.js</li>
<li class="file" data-file="team-data.js">team-data.js</li>
</ul>
</li>
<li class="folder">scrapers
<ul class="pl-4 space-y-1">
<li class="file" data-file="base-scraper.js">base-scraper.js</li>
<li class="file" data-file="match-summary.js">match-summary.js</li>
<li class="file" data-file="lineups.js">lineups.js</li>
</ul>
</li>
<li class="folder">utils
<ul class="pl-4 space-y-1">
<li class="file" data-file="browser-manager.js">browser-manager.js</li>
<li class="file" data-file="data-processor.js">data-processor.js</li>
<li class="file" data-file="proxy-manager.js">proxy-manager.js</li>
</ul>
</li>
<li class="file" data-file="index.js">index.js</li>
</ul>
</li>
<li class="file" data-file="package.json">package.json</li>
</ul>
</li>
</ul>
</div>
</div>
<div class="md:col-span-8">
<!-- Code View Tabs -->
<div class="code-container active" id="base-scraper.js">
<div class="code-header bg-slate-800 text-slate-200 px-4 py-2 rounded-t-lg flex justify-between">
<div>
<i class="far fa-file-code mr-2"></i>
<span class="font-mono">src/scrapers/base-scraper.js</span>
</div>
<div>
<span class="text-green-400">β€’</span>
<span class="text-xs ml-1">JavaScript</span>
</div>
</div>
<pre class="rounded-b-lg"><code class="language-javascript">const { chromium } = require('playwright');
const StealthPlugin = require('puppeteer-extra-plugin-stealth')();
const UserAgent = require('user-agents');
const ProxyManager = require('../utils/proxy-manager');
class BaseScraper {
constructor() {
this.proxyManager = new ProxyManager();
this.stealthPlugin = StealthPlugin;
this.userAgent = new UserAgent();
}
async launchBrowser() {
const proxy = this.proxyManager.getNextProxy();
this.browser = await chromium.launch({
headless: true,
proxy: {
server: proxy,
},
args: [
'--disable-blink-features=AutomationControlled',
'--no-sandbox'
],
});
this.context = await this.browser.newContext({
userAgent: this.userAgent.toString(),
viewport: { width: 1920, height: 1080 },
});
this.page = await this.context.newPage();
// Block unnecessary resources
await this.page.route(/\.(jpg|jpeg|png|gif|css|ads|adservice|googleadservices|doubleclick)/, route => route.abort());
// Enable stealth
await this.stealthPlugin.onPageCreated(this.page);
}
async navigateTo(url) {
await this.page.goto(url, { waitUntil: 'networkidle', timeout: 60000 });
await this.randomDelay(2000, 8000);
}
async randomDelay(min, max) {
const delay = Math.floor(Math.random() * (max - min + 1)) + min;
await this.page.waitForTimeout(delay);
}
async closeBrowser() {
await this.browser.close();
}
}
module.exports = BaseScraper;</code></pre>
</div>
<div class="code-container" id="browser-manager.js">
<div class="code-header bg-slate-800 text-slate-200 px-4 py-2 rounded-t-lg flex justify-between">
<div>
<i class="far fa-file-code mr-2"></i>
<span class="font-mono">src/utils/browser-manager.js</span>
</div>
<div>
<span class="text-green-400">β€’</span>
<span class="text-xs ml-1">JavaScript</span>
</div>
</div>
<pre class="rounded-b-lg"><code class="language-javascript">const { chromium } = require('playwright');
const singleton = Symbol();
const singletonEnforcer = Symbol();
class BrowserManager {
constructor(enforcer) {
if (enforcer !== singletonEnforcer) {
throw new Error('Cannot construct singleton');
}
this.browser = null;
}
static get instance() {
if (!this[singleton]) {
this[singleton] = new BrowserManager(singletonEnforcer);
}
return this[singleton];
}
async launch() {
if (!this.browser || !this.browser.isConnected()) {
this.browser = await chromium.launch({
headless: true,
args: [
'--disable-blink-features=AutomationControlled',
'--no-sandbox'
],
});
}
return this.browser;
}
async newContext() {
const browser = await this.launch();
return browser.newContext({
viewport: { width: 1920, height: 1080 },
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
});
}
async close() {
if (this.browser) {
await this.browser.close();
this.browser = null;
}
}
}
module.exports = BrowserManager;</code></pre>
</div>
<div class="code-container" id="match-summary.js">
<div class="code-header bg-slate-800 text-slate-200 px-4 py-2 rounded-t-lg flex justify-between">
<div>
<i class="far fa-file-code mr-2"></i>
<span class="font-mono">src/scrapers/match-summary.js</span>
</div>
<div>
<span class="text-green-400">β€’</span>
<span class="text-xs ml-1">JavaScript</span>
</div>
</div>
<pre class="rounded-b-lg"><code class="language-javascript">const BaseScraper = require('./base-scraper');
const cheerio = require('cheerio');
const MatchData = require('../models/match-data');
const DataProcessor = require('../utils/data-processor');
class MatchSummaryScraper extends BaseScraper {
constructor(matchUrl) {
super();
this.matchUrl = matchUrl;
this.matchData = new MatchData();
}
async scrape() {
try {
await this.launchBrowser();
await this.navigateTo(this.matchUrl);
await this.page.waitForSelector('.matchSummary', { timeout: 10000 });
const html = await this.page.content();
this.matchData = this.parseMatchSummary(html);
await this.closeBrowser();
return this.matchData;
} catch (error) {
console.error('Error scraping match summary:', error);
await this.closeBrowser();
throw error;
}
}
parseMatchSummary(html) {
const $ = cheerio.load(html);
const match = new MatchData();
// Parse teams and scores
match.homeTeam = $('.home-team-name').text().trim();
match.awayTeam = $('.away-team-name').text().trim();
match.score = $('.score').text().trim();
match.halfTimeScore = $('.half-time-score').text().trim();
// Parse date and time
match.date = $('.match-date').attr('data-date');
match.time = $('.match-time').attr('data-time');
// Parse match events
$('.event-row').each((i, element) => {
const event = {
type: $(element).find('.event-type').text().trim(),
time: $(element).find('.event-time').text().trim(),
player: $(element).find('.event-player').text().trim(),
team: $(element).attr('class').includes('home') ? 'home' : 'away'
};
match.events.push(event);
});
// Data normalization
match.date = DataProcessor.normalizeDate(match.date);
match.events = DataProcessor.normalizeEvents(match.events);
return match;
}
}
module.exports = MatchSummaryScraper;</code></pre>
</div>
<div class="code-container" id="match-data.js">
<div class="code-header bg-slate-800 text-slate-200 px-4 py-2 rounded-t-lg flex justify-between">
<div>
<i class="far fa-file-code mr-2"></i>
<span class="font-mono">src/models/match-data.js</span>
</div>
<div>
<span class="text-green-400">β€’</span>
<span class="text-xs ml-1">JavaScript</span>
</div>
</div>
<pre class="rounded-b-lg"><code class="language-javascript">class MatchData {
constructor() {
this.id = '';
this.homeTeam = '';
this.awayTeam = '';
this.competition = '';
this.status = '';
this.date = '';
this.time = '';
this.score = '';
this.halfTimeScore = '';
this.venue = '';
this.attendance = 0;
this.referee = '';
this.events = [];
this.statistics = {};
this.lastUpdated = new Date();
}
addEvent(event) {
this.events.push(event);
}
addStatistic(type, value) {
this.statistics[type] = value;
}
toJSON() {
return {
id: this.id,
homeTeam: this.homeTeam,
awayTeam: this.awayTeam,
competition: this.competition,
status: this.status,
date: this.date,
time: this.time,
score: this.score,
halfTimeScore: this.halfTimeScore,
venue: this.venue,
attendance: this.attendance,
referee: this.referee,
events: this.events,
statistics: this.statistics,
lastUpdated: this.lastUpdated.toISOString()
};
}
}
module.exports = MatchData;</code></pre>
</div>
</div>
</div>
</div>
</section>
<!-- Features -->
<section class="mb-12">
<h3 class="text-2xl font-bold text-center mb-8 text-slate-800">Project Features</h3>
<div class="grid grid-cols-1 md:grid-cols-3 gap-6">
<!-- Feature 1 -->
<div class="bg-white rounded-xl shadow-lg p-6 transition-all hover:shadow-xl">
<div class="w-16 h-16 bg-primary/10 rounded-full flex items-center justify-center mb-4">
<i class="fas fa-user-secret text-primary text-2xl"></i>
</div>
<h4 class="text-xl font-bold mb-2 text-slate-800">Stealth Mode</h4>
<p class="text-slate-600">Avoid detection with advanced techniques like randomized user agents, request delays, and proxy rotation.</p>
</div>
<!-- Feature 2 -->
<div class="bg-white rounded-xl shadow-lg p-6 transition-all hover:shadow-xl">
<div class="w-16 h-16 bg-primary/10 rounded-full flex items-center justify-center mb-4">
<i class="fas fa-history text-primary text-2xl"></i>
</div>
<h4 class="text-xl font-bold mb-2 text-slate-800">Scheduled Scraping</h4>
<p class="text-slate-600">Regularly collect data using cron scheduling and automated retries with exponential backoff.</p>
</div>
<!-- Feature 3 -->
<div class="bg-white rounded-xl shadow-lg p-6 transition-all hover:shadow-xl">
<div class="w-16 h-16 bg-primary/10 rounded-full flex items-center justify-center mb-4">
<i class="fas fa-th-large text-primary text-2xl"></i>
</div>
<h4 class="text-xl font-bold mb-2 text-slate-800">Modular Architecture</h4>
<p class="text-slate-600">Clean separation of concerns with independent modules for scraping, data processing, and utilities.</p>
</div>
</div>
</section>
<!-- Installation -->
<section class="bg-gradient-to-r from-primary/10 to-secondary/10 rounded-xl p-8 mb-12">
<div class="max-w-4xl mx-auto">
<h3 class="text-2xl font-bold text-center mb-6 text-slate-800">Installation & Usage</h3>
<div class="bg-white rounded-xl shadow-lg p-6 mb-6">
<div class="flex items-start">
<div class="w-10 h-10 rounded-full bg-primary text-white flex items-center justify-center mr-4 flex-shrink-0">
1
</div>
<div>
<h4 class="text-xl font-bold mb-2 text-slate-800">Initialize Project</h4>
<pre class="bg-slate-800 text-green-400 rounded-lg p-4"><code class="language-bash"># Create project directory
mkdir flashscore-scraper
cd flashscore-scraper
# Initialize npm project
npm init -y
# Install dependencies
npm install playwright cheerio axios dotenv fs-extra node-cron</code></pre>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-lg p-6 mb-6">
<div class="flex items-start">
<div class="w-10 h-10 rounded-full bg-primary text-white flex items-center justify-center mr-4 flex-shrink-0">
2
</div>
<div>
<h4 class="text-xl font-bold mb-2 text-slate-800">Configure Environment</h4>
<p class="text-slate-600 mb-4">Create a <code class="bg-slate-100 px-2 py-1 rounded">.env</code> file with your configuration:</p>
<pre class="bg-slate-800 text-yellow-300 rounded-lg p-4"><code class="language-bash"># Proxy configuration
PROXY_SERVERS="http://user:pass@proxy1.com:8080,http://user:pass@proxy2.com:8080"
# Flashscore base URL
BASE_URL="https://www.flashscore.com"
# Schedule - every day at midnight
CRON_SCHEDULE="0 0 * * *"
# Output directory
DATA_DIR="./data"</code></pre>
</div>
</div>
</div>
<div class="bg-white rounded-xl shadow-lg p-6">
<div class="flex items-start">
<div class="w-10 h-10 rounded-full bg-primary text-white flex items-center justify-center mr-4 flex-shrink-0">
3
</div>
<div>
<h4 class="text-xl font-bold mb-2 text-slate-800">Run Scraper</h4>
<p class="text-slate-600 mb-4">Execute the main script directly or set up a cron job:</p>
<pre class="bg-slate-800 text-amber-200 rounded-lg p-4"><code class="language-bash"># Run once for a specific match
node src/index.js --matchId "123456"
# Or run on a schedule according to settings.js
node src/index.js --cron</code></pre>
</div>
</div>
</div>
</div>
</section>
<!-- Footer -->
<footer class="bg-slate-900 text-white rounded-xl p-8">
<div class="max-w-6xl mx-auto">
<div class="flex flex-col md:flex-row justify-between items-center">
<div class="mb-6 md:mb-0">
<h3 class="text-2xl font-bold mb-4">Flashscore Scraper</h3>
<p class="text-slate-400">
Robust and scalable data extraction solution<br>
Built for developers by developers
</p>
</div>
<div class="flex space-x-6">
<div>
<h4 class="font-bold text-primary mb-2">Technology Stack</h4>
<ul class="text-slate-400 text-sm space-y-1">
<li><i class="fas fa-play mr-2"></i> Node.js</li>
<li><i class="fas fa-window-restore mr-2"></i> Playwright</li>
<li><i class="fas fa-filter mr-2"></i> Cheerio</li>
</ul>
</div>
<div>
<h4 class="font-bold text-primary mb-2">Documentation</h4>
<ul class="text-slate-400 text-sm space-y-1">
<li><i class="fas fa-book mr-2"></i> GitHub</li>
<li><i class="fas fa-code mr-2"></i> API Reference</li>
<li><i class="fas fa-exclamation-circle mr-2"></i> FAQ</li>
</ul>
</div>
</div>
</div>
<div class="border-t border-slate-800 mt-8 pt-8 text-center text-slate-500">
<p>&copy; 2023 Flashscore Scraper Project. All rights reserved.</p>
</div>
</div>
</footer>
</div>
<script>
document.addEventListener('DOMContentLoaded', function() {
// File selection logic
const fileItems = document.querySelectorAll('.file');
const codeContainers = document.querySelectorAll('.code-container');
fileItems.forEach(file => {
file.addEventListener('click', function() {
const fileId = this.getAttribute('data-file');
// Update active file styling
fileItems.forEach(item => item.classList.remove('active-file'));
this.classList.add('active-file');
// Show selected code container
codeContainers.forEach(container => {
container.classList.remove('active');
if (container.id === fileId) {
container.classList.add('active');
}
});
});
});
// Activate the first file by default
if (fileItems.length > 0) {
fileItems[0].click();
}
});
</script>
<p style="border-radius: 8px; text-align: center; font-size: 12px; color: #fff; margin-top: 16px;position: fixed; left: 8px; bottom: 8px; z-index: 10; background: rgba(0, 0, 0, 0.8); padding: 4px 8px;">Made with <img src="https://enzostvs-deepsite.hf.space/logo.svg" alt="DeepSite Logo" style="width: 16px; height: 16px; vertical-align: middle;display:inline-block;margin-right:3px;filter:brightness(0) invert(1);"><a href="https://enzostvs-deepsite.hf.space" style="color: #fff;text-decoration: underline;" target="_blank" >DeepSite</a> - 🧬 <a href="https://enzostvs-deepsite.hf.space?remix=Saad4web/fs" style="color: #fff;text-decoration: underline;" target="_blank" >Remix</a></p></body>
</html>