scrapling / PROJECT_SCAN.md
GraziePrego's picture
Upload original Scraper_hub repo as-is
eb37804 verified

CyberScraper-2077 Project Scan & Integration Analysis

1. Project Overview

CyberScraper-2077 is an advanced, AI-powered web scraping tool with a futuristic cyberpunk theme. It leverages Large Language Models (LLMs) like OpenAI, Gemini, and Ollama to extract structured data from websites. It features a Streamlit-based UI, supports Tor for .onion sites, and currently uses a custom PlaywrightScraper (based on patchright) for stealthy scraping.

2. Architecture Analysis

Core Components

  • Frontend (main.py, app/): Built with Streamlit. Handles user interaction, chat interface, and displays scraped data. Manages session state and API keys.
  • Orchestrator (src/web_extractor.py): The central logic (WebExtractor class). It coordinates:
    • Fetching: Calls PlaywrightScraper or TorScraper to get raw HTML.
    • Preprocessing: Cleans HTML using BeautifulSoup.
    • Extraction: Uses LangChain and LLMs to parse content and answer user queries.
    • Agentic Loop: Has an experimental agentic loop for iterative investigation using browser_tools.
  • Scraping Engine (src/scrapers/):
    • PlaywrightScraper: A custom implementation using patchright (undetected Playwright). Features:
      • Stealth: Uses persistent contexts and patchright's built-in stealth.
      • Concurrency: scrape_multiple_pages handles basic pagination.
      • CAPTCHA: Manual handle_captcha method.
    • TorScraper: Uses requests with SOCKS proxy for .onion sites.
  • Utilities:
    • src/models.py / src/ollama_models.py: Model management.
    • src/utils/: Error handling, Google Sheets integration.

Data Flow

  1. User inputs URL/Query in Streamlit.
  2. StreamlitWebScraperChat initializes WebExtractor.
  3. WebExtractor determines if it's a new URL or chat.
  4. If new URL:
    • PlaywrightScraper.fetch_content is called.
    • Content is preprocessed and cached.
    • LLM is called with context to answer query or extract data.

3. Scrapling Integration Analysis

The Scrapling folder contains a new, sophisticated scraping library. Integrating it will enhance CyberScraper-2077 by providing:

  • Adaptive Parsing: Robustness against layout changes.
  • Advanced Fetching: StealthyFetcher and DynamicFetcher (wrapping Playwright) for better evasion.
  • Spider Framework: A Scrapy-like Spider class for efficient, concurrent, and persistent crawling of multiple pages.

Key Integration Points

  • Replacement/Augmentation of PlaywrightScraper:
    • Scrapling's StealthyFetcher can replace the manual patchright setup in PlaywrightScraper.
    • Scrapling's Spider can replace the manual scrape_multiple_pages loop, offering better concurrency and state management.
  • Dependencies:
    • Scrapling requires playwright (implied by imports in Scrapling/scrapling/engines/_browsers/_controllers.py).
    • Current repo uses patchright. Integration must ensure Scrapling works with the installed environment. Scrapling's pyproject.toml lists both playwright and patchright in optional dependencies, but the code imports playwright. I will need to ensure playwright is installed or alias it if patchright is to be used exclusively (though patchright module name is different).
  • Async Compatibility:
    • Scrapling supports async fetchers and spiders, compatible with WebExtractor's async nature.

4. Proposed Integration Plan

  1. Dependency Management:
    • Install Scrapling in editable mode (pip install -e Scrapling[fetchers]).
    • Ensure compatible versions of playwright/patchright.
  2. Adapter Implementation (src/scrapers/scrapling_adapter.py):
    • Create a ScraplingAdapter class implementing the interface expected by WebExtractor (matching PlaywrightScraper's fetch_content signature).
    • Implement fetch_content to use:
      • StealthyFetcher (or DynamicFetcher) for single URLs.
      • A dynamic Spider subclass for multi-page/crawl tasks.
  3. WebExtractor Update:
    • Modify src/web_extractor.py to instantiate ScraplingAdapter instead of PlaywrightScraper.
    • Map existing config (headless, proxy) to Scrapling configuration.
  4. Verification:
    • Test single page extraction.
    • Test multi-page crawling.

5. Potential Issues & Risks

  • Dependency Conflict: Scrapling might pull a different version of playwright or patchright.
  • Browser Management: Scrapling manages its own browser instances. I need to ensure it plays nicely with Streamlit's lifecycle (though WebExtractor creates a new scraper instance).