Spaces:

GraziePrego
/

scrapling

Paused

App Files Files Community

scrapling / PROJECT_SCAN.md

GraziePrego

Upload original Scraper_hub repo as-is

eb37804 verified about 2 months ago

preview code

raw

history blame contribute delete

4.79 kB

CyberScraper-2077 Project Scan & Integration Analysis

1. Project Overview

CyberScraper-2077 is an advanced, AI-powered web scraping tool with a futuristic cyberpunk theme. It leverages Large Language Models (LLMs) like OpenAI, Gemini, and Ollama to extract structured data from websites. It features a Streamlit-based UI, supports Tor for .onion sites, and currently uses a custom PlaywrightScraper (based on patchright) for stealthy scraping.

2. Architecture Analysis

Core Components

Frontend (main.py, app/): Built with Streamlit. Handles user interaction, chat interface, and displays scraped data. Manages session state and API keys.
Orchestrator (src/web_extractor.py): The central logic (WebExtractor class). It coordinates:
- Fetching: Calls PlaywrightScraper or TorScraper to get raw HTML.
- Preprocessing: Cleans HTML using BeautifulSoup.
- Extraction: Uses LangChain and LLMs to parse content and answer user queries.
- Agentic Loop: Has an experimental agentic loop for iterative investigation using browser_tools.
Scraping Engine (src/scrapers/):
- PlaywrightScraper: A custom implementation using patchright (undetected Playwright). Features:
  - Stealth: Uses persistent contexts and patchright's built-in stealth.
  - Concurrency: scrape_multiple_pages handles basic pagination.
  - CAPTCHA: Manual handle_captcha method.
- TorScraper: Uses requests with SOCKS proxy for .onion sites.
Utilities:
- src/models.py / src/ollama_models.py: Model management.
- src/utils/: Error handling, Google Sheets integration.

Data Flow

User inputs URL/Query in Streamlit.
StreamlitWebScraperChat initializes WebExtractor.
WebExtractor determines if it's a new URL or chat.
If new URL:
- PlaywrightScraper.fetch_content is called.
- Content is preprocessed and cached.
- LLM is called with context to answer query or extract data.

3. Scrapling Integration Analysis

The Scrapling folder contains a new, sophisticated scraping library. Integrating it will enhance CyberScraper-2077 by providing:

Adaptive Parsing: Robustness against layout changes.
Advanced Fetching: StealthyFetcher and DynamicFetcher (wrapping Playwright) for better evasion.
Spider Framework: A Scrapy-like Spider class for efficient, concurrent, and persistent crawling of multiple pages.

Key Integration Points

Replacement/Augmentation of PlaywrightScraper:
- Scrapling's StealthyFetcher can replace the manual patchright setup in PlaywrightScraper.
- Scrapling's Spider can replace the manual scrape_multiple_pages loop, offering better concurrency and state management.
Dependencies:
- Scrapling requires playwright (implied by imports in Scrapling/scrapling/engines/_browsers/_controllers.py).
- Current repo uses patchright. Integration must ensure Scrapling works with the installed environment. Scrapling's pyproject.toml lists both playwright and patchright in optional dependencies, but the code imports playwright. I will need to ensure playwright is installed or alias it if patchright is to be used exclusively (though patchright module name is different).
Async Compatibility:
- Scrapling supports async fetchers and spiders, compatible with WebExtractor's async nature.

4. Proposed Integration Plan

Dependency Management:
- Install Scrapling in editable mode (pip install -e Scrapling[fetchers]).
- Ensure compatible versions of playwright/patchright.
Adapter Implementation (src/scrapers/scrapling_adapter.py):
- Create a ScraplingAdapter class implementing the interface expected by WebExtractor (matching PlaywrightScraper's fetch_content signature).
- Implement fetch_content to use:
  - StealthyFetcher (or DynamicFetcher) for single URLs.
  - A dynamic Spider subclass for multi-page/crawl tasks.
WebExtractor Update:
- Modify src/web_extractor.py to instantiate ScraplingAdapter instead of PlaywrightScraper.
- Map existing config (headless, proxy) to Scrapling configuration.
Verification:
- Test single page extraction.
- Test multi-page crawling.

5. Potential Issues & Risks

Dependency Conflict: Scrapling might pull a different version of playwright or patchright.
Browser Management: Scrapling manages its own browser instances. I need to ensure it plays nicely with Streamlit's lifecycle (though WebExtractor creates a new scraper instance).