Spaces:

GraziePrego
/

scrapling

Paused

App Files Files Community

scrapling / PROJECT_SCAN.md

GraziePrego

Upload original Scraper_hub repo as-is

eb37804 verified about 2 months ago

preview code

raw

history blame contribute delete

4.79 kB

	# CyberScraper-2077 Project Scan & Integration Analysis

	## 1. Project Overview
	CyberScraper-2077 is an advanced, AI-powered web scraping tool with a futuristic cyberpunk theme. It leverages Large Language Models (LLMs) like OpenAI, Gemini, and Ollama to extract structured data from websites. It features a Streamlit-based UI, supports Tor for .onion sites, and currently uses a custom `PlaywrightScraper` (based on `patchright`) for stealthy scraping.

	## 2. Architecture Analysis

	### Core Components
	* Frontend (`main.py`, `app/`): Built with Streamlit. Handles user interaction, chat interface, and displays scraped data. Manages session state and API keys.
	* Orchestrator (`src/web_extractor.py`): The central logic (`WebExtractor` class). It coordinates:
	* Fetching: Calls `PlaywrightScraper` or `TorScraper` to get raw HTML.
	* Preprocessing: Cleans HTML using BeautifulSoup.
	* Extraction: Uses LangChain and LLMs to parse content and answer user queries.
	* Agentic Loop: Has an experimental agentic loop for iterative investigation using `browser_tools`.
	* Scraping Engine (`src/scrapers/`):
	* `PlaywrightScraper`: A custom implementation using `patchright` (undetected Playwright). Features:
	* Stealth: Uses persistent contexts and `patchright`'s built-in stealth.
	* Concurrency: `scrape_multiple_pages` handles basic pagination.
	* CAPTCHA: Manual `handle_captcha` method.
	* `TorScraper`: Uses `requests` with SOCKS proxy for .onion sites.
	* Utilities:
	* `src/models.py` / `src/ollama_models.py`: Model management.
	* `src/utils/`: Error handling, Google Sheets integration.

	### Data Flow
	1. User inputs URL/Query in Streamlit.
	2. `StreamlitWebScraperChat` initializes `WebExtractor`.
	3. `WebExtractor` determines if it's a new URL or chat.
	4. If new URL:
	* `PlaywrightScraper.fetch_content` is called.
	* Content is preprocessed and cached.
	* LLM is called with context to answer query or extract data.

	## 3. Scrapling Integration Analysis
	The `Scrapling` folder contains a new, sophisticated scraping library. Integrating it will enhance CyberScraper-2077 by providing:

	* Adaptive Parsing: Robustness against layout changes.
	* Advanced Fetching: `StealthyFetcher` and `DynamicFetcher` (wrapping Playwright) for better evasion.
	* Spider Framework: A Scrapy-like `Spider` class for efficient, concurrent, and persistent crawling of multiple pages.

	### Key Integration Points
	* Replacement/Augmentation of `PlaywrightScraper`:
	* `Scrapling`'s `StealthyFetcher` can replace the manual `patchright` setup in `PlaywrightScraper`.
	* `Scrapling`'s `Spider` can replace the manual `scrape_multiple_pages` loop, offering better concurrency and state management.
	* Dependencies:
	* `Scrapling` requires `playwright` (implied by imports in `Scrapling/scrapling/engines/_browsers/_controllers.py`).
	* Current repo uses `patchright`. Integration must ensure `Scrapling` works with the installed environment. `Scrapling`'s `pyproject.toml` lists both `playwright` and `patchright` in optional dependencies, but the code imports `playwright`. I will need to ensure `playwright` is installed or alias it if `patchright` is to be used exclusively (though `patchright` module name is different).
	* Async Compatibility:
	* `Scrapling` supports async fetchers and spiders, compatible with `WebExtractor`'s async nature.

	## 4. Proposed Integration Plan

	1. Dependency Management:
	* Install `Scrapling` in editable mode (`pip install -e Scrapling[fetchers]`).
	* Ensure compatible versions of `playwright`/`patchright`.
	2. Adapter Implementation (`src/scrapers/scrapling_adapter.py`):
	* Create a `ScraplingAdapter` class implementing the interface expected by `WebExtractor` (matching `PlaywrightScraper`'s `fetch_content` signature).
	* Implement `fetch_content` to use:
	* `StealthyFetcher` (or `DynamicFetcher`) for single URLs.
	* A dynamic `Spider` subclass for multi-page/crawl tasks.
	3. WebExtractor Update:
	* Modify `src/web_extractor.py` to instantiate `ScraplingAdapter` instead of `PlaywrightScraper`.
	* Map existing config (headless, proxy) to `Scrapling` configuration.
	4. Verification:
	* Test single page extraction.
	* Test multi-page crawling.

	## 5. Potential Issues & Risks
	* Dependency Conflict: `Scrapling` might pull a different version of `playwright` or `patchright`.
	* Browser Management: `Scrapling` manages its own browser instances. I need to ensure it plays nicely with Streamlit's lifecycle (though `WebExtractor` creates a new scraper instance).