scrapling / PROJECT_SCAN.md
GraziePrego's picture
Upload original Scraper_hub repo as-is
eb37804 verified
# CyberScraper-2077 Project Scan & Integration Analysis
## 1. Project Overview
CyberScraper-2077 is an advanced, AI-powered web scraping tool with a futuristic cyberpunk theme. It leverages Large Language Models (LLMs) like OpenAI, Gemini, and Ollama to extract structured data from websites. It features a Streamlit-based UI, supports Tor for .onion sites, and currently uses a custom `PlaywrightScraper` (based on `patchright`) for stealthy scraping.
## 2. Architecture Analysis
### Core Components
* **Frontend (`main.py`, `app/`):** Built with Streamlit. Handles user interaction, chat interface, and displays scraped data. Manages session state and API keys.
* **Orchestrator (`src/web_extractor.py`):** The central logic (`WebExtractor` class). It coordinates:
* **Fetching:** Calls `PlaywrightScraper` or `TorScraper` to get raw HTML.
* **Preprocessing:** Cleans HTML using BeautifulSoup.
* **Extraction:** Uses LangChain and LLMs to parse content and answer user queries.
* **Agentic Loop:** Has an experimental agentic loop for iterative investigation using `browser_tools`.
* **Scraping Engine (`src/scrapers/`):**
* **`PlaywrightScraper`:** A custom implementation using `patchright` (undetected Playwright). Features:
* **Stealth:** Uses persistent contexts and `patchright`'s built-in stealth.
* **Concurrency:** `scrape_multiple_pages` handles basic pagination.
* **CAPTCHA:** Manual `handle_captcha` method.
* **`TorScraper`:** Uses `requests` with SOCKS proxy for .onion sites.
* **Utilities:**
* `src/models.py` / `src/ollama_models.py`: Model management.
* `src/utils/`: Error handling, Google Sheets integration.
### Data Flow
1. User inputs URL/Query in Streamlit.
2. `StreamlitWebScraperChat` initializes `WebExtractor`.
3. `WebExtractor` determines if it's a new URL or chat.
4. If new URL:
* `PlaywrightScraper.fetch_content` is called.
* Content is preprocessed and cached.
* LLM is called with context to answer query or extract data.
## 3. Scrapling Integration Analysis
The `Scrapling` folder contains a new, sophisticated scraping library. Integrating it will enhance CyberScraper-2077 by providing:
* **Adaptive Parsing:** Robustness against layout changes.
* **Advanced Fetching:** `StealthyFetcher` and `DynamicFetcher` (wrapping Playwright) for better evasion.
* **Spider Framework:** A Scrapy-like `Spider` class for efficient, concurrent, and persistent crawling of multiple pages.
### Key Integration Points
* **Replacement/Augmentation of `PlaywrightScraper`:**
* `Scrapling`'s `StealthyFetcher` can replace the manual `patchright` setup in `PlaywrightScraper`.
* `Scrapling`'s `Spider` can replace the manual `scrape_multiple_pages` loop, offering better concurrency and state management.
* **Dependencies:**
* `Scrapling` requires `playwright` (implied by imports in `Scrapling/scrapling/engines/_browsers/_controllers.py`).
* Current repo uses `patchright`. Integration must ensure `Scrapling` works with the installed environment. `Scrapling`'s `pyproject.toml` lists both `playwright` and `patchright` in optional dependencies, but the code imports `playwright`. I will need to ensure `playwright` is installed or alias it if `patchright` is to be used exclusively (though `patchright` module name is different).
* **Async Compatibility:**
* `Scrapling` supports async fetchers and spiders, compatible with `WebExtractor`'s async nature.
## 4. Proposed Integration Plan
1. **Dependency Management:**
* Install `Scrapling` in editable mode (`pip install -e Scrapling[fetchers]`).
* Ensure compatible versions of `playwright`/`patchright`.
2. **Adapter Implementation (`src/scrapers/scrapling_adapter.py`):**
* Create a `ScraplingAdapter` class implementing the interface expected by `WebExtractor` (matching `PlaywrightScraper`'s `fetch_content` signature).
* Implement `fetch_content` to use:
* `StealthyFetcher` (or `DynamicFetcher`) for single URLs.
* A dynamic `Spider` subclass for multi-page/crawl tasks.
3. **WebExtractor Update:**
* Modify `src/web_extractor.py` to instantiate `ScraplingAdapter` instead of `PlaywrightScraper`.
* Map existing config (headless, proxy) to `Scrapling` configuration.
4. **Verification:**
* Test single page extraction.
* Test multi-page crawling.
## 5. Potential Issues & Risks
* **Dependency Conflict:** `Scrapling` might pull a different version of `playwright` or `patchright`.
* **Browser Management:** `Scrapling` manages its own browser instances. I need to ensure it plays nicely with Streamlit's lifecycle (though `WebExtractor` creates a new scraper instance).