# CyberScraper-2077 Project Scan & Integration Analysis

## 1. Project Overview
CyberScraper-2077 is an advanced, AI-powered web scraping tool with a futuristic cyberpunk theme. It leverages Large Language Models (LLMs) like OpenAI, Gemini, and Ollama to extract structured data from websites. It features a Streamlit-based UI, supports Tor for .onion sites, and currently uses a custom `PlaywrightScraper` (based on `patchright`) for stealthy scraping.

## 2. Architecture Analysis

### Core Components
*   **Frontend (`main.py`, `app/`):** Built with Streamlit. Handles user interaction, chat interface, and displays scraped data. Manages session state and API keys.
*   **Orchestrator (`src/web_extractor.py`):** The central logic (`WebExtractor` class). It coordinates:
    *   **Fetching:** Calls `PlaywrightScraper` or `TorScraper` to get raw HTML.
    *   **Preprocessing:** Cleans HTML using BeautifulSoup.
    *   **Extraction:** Uses LangChain and LLMs to parse content and answer user queries.
    *   **Agentic Loop:** Has an experimental agentic loop for iterative investigation using `browser_tools`.
*   **Scraping Engine (`src/scrapers/`):**
    *   **`PlaywrightScraper`:** A custom implementation using `patchright` (undetected Playwright). Features:
        *   **Stealth:** Uses persistent contexts and `patchright`'s built-in stealth.
        *   **Concurrency:** `scrape_multiple_pages` handles basic pagination.
        *   **CAPTCHA:** Manual `handle_captcha` method.
    *   **`TorScraper`:** Uses `requests` with SOCKS proxy for .onion sites.
*   **Utilities:**
    *   `src/models.py` / `src/ollama_models.py`: Model management.
    *   `src/utils/`: Error handling, Google Sheets integration.

### Data Flow
1.  User inputs URL/Query in Streamlit.
2.  `StreamlitWebScraperChat` initializes `WebExtractor`.
3.  `WebExtractor` determines if it's a new URL or chat.
4.  If new URL:
    *   `PlaywrightScraper.fetch_content` is called.
    *   Content is preprocessed and cached.
    *   LLM is called with context to answer query or extract data.

## 3. Scrapling Integration Analysis
The `Scrapling` folder contains a new, sophisticated scraping library. Integrating it will enhance CyberScraper-2077 by providing:

*   **Adaptive Parsing:** Robustness against layout changes.
*   **Advanced Fetching:** `StealthyFetcher` and `DynamicFetcher` (wrapping Playwright) for better evasion.
*   **Spider Framework:** A Scrapy-like `Spider` class for efficient, concurrent, and persistent crawling of multiple pages.

### Key Integration Points
*   **Replacement/Augmentation of `PlaywrightScraper`:**
    *   `Scrapling`'s `StealthyFetcher` can replace the manual `patchright` setup in `PlaywrightScraper`.
    *   `Scrapling`'s `Spider` can replace the manual `scrape_multiple_pages` loop, offering better concurrency and state management.
*   **Dependencies:**
    *   `Scrapling` requires `playwright` (implied by imports in `Scrapling/scrapling/engines/_browsers/_controllers.py`).
    *   Current repo uses `patchright`. Integration must ensure `Scrapling` works with the installed environment. `Scrapling`'s `pyproject.toml` lists both `playwright` and `patchright` in optional dependencies, but the code imports `playwright`. I will need to ensure `playwright` is installed or alias it if `patchright` is to be used exclusively (though `patchright` module name is different).
*   **Async Compatibility:**
    *   `Scrapling` supports async fetchers and spiders, compatible with `WebExtractor`'s async nature.

## 4. Proposed Integration Plan

1.  **Dependency Management:**
    *   Install `Scrapling` in editable mode (`pip install -e Scrapling[fetchers]`).
    *   Ensure compatible versions of `playwright`/`patchright`.
2.  **Adapter Implementation (`src/scrapers/scrapling_adapter.py`):**
    *   Create a `ScraplingAdapter` class implementing the interface expected by `WebExtractor` (matching `PlaywrightScraper`'s `fetch_content` signature).
    *   Implement `fetch_content` to use:
        *   `StealthyFetcher` (or `DynamicFetcher`) for single URLs.
        *   A dynamic `Spider` subclass for multi-page/crawl tasks.
3.  **WebExtractor Update:**
    *   Modify `src/web_extractor.py` to instantiate `ScraplingAdapter` instead of `PlaywrightScraper`.
    *   Map existing config (headless, proxy) to `Scrapling` configuration.
4.  **Verification:**
    *   Test single page extraction.
    *   Test multi-page crawling.

## 5. Potential Issues & Risks
*   **Dependency Conflict:** `Scrapling` might pull a different version of `playwright` or `patchright`.
*   **Browser Management:** `Scrapling` manages its own browser instances. I need to ensure it plays nicely with Streamlit's lifecycle (though `WebExtractor` creates a new scraper instance).