--- license: apache-2.0 language: - en tags: - web-scraping - html-extraction - agent - structured-data - qwen2.5 - unsloth - lora datasets: - sukritvemula/webscrape-agent-training-data base_model: Qwen/Qwen2.5-7B-Instruct pipeline_tag: text-generation --- # πŸ•·οΈ WebScrapeAgent-7B-v1 **An autonomous web scraping agent** built on Qwen2.5-7B-Instruct, fine-tuned to extract structured data from any web page automatically. Give it a URL and a description of what you want β†’ it comes back with clean, usable JSON data every time. ## What It Does | Capability | Description | |---|---| | **HTML Reading** | Understands page structure β€” tables, nested divs, lists, forms, data attributes, malformed HTML | | **Action Sequencing** | Decides what tools to call and in what order to get the data | | **Authentication** | Handles login pages via cookie replay, form submission, token injection, or browser profiles | | **Error Recovery** | When something breaks (403, timeout, CAPTCHA, rate limit), switches approach instead of failing | ## How It Works The model operates in an **action loop**: ``` User: "Extract product listings from example.com/shop" ↓ Model: Let me navigate there first. ACTION: NAVIGATE {"url": "example.com/shop"} ↓ System: HTTP 200 OK. ... ↓ Model: I see product cards. Let me extract the data. ACTION: RETURN_RESULT {"status": "success", "data": [...]} ``` Each response includes a **status**: `success`, `partial`, or `failed` β€” so the caller always knows where things stand. Maximum 10 steps per job. If it can't finish, it returns what it has with a clear explanation. ## Quick Start ### Python API ```python from webscrape_agent import WebScrapeAgent agent = WebScrapeAgent("sukritvemula/WebScrapeAgent-7B-v1") result = agent.scrape( url="https://example.com/products", task="Extract all product names, prices, and ratings", schema={ "type": "array", "items": { "type": "object", "properties": { "name": {"type": "string"}, "price": {"type": "string"}, "rating": {"type": "string"} } } } ) print(result.status) # "success" | "partial" | "failed" print(result.data) # Clean JSON data print(result.message) # Human-readable explanation ``` ### With Authentication ```python # Cookie-based auth result = agent.scrape( url="https://dashboard.example.com/analytics", task="Get my usage statistics", auth={"method": "cookies", "cookies": {"session_id": "abc123"}} ) # API token result = agent.scrape( url="https://api.example.com/v2/data", task="Get all user records", auth={"method": "token", "token": "sk-xxx"} ) ``` ### CLI ```bash python webscrape_agent.py "https://example.com/pricing" "Extract all pricing tiers with features" ``` ### Direct Model Usage ```python import unsloth from unsloth import FastLanguageModel from unsloth.chat_templates import get_chat_template model, tokenizer = FastLanguageModel.from_pretrained( "sukritvemula/WebScrapeAgent-7B-v1", max_seq_length=4096, load_in_4bit=True, ) FastLanguageModel.for_inference(model) tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5") messages = [ {"role": "system", "content": "You are WebScrapeAgent..."}, {"role": "user", "content": "Task: Extract pricing data\nURL: https://example.com/pricing"}, ] inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda") outputs = model.generate(input_ids=inputs, max_new_tokens=1024, temperature=0.3) print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)) ``` ## Available Actions The model can output these actions in its loop: | Action | Purpose | Example Params | |---|---|---| | `NAVIGATE` | Load a URL | `{"url": "...", "method": "GET", "headers": {...}}` | | `CLICK` | Click an element | `{"selector": "#submit-btn"}` | | `FILL_FORM` | Submit a form | `{"selector": "#login", "fields": {"email": "...", "password": "..."}}` | | `WAIT` | Wait for dynamic content | `{"selector": ".results", "timeout_ms": 5000}` | | `SET_COOKIES` | Inject auth cookies | `{"cookies": {"session": "abc"}}` | | `SET_HEADERS` | Set HTTP headers | `{"headers": {"Authorization": "Bearer ..."}}` | | `LOAD_BROWSER_PROFILE` | Use saved browser session | `{"profile_name": "work-chrome"}` | | `EXECUTE_JS` | Run JavaScript | `{"script": "return document.querySelector('#app').innerHTML"}` | | `SCROLL` | Scroll the page | `{"direction": "down", "amount": 500}` | | `SWITCH_STRATEGY` | Change approach on failure | `{"new_strategy": "headless_browser", "reason": "403 blocked"}` | | `RETURN_RESULT` | Return final data | `{"status": "success", "data": [...], "message": "..."}` | ## Training Details ### Recipe Based on three published approaches: | Paper | Key Contribution | Result | |---|---|---| | [ScrapeGraphAI-100k](https://arxiv.org/abs/2602.15189) | QLoRA + completion-only loss for HTMLβ†’JSON | Key F1 = 0.887 at 1.7B params | | [BrowserAgent](https://arxiv.org/abs/2510.10666) | Multi-turn browser SFT on Qwen2.5-7B | +20% over baselines | | [A3-Annotators](https://arxiv.org/abs/2604.07776) | Assistant-token-only loss + thought chains | 41.5% WebArena | ### Hyperparameters | Parameter | Value | Source | |---|---|---| | Base model | Qwen/Qwen2.5-7B-Instruct | β€” | | Method | QLoRA (4-bit NF4) | ScrapeGraphAI | | LoRA rank | 32 | Increased from paper's 16 for structured output complexity | | LoRA alpha | 32 | Standard (= rank) | | LoRA targets | All linear (q,k,v,o,gate,up,down) | ScrapeGraphAI + A3 | | Learning rate | 1e-4 | ScrapeGraphAI | | LR schedule | Cosine with 3% warmup | A3-Annotators | | Optimizer | AdamW 8-bit | Unsloth best practice | | Epochs | 2 | ScrapeGraphAI + BrowserAgent | | Effective batch | 16 | β€” | | Max seq length | 4096 | β€” | | Loss | Completion-only (assistant tokens) | All three papers | | Gradient checkpointing | Unsloth custom | β€” | ### Training Data **Dataset**: [sukritvemula/webscrape-agent-training-data](https://huggingface.co/datasets/sukritvemula/webscrape-agent-training-data) (45,624 examples) | Source | Count | % | What It Teaches | |---|---|---|---| | [ScrapeGraphAI-100k](https://huggingface.co/datasets/scrapegraphai/scrapegraph-100k-finetuning) | 25,244 | 55.3% | HTMLβ†’JSON extraction across real websites | | [BrowserAgent-Data](https://huggingface.co/datasets/TIGER-Lab/BrowserAgent-Data) | 20,361 | 44.6% | Multi-turn browser interaction and reasoning | | Synthetic scenarios | 19 | 0.04% | Auth handling, error recovery, diverse HTML | ### Training Infrastructure Designed for **free GPU** platforms: - βœ… Google Colab (T4, 16GB) - βœ… Kaggle (T4/P100, 16GB) - βœ… Any 16GB+ GPU ## How to Train ### Option 1: Colab Notebook (Easiest) Open `WebScrapeAgent_Training.ipynb` in Google Colab with a T4 GPU runtime. Everything is set up β€” just run all cells. ### Option 2: Command Line ```bash pip install unsloth trl peft transformers accelerate datasets bitsandbytes # Train with defaults (pushes to Hub) python train.py # Custom settings python train.py \ --model unsloth/Qwen2.5-7B-Instruct-bnb-4bit \ --output your-username/WebScrapeAgent-7B-custom \ --epochs 3 \ --lr 5e-5 \ --lora-r 64 \ --batch-size 2 \ --grad-accum 8 # Save locally only (no Hub push) python train.py --no-push --save-local ./my-model ``` ## Limitations - **CAPTCHA**: Cannot solve visual CAPTCHAs. Returns partial results with explanation. - **JavaScript-heavy SPAs**: The default HTTP executor doesn't render JS. Use with Playwright/Selenium for full browser support (see `ActionExecutor` class in `webscrape_agent.py`). - **Private networks**: Cannot access internal/intranet URLs. Returns clear failure message. - **Very large pages**: HTML truncated to ~8K chars to fit context window. May miss data on extremely long pages. - **Data hallucination**: While trained to never invent data, always verify critical extractions. ## Files in This Repo | File | Purpose | |---|---| | `webscrape_agent.py` | Runtime inference loop β€” Python API and CLI | | `train.py` | Standalone training script (CLI with args) | | `WebScrapeAgent_Training.ipynb` | Colab/Kaggle training notebook | | `evaluate.py` | Evaluation script testing all 4 core skills | | `prepare_data.py` | Dataset preparation pipeline (builds the training data) | ## License Apache 2.0 (same as base model)