---
license: apache-2.0
language:
- en
tags:
- web-scraping
- html-extraction
- agent
- structured-data
- qwen2.5
- unsloth
- lora
datasets:
- sukritvemula/webscrape-agent-training-data
base_model: Qwen/Qwen2.5-7B-Instruct
pipeline_tag: text-generation
---

# 🕷️ WebScrapeAgent-7B-v1

**An autonomous web scraping agent** built on Qwen2.5-7B-Instruct, fine-tuned to extract structured data from any web page automatically.

Give it a URL and a description of what you want → it comes back with clean, usable JSON data every time.

## What It Does

| Capability | Description |
|---|---|
| **HTML Reading** | Understands page structure — tables, nested divs, lists, forms, data attributes, malformed HTML |
| **Action Sequencing** | Decides what tools to call and in what order to get the data |
| **Authentication** | Handles login pages via cookie replay, form submission, token injection, or browser profiles |
| **Error Recovery** | When something breaks (403, timeout, CAPTCHA, rate limit), switches approach instead of failing |

## How It Works

The model operates in an **action loop**:

```
User: "Extract product listings from example.com/shop"
  ↓
Model: <thought>Let me navigate there first.</thought>
       ACTION: NAVIGATE {"url": "example.com/shop"}
  ↓
System: HTTP 200 OK. <html>...</html>
  ↓
Model: <thought>I see product cards. Let me extract the data.</thought>
       ACTION: RETURN_RESULT {"status": "success", "data": [...]}
```

Each response includes a **status**: `success`, `partial`, or `failed` — so the caller always knows where things stand.

Maximum 10 steps per job. If it can't finish, it returns what it has with a clear explanation.

## Quick Start

### Python API

```python
from webscrape_agent import WebScrapeAgent

agent = WebScrapeAgent("sukritvemula/WebScrapeAgent-7B-v1")

result = agent.scrape(
    url="https://example.com/products",
    task="Extract all product names, prices, and ratings",
    schema={
        "type": "array",
        "items": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "string"},
                "rating": {"type": "string"}
            }
        }
    }
)

print(result.status)   # "success" | "partial" | "failed"
print(result.data)     # Clean JSON data
print(result.message)  # Human-readable explanation
```

### With Authentication

```python
# Cookie-based auth
result = agent.scrape(
    url="https://dashboard.example.com/analytics",
    task="Get my usage statistics",
    auth={"method": "cookies", "cookies": {"session_id": "abc123"}}
)

# API token
result = agent.scrape(
    url="https://api.example.com/v2/data",
    task="Get all user records",
    auth={"method": "token", "token": "sk-xxx"}
)
```

### CLI

```bash
python webscrape_agent.py "https://example.com/pricing" "Extract all pricing tiers with features"
```

### Direct Model Usage

```python
import unsloth
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template

model, tokenizer = FastLanguageModel.from_pretrained(
    "sukritvemula/WebScrapeAgent-7B-v1",
    max_seq_length=4096,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")

messages = [
    {"role": "system", "content": "You are WebScrapeAgent..."},
    {"role": "user", "content": "Task: Extract pricing data\nURL: https://example.com/pricing"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=1024, temperature=0.3)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
```

## Available Actions

The model can output these actions in its loop:

| Action | Purpose | Example Params |
|---|---|---|
| `NAVIGATE` | Load a URL | `{"url": "...", "method": "GET", "headers": {...}}` |
| `CLICK` | Click an element | `{"selector": "#submit-btn"}` |
| `FILL_FORM` | Submit a form | `{"selector": "#login", "fields": {"email": "...", "password": "..."}}` |
| `WAIT` | Wait for dynamic content | `{"selector": ".results", "timeout_ms": 5000}` |
| `SET_COOKIES` | Inject auth cookies | `{"cookies": {"session": "abc"}}` |
| `SET_HEADERS` | Set HTTP headers | `{"headers": {"Authorization": "Bearer ..."}}` |
| `LOAD_BROWSER_PROFILE` | Use saved browser session | `{"profile_name": "work-chrome"}` |
| `EXECUTE_JS` | Run JavaScript | `{"script": "return document.querySelector('#app').innerHTML"}` |
| `SCROLL` | Scroll the page | `{"direction": "down", "amount": 500}` |
| `SWITCH_STRATEGY` | Change approach on failure | `{"new_strategy": "headless_browser", "reason": "403 blocked"}` |
| `RETURN_RESULT` | Return final data | `{"status": "success", "data": [...], "message": "..."}` |

## Training Details

### Recipe

Based on three published approaches:

| Paper | Key Contribution | Result |
|---|---|---|
| [ScrapeGraphAI-100k](https://arxiv.org/abs/2602.15189) | QLoRA + completion-only loss for HTML→JSON | Key F1 = 0.887 at 1.7B params |
| [BrowserAgent](https://arxiv.org/abs/2510.10666) | Multi-turn browser SFT on Qwen2.5-7B | +20% over baselines |
| [A3-Annotators](https://arxiv.org/abs/2604.07776) | Assistant-token-only loss + thought chains | 41.5% WebArena |

### Hyperparameters

| Parameter | Value | Source |
|---|---|---|
| Base model | Qwen/Qwen2.5-7B-Instruct | — |
| Method | QLoRA (4-bit NF4) | ScrapeGraphAI |
| LoRA rank | 32 | Increased from paper's 16 for structured output complexity |
| LoRA alpha | 32 | Standard (= rank) |
| LoRA targets | All linear (q,k,v,o,gate,up,down) | ScrapeGraphAI + A3 |
| Learning rate | 1e-4 | ScrapeGraphAI |
| LR schedule | Cosine with 3% warmup | A3-Annotators |
| Optimizer | AdamW 8-bit | Unsloth best practice |
| Epochs | 2 | ScrapeGraphAI + BrowserAgent |
| Effective batch | 16 | — |
| Max seq length | 4096 | — |
| Loss | Completion-only (assistant tokens) | All three papers |
| Gradient checkpointing | Unsloth custom | — |

### Training Data

**Dataset**: [sukritvemula/webscrape-agent-training-data](https://huggingface.co/datasets/sukritvemula/webscrape-agent-training-data) (45,624 examples)

| Source | Count | % | What It Teaches |
|---|---|---|---|
| [ScrapeGraphAI-100k](https://huggingface.co/datasets/scrapegraphai/scrapegraph-100k-finetuning) | 25,244 | 55.3% | HTML→JSON extraction across real websites |
| [BrowserAgent-Data](https://huggingface.co/datasets/TIGER-Lab/BrowserAgent-Data) | 20,361 | 44.6% | Multi-turn browser interaction and reasoning |
| Synthetic scenarios | 19 | 0.04% | Auth handling, error recovery, diverse HTML |

### Training Infrastructure

Designed for **free GPU** platforms:
- ✅ Google Colab (T4, 16GB)
- ✅ Kaggle (T4/P100, 16GB)
- ✅ Any 16GB+ GPU

## How to Train

### Option 1: Colab Notebook (Easiest)

Open `WebScrapeAgent_Training.ipynb` in Google Colab with a T4 GPU runtime. Everything is set up — just run all cells.

### Option 2: Command Line

```bash
pip install unsloth trl peft transformers accelerate datasets bitsandbytes

# Train with defaults (pushes to Hub)
python train.py

# Custom settings
python train.py \
    --model unsloth/Qwen2.5-7B-Instruct-bnb-4bit \
    --output your-username/WebScrapeAgent-7B-custom \
    --epochs 3 \
    --lr 5e-5 \
    --lora-r 64 \
    --batch-size 2 \
    --grad-accum 8

# Save locally only (no Hub push)
python train.py --no-push --save-local ./my-model
```

## Limitations

- **CAPTCHA**: Cannot solve visual CAPTCHAs. Returns partial results with explanation.
- **JavaScript-heavy SPAs**: The default HTTP executor doesn't render JS. Use with Playwright/Selenium for full browser support (see `ActionExecutor` class in `webscrape_agent.py`).
- **Private networks**: Cannot access internal/intranet URLs. Returns clear failure message.
- **Very large pages**: HTML truncated to ~8K chars to fit context window. May miss data on extremely long pages.
- **Data hallucination**: While trained to never invent data, always verify critical extractions.

## Files in This Repo

| File | Purpose |
|---|---|
| `webscrape_agent.py` | Runtime inference loop — Python API and CLI |
| `train.py` | Standalone training script (CLI with args) |
| `WebScrapeAgent_Training.ipynb` | Colab/Kaggle training notebook |
| `evaluate.py` | Evaluation script testing all 4 core skills |
| `prepare_data.py` | Dataset preparation pipeline (builds the training data) |

## License

Apache 2.0 (same as base model)