---
license: apache-2.0
language:
- en
tags:
- web-scraping
- html-extraction
- agent
- structured-data
- qwen2.5
- unsloth
- lora
datasets:
- sukritvemula/webscrape-agent-training-data
base_model: Qwen/Qwen2.5-7B-Instruct
pipeline_tag: text-generation
---
# π·οΈ WebScrapeAgent-7B-v1
**An autonomous web scraping agent** built on Qwen2.5-7B-Instruct, fine-tuned to extract structured data from any web page automatically.
Give it a URL and a description of what you want β it comes back with clean, usable JSON data every time.
## What It Does
| Capability | Description |
|---|---|
| **HTML Reading** | Understands page structure β tables, nested divs, lists, forms, data attributes, malformed HTML |
| **Action Sequencing** | Decides what tools to call and in what order to get the data |
| **Authentication** | Handles login pages via cookie replay, form submission, token injection, or browser profiles |
| **Error Recovery** | When something breaks (403, timeout, CAPTCHA, rate limit), switches approach instead of failing |
## How It Works
The model operates in an **action loop**:
```
User: "Extract product listings from example.com/shop"
β
Model: Let me navigate there first.
ACTION: NAVIGATE {"url": "example.com/shop"}
β
System: HTTP 200 OK. ...
β
Model: I see product cards. Let me extract the data.
ACTION: RETURN_RESULT {"status": "success", "data": [...]}
```
Each response includes a **status**: `success`, `partial`, or `failed` β so the caller always knows where things stand.
Maximum 10 steps per job. If it can't finish, it returns what it has with a clear explanation.
## Quick Start
### Python API
```python
from webscrape_agent import WebScrapeAgent
agent = WebScrapeAgent("sukritvemula/WebScrapeAgent-7B-v1")
result = agent.scrape(
url="https://example.com/products",
task="Extract all product names, prices, and ratings",
schema={
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "string"},
"rating": {"type": "string"}
}
}
}
)
print(result.status) # "success" | "partial" | "failed"
print(result.data) # Clean JSON data
print(result.message) # Human-readable explanation
```
### With Authentication
```python
# Cookie-based auth
result = agent.scrape(
url="https://dashboard.example.com/analytics",
task="Get my usage statistics",
auth={"method": "cookies", "cookies": {"session_id": "abc123"}}
)
# API token
result = agent.scrape(
url="https://api.example.com/v2/data",
task="Get all user records",
auth={"method": "token", "token": "sk-xxx"}
)
```
### CLI
```bash
python webscrape_agent.py "https://example.com/pricing" "Extract all pricing tiers with features"
```
### Direct Model Usage
```python
import unsloth
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
model, tokenizer = FastLanguageModel.from_pretrained(
"sukritvemula/WebScrapeAgent-7B-v1",
max_seq_length=4096,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
messages = [
{"role": "system", "content": "You are WebScrapeAgent..."},
{"role": "user", "content": "Task: Extract pricing data\nURL: https://example.com/pricing"},
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(input_ids=inputs, max_new_tokens=1024, temperature=0.3)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
```
## Available Actions
The model can output these actions in its loop:
| Action | Purpose | Example Params |
|---|---|---|
| `NAVIGATE` | Load a URL | `{"url": "...", "method": "GET", "headers": {...}}` |
| `CLICK` | Click an element | `{"selector": "#submit-btn"}` |
| `FILL_FORM` | Submit a form | `{"selector": "#login", "fields": {"email": "...", "password": "..."}}` |
| `WAIT` | Wait for dynamic content | `{"selector": ".results", "timeout_ms": 5000}` |
| `SET_COOKIES` | Inject auth cookies | `{"cookies": {"session": "abc"}}` |
| `SET_HEADERS` | Set HTTP headers | `{"headers": {"Authorization": "Bearer ..."}}` |
| `LOAD_BROWSER_PROFILE` | Use saved browser session | `{"profile_name": "work-chrome"}` |
| `EXECUTE_JS` | Run JavaScript | `{"script": "return document.querySelector('#app').innerHTML"}` |
| `SCROLL` | Scroll the page | `{"direction": "down", "amount": 500}` |
| `SWITCH_STRATEGY` | Change approach on failure | `{"new_strategy": "headless_browser", "reason": "403 blocked"}` |
| `RETURN_RESULT` | Return final data | `{"status": "success", "data": [...], "message": "..."}` |
## Training Details
### Recipe
Based on three published approaches:
| Paper | Key Contribution | Result |
|---|---|---|
| [ScrapeGraphAI-100k](https://arxiv.org/abs/2602.15189) | QLoRA + completion-only loss for HTMLβJSON | Key F1 = 0.887 at 1.7B params |
| [BrowserAgent](https://arxiv.org/abs/2510.10666) | Multi-turn browser SFT on Qwen2.5-7B | +20% over baselines |
| [A3-Annotators](https://arxiv.org/abs/2604.07776) | Assistant-token-only loss + thought chains | 41.5% WebArena |
### Hyperparameters
| Parameter | Value | Source |
|---|---|---|
| Base model | Qwen/Qwen2.5-7B-Instruct | β |
| Method | QLoRA (4-bit NF4) | ScrapeGraphAI |
| LoRA rank | 32 | Increased from paper's 16 for structured output complexity |
| LoRA alpha | 32 | Standard (= rank) |
| LoRA targets | All linear (q,k,v,o,gate,up,down) | ScrapeGraphAI + A3 |
| Learning rate | 1e-4 | ScrapeGraphAI |
| LR schedule | Cosine with 3% warmup | A3-Annotators |
| Optimizer | AdamW 8-bit | Unsloth best practice |
| Epochs | 2 | ScrapeGraphAI + BrowserAgent |
| Effective batch | 16 | β |
| Max seq length | 4096 | β |
| Loss | Completion-only (assistant tokens) | All three papers |
| Gradient checkpointing | Unsloth custom | β |
### Training Data
**Dataset**: [sukritvemula/webscrape-agent-training-data](https://huggingface.co/datasets/sukritvemula/webscrape-agent-training-data) (45,624 examples)
| Source | Count | % | What It Teaches |
|---|---|---|---|
| [ScrapeGraphAI-100k](https://huggingface.co/datasets/scrapegraphai/scrapegraph-100k-finetuning) | 25,244 | 55.3% | HTMLβJSON extraction across real websites |
| [BrowserAgent-Data](https://huggingface.co/datasets/TIGER-Lab/BrowserAgent-Data) | 20,361 | 44.6% | Multi-turn browser interaction and reasoning |
| Synthetic scenarios | 19 | 0.04% | Auth handling, error recovery, diverse HTML |
### Training Infrastructure
Designed for **free GPU** platforms:
- β
Google Colab (T4, 16GB)
- β
Kaggle (T4/P100, 16GB)
- β
Any 16GB+ GPU
## How to Train
### Option 1: Colab Notebook (Easiest)
Open `WebScrapeAgent_Training.ipynb` in Google Colab with a T4 GPU runtime. Everything is set up β just run all cells.
### Option 2: Command Line
```bash
pip install unsloth trl peft transformers accelerate datasets bitsandbytes
# Train with defaults (pushes to Hub)
python train.py
# Custom settings
python train.py \
--model unsloth/Qwen2.5-7B-Instruct-bnb-4bit \
--output your-username/WebScrapeAgent-7B-custom \
--epochs 3 \
--lr 5e-5 \
--lora-r 64 \
--batch-size 2 \
--grad-accum 8
# Save locally only (no Hub push)
python train.py --no-push --save-local ./my-model
```
## Limitations
- **CAPTCHA**: Cannot solve visual CAPTCHAs. Returns partial results with explanation.
- **JavaScript-heavy SPAs**: The default HTTP executor doesn't render JS. Use with Playwright/Selenium for full browser support (see `ActionExecutor` class in `webscrape_agent.py`).
- **Private networks**: Cannot access internal/intranet URLs. Returns clear failure message.
- **Very large pages**: HTML truncated to ~8K chars to fit context window. May miss data on extremely long pages.
- **Data hallucination**: While trained to never invent data, always verify critical extractions.
## Files in This Repo
| File | Purpose |
|---|---|
| `webscrape_agent.py` | Runtime inference loop β Python API and CLI |
| `train.py` | Standalone training script (CLI with args) |
| `WebScrapeAgent_Training.ipynb` | Colab/Kaggle training notebook |
| `evaluate.py` | Evaluation script testing all 4 core skills |
| `prepare_data.py` | Dataset preparation pipeline (builds the training data) |
## License
Apache 2.0 (same as base model)