Upload README.md with huggingface_hub

55f8868 verified about 1 month ago

8.58 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- web-scraping
	- html-extraction
	- agent
	- structured-data
	- qwen2.5
	- unsloth
	- lora
	datasets:
	- sukritvemula/webscrape-agent-training-data
	base_model: Qwen/Qwen2.5-7B-Instruct
	pipeline_tag: text-generation
	---

	# 🕷️ WebScrapeAgent-7B-v1

	An autonomous web scraping agent built on Qwen2.5-7B-Instruct, fine-tuned to extract structured data from any web page automatically.

	Give it a URL and a description of what you want → it comes back with clean, usable JSON data every time.

	## What It Does

	\| Capability \| Description \|
	\|---\|---\|
	\| HTML Reading \| Understands page structure — tables, nested divs, lists, forms, data attributes, malformed HTML \|
	\| Action Sequencing \| Decides what tools to call and in what order to get the data \|
	\| Authentication \| Handles login pages via cookie replay, form submission, token injection, or browser profiles \|
	\| Error Recovery \| When something breaks (403, timeout, CAPTCHA, rate limit), switches approach instead of failing \|

	## How It Works

	The model operates in an action loop:

	```
	User: "Extract product listings from example.com/shop"
	↓
	Model: <thought>Let me navigate there first.</thought>
	ACTION: NAVIGATE {"url": "example.com/shop"}
	↓
	System: HTTP 200 OK. <html>...</html>
	↓
	Model: <thought>I see product cards. Let me extract the data.</thought>
	ACTION: RETURN_RESULT {"status": "success", "data": [...]}
	```

	Each response includes a status: `success`, `partial`, or `failed` — so the caller always knows where things stand.

	Maximum 10 steps per job. If it can't finish, it returns what it has with a clear explanation.

	## Quick Start

	### Python API

	```python
	from webscrape_agent import WebScrapeAgent

	agent = WebScrapeAgent("sukritvemula/WebScrapeAgent-7B-v1")

	result = agent.scrape(
	url="https://example.com/products",
	task="Extract all product names, prices, and ratings",
	schema={
	"type": "array",
	"items": {
	"type": "object",
	"properties": {
	"name": {"type": "string"},
	"price": {"type": "string"},
	"rating": {"type": "string"}
	}
	}
	}
	)

	print(result.status) # "success" \| "partial" \| "failed"
	print(result.data) # Clean JSON data
	print(result.message) # Human-readable explanation
	```

	### With Authentication

	```python
	# Cookie-based auth
	result = agent.scrape(
	url="https://dashboard.example.com/analytics",
	task="Get my usage statistics",
	auth={"method": "cookies", "cookies": {"session_id": "abc123"}}
	)

	# API token
	result = agent.scrape(
	url="https://api.example.com/v2/data",
	task="Get all user records",
	auth={"method": "token", "token": "sk-xxx"}
	)
	```

	### CLI

	```bash
	python webscrape_agent.py "https://example.com/pricing" "Extract all pricing tiers with features"
	```

	### Direct Model Usage

	```python
	import unsloth
	from unsloth import FastLanguageModel
	from unsloth.chat_templates import get_chat_template

	model, tokenizer = FastLanguageModel.from_pretrained(
	"sukritvemula/WebScrapeAgent-7B-v1",
	max_seq_length=4096,
	load_in_4bit=True,
	)
	FastLanguageModel.for_inference(model)
	tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")

	messages = [
	{"role": "system", "content": "You are WebScrapeAgent..."},
	{"role": "user", "content": "Task: Extract pricing data\nURL: https://example.com/pricing"},
	]

	inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
	outputs = model.generate(input_ids=inputs, max_new_tokens=1024, temperature=0.3)
	print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
	```

	## Available Actions

	The model can output these actions in its loop:

	\| Action \| Purpose \| Example Params \|
	\|---\|---\|---\|
	\| `NAVIGATE` \| Load a URL \| `{"url": "...", "method": "GET", "headers": {...}}` \|
	\| `CLICK` \| Click an element \| `{"selector": "#submit-btn"}` \|
	\| `FILL_FORM` \| Submit a form \| `{"selector": "#login", "fields": {"email": "...", "password": "..."}}` \|
	\| `WAIT` \| Wait for dynamic content \| `{"selector": ".results", "timeout_ms": 5000}` \|
	\| `SET_COOKIES` \| Inject auth cookies \| `{"cookies": {"session": "abc"}}` \|
	\| `SET_HEADERS` \| Set HTTP headers \| `{"headers": {"Authorization": "Bearer ..."}}` \|
	\| `LOAD_BROWSER_PROFILE` \| Use saved browser session \| `{"profile_name": "work-chrome"}` \|
	\| `EXECUTE_JS` \| Run JavaScript \| `{"script": "return document.querySelector('#app').innerHTML"}` \|
	\| `SCROLL` \| Scroll the page \| `{"direction": "down", "amount": 500}` \|
	\| `SWITCH_STRATEGY` \| Change approach on failure \| `{"new_strategy": "headless_browser", "reason": "403 blocked"}` \|
	\| `RETURN_RESULT` \| Return final data \| `{"status": "success", "data": [...], "message": "..."}` \|

	## Training Details

	### Recipe

	Based on three published approaches:

	\| Paper \| Key Contribution \| Result \|
	\|---\|---\|---\|
	\| [ScrapeGraphAI-100k](https://arxiv.org/abs/2602.15189) \| QLoRA + completion-only loss for HTML→JSON \| Key F1 = 0.887 at 1.7B params \|
	\| [BrowserAgent](https://arxiv.org/abs/2510.10666) \| Multi-turn browser SFT on Qwen2.5-7B \| +20% over baselines \|
	\| [A3-Annotators](https://arxiv.org/abs/2604.07776) \| Assistant-token-only loss + thought chains \| 41.5% WebArena \|

	### Hyperparameters

	\| Parameter \| Value \| Source \|
	\|---\|---\|---\|
	\| Base model \| Qwen/Qwen2.5-7B-Instruct \| — \|
	\| Method \| QLoRA (4-bit NF4) \| ScrapeGraphAI \|
	\| LoRA rank \| 32 \| Increased from paper's 16 for structured output complexity \|
	\| LoRA alpha \| 32 \| Standard (= rank) \|
	\| LoRA targets \| All linear (q,k,v,o,gate,up,down) \| ScrapeGraphAI + A3 \|
	\| Learning rate \| 1e-4 \| ScrapeGraphAI \|
	\| LR schedule \| Cosine with 3% warmup \| A3-Annotators \|
	\| Optimizer \| AdamW 8-bit \| Unsloth best practice \|
	\| Epochs \| 2 \| ScrapeGraphAI + BrowserAgent \|
	\| Effective batch \| 16 \| — \|
	\| Max seq length \| 4096 \| — \|
	\| Loss \| Completion-only (assistant tokens) \| All three papers \|
	\| Gradient checkpointing \| Unsloth custom \| — \|

	### Training Data

	Dataset: [sukritvemula/webscrape-agent-training-data](https://huggingface.co/datasets/sukritvemula/webscrape-agent-training-data) (45,624 examples)

	\| Source \| Count \| % \| What It Teaches \|
	\|---\|---\|---\|---\|
	\| [ScrapeGraphAI-100k](https://huggingface.co/datasets/scrapegraphai/scrapegraph-100k-finetuning) \| 25,244 \| 55.3% \| HTML→JSON extraction across real websites \|
	\| [BrowserAgent-Data](https://huggingface.co/datasets/TIGER-Lab/BrowserAgent-Data) \| 20,361 \| 44.6% \| Multi-turn browser interaction and reasoning \|
	\| Synthetic scenarios \| 19 \| 0.04% \| Auth handling, error recovery, diverse HTML \|

	### Training Infrastructure

	Designed for free GPU platforms:
	- ✅ Google Colab (T4, 16GB)
	- ✅ Kaggle (T4/P100, 16GB)
	- ✅ Any 16GB+ GPU

	## How to Train

	### Option 1: Colab Notebook (Easiest)

	Open `WebScrapeAgent_Training.ipynb` in Google Colab with a T4 GPU runtime. Everything is set up — just run all cells.

	### Option 2: Command Line

	```bash
	pip install unsloth trl peft transformers accelerate datasets bitsandbytes

	# Train with defaults (pushes to Hub)
	python train.py

	# Custom settings
	python train.py \
	--model unsloth/Qwen2.5-7B-Instruct-bnb-4bit \
	--output your-username/WebScrapeAgent-7B-custom \
	--epochs 3 \
	--lr 5e-5 \
	--lora-r 64 \
	--batch-size 2 \
	--grad-accum 8

	# Save locally only (no Hub push)
	python train.py --no-push --save-local ./my-model
	```

	## Limitations

	- CAPTCHA: Cannot solve visual CAPTCHAs. Returns partial results with explanation.
	- JavaScript-heavy SPAs: The default HTTP executor doesn't render JS. Use with Playwright/Selenium for full browser support (see `ActionExecutor` class in `webscrape_agent.py`).
	- Private networks: Cannot access internal/intranet URLs. Returns clear failure message.
	- Very large pages: HTML truncated to ~8K chars to fit context window. May miss data on extremely long pages.
	- Data hallucination: While trained to never invent data, always verify critical extractions.

	## Files in This Repo

	\| File \| Purpose \|
	\|---\|---\|
	\| `webscrape_agent.py` \| Runtime inference loop — Python API and CLI \|
	\| `train.py` \| Standalone training script (CLI with args) \|
	\| `WebScrapeAgent_Training.ipynb` \| Colab/Kaggle training notebook \|
	\| `evaluate.py` \| Evaluation script testing all 4 core skills \|
	\| `prepare_data.py` \| Dataset preparation pipeline (builds the training data) \|

	## License

	Apache 2.0 (same as base model)