Instructions to use sukritvemula/WebScrapeAgent-7B-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps Settings
- Unsloth Studio
How to use sukritvemula/WebScrapeAgent-7B-v1 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sukritvemula/WebScrapeAgent-7B-v1 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for sukritvemula/WebScrapeAgent-7B-v1 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for sukritvemula/WebScrapeAgent-7B-v1 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="sukritvemula/WebScrapeAgent-7B-v1", max_seq_length=2048, )
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - web-scraping | |
| - html-extraction | |
| - agent | |
| - structured-data | |
| - qwen2.5 | |
| - unsloth | |
| - lora | |
| datasets: | |
| - sukritvemula/webscrape-agent-training-data | |
| base_model: Qwen/Qwen2.5-7B-Instruct | |
| pipeline_tag: text-generation | |
| # 🕷️ WebScrapeAgent-7B-v1 | |
| **An autonomous web scraping agent** built on Qwen2.5-7B-Instruct, fine-tuned to extract structured data from any web page automatically. | |
| Give it a URL and a description of what you want → it comes back with clean, usable JSON data every time. | |
| ## What It Does | |
| | Capability | Description | | |
| |---|---| | |
| | **HTML Reading** | Understands page structure — tables, nested divs, lists, forms, data attributes, malformed HTML | | |
| | **Action Sequencing** | Decides what tools to call and in what order to get the data | | |
| | **Authentication** | Handles login pages via cookie replay, form submission, token injection, or browser profiles | | |
| | **Error Recovery** | When something breaks (403, timeout, CAPTCHA, rate limit), switches approach instead of failing | | |
| ## How It Works | |
| The model operates in an **action loop**: | |
| ``` | |
| User: "Extract product listings from example.com/shop" | |
| ↓ | |
| Model: <thought>Let me navigate there first.</thought> | |
| ACTION: NAVIGATE {"url": "example.com/shop"} | |
| ↓ | |
| System: HTTP 200 OK. <html>...</html> | |
| ↓ | |
| Model: <thought>I see product cards. Let me extract the data.</thought> | |
| ACTION: RETURN_RESULT {"status": "success", "data": [...]} | |
| ``` | |
| Each response includes a **status**: `success`, `partial`, or `failed` — so the caller always knows where things stand. | |
| Maximum 10 steps per job. If it can't finish, it returns what it has with a clear explanation. | |
| ## Quick Start | |
| ### Python API | |
| ```python | |
| from webscrape_agent import WebScrapeAgent | |
| agent = WebScrapeAgent("sukritvemula/WebScrapeAgent-7B-v1") | |
| result = agent.scrape( | |
| url="https://example.com/products", | |
| task="Extract all product names, prices, and ratings", | |
| schema={ | |
| "type": "array", | |
| "items": { | |
| "type": "object", | |
| "properties": { | |
| "name": {"type": "string"}, | |
| "price": {"type": "string"}, | |
| "rating": {"type": "string"} | |
| } | |
| } | |
| } | |
| ) | |
| print(result.status) # "success" | "partial" | "failed" | |
| print(result.data) # Clean JSON data | |
| print(result.message) # Human-readable explanation | |
| ``` | |
| ### With Authentication | |
| ```python | |
| # Cookie-based auth | |
| result = agent.scrape( | |
| url="https://dashboard.example.com/analytics", | |
| task="Get my usage statistics", | |
| auth={"method": "cookies", "cookies": {"session_id": "abc123"}} | |
| ) | |
| # API token | |
| result = agent.scrape( | |
| url="https://api.example.com/v2/data", | |
| task="Get all user records", | |
| auth={"method": "token", "token": "sk-xxx"} | |
| ) | |
| ``` | |
| ### CLI | |
| ```bash | |
| python webscrape_agent.py "https://example.com/pricing" "Extract all pricing tiers with features" | |
| ``` | |
| ### Direct Model Usage | |
| ```python | |
| import unsloth | |
| from unsloth import FastLanguageModel | |
| from unsloth.chat_templates import get_chat_template | |
| model, tokenizer = FastLanguageModel.from_pretrained( | |
| "sukritvemula/WebScrapeAgent-7B-v1", | |
| max_seq_length=4096, | |
| load_in_4bit=True, | |
| ) | |
| FastLanguageModel.for_inference(model) | |
| tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5") | |
| messages = [ | |
| {"role": "system", "content": "You are WebScrapeAgent..."}, | |
| {"role": "user", "content": "Task: Extract pricing data\nURL: https://example.com/pricing"}, | |
| ] | |
| inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda") | |
| outputs = model.generate(input_ids=inputs, max_new_tokens=1024, temperature=0.3) | |
| print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ## Available Actions | |
| The model can output these actions in its loop: | |
| | Action | Purpose | Example Params | | |
| |---|---|---| | |
| | `NAVIGATE` | Load a URL | `{"url": "...", "method": "GET", "headers": {...}}` | | |
| | `CLICK` | Click an element | `{"selector": "#submit-btn"}` | | |
| | `FILL_FORM` | Submit a form | `{"selector": "#login", "fields": {"email": "...", "password": "..."}}` | | |
| | `WAIT` | Wait for dynamic content | `{"selector": ".results", "timeout_ms": 5000}` | | |
| | `SET_COOKIES` | Inject auth cookies | `{"cookies": {"session": "abc"}}` | | |
| | `SET_HEADERS` | Set HTTP headers | `{"headers": {"Authorization": "Bearer ..."}}` | | |
| | `LOAD_BROWSER_PROFILE` | Use saved browser session | `{"profile_name": "work-chrome"}` | | |
| | `EXECUTE_JS` | Run JavaScript | `{"script": "return document.querySelector('#app').innerHTML"}` | | |
| | `SCROLL` | Scroll the page | `{"direction": "down", "amount": 500}` | | |
| | `SWITCH_STRATEGY` | Change approach on failure | `{"new_strategy": "headless_browser", "reason": "403 blocked"}` | | |
| | `RETURN_RESULT` | Return final data | `{"status": "success", "data": [...], "message": "..."}` | | |
| ## Training Details | |
| ### Recipe | |
| Based on three published approaches: | |
| | Paper | Key Contribution | Result | | |
| |---|---|---| | |
| | [ScrapeGraphAI-100k](https://arxiv.org/abs/2602.15189) | QLoRA + completion-only loss for HTML→JSON | Key F1 = 0.887 at 1.7B params | | |
| | [BrowserAgent](https://arxiv.org/abs/2510.10666) | Multi-turn browser SFT on Qwen2.5-7B | +20% over baselines | | |
| | [A3-Annotators](https://arxiv.org/abs/2604.07776) | Assistant-token-only loss + thought chains | 41.5% WebArena | | |
| ### Hyperparameters | |
| | Parameter | Value | Source | | |
| |---|---|---| | |
| | Base model | Qwen/Qwen2.5-7B-Instruct | — | | |
| | Method | QLoRA (4-bit NF4) | ScrapeGraphAI | | |
| | LoRA rank | 32 | Increased from paper's 16 for structured output complexity | | |
| | LoRA alpha | 32 | Standard (= rank) | | |
| | LoRA targets | All linear (q,k,v,o,gate,up,down) | ScrapeGraphAI + A3 | | |
| | Learning rate | 1e-4 | ScrapeGraphAI | | |
| | LR schedule | Cosine with 3% warmup | A3-Annotators | | |
| | Optimizer | AdamW 8-bit | Unsloth best practice | | |
| | Epochs | 2 | ScrapeGraphAI + BrowserAgent | | |
| | Effective batch | 16 | — | | |
| | Max seq length | 4096 | — | | |
| | Loss | Completion-only (assistant tokens) | All three papers | | |
| | Gradient checkpointing | Unsloth custom | — | | |
| ### Training Data | |
| **Dataset**: [sukritvemula/webscrape-agent-training-data](https://huggingface.co/datasets/sukritvemula/webscrape-agent-training-data) (45,624 examples) | |
| | Source | Count | % | What It Teaches | | |
| |---|---|---|---| | |
| | [ScrapeGraphAI-100k](https://huggingface.co/datasets/scrapegraphai/scrapegraph-100k-finetuning) | 25,244 | 55.3% | HTML→JSON extraction across real websites | | |
| | [BrowserAgent-Data](https://huggingface.co/datasets/TIGER-Lab/BrowserAgent-Data) | 20,361 | 44.6% | Multi-turn browser interaction and reasoning | | |
| | Synthetic scenarios | 19 | 0.04% | Auth handling, error recovery, diverse HTML | | |
| ### Training Infrastructure | |
| Designed for **free GPU** platforms: | |
| - ✅ Google Colab (T4, 16GB) | |
| - ✅ Kaggle (T4/P100, 16GB) | |
| - ✅ Any 16GB+ GPU | |
| ## How to Train | |
| ### Option 1: Colab Notebook (Easiest) | |
| Open `WebScrapeAgent_Training.ipynb` in Google Colab with a T4 GPU runtime. Everything is set up — just run all cells. | |
| ### Option 2: Command Line | |
| ```bash | |
| pip install unsloth trl peft transformers accelerate datasets bitsandbytes | |
| # Train with defaults (pushes to Hub) | |
| python train.py | |
| # Custom settings | |
| python train.py \ | |
| --model unsloth/Qwen2.5-7B-Instruct-bnb-4bit \ | |
| --output your-username/WebScrapeAgent-7B-custom \ | |
| --epochs 3 \ | |
| --lr 5e-5 \ | |
| --lora-r 64 \ | |
| --batch-size 2 \ | |
| --grad-accum 8 | |
| # Save locally only (no Hub push) | |
| python train.py --no-push --save-local ./my-model | |
| ``` | |
| ## Limitations | |
| - **CAPTCHA**: Cannot solve visual CAPTCHAs. Returns partial results with explanation. | |
| - **JavaScript-heavy SPAs**: The default HTTP executor doesn't render JS. Use with Playwright/Selenium for full browser support (see `ActionExecutor` class in `webscrape_agent.py`). | |
| - **Private networks**: Cannot access internal/intranet URLs. Returns clear failure message. | |
| - **Very large pages**: HTML truncated to ~8K chars to fit context window. May miss data on extremely long pages. | |
| - **Data hallucination**: While trained to never invent data, always verify critical extractions. | |
| ## Files in This Repo | |
| | File | Purpose | | |
| |---|---| | |
| | `webscrape_agent.py` | Runtime inference loop — Python API and CLI | | |
| | `train.py` | Standalone training script (CLI with args) | | |
| | `WebScrapeAgent_Training.ipynb` | Colab/Kaggle training notebook | | |
| | `evaluate.py` | Evaluation script testing all 4 core skills | | |
| | `prepare_data.py` | Dataset preparation pipeline (builds the training data) | | |
| ## License | |
| Apache 2.0 (same as base model) | |