{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "gpuType": "T4" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 🕷️ WebScrapeAgent — Fine-tune Qwen2.5-7B for Autonomous Web Scraping\n", "\n", "This notebook fine-tunes **Qwen2.5-7B-Instruct** with **Unsloth + QLoRA** to create an autonomous web scraping agent that:\n", "\n", "1. **Reads HTML** and understands page structure (tables, lists, forms, nested elements)\n", "2. **Decides action sequences** to extract data (navigate, click, scroll, wait)\n", "3. **Handles authentication** (cookie replay, form login, token injection, browser profiles)\n", "4. **Recovers from failures** (403→headless browser, timeout→JS execution, rate limit→backoff)\n", "\n", "**Training recipe based on:**\n", "- ScrapeGraphAI-100k (arXiv:2602.15189): QLoRA r=16, lr=1e-4, completion-only loss → Key F1=0.887\n", "- BrowserAgent (arXiv:2510.10666): Qwen2.5-7B SFT → +20% over baselines\n", "- A3-Annotators (arXiv:2604.07776): assistant-token-only loss → 41.5% on WebArena\n", "\n", "**Free GPU**: Works on Colab T4 (16GB), Kaggle P100/T4, or any 16GB+ GPU.\n", "\n", "**Training data**: 45K examples from [sukritvemula/webscrape-agent-training-data](https://huggingface.co/datasets/sukritvemula/webscrape-agent-training-data)\n", "- 55% real-world HTML→JSON extraction (ScrapeGraphAI-100k)\n", "- 44% multi-turn browser interaction sessions (BrowserAgent)\n", "- 1% synthetic auth handling, error recovery, and diverse HTML structures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Install Dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "!pip install unsloth\n", "!pip install --no-deps trl peft accelerate bitsandbytes xformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Configuration\n", "\n", "Adjust these based on your GPU. Defaults are tuned for **free Colab T4 (16GB VRAM)**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# === EDIT THESE ===\n", "HF_USERNAME = \"sukritvemula\" # Your HuggingFace username\n", "OUTPUT_MODEL = f\"{HF_USERNAME}/WebScrapeAgent-7B-v1\" # Where to push the trained model\n", "\n", "# === Training hyperparameters (from ScrapeGraphAI + BrowserAgent papers) ===\n", "MAX_SEQ_LENGTH = 4096 # Covers 90%+ of examples; increase to 8192 if you have more VRAM\n", "LORA_R = 32 # LoRA rank (higher = more capacity for structured output)\n", "LORA_ALPHA = 32 # alpha = r (standard)\n", "LEARNING_RATE = 1e-4 # QLoRA needs ~10x higher LR than full fine-tuning\n", "NUM_EPOCHS = 2 # Both reference papers use 2 epochs\n", "BATCH_SIZE = 1 # Per-device (T4-safe)\n", "GRAD_ACCUM = 16 # Effective batch = 16\n", "\n", "# === Model ===\n", "MODEL_NAME = \"unsloth/Qwen2.5-7B-Instruct-bnb-4bit\" # Pre-quantized for fast start\n", "DATASET_NAME = \"sukritvemula/webscrape-agent-training-data\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Login to HuggingFace (for pushing model)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from huggingface_hub import login\n", "login() # Enter your HF token when prompted" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Load Model + Apply LoRA" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# CRITICAL: import unsloth FIRST\n", "import unsloth\n", "\n", "import torch\n", "from unsloth import FastLanguageModel, is_bfloat16_supported\n", "from unsloth.chat_templates import get_chat_template, train_on_responses_only\n", "\n", "print(f\"GPU: {torch.cuda.get_device_name()}\")\n", "print(f\"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB\")\n", "\n", "# Load model\n", "model, tokenizer = FastLanguageModel.from_pretrained(\n", " model_name=MODEL_NAME,\n", " max_seq_length=MAX_SEQ_LENGTH,\n", " dtype=None, # Auto-detect\n", " load_in_4bit=True, # QLoRA\n", ")\n", "\n", "# Apply LoRA adapters to all attention + MLP layers\n", "model = FastLanguageModel.get_peft_model(\n", " model,\n", " r=LORA_R,\n", " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n", " \"gate_proj\", \"up_proj\", \"down_proj\"],\n", " lora_alpha=LORA_ALPHA,\n", " lora_dropout=0.0,\n", " bias=\"none\",\n", " use_gradient_checkpointing=\"unsloth\", # 30% more memory efficient\n", " random_state=42,\n", ")\n", "\n", "# Set Qwen2.5 chat template\n", "tokenizer = get_chat_template(tokenizer, chat_template=\"qwen-2.5\")\n", "\n", "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", "total = sum(p.numel() for p in model.parameters())\n", "print(f\"Trainable: {trainable:,} / {total:,} ({trainable/total*100:.2f}%)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Load & Format Training Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "\n", "dataset = load_dataset(DATASET_NAME)\n", "train_ds = dataset[\"train\"]\n", "print(f\"Training examples: {len(train_ds)}\")\n", "\n", "# Convert messages → ChatML text\n", "def format_to_text(examples):\n", " texts = []\n", " for msgs in examples[\"messages\"]:\n", " try:\n", " text = tokenizer.apply_chat_template(\n", " msgs, tokenize=False, add_generation_prompt=False\n", " )\n", " texts.append(text)\n", " except Exception:\n", " # Fallback for any format issues\n", " text = \"\"\n", " for msg in msgs:\n", " text += f\"<|im_start|>{msg['role']}\\n{msg['content']}<|im_end|>\\n\"\n", " texts.append(text)\n", " return {\"text\": texts}\n", "\n", "train_ds = train_ds.map(format_to_text, batched=True, num_proc=2,\n", " remove_columns=train_ds.column_names)\n", "\n", "# Filter sequences that exceed max length\n", "def filter_length(example):\n", " tokens = tokenizer(example[\"text\"], truncation=False)\n", " return len(tokens[\"input_ids\"]) <= MAX_SEQ_LENGTH\n", "\n", "original_len = len(train_ds)\n", "train_ds = train_ds.filter(filter_length, num_proc=2)\n", "print(f\"After length filter: {len(train_ds)} / {original_len} ({len(train_ds)/original_len*100:.1f}% kept)\")\n", "\n", "# Show a sample\n", "print(f\"\\nSample (first 300 chars):\\n{train_ds[0]['text'][:300]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Train with Completion-Only Loss\n", "\n", "Key: we only train on assistant tokens (not system/user). This is critical for structured output quality (+15% schema compliance per ScrapeGraphAI paper)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from trl import SFTTrainer, SFTConfig\n", "\n", "training_args = SFTConfig(\n", " output_dir=\"./webscrape-checkpoints\",\n", " \n", " # Core training\n", " num_train_epochs=NUM_EPOCHS,\n", " per_device_train_batch_size=BATCH_SIZE,\n", " gradient_accumulation_steps=GRAD_ACCUM,\n", " \n", " # Optimizer\n", " optim=\"adamw_8bit\",\n", " learning_rate=LEARNING_RATE,\n", " weight_decay=0.01,\n", " lr_scheduler_type=\"cosine\",\n", " warmup_ratio=0.03,\n", " max_grad_norm=0.3,\n", " \n", " # Precision\n", " fp16=not is_bfloat16_supported(),\n", " bf16=is_bfloat16_supported(),\n", " \n", " # Sequence\n", " max_seq_length=MAX_SEQ_LENGTH,\n", " dataset_text_field=\"text\",\n", " packing=False, # Must be False for multi-turn chat with response-only masking\n", " \n", " # Logging\n", " logging_steps=10,\n", " logging_first_step=True,\n", " \n", " # Saving\n", " save_strategy=\"steps\",\n", " save_steps=500,\n", " save_total_limit=2,\n", " \n", " # Push to Hub\n", " push_to_hub=True,\n", " hub_model_id=OUTPUT_MODEL,\n", " hub_strategy=\"end\",\n", " \n", " # Misc\n", " seed=42,\n", " dataset_num_proc=2,\n", ")\n", "\n", "trainer = SFTTrainer(\n", " model=model,\n", " tokenizer=tokenizer,\n", " train_dataset=train_ds,\n", " args=training_args,\n", ")\n", "\n", "# CRITICAL: Apply completion-only loss (train only on assistant tokens)\n", "trainer = train_on_responses_only(trainer)\n", "\n", "print(\"Ready to train!\")\n", "print(f\" Model: {MODEL_NAME}\")\n", "print(f\" LoRA: r={LORA_R}, alpha={LORA_ALPHA}\")\n", "print(f\" LR: {LEARNING_RATE}, Epochs: {NUM_EPOCHS}\")\n", "print(f\" Effective batch: {BATCH_SIZE * GRAD_ACCUM}\")\n", "print(f\" Max seq: {MAX_SEQ_LENGTH}\")\n", "print(f\" Output: {OUTPUT_MODEL}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 🚀 TRAIN!\n", "trainer_stats = trainer.train()\n", "print(f\"\\n✅ Training complete! Loss: {trainer_stats.training_loss:.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Save & Push to Hub" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save LoRA adapter\n", "model.save_pretrained(\"webscrape-agent-lora\")\n", "tokenizer.save_pretrained(\"webscrape-agent-lora\")\n", "\n", "# Push merged 16-bit model to Hub\n", "print(\"Pushing merged model to Hub (this takes a few minutes)...\")\n", "model.push_to_hub_merged(\n", " OUTPUT_MODEL,\n", " tokenizer,\n", " save_method=\"merged_16bit\",\n", ")\n", "\n", "# Also push LoRA adapter separately (smaller, faster to load)\n", "model.push_to_hub(\n", " OUTPUT_MODEL + \"-lora\",\n", " tokenizer,\n", ")\n", "\n", "print(f\"\\n✅ Merged model: https://huggingface.co/{OUTPUT_MODEL}\")\n", "print(f\"✅ LoRA adapter: https://huggingface.co/{OUTPUT_MODEL}-lora\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Test the Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Switch to inference mode\n", "FastLanguageModel.for_inference(model)\n", "\n", "# Test: HTML extraction\n", "test_messages = [\n", " {\"role\": \"system\", \"content\": \"You are WebScrapeAgent, a web data extraction assistant. Given web content and a target schema, extract clean structured JSON. Every value must exist in the source content. Never invent data. Always include extraction status.\"},\n", " {\"role\": \"user\", \"content\": \"\"\"Extract structured data from the following web content.\n", "\n", "\n", "
\n", "
\n", "

Sony WH-1000XM5

\n", " $348.00\n", "
4.7 out of 5
\n", " Available\n", "
\n", "
\n", "

AirPods Max

\n", " $549.00\n", "
4.3 out of 5
\n", " Only 2 left\n", "
\n", "
\n", "
\n", "\n", "Return as JSON array of products with name, sku, price, rating, and availability.\"\"\"}\n", "]\n", "\n", "inputs = tokenizer.apply_chat_template(\n", " test_messages, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\"\n", ").to(\"cuda\")\n", "\n", "outputs = model.generate(\n", " input_ids=inputs,\n", " max_new_tokens=512,\n", " temperature=0.3,\n", " do_sample=True,\n", ")\n", "\n", "response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)\n", "print(\"Model response:\")\n", "print(response)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Test: Multi-step scraping with error recovery\n", "test_messages_2 = [\n", " {\"role\": \"system\", \"content\": \"\"\"You are WebScrapeAgent, an autonomous web scraping and data extraction system.\n", "\n", "Available actions:\n", "- EXTRACT_JSON, NAVIGATE, FILL_FORM, CLICK, WAIT, SET_COOKIES, SET_HEADERS,\n", " LOAD_BROWSER_PROFILE, EXECUTE_JS, SCROLL, SWITCH_STRATEGY, RETURN_RESULT\n", "\n", "Rules:\n", "- NEVER invent data\n", "- ALWAYS include status in RETURN_RESULT: \\\"success\\\", \\\"partial\\\", or \\\"failed\\\"\n", "- Think step-by-step in blocks\n", "- Maximum 10 steps per job\"\"\"},\n", " {\"role\": \"user\", \"content\": \"Task: Extract product reviews\\nURL: https://reviews.example.com/product/456\"},\n", " {\"role\": \"assistant\", \"content\": \"Let me navigate to the reviews page.\\n\\nACTION: NAVIGATE\\n```json\\n{\\\"url\\\": \\\"https://reviews.example.com/product/456\\\", \\\"method\\\": \\\"GET\\\"}\\n```\"},\n", " {\"role\": \"user\", \"content\": \"Observation: HTTP 403 Forbidden\\n\\n

Access Denied

Bot detection triggered.

\"},\n", "]\n", "\n", "inputs = tokenizer.apply_chat_template(\n", " test_messages_2, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\"\n", ").to(\"cuda\")\n", "\n", "outputs = model.generate(\n", " input_ids=inputs,\n", " max_new_tokens=512,\n", " temperature=0.3,\n", " do_sample=True,\n", ")\n", "\n", "response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)\n", "print(\"Model response (error recovery):\")\n", "print(response)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Optional: Export to GGUF (for llama.cpp / Ollama)\n", "\n", "Uncomment to export for local deployment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# # Export to GGUF Q4_K_M (smallest good quality)\n", "# model.save_pretrained_gguf(\n", "# \"webscrape-agent-gguf\",\n", "# tokenizer,\n", "# quantization_method=\"q4_k_m\",\n", "# )\n", "# \n", "# # Push GGUF to Hub\n", "# model.push_to_hub_gguf(\n", "# OUTPUT_MODEL + \"-GGUF\",\n", "# tokenizer,\n", "# quantization_method=\"q4_k_m\",\n", "# )" ] } ] }