Spaces:
Running
Running
Web Wizard — Playwright Automation Curriculum
What This Is
A structured, progressive learning repository for advanced web automation and scraping using Playwright for Python. Covers a full 12-module curriculum from browser automation basics through to distributed, AI-integrated crawler architectures.
Problem
Playwright is far more powerful than most tutorials cover. Existing resources teach basic navigation and clicking — but not stealth techniques, async concurrency, database integration, Celery-based orchestration, or using Playwright as an LLM agent tool. This repo is a self-directed curriculum to cover all of it.
Curriculum Structure
- Part 1 — Foundations: Python async, HTTP internals, DOM, DevTools, Playwright core API, network interception, debugging
- Part 2 — Advanced Scraping: Anti-bot & stealth, async concurrency, Postgres/Pandas integration, Docker + pytest CI
- Part 3 — Production & AI: Infinite scroll, SPAs, multi-step auth, Celery/RabbitMQ orchestration, Playwright as an LLM agent tool
Hands-On Projects Planned
- Single-page scraper with CSV export
- Login-protected scraper
- Infinite-scroll scraper with deduplication
- XHR-intercept scraper
- Multi-user crawler writing to Postgres
- Playwright-agent connector for LLM/RAG workflows
- Capstone: Distributed, Dockerized crawler with queue + vector DB pipeline
Tech Stack
Playwright, pytest, pandas, SQLAlchemy, PostgreSQL, Redis, Celery, Docker, ChromaDB, LangChain