diff --git a/.bandit.yml b/.bandit.yml new file mode 100644 index 0000000000000000000000000000000000000000..bd06507f49dbd7031ea5b87ce861359dbc533cec --- /dev/null +++ b/.bandit.yml @@ -0,0 +1,11 @@ +skips: +- B101 +- B311 +- B113 # `Requests call without timeout` these requests are done in the benchmark and examples scripts only +- B403 # We are using pickle for tests only +- B404 # Using subprocess library +- B602 # subprocess call with shell=True identified +- B110 # Try, Except, Pass detected. +- B104 # Possible binding to all interfaces. +- B301 # Pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue. +- B108 # Probable insecure usage of temp file/directory. \ No newline at end of file diff --git a/.dockerignore b/.dockerignore index f1d82d8383fe1f10641ef259ec870c9a820dcd05..be64db50a741519bd138ab3be215433ac5a51a96 100644 --- a/.dockerignore +++ b/.dockerignore @@ -1,12 +1,110 @@ -.git -.gitignore -__pycache__ -*.pyc -*.pyo -venv -.venv +# Github +.github/ + +# docs +docs/ +images/ +.cache/ +.claude/ + +# cached files +__pycache__/ +*.py[cod] +.cache +.DS_Store +*~ +.*.sw[po] +.build +.ve .env +.pytest +.benchmarks +.bootstrap +.appveyor.token +*.bak +*.db +*.db-* + +# installation package +*.egg-info/ +dist/ +build/ + +# environments +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ + +# C extensions +*.so + +# pycharm +.idea/ + +# vscode +*.code-workspace + +# Packages +*.egg *.egg-info -.DS_Store -chat_history.json -client_secret.json +dist +build +eggs +.eggs +parts +bin +var +sdist +wheelhouse +develop-eggs +.installed.cfg +lib +lib64 +venv*/ +.venv*/ +pyvenv*/ +pip-wheel-metadata/ +poetry.lock + +# Installer logs +pip-log.txt + +# mypy +.mypy_cache/ +.dmypy.json +dmypy.json +mypy.ini + +# test caches +.tox/ +.pytest_cache/ +.coverage +htmlcov +report.xml +nosetests.xml +coverage.xml + +# Translations +*.mo + +# Buildout +.mr.developer.cfg + +# IDE project files +.project +.pydevproject +.idea +*.iml +*.komodoproject + +# Complexity +output/*.html +output/*/index.html + +# Sphinx +docs/_build +public/ +web/ diff --git a/.gitattributes b/.gitattributes index 9e74bc24ca2e573b41c6bf88e1b7289affe6032b..f5b3ed77729dcbafe47b97a48d88401dcf3f0dd3 100644 --- a/.gitattributes +++ b/.gitattributes @@ -3,3 +3,8 @@ Scrapling/docs/assets/favicon.ico filter=lfs diff=lfs merge=lfs -text Scrapling/docs/assets/main_cover.png filter=lfs diff=lfs merge=lfs -text Scrapling/docs/assets/scrapling_shell_curl.png filter=lfs diff=lfs merge=lfs -text Scrapling/docs/assets/spider_architecture.png filter=lfs diff=lfs merge=lfs -text +docs/assets/cover_dark.png filter=lfs diff=lfs merge=lfs -text +docs/assets/favicon.ico filter=lfs diff=lfs merge=lfs -text +docs/assets/main_cover.png filter=lfs diff=lfs merge=lfs -text +docs/assets/scrapling_shell_curl.png filter=lfs diff=lfs merge=lfs -text +docs/assets/spider_architecture.png filter=lfs diff=lfs merge=lfs -text diff --git a/.github/FUNDING.yml b/.github/FUNDING.yml index b2b06fba1267df2bad953220ee8db5c6b7c22a58..361677ff637b5552eafe102132c13195fda9f7c6 100644 --- a/.github/FUNDING.yml +++ b/.github/FUNDING.yml @@ -1,3 +1,3 @@ -# These are supported funding model platforms - -github: itsOwen +github: D4Vinci +buy_me_a_coffee: d4vinci +ko_fi: d4vinci diff --git a/.github/ISSUE_TEMPLATE/01-bug_report.yml b/.github/ISSUE_TEMPLATE/01-bug_report.yml new file mode 100644 index 0000000000000000000000000000000000000000..0340da12c74f8eb421977d808a0e070ef0f7a3b4 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/01-bug_report.yml @@ -0,0 +1,82 @@ +name: Bug report +description: Create a bug report to help us address errors in the repository +labels: [bug] +body: + - type: checkboxes + attributes: + label: Have you searched if there an existing issue for this? + description: Please search [existing issues](https://github.com/D4Vinci/Scrapling/labels/bug). + options: + - label: I have searched the existing issues + required: true + + - type: input + attributes: + label: "Python version (python --version)" + placeholder: "Python 3.8" + validations: + required: true + + - type: input + attributes: + label: "Scrapling version (scrapling.__version__)" + placeholder: "0.1" + validations: + required: true + + - type: textarea + attributes: + label: "Dependencies version (pip3 freeze)" + description: > + This is the output of the command `pip3 freeze --all`. Note that the + actual output might be different as compared to the placeholder text. + placeholder: | + cssselect==1.2.0 + lxml==5.3.0 + orjson==3.10.7 + ... + validations: + required: true + + - type: input + attributes: + label: "What's your operating system?" + placeholder: "Windows 10" + validations: + required: true + + - type: dropdown + attributes: + label: 'Are you using a separate virtual environment?' + description: "Please pay attention to this question" + options: + - 'No' + - 'Yes' + default: 0 + validations: + required: true + + - type: textarea + attributes: + label: "Expected behavior" + description: "Describe the behavior you expect. May include images or videos." + validations: + required: true + + - type: textarea + attributes: + label: "Actual behavior" + validations: + required: true + + - type: textarea + attributes: + label: Steps To Reproduce + description: Steps to reproduce the behavior. + placeholder: | + 1. In this environment... + 2. With this config... + 3. Run '...' + 4. See error... + validations: + required: false diff --git a/.github/ISSUE_TEMPLATE/02-feature_request.yml b/.github/ISSUE_TEMPLATE/02-feature_request.yml new file mode 100644 index 0000000000000000000000000000000000000000..8d07e864375e97ce66088c4b6bf4b9c83218c4b8 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/02-feature_request.yml @@ -0,0 +1,19 @@ +name: Feature request +description: Suggest features, propose improvements, discuss new ideas. +labels: [enhancement] +body: + - type: checkboxes + attributes: + label: Have you searched if there an existing feature request for this? + description: Please search [existing requests](https://github.com/D4Vinci/Scrapling/labels/enhancement). + options: + - label: I have searched the existing requests + required: true + + - type: textarea + attributes: + label: "Feature description" + description: > + This could include new topics or improving any existing features/implementations. + validations: + required: true \ No newline at end of file diff --git a/.github/ISSUE_TEMPLATE/03-other.yml b/.github/ISSUE_TEMPLATE/03-other.yml new file mode 100644 index 0000000000000000000000000000000000000000..697c1f4dc64f129a8ad43c32eff73bba54698cde --- /dev/null +++ b/.github/ISSUE_TEMPLATE/03-other.yml @@ -0,0 +1,19 @@ +name: Other +description: Use this for any other issues. PLEASE provide as much information as possible. +labels: ["awaiting triage"] +body: + - type: textarea + id: issuedescription + attributes: + label: What would you like to share? + description: Provide a clear and concise explanation of your issue. + validations: + required: true + + - type: textarea + id: extrainfo + attributes: + label: Additional information + description: Is there anything else we should know about this issue? + validations: + required: false \ No newline at end of file diff --git a/.github/ISSUE_TEMPLATE/04-docs_issue.yml b/.github/ISSUE_TEMPLATE/04-docs_issue.yml new file mode 100644 index 0000000000000000000000000000000000000000..344537451e7deab333c1af33eba0af0f05f6a937 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/04-docs_issue.yml @@ -0,0 +1,40 @@ +name: Documentation issue +description: Report incorrect, unclear, or missing documentation. +labels: [documentation] +body: + - type: checkboxes + attributes: + label: Have you searched if there an existing issue for this? + description: Please search [existing issues](https://github.com/D4Vinci/Scrapling/labels/documentation). + options: + - label: I have searched the existing issues + required: true + + - type: input + attributes: + label: "Page URL" + description: "Link to the documentation page with the issue." + placeholder: "https://scrapling.readthedocs.io/en/latest/..." + validations: + required: true + + - type: dropdown + attributes: + label: "Type of issue" + options: + - Incorrect information + - Unclear or confusing + - Missing information + - Typo or formatting + - Broken link + - Other + default: 0 + validations: + required: true + + - type: textarea + attributes: + label: "Description" + description: "Describe what's wrong and what you expected to find." + validations: + required: true diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 0000000000000000000000000000000000000000..59525b7feb100992e37f396918e16ee89b820565 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,10 @@ +blank_issues_enabled: false +contact_links: +- name: Discussions + url: https://github.com/D4Vinci/Scrapling/discussions + about: > + The "Discussions" forum is where you want to start. ๐Ÿ’– +- name: Ask on our discord server + url: https://discord.gg/EMgGbDceNQ + about: > + Our community chat forum. \ No newline at end of file diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 0000000000000000000000000000000000000000..b8eac74475b6ac545db423232b116dd2f0e45673 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,51 @@ + + +## Proposed change + + + +### Type of change: + + + + +- [ ] Dependency upgrade +- [ ] Bugfix (non-breaking change which fixes an issue) +- [ ] New integration (thank you!) +- [ ] New feature (which adds functionality to an existing integration) +- [ ] Deprecation (breaking change to happen in the future) +- [ ] Breaking change (fix/feature causing existing functionality to break) +- [ ] Code quality improvements to existing code or addition of tests +- [ ] Add or change doctests? -- Note: Please avoid changing both code and tests in a single pull request. +- [ ] Documentation change? + +### Additional information + + +- This PR fixes or closes an issue: fixes # +- This PR is related to an issue: # +- Link to documentation pull request: ** + +### Checklist: +* [ ] I have read [CONTRIBUTING.md](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md). +* [ ] This pull request is all my own work -- I have not plagiarized. +* [ ] I know that pull requests will not be merged if they fail the automated tests. +* [ ] All new Python files are placed inside an existing directory. +* [ ] All filenames are in all lowercase characters with no spaces or dashes. +* [ ] All functions and variable names follow Python naming conventions. +* [ ] All function parameters and return values are annotated with Python [type hints](https://docs.python.org/3/library/typing.html). +* [ ] All functions have doc-strings. diff --git a/.github/workflows/code-quality.yml b/.github/workflows/code-quality.yml new file mode 100644 index 0000000000000000000000000000000000000000..5c32328b61557dddd7c10e348f3ffaf14fac1d77 --- /dev/null +++ b/.github/workflows/code-quality.yml @@ -0,0 +1,184 @@ +name: Code Quality + +on: + push: + branches: + - main + - dev + paths-ignore: + - '*.md' + - '**/*.md' + - 'docs/**' + - 'images/**' + - '.github/**' + - '!.github/workflows/code-quality.yml' # Always run when this workflow changes + pull_request: + branches: + - main + - dev + paths-ignore: + - '*.md' + - '**/*.md' + - 'docs/**' + - 'images/**' + workflow_dispatch: # Allow manual triggering + +concurrency: + group: ${{ github.workflow }}-${{ github.ref }} + cancel-in-progress: true + +jobs: + code-quality: + name: Code Quality Checks + runs-on: ubuntu-latest + permissions: + contents: read + pull-requests: write # For PR annotations + + steps: + - name: Checkout code + uses: actions/checkout@v6 + with: + fetch-depth: 0 # Full history for better analysis + + - name: Set up Python + uses: actions/setup-python@v6 + with: + python-version: '3.10' + cache: 'pip' + + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install bandit[toml] ruff vermin mypy pyright + pip install -e ".[all]" + pip install lxml-stubs + + - name: Run Bandit (Security Linter) + id: bandit + continue-on-error: true + run: | + echo "::group::Bandit - Security Linter" + bandit -r -c .bandit.yml scrapling/ -f json -o bandit-report.json + bandit -r -c .bandit.yml scrapling/ + echo "::endgroup::" + + - name: Run Ruff Linter + id: ruff-lint + continue-on-error: true + run: | + echo "::group::Ruff - Linter" + ruff check scrapling/ --output-format=github + echo "::endgroup::" + + - name: Run Ruff Formatter Check + id: ruff-format + continue-on-error: true + run: | + echo "::group::Ruff - Formatter Check" + ruff format --check scrapling/ --diff + echo "::endgroup::" + + - name: Run Vermin (Python Version Compatibility) + id: vermin + continue-on-error: true + run: | + echo "::group::Vermin - Python 3.10+ Compatibility Check" + vermin -t=3.10- --violations --eval-annotations --no-tips scrapling/ + echo "::endgroup::" + + - name: Run Mypy (Static Type Checker) + id: mypy + continue-on-error: true + run: | + echo "::group::Mypy - Static Type Checker" + mypy scrapling/ + echo "::endgroup::" + + - name: Run Pyright (Static Type Checker) + id: pyright + continue-on-error: true + run: | + echo "::group::Pyright - Static Type Checker" + pyright scrapling/ + echo "::endgroup::" + + - name: Check results and create summary + if: always() + run: | + echo "# Code Quality Check Results" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + + # Initialize status + all_passed=true + + # Check Bandit + if [ "${{ steps.bandit.outcome }}" == "success" ]; then + echo "โœ… **Bandit (Security)**: Passed" >> $GITHUB_STEP_SUMMARY + else + echo "โŒ **Bandit (Security)**: Failed" >> $GITHUB_STEP_SUMMARY + all_passed=false + fi + + # Check Ruff Linter + if [ "${{ steps.ruff-lint.outcome }}" == "success" ]; then + echo "โœ… **Ruff Linter**: Passed" >> $GITHUB_STEP_SUMMARY + else + echo "โŒ **Ruff Linter**: Failed" >> $GITHUB_STEP_SUMMARY + all_passed=false + fi + + # Check Ruff Formatter + if [ "${{ steps.ruff-format.outcome }}" == "success" ]; then + echo "โœ… **Ruff Formatter**: Passed" >> $GITHUB_STEP_SUMMARY + else + echo "โŒ **Ruff Formatter**: Failed" >> $GITHUB_STEP_SUMMARY + all_passed=false + fi + + # Check Vermin + if [ "${{ steps.vermin.outcome }}" == "success" ]; then + echo "โœ… **Vermin (Python 3.10+)**: Passed" >> $GITHUB_STEP_SUMMARY + else + echo "โŒ **Vermin (Python 3.10+)**: Failed" >> $GITHUB_STEP_SUMMARY + all_passed=false + fi + + # Check Mypy + if [ "${{ steps.mypy.outcome }}" == "success" ]; then + echo "โœ… **Mypy (Type Checker)**: Passed" >> $GITHUB_STEP_SUMMARY + else + echo "โŒ **Mypy (Type Checker)**: Failed" >> $GITHUB_STEP_SUMMARY + all_passed=false + fi + + # Check Pyright + if [ "${{ steps.pyright.outcome }}" == "success" ]; then + echo "โœ… **Pyright (Type Checker)**: Passed" >> $GITHUB_STEP_SUMMARY + else + echo "โŒ **Pyright (Type Checker)**: Failed" >> $GITHUB_STEP_SUMMARY + all_passed=false + fi + + echo "" >> $GITHUB_STEP_SUMMARY + + if [ "$all_passed" == "true" ]; then + echo "### ๐ŸŽ‰ All checks passed!" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + echo "Your code meets all quality standards." >> $GITHUB_STEP_SUMMARY + else + echo "### โš ๏ธ Some checks failed" >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + echo "Please review the errors above and fix them." >> $GITHUB_STEP_SUMMARY + echo "" >> $GITHUB_STEP_SUMMARY + echo "**Tip**: Run \`pre-commit run --all-files\` locally to catch these issues before pushing." >> $GITHUB_STEP_SUMMARY + exit 1 + fi + + - name: Upload Bandit report + if: always() && steps.bandit.outcome != 'skipped' + uses: actions/upload-artifact@v6 + with: + name: bandit-security-report + path: bandit-report.json + retention-days: 30 diff --git a/.github/workflows/docker-build.yml b/.github/workflows/docker-build.yml new file mode 100644 index 0000000000000000000000000000000000000000..2c0948cab96d9cf9b0147e420939b28e5b037e3b --- /dev/null +++ b/.github/workflows/docker-build.yml @@ -0,0 +1,86 @@ +name: Build and Push Docker Image + +on: + pull_request: + types: [closed] + branches: + - main + workflow_dispatch: + inputs: + tag: + description: 'Docker image tag' + required: true + default: 'latest' + +env: + DOCKERHUB_IMAGE: pyd4vinci/scrapling + GHCR_IMAGE: ghcr.io/${{ github.repository_owner }}/scrapling + +jobs: + build-and-push: + runs-on: ubuntu-latest + permissions: + contents: read + packages: write + + steps: + - name: Checkout repository + uses: actions/checkout@v6 + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v3 + with: + platforms: linux/amd64,linux/arm64 + + - name: Log in to Docker Hub + uses: docker/login-action@v3 + with: + registry: docker.io + username: ${{ secrets.DOCKER_USERNAME }} + password: ${{ secrets.DOCKER_PASSWORD }} + + - name: Log in to GitHub Container Registry + uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.CONTAINER_TOKEN }} + + - name: Extract metadata + id: meta + uses: docker/metadata-action@v5 + with: + images: | + ${{ env.DOCKERHUB_IMAGE }} + ${{ env.GHCR_IMAGE }} + tags: | + type=ref,event=branch + type=ref,event=pr + type=semver,pattern={{version}} + type=semver,pattern={{major}}.{{minor}} + type=semver,pattern={{major}} + type=raw,value=latest,enable={{is_default_branch}} + labels: | + org.opencontainers.image.title=Scrapling + org.opencontainers.image.description=An undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be! + org.opencontainers.image.vendor=D4Vinci + org.opencontainers.image.licenses=BSD + org.opencontainers.image.url=https://scrapling.readthedocs.io/en/latest/ + org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }} + org.opencontainers.image.documentation=https://scrapling.readthedocs.io/en/latest/ + + - name: Build and push Docker image + uses: docker/build-push-action@v6 + with: + context: . + platforms: linux/amd64,linux/arm64 + push: true + tags: ${{ steps.meta.outputs.tags }} + labels: ${{ steps.meta.outputs.labels }} + cache-from: type=gha + cache-to: type=gha,mode=max + build-args: | + BUILDKIT_INLINE_CACHE=1 + + - name: Image digest + run: echo ${{ steps.build.outputs.digest }} \ No newline at end of file diff --git a/.github/workflows/release-and-publish.yml b/.github/workflows/release-and-publish.yml new file mode 100644 index 0000000000000000000000000000000000000000..d4350889fc8eec35aaeeb97216ea5a8905312d00 --- /dev/null +++ b/.github/workflows/release-and-publish.yml @@ -0,0 +1,74 @@ +name: Create Release and Publish to PyPI +# Creates a GitHub release when a PR is merged to main (using PR title as version and body as release notes), then publishes to PyPI. + +on: + pull_request: + types: [closed] + branches: + - main + +jobs: + create-release-and-publish: + if: github.event.pull_request.merged == true + runs-on: ubuntu-latest + environment: + name: PyPI + url: https://pypi.org/p/scrapling + permissions: + contents: write + id-token: write + steps: + - uses: actions/checkout@v6 + with: + fetch-depth: 0 + + - name: Get PR title + id: pr_title + run: echo "title=${{ github.event.pull_request.title }}" >> $GITHUB_OUTPUT + + - name: Save PR body to file + uses: actions/github-script@v8 + with: + script: | + const fs = require('fs'); + fs.writeFileSync('pr_body.md', context.payload.pull_request.body || ''); + + - name: Extract version + id: extract_version + run: | + PR_TITLE="${{ steps.pr_title.outputs.title }}" + if [[ $PR_TITLE =~ ^v ]]; then + echo "version=$PR_TITLE" >> $GITHUB_OUTPUT + echo "Valid version format found in PR title: $PR_TITLE" + else + echo "Error: PR title '$PR_TITLE' must start with 'v' (e.g., 'v1.0.0') to create a release." + exit 1 + fi + + - name: Create Release + uses: softprops/action-gh-release@v2 + with: + tag_name: ${{ steps.extract_version.outputs.version }} + name: Release ${{ steps.extract_version.outputs.version }} + body_path: pr_body.md + draft: false + prerelease: false + env: + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + + - name: Set up Python + uses: actions/setup-python@v6 + with: + python-version: 3.12 + + - name: Upgrade pip + run: python3 -m pip install --upgrade pip + + - name: Install build + run: python3 -m pip install --upgrade build twine setuptools + + - name: Build a binary wheel and a source tarball + run: python3 -m build --sdist --wheel --outdir dist/ + + - name: Publish distribution ๐Ÿ“ฆ to PyPI + uses: pypa/gh-action-pypi-publish@release/v1 \ No newline at end of file diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml new file mode 100644 index 0000000000000000000000000000000000000000..75d6ab13444a45db9e0752e999c80bb78b2a159c --- /dev/null +++ b/.github/workflows/tests.yml @@ -0,0 +1,109 @@ +name: Tests +on: + push: + branches: + - main + - dev + paths-ignore: + - '*.md' + - '**/*.md' + - 'docs/*' + - 'images/*' + - '.github/*' + - '*.yml' + - '*.yaml' + - 'ruff.toml' + +concurrency: + group: ${{github.workflow}}-${{ github.ref }} + cancel-in-progress: true + +jobs: + tests: + timeout-minutes: 60 + runs-on: ${{ matrix.os }} + strategy: + fail-fast: false + matrix: + include: + - python-version: "3.10" + os: macos-latest + env: + TOXENV: py310 + - python-version: "3.11" + os: macos-latest + env: + TOXENV: py311 + - python-version: "3.12" + os: macos-latest + env: + TOXENV: py312 + - python-version: "3.13" + os: macos-latest + env: + TOXENV: py313 + + steps: + - uses: actions/checkout@v6 + + - name: Set up Python ${{ matrix.python-version }} + uses: actions/setup-python@v6 + with: + python-version: ${{ matrix.python-version }} + cache: 'pip' + cache-dependency-path: | + pyproject.toml + tox.ini + + - name: Install all browsers dependencies + run: | + python3 -m pip install --upgrade pip + python3 -m pip install playwright==1.56.0 patchright==1.56.0 + + - name: Get Playwright version + id: playwright-version + run: | + PLAYWRIGHT_VERSION=$(python3 -c "import importlib.metadata; print(importlib.metadata.version('playwright'))") + echo "version=$PLAYWRIGHT_VERSION" >> $GITHUB_OUTPUT + echo "Playwright version: $PLAYWRIGHT_VERSION" + + - name: Retrieve Playwright browsers from cache if any + id: playwright-cache + uses: actions/cache@v5 + with: + path: | + ~/.cache/ms-playwright + ~/Library/Caches/ms-playwright + ~/.ms-playwright + key: ${{ runner.os }}-playwright-${{ steps.playwright-version.outputs.version }}-v1 + restore-keys: | + ${{ runner.os }}-playwright-${{ steps.playwright-version.outputs.version }}- + ${{ runner.os }}-playwright- + + - name: Install Playwright browsers + run: | + echo "Cache hit: ${{ steps.playwright-cache.outputs.cache-hit }}" + if [ "${{ steps.playwright-cache.outputs.cache-hit }}" != "true" ]; then + python3 -m playwright install chromium + else + echo "Skipping install - using cached Playwright browsers" + fi + python3 -m playwright install-deps chromium + + # Cache tox environments + - name: Cache tox environments + uses: actions/cache@v5 + with: + path: .tox + # Include python version and os in the cache key + key: tox-v1-${{ runner.os }}-py${{ matrix.python-version }}-${{ hashFiles('/Users/runner/work/Scrapling/pyproject.toml') }} + restore-keys: | + tox-v1-${{ runner.os }}-py${{ matrix.python-version }}- + tox-v1-${{ runner.os }}- + + - name: Install tox + run: pip install -U tox + + - name: Run tests + env: ${{ matrix.env }} + run: tox \ No newline at end of file diff --git a/.gitignore b/.gitignore index ca4292901bc1eeceb7d475216a6e4a97d988f3f5..77d121097b53da997aafb97816d7a91ce888d58e 100644 --- a/.gitignore +++ b/.gitignore @@ -1,75 +1,110 @@ -# Python cache files -__pycache__/ -*.py[cod] -*$py.class - -# Virtual environment -venv/ - -# Streamlit cache -.streamlit/ - -# PyCharm files -.idea/ - -# VS Code files -.vscode/ +# local files +site/* +local_tests/* +.mcpregistry_* -# Jupyter Notebook -.ipynb_checkpoints +# AI related files +.claude/* +CLAUDE.md -# Environment variables -.env - -# Operating system files +# cached files +__pycache__/ +*.py[cod] +.cache .DS_Store -Thumbs.db - -# Log files -*.log - -# Database files +*~ +.*.sw[po] +.build +.ve +.env +.pytest +.benchmarks +.bootstrap +.appveyor.token +*.bak *.db -*.sqlite3 +*.db-* -# Compiled Python files -*.pyc - -# Package directories +# installation package +*.egg-info/ dist/ build/ -*.egg-info/ -# Backup files -*~ -*.bak +# environments +.venv +env/ +venv/ +ENV/ +env.bak/ +venv.bak/ -# Coverage reports -htmlcov/ -.coverage -.coverage.* -coverage.xml +# C extensions +*.so -# Pytest cache -.pytest_cache/ +# pycharm +.idea/ -# mypy cache +# vscode +*.code-workspace + +# Packages +*.egg +*.egg-info +dist +build +eggs +.eggs +parts +bin +var +sdist +wheelhouse +develop-eggs +.installed.cfg +lib +lib64 +venv*/ +.venv*/ +pyvenv*/ +pip-wheel-metadata/ +poetry.lock + +# Installer logs +pip-log.txt + +# mypy .mypy_cache/ +.dmypy.json +dmypy.json +mypy.ini -# Scrapy stuff: -.scrapy +# test caches +.tox/ +.pytest_cache/ +.coverage +htmlcov +report.xml +nosetests.xml +coverage.xml -# Sphinx documentation -docs/_build/ +# Translations +*.mo -# PyBuilder -target/ +# Buildout +.mr.developer.cfg -# Google Sheets authentication token -token.json +# IDE project files +.project +.pydevproject +.idea +*.iml +*.komodoproject -# Chat history -chat_history.json +# Complexity +output/*.html +output/*/index.html -# Google OAuth client secret -client_secret.json \ No newline at end of file +# Sphinx +docs/_build +public/ +web/ diff --git a/.hfignore b/.hfignore index c92693047ef2cb76d29e2b178a100a4bd99316ae..f24e4a1339adaf7ec0851302adb001a18748e40d 100644 --- a/.hfignore +++ b/.hfignore @@ -1,9 +1,23 @@ -.git/ -.github/ -venv/ -__pycache__/ +.git +.github +.venv +__pycache__ *.pyc +*.pyo +*.pyd +.DS_Store +tests/ +docs/ +images/ +.coverage +htmlcov/ +pytest_cache/ +.mypy_cache/ +.tox/ +.pytest_cache/ +.ruff_cache/ +.uv/ +dist/ +build/ +*.egg-info/ .env -chat_history.json -test_patchright.py -client_secret.json diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..205e3550af8e529058a2a214c603302c2f4ab638 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,20 @@ +repos: +- repo: https://github.com/PyCQA/bandit + rev: 1.9.0 + hooks: + - id: bandit + args: [-r, -c, .bandit.yml] +- repo: https://github.com/astral-sh/ruff-pre-commit + # Ruff version. + rev: v0.14.5 + hooks: + # Run the linter. + - id: ruff + args: [ --fix ] + # Run the formatter. + - id: ruff-format +- repo: https://github.com/netromdk/vermin + rev: v1.7.0 + hooks: + - id: vermin + args: ['-t=3.10-', '--violations', '--eval-annotations', '--no-tips'] diff --git a/.readthedocs.yaml b/.readthedocs.yaml new file mode 100644 index 0000000000000000000000000000000000000000..8d5a8d52a7881b09df9293c049d402d5e2c993ec --- /dev/null +++ b/.readthedocs.yaml @@ -0,0 +1,21 @@ +# See https://docs.readthedocs.com/platform/stable/intro/zensical.html for details +# Example: https://github.com/readthedocs/test-builds/tree/zensical + +version: 2 + +build: + os: ubuntu-24.04 + apt_packages: + - pngquant + tools: + python: "3.13" + jobs: + install: + - pip install -r docs/requirements.txt + - pip install ".[all]" + build: + html: + - zensical build + post_build: + - mkdir -p $READTHEDOCS_OUTPUT/html/ + - cp --recursive site/* $READTHEDOCS_OUTPUT/html/ diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md index 8baa55e395bcf030147a3e32a9e8ba9a98d6133e..5ae54c91972c63797115ea0aee63df524ac032f3 100644 --- a/CODE_OF_CONDUCT.md +++ b/CODE_OF_CONDUCT.md @@ -60,7 +60,7 @@ representative at an online or offline event. Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at -owensingh72@gmail.com. +karim.shoair@pm.me. All complaints will be reviewed and investigated promptly and fairly. All community leaders are obligated to respect the privacy and security of the diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 6c281a5205d6fef6aac30e6d60a0c8b100cc720f..425531fa1656af36d0a9fa1aa361d9c5027f5ac6 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,167 +1,106 @@ -# Contributing to CyberScraper 2077 +# Contributing to Scrapling -> "In 2077, what makes someone a contributor? Pushing code." - Johnny Silverhand +Thank you for your interest in contributing to Scrapling! -Thanks for considering contributing to CyberScraper 2077! This document outlines the process and guidelines for contributing to make the experience smooth for everyone involved. +Everybody is invited and welcome to contribute to Scrapling. -## ๐Ÿค Code of Conduct +Minor changes are more likely to be included promptly. Adding unit tests for new features or test cases for bugs you've fixed helps us ensure that the Pull Request (PR) is acceptable. -By participating in this project, you agree to abide by our Code of Conduct. Please read it before contributing. +There are many ways to contribute to Scrapling. Here are some of them: -## ๐Ÿš€ How to Contribute +- Report bugs and request features using the [GitHub issues](https://github.com/D4Vinci/Scrapling/issues). Please follow the issue template to help us resolve your issue quickly. +- Blog about Scrapling. Tell the world how youโ€™re using Scrapling. This will help newcomers with more examples and increase the Scrapling project's visibility. +- Join the [Discord community](https://discord.gg/EMgGbDceNQ) and share your ideas on how to improve Scrapling. Weโ€™re always open to suggestions. +- If you are not a developer, perhaps you would like to help with translating the [documentation](https://github.com/D4Vinci/Scrapling/tree/docs)? -### Setting Up Development Environment -1. Fork the repository -2. Clone your fork: - ```bash - git clone https://github.com/your-username/CyberScraper-2077.git - cd CyberScraper-2077 - ``` -3. Create a virtual environment: - ```bash - python -m venv venv - source venv/bin/activate # On Windows: venv\Scripts\activate - ``` -4. Install dependencies: - ```bash - pip install -r requirements.txt - playwright install - ``` +## Finding work -### Making Changes +If you have decided to make a contribution to Scrapling, but you do not know what to contribute, here are some ways to find pending work: -1. Create a new branch: - ```bash - git checkout -b feature/your-feature-name - ``` -2. Make your changes -3. Test your changes thoroughly -4. Commit your changes: - ```bash - git commit -m "feat: add new feature" +- Check out the [contribution](https://github.com/D4Vinci/Scrapling/contribute) GitHub page, which lists open issues tagged as `good first issue`. These issues provide a good starting point. +- There are also the [help wanted](https://github.com/D4Vinci/Scrapling/issues?q=is%3Aissue%20label%3A%22help%20wanted%22%20state%3Aopen) issues, but know that some may require familiarity with the Scrapling code base first. You can also target any other issue, provided it is not tagged as `invalid`, `wontfix`, or similar tags. +- If you enjoy writing automated tests, you can work on increasing our test coverage. Currently, the test coverage is around 90โ€“92%. +- Join the [Discord community](https://discord.gg/EMgGbDceNQ) and ask questions in the `#help` channel. + +## Coding style +Please follow these coding conventions as we do when writing code for Scrapling: +- We use [pre-commit](https://pre-commit.com/) to automatically address simple code issues before every commit, so please install it and run `pre-commit install` to set it up. This will install hooks to run [ruff](https://docs.astral.sh/ruff/), [bandit](https://github.com/PyCQA/bandit), and [vermin](https://github.com/netromdk/vermin) on every commit. We are currently using a workflow to automatically run these tools on every PR, so if your code doesn't pass these checks, the PR will be rejected. +- We use type hints for better code clarity and [pyright](https://github.com/microsoft/pyright) for static type checking, which depends on the type hints, of course. +- We use the conventional commit messages format as [here](https://gist.github.com/qoomon/5dfcdf8eec66a051ecd85625518cfd13#types), so for example, we use the following prefixes for commit messages: + + | Prefix | When to use it | + |-------------|--------------------------| + | `feat:` | New feature added | + | `fix:` | Bug fix | + | `docs:` | Documentation change/add | + | `test:` | Tests | + | `refactor:` | Code refactoring | + | `chore:` | Maintenance tasks | + + Then include the details of the change in the commit message body/description. + + Example: ``` -5. Push to your fork: - ```bash - git push origin feature/your-feature-name + feat: add `adaptive` for similar elements + + - Added find_similar() method + - Implemented pattern matching + - Added tests and documentation ``` -6. Create a Pull Request - -## ๐Ÿ“ Commit Message Guidelines -We follow [Conventional Commits](https://www.conventionalcommits.org/). Your commit messages should be structured as follows: +> Please donโ€™t put your name in the code you contribute; git provides enough metadata to identify the author of the code. +## Development +Setting the scrapling logging level to `debug` makes it easier to know what's happening in the background. +```python +import logging +logging.getLogger("scrapling").setLevel(logging.DEBUG) ``` -(): - -[optional body] - -[optional footer] +Bonus: You can install the beta of the upcoming update from the dev branch as follows +```commandline +pip3 install git+https://github.com/D4Vinci/Scrapling.git@dev ``` -Types: -- `feat`: New feature -- `fix`: Bug fix -- `docs`: Documentation changes -- `style`: Code style changes (formatting, missing semi-colons, etc) -- `refactor`: Code refactoring -- `test`: Adding missing tests -- `chore`: Changes to build process or auxiliary tools - -Example: -``` -feat(scraper): add support for dynamic loading websites +## Building Documentation +Documentation is built using [MkDocs](https://www.mkdocs.org/). You can build it locally using the following commands: +```bash +pip install mkdocs-material +mkdocs serve # Local preview +mkdocs build # Build the static site ``` -## ๐Ÿงช Testing Guidelines - -- Write tests for new features -- Ensure all tests pass before submitting PR -- Follow existing test patterns -- Include both unit and integration tests when applicable - -## ๐Ÿ“š Documentation Guidelines - -- Update README.md if adding new features -- Add docstrings to new functions/classes -- Include code examples when appropriate -- Keep documentation clear and concise - -## ๐Ÿ—๏ธ Project Structure +## Tests +Scrapling includes a comprehensive test suite that can be executed with pytest. However, first, you need to install all libraries and `pytest-plugins` listed in `tests/requirements.txt`. Then, running the tests will result in an output like this: + ```bash + $ pytest tests -n auto + =============================== test session starts =============================== + platform darwin -- Python 3.13.8, pytest-8.4.2, pluggy-1.6.0 -- /Users//.venv/bin/python3.13 + cachedir: .pytest_cache + rootdir: /Users//scrapling + configfile: pytest.ini + plugins: asyncio-1.2.0, anyio-4.11.0, xdist-3.8.0, httpbin-2.1.0, cov-7.0.0 + asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function + 10 workers [271 items] + scheduling tests via LoadScheduling + + ...... + + =============================== 271 passed in 52.68s ============================== + ``` +Hence, we used `-n auto` in the command above to run tests in threads to increase speed. +Bonus: You can also see the test coverage with the `pytest` plugin below +```bash +pytest --cov=scrapling tests/ ``` -CyberScraper-2077/ -โ”œโ”€โ”€ app/ -โ”‚ โ”œโ”€โ”€ scrapers/ -โ”‚ โ”œโ”€โ”€ utils/ -โ”‚ โ””โ”€โ”€ ui_components/ -โ”œโ”€โ”€ src/ -โ”‚ โ””โ”€โ”€ models/ -โ”œโ”€โ”€ tests/ -โ””โ”€โ”€ docs/ -``` - -- Place new scraper implementations in `app/scrapers/` -- Add utility functions in `app/utils/` -- UI components go in `app/ui_components/` -- Model-related code goes in `src/models/` - -## ๐ŸŽฏ Feature Requests - -- Use GitHub Issues to propose new features -- Tag feature requests with `enhancement` -- Provide clear use cases -- Discuss implementation approach - -## ๐Ÿ› Bug Reports - -When reporting bugs, include: -- Detailed description of the issue -- Steps to reproduce -- Expected vs actual behavior -- Environment details (OS, Python version, etc.) -- Screenshots if applicable - -## ๐Ÿ” Pull Request Process - -1. Update documentation -2. Add/update tests -3. Ensure CI/CD pipeline passes -4. Get at least one code review -5. Squash commits if requested -6. Ensure branch is up to date with main - -## โš™๏ธ Development Best Practices - -1. Follow PEP 8 style guide -2. Use type hints -3. Keep functions/methods focused and small -4. Comment complex logic -5. Use meaningful variable/function names -6. Handle errors appropriately -7. Log important operations - -## ๐Ÿšซ What to Avoid - -- Breaking existing functionality -- Introducing unnecessary dependencies -- Making large, unfocused PRs -- Ignoring code review feedback -- Modifying core functionality without discussion - -## ๐Ÿ† Recognition - -Contributors will be added to our README.md and CONTRIBUTORS.md files. We value and appreciate all contributions! - -## ๐Ÿ“ž Getting Help - -- Create an issue for questions -- Join our Discord community -- Check existing documentation -- Look through closed issues - -## ๐Ÿ“œ License -By contributing, you agree that your contributions will be licensed under the project's MIT License. +## Making a Pull Request +To ensure that your PR gets accepted, please make sure that your PR is based on the latest changes from the dev branch and that it satisfies the following requirements: -Remember: In Night City - and in open source - style is everything, choom. Let's keep the code clean and the commits conventional. +- The PR should be made against the [**dev**](https://github.com/D4Vinci/Scrapling/tree/dev) branch of Scrapling. Any PR made against the main branch will be rejected. +- The code should be passing all available tests. We use tox with GitHub's CI to run the current tests on all supported Python versions for every code-related commit. +- The code should be passing all code quality checks we mentioned above. We are using GitHub's CI to enforce the code style checks performed by pre-commit. If you were using the pre-commit hooks we discussed above, you should not see any issues when committing your changes. +- Make your changes, keep the code clean with an explanation of any part that might be vague, and remember to create a separate virtual environment for this project. +- If you are adding a new feature, please add tests for it. +- If you are fixing a bug, please add code with the PR that reproduces the bug. \ No newline at end of file diff --git a/Dockerfile b/Dockerfile index 07222313098910c8af874a9a2aca59657e739477..43b04e4d45848b64fadd346e42b60bef6700c7a8 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,102 +1,48 @@ -# Use Python 3.12 for better performance and compatibility -FROM python:3.12-slim-bookworm +FROM python:3.12-slim-trixie -# Set environment variables -ENV PYTHONUNBUFFERED=1 \ - PYTHONDONTWRITEBYTECODE=1 \ - PORT=7860 \ - UV_SYSTEM_PYTHON=1 \ - HOME=/home/user \ - STREAMLIT_BROWSER_GATHER_USAGE_STATS=false \ - STREAMLIT_SERVER_HEADLESS=true \ - STREAMLIT_SERVER_PORT=8501 \ - STREAMLIT_SERVER_ADDRESS=0.0.0.0 - -# Install system dependencies -RUN apt-get update && apt-get install -y \ - wget \ - gnupg \ - git \ - tor \ - tor-geoipdb \ - netcat-traditional \ - curl \ - build-essential \ - python3-dev \ - libffi-dev \ - procps \ - nginx \ - # Browser dependencies for Playwright/Patchright - libglib2.0-0 \ - libnspr4 \ - libnss3 \ - libdbus-1-3 \ - libatk1.0-0 \ - libatk-bridge2.0-0 \ - libcups2 \ - libxkbcommon0 \ - libatspi2.0-0 \ - libxcomposite1 \ - libxdamage1 \ - libxfixes3 \ - libxrandr2 \ - libgbm1 \ - libcairo2 \ - libpango-1.0-0 \ - libasound2 \ - && apt-get clean \ - && rm -rf /var/lib/apt/lists/* - -# Install uv +LABEL io.modelcontextprotocol.server.name="io.github.D4Vinci/Scrapling" COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ -# Set up working directory -WORKDIR /app - -# Copy requirements and install as root -COPY requirements.txt . -RUN uv pip install --system -r requirements.txt -RUN uv pip install --system fastapi uvicorn - -# Install patchright browser (Chromium) -RUN patchright install chromium - -# Create a non-root user -RUN useradd -m -u 1000 user +# Set environment variables +ENV DEBIAN_FRONTEND=noninteractive \ + PYTHONUNBUFFERED=1 \ + PYTHONDONTWRITEBYTECODE=1 -# Configure Tor -RUN echo "SocksPort 9050" >> /etc/tor/torrc && \ - echo "ControlPort 9051" >> /etc/tor/torrc && \ - echo "CookieAuthentication 1" >> /etc/tor/torrc && \ - echo "DataDirectory /var/lib/tor" >> /etc/tor/torrc +WORKDIR /app -# Set permissions for Tor, app directory, and nginx -RUN mkdir -p /var/lib/tor && \ - chown -R user:user /var/lib/tor && \ - chmod 700 /var/lib/tor && \ - chown -R user:user /app && \ - mkdir -p /var/log/nginx /var/lib/nginx /tmp && \ - chown -R user:user /var/log/nginx /var/lib/nginx /tmp +# Copy dependency file first for better layer caching +COPY pyproject.toml ./ -# Pre-create streamlit config dir in home -RUN mkdir -p /home/user/.streamlit && chown -R user:user /home/user +# Install dependencies only +RUN --mount=type=cache,target=/root/.cache/uv \ + uv sync --no-install-project --all-extras --compile-bytecode -# Copy the rest of the application -COPY --chown=user:user . . +# Copy source code +COPY . . -# Install Scrapling in editable mode and its browser dependencies -RUN uv pip install --system -e ./Scrapling[fetchers] -RUN playwright install chromium +# Install browsers and project in one optimized layer +RUN --mount=type=cache,target=/root/.cache/uv \ + --mount=type=cache,target=/var/cache/apt \ + --mount=type=cache,target=/var/lib/apt \ + apt-get update && \ + uv run playwright install-deps chromium && \ + uv run playwright install chromium && \ + uv sync --all-extras --compile-bytecode && \ + apt-get clean && \ + rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* -# Set permissions for the start script -RUN chmod +x start.sh +# Create a non-root user +RUN useradd -m -u 1000 user && \ + chown -R user:user /app -# Switch to non-root user +# Switch to the non-root user USER user -ENV PATH="/home/user/.local/bin:$PATH" -# Expose port +# Expose port for MCP server HTTP transport EXPOSE 7860 -# Set the entrypoint -ENTRYPOINT ["./start.sh"] +# Set entrypoint to run scrapling +ENTRYPOINT ["uv", "run", "scrapling"] + +# Default command (can be overridden) +CMD ["mcp", "--http", "--port", "7860", "--host", "0.0.0.0"] diff --git a/LICENSE b/LICENSE index 331be2060a144ff736cdb9c3fa794a72323eecb0..41615aad07eb42df7f09997f207950098e853a68 100644 --- a/LICENSE +++ b/LICENSE @@ -1,21 +1,28 @@ -MIT License +BSD 3-Clause License -Copyright (c) 2024 Owen Singh +Copyright (c) 2024, Karim shoair -Permission is hereby granted, free of charge, to any person obtaining a copy -of this software and associated documentation files (the "Software"), to deal -in the Software without restriction, including without limitation the rights -to use, copy, modify, merge, publish, distribute, sublicense, and/or sell -copies of the Software, and to permit persons to whom the Software is -furnished to do so, subject to the following conditions: +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: -The above copyright notice and this permission notice shall be included in all -copies or substantial portions of the Software. +1. Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE -SOFTWARE. +2. Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +3. Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/MANIFEST.in b/MANIFEST.in new file mode 100644 index 0000000000000000000000000000000000000000..aa9bf232aeda83321c300908aa9475680da4ecf2 --- /dev/null +++ b/MANIFEST.in @@ -0,0 +1,12 @@ +include LICENSE +include *.db +include *.js +include scrapling/*.db +include scrapling/*.db* +include scrapling/*.db-* +include scrapling/py.typed +include scrapling/.scrapling_dependencies_installed +include .scrapling_dependencies_installed + +recursive-exclude * __pycache__ +recursive-exclude * *.py[co] \ No newline at end of file diff --git a/README.md b/README.md index 112efaa3e60977a68e976dc6939a74f30b2593e5..be8faa846d22cc133abc3b019c9cc07664649913 100644 --- a/README.md +++ b/README.md @@ -1,406 +1,437 @@ --- title: Scraper Hub -emoji: ๐ŸŒ -colorFrom: blue -colorTo: red +emoji: ๐Ÿ•ท๏ธ +colorFrom: purple +colorTo: blue sdk: docker app_port: 7860 --- - -# ๐ŸŒ CyberScraper 2077 + + +

+ + + + Scrapling Poster + + +
+ Effortless Web Scraping for the Modern Web +

- CyberScraper 2077 Logo + D4Vinci%2FScrapling | Trendshift +
+ ุงู„ุนุฑุจูŠู‡ | Espaรฑol | Deutsch | ็ฎ€ไฝ“ไธญๆ–‡ | ๆ—ฅๆœฌ่ชž | ะ ัƒััะบะธะน +
+ + Tests + + PyPI version + + PyPI Downloads +
+ + Discord + + + X (formerly Twitter) Follow + +
+ + Supported Python versions

- + Selection methods + · + Fetchers + · + Spiders + · + Proxy Rotation + · + CLI + · + MCP

-[![Python](https://img.shields.io/badge/Python-blue)](https://www.python.org/downloads/) -[![Streamlit](https://img.shields.io/badge/Streamlit-FF4B4B)](https://streamlit.io/) -[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) -[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com) - -> Rip data from the net, leaving no trace. Welcome to the future of web scraping. - -## ๐Ÿ” About - -CyberScraper 2077 is not just another web scraping tool โ€“ it's a glimpse into the future of data extraction. Born from the neon-lit streets of a cyberpunk world, this AI-powered scraper uses OpenAI, Gemini and LocalLLM Models to slice through the web's defenses, extracting the data you need with unparalleled precision and style. - -Whether you're a corpo data analyst, a street-smart netrunner, or just someone looking to pull information from the digital realm, CyberScraper 2077 has got you covered. - -

- -

- -## โœจ Features - -- **AI-Powered Extraction**: Utilizes cutting-edge AI models to understand and parse web content intelligently. -- **Sleek Streamlit Interface**: User-friendly GUI that even a chrome-armed street samurai could navigate. -- **Multi-Format Support**: Export your data in JSON, CSV, HTML, SQL or Excel โ€“ whatever fits your cyberdeck. -- **Tor Network Support**: Safely scrape .onion sites through the Tor network with automatic routing and security features. -- **Stealth Mode**: Implemented stealth mode parameters that help avoid detection as a bot. -- **Ollama Support**: Use a huge library of open source LLMs. -- **Async Operations**: Lightning-fast scraping that would make a Trauma Team jealous. -- **Smart Parsing**: Structures scraped content as if it was extracted straight from the engram of a master netrunner. -- **Caching**: Implemented content-based and query-based caching using LRU cache and a custom dictionary to reduce redundant API calls. -- **Upload to Google Sheets**: Now you can easily upload your extracted CSV data to Google Sheets with one click. -- **Bypass Captcha**: Bypass captcha by using the -captcha at the end of the URL. (Currently only works natively, doesn't work on Docker) -- **Current Browser**: The current browser feature uses your local browser instance which will help you bypass 99% of bot detections. (Only use when necessary) -- **Navigate through the Pages (BETA)**: Navigate through the webpage and scrape data from different pages. - -## ๐ŸชŸ For Windows Users +Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl. -Please follow the Docker Container Guide given below, as I won't be able to maintain another version for Windows systems. +Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation โ€” all in a few lines of Python. One library, zero compromises. -## ๐Ÿ›  Installation +Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone. -**Note: CyberScraper 2077 requires Python 3.10 or higher.** - -1. Clone this repository: - ```bash - git clone https://github.com/itsOwen/CyberScraper-2077.git - cd CyberScraper-2077 - ``` - -2. Create and activate a virtual environment: - ```bash - virtualenv venv - source venv/bin/activate # Optional - ``` - -3. Install the required packages: - ```bash - pip install -r requirements.txt - ``` - -4. Install the playwright: - ```bash - playwright install - ``` - -5. Set OpenAI & Gemini Key in your environment: - - Linux/Mac: - ```bash - export OPENAI_API_KEY="your-api-key-here" - export GOOGLE_API_KEY="your-api-key-here" - ``` - -### Using Ollama - -Note: I only recommend using OpenAI and Gemini API as these models are really good at following instructions. If you are using open-source LLMs, make sure you have a good system as the speed of the data generation/presentation depends on how well your system can run the LLM. You may also have to fine-tune the prompt and add some additional filters yourself. - -```bash -1. Setup Ollama using `pip install ollama` -2. Download Ollama from the official website: https://ollama.com/download -3. Now type: ollama pull llama3.1 or whatever LLM you want to use. -4. Now follow the rest of the steps below. +```python +from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher +StealthyFetcher.adaptive = True +p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar! +products = p.css('.product', auto_save=True) # Scrape data that survives website design changes! +products = p.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them! ``` +Or scale up to full crawls +```python +from scrapling.spiders import Spider, Response -## ๐Ÿณ Docker Installation - -If you prefer to use Docker, follow these steps to set up and run CyberScraper 2077: - -1. Ensure you have Docker installed on your system. - -2. Clone this repository: - ```bash - git clone https://github.com/itsOwen/CyberScraper-2077.git - cd CyberScraper-2077 - ``` +class MySpider(Spider): + name = "demo" + start_urls = ["https://example.com/"] -3. Build the Docker image: - ```bash - docker build -t cyberscraper-2077 . - ``` + async def parse(self, response: Response): + for item in response.css('.product'): + yield {"title": item.css('h2::text').get()} -4. Run the container: - ```bash - docker run -p 8501:8501 -e OPENAI_API_KEY="your-actual-api-key" -e GOOGLE_API_KEY="your-actual-api-key" cyberscraper-2077 - ``` +MySpider().start() +``` -### Using Ollama with Docker +# Platinum Sponsors -If you want to use Ollama with the Docker setup: +Do you want to be the first company to show up here? Click [here](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646) +# Sponsors -1. Install Ollama on your host machine following the instructions at https://ollama.com/download + -2. Run Ollama on your host machine: - ```bash - ollama pull llama3.1 - ``` + + + + + + + + + -3. Find your host machine's IP address: - - On Linux/Mac: `ifconfig` or `ip addr show` - - On Windows: `ipconfig` -4. Run the Docker container with the host network and set the Ollama URL: - ```bash - docker run -e OLLAMA_BASE_URL=http://host.docker.internal:11434 -p 8501:8501 cyberscraper-2077 - ``` + + + - Now visit the url: http://localhost:8501/ + - On Linux you might need to use this below: - ```bash - docker run -e OLLAMA_BASE_URL=http://:11434 -p 8501:8501 cyberscraper-2077 - ``` - Replace `` with your actual host machine IP address. +Do you want to show your ad here? Click [here](https://github.com/sponsors/D4Vinci) and choose the tier that suites you! -5. In the Streamlit interface, select the Ollama model you want to use (e.g., "ollama:llama3.1"). +--- -Note: Ensure that your firewall allows connections to port 11434 for Ollama. +## Key Features + +### Spiders โ€” A Full Crawling Framework +- ๐Ÿ•ท๏ธ **Scrapy-like Spider API**: Define spiders with `start_urls`, async `parse` callbacks, and `Request`/`Response` objects. +- โšก **Concurrent Crawling**: Configurable concurrency limits, per-domain throttling, and download delays. +- ๐Ÿ”„ **Multi-Session Support**: Unified interface for HTTP requests, and stealthy headless browsers in a single spider โ€” route requests to different sessions by ID. +- ๐Ÿ’พ **Pause & Resume**: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off. +- ๐Ÿ“ก **Streaming Mode**: Stream scraped items as they arrive via `async for item in spider.stream()` with real-time stats โ€” ideal for UI, pipelines, and long-running crawls. +- ๐Ÿ›ก๏ธ **Blocked Request Detection**: Automatic detection and retry of blocked requests with customizable logic. +- ๐Ÿ“ฆ **Built-in Export**: Export results through hooks and your own pipeline or the built-in JSON/JSONL with `result.items.to_json()` / `result.items.to_jsonl()` respectively. + +### Advanced Websites Fetching with Session Support +- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3. +- **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium and Google's Chrome. +- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation. +- **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests. +- **Proxy Rotation**: Built-in `ProxyRotator` with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides. +- **Domain Blocking**: Block requests to specific domains (and their subdomains) in browser-based fetchers. +- **Async Support**: Complete async support across all fetchers and dedicated async session classes. + +### Adaptive Scraping & AI Integration +- ๐Ÿ”„ **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms. +- ๐ŸŽฏ **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more. +- ๐Ÿ” **Find Similar Elements**: Automatically locate elements similar to found elements. +- ๐Ÿค– **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE)) + +### High-Performance & battle-tested Architecture +- ๐Ÿš€ **Lightning Fast**: Optimized performance outperforming most Python scraping libraries. +- ๐Ÿ”‹ **Memory Efficient**: Optimized data structures and lazy loading for a minimal memory footprint. +- โšก **Fast JSON Serialization**: 10x faster than the standard library. +- ๐Ÿ—๏ธ **Battle tested**: Not only does Scrapling have 92% test coverage and full type hints coverage, but it has been used daily by hundreds of Web Scrapers over the past year. + +### Developer/Web Scraper Friendly Experience +- ๐ŸŽฏ **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser. +- ๐Ÿš€ **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single line of code! +- ๐Ÿ› ๏ธ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods. +- ๐Ÿงฌ **Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations. +- ๐Ÿ“ **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element. +- ๐Ÿ”Œ **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel. +- ๐Ÿ“˜ **Complete Type Coverage**: Full type hints for excellent IDE support and code completion. The entire codebase is automatically scanned with **PyRight** and **MyPy** with each change. +- ๐Ÿ”‹ **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed. + +## Getting Started + +Let's give you a quick glimpse of what Scrapling can do without deep diving. + +### Basic Usage +HTTP requests with session support +```python +from scrapling.fetchers import Fetcher, FetcherSession -## ๐Ÿš€ Usage +with FetcherSession(impersonate='chrome') as session: # Use latest version of Chrome's TLS fingerprint + page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) + quotes = page.css('.quote .text::text').getall() -1. Fire up the Streamlit app: - ```bash - streamlit run main.py - ``` +# Or use one-off requests +page = Fetcher.get('https://quotes.toscrape.com/') +quotes = page.css('.quote .text::text').getall() +``` +Advanced stealth mode +```python +from scrapling.fetchers import StealthyFetcher, StealthySession -2. Open your browser and navigate to `http://localhost:8501`. +with StealthySession(headless=True, solve_cloudflare=True) as session: # Keep the browser open until you finish + page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False) + data = page.css('#padded_content a').getall() -3. Enter the URL of the site you want to scrape or ask a question about the data you need. +# Or use one-off request style, it opens the browser for this request, then closes it after finishing +page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare') +data = page.css('#padded_content a').getall() +``` +Full browser automation +```python +from scrapling.fetchers import DynamicFetcher, DynamicSession -4. Ask the chatbot to extract the data in any format. Select whatever data you want to export or even everything from the webpage. +with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Keep the browser open until you finish + page = session.fetch('https://quotes.toscrape.com/', load_dom=False) + data = page.xpath('//span[@class="text"]/text()').getall() # XPath selector if you prefer it -5. Watch as CyberScraper 2077 tears through the net, extracting your data faster than you can say "flatline"! +# Or use one-off request style, it opens the browser for this request, then closes it after finishing +page = DynamicFetcher.fetch('https://quotes.toscrape.com/') +data = page.css('.quote .text::text').getall() +``` -Example usage with page ranges: +### Spiders +Build full crawlers with concurrent requests, multiple session types, and pause/resume: +```python +from scrapling.spiders import Spider, Request, Response + +class QuotesSpider(Spider): + name = "quotes" + start_urls = ["https://quotes.toscrape.com/"] + concurrent_requests = 10 + + async def parse(self, response: Response): + for quote in response.css('.quote'): + yield { + "text": quote.css('.text::text').get(), + "author": quote.css('.author::text').get(), + } + + next_page = response.css('.next a') + if next_page: + yield response.follow(next_page[0].attrib['href']) + +result = QuotesSpider().start() +print(f"Scraped {len(result.items)} quotes") +result.items.to_json("quotes.json") ``` -https://example.com/products 1-5 -https://example.com/search?q=cyberpunk&page={page} 1-10 +Use multiple session types in a single spider: +```python +from scrapling.spiders import Spider, Request, Response +from scrapling.fetchers import FetcherSession, AsyncStealthySession + +class MultiSessionSpider(Spider): + name = "multi" + start_urls = ["https://example.com/"] + + def configure_sessions(self, manager): + manager.add("fast", FetcherSession(impersonate="chrome")) + manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) + + async def parse(self, response: Response): + for link in response.css('a::attr(href)').getall(): + # Route protected pages through the stealth session + if "protected" in link: + yield Request(link, sid="stealth") + else: + yield Request(link, sid="fast", callback=self.parse) # explicit callback ``` - -## ๐ŸŒ Multi-Page Scraping (BETA) - -> **Note**: The multi-page scraping feature is currently in beta. While functional, you may encounter occasional issues or unexpected behavior. We appreciate your feedback and patience as we continue to improve this feature. - -CyberScraper 2077 now supports multi-page scraping, allowing you to extract data from multiple pages of a website in one go. This feature is perfect for scraping paginated content, search results, or any site with data spread across multiple pages. - -### How to Use Multi-Page Scraping - -I suggest you enter the URL structure every time if you want to scrape multiple pages so it can detect the URL structure easily. It detects nearly all URL types. - -1. **Basic Usage**: - To scrape multiple pages, use the following format when entering the URL: - ``` - https://example.com/page 1-5 - https://example.com/p/ 1-6 - https://example.com/xample/something-something-1279?p=1 1-3 - ``` - This will scrape pages 1 through 5 of the website. - -2. **Custom Page Ranges**: - You can specify custom page ranges: - ``` - https://example.com/p/ 1-5,7,9-12 - https://example.com/xample/something-something-1279?p=1 1,7,8,9 - ``` - This will scrape pages 1 to 5, page 7, and pages 9 to 12. - -3. **URL Patterns**: - For websites with different URL structures, you can specify a pattern: - ``` - https://example.com/search?q=cyberpunk&page={page} 1-5 - ``` - Replace `{page}` with where the page number should be in the URL. - -4. **Automatic Pattern Detection**: - If you don't specify a pattern, CyberScraper 2077 will attempt to detect the URL pattern automatically. However, for best results, specifying the pattern is recommended. - -### Tips for Effective Multi-Page Scraping - -- Start with a small range of pages to test before scraping a large number. -- Be mindful of the website's load and your scraping speed to avoid overloading servers. -- Use the `simulate_human` option for more natural scraping behavior on sites with anti-bot measures. -- Regularly check the website's `robots.txt` file and terms of service to ensure compliance. - -### Example - -```bash -URL Example : "https://news.ycombinator.com/?p=1 1-3 or 1,2,3,4" +Pause and resume long crawls with checkpoints by running the spider like this: +```python +QuotesSpider(crawldir="./crawl_data").start() ``` +Press Ctrl+C to pause gracefully โ€” progress is saved automatically. Later, when you start the spider again, pass the same `crawldir`, and it will resume from where it stopped. -If you want to scrape a specific page, just enter the query "please scrape page number 1 or 2". If you want to scrape all pages, simply give a query like "scrape all pages in csv" or whatever format you want. - -## ๐Ÿง… Tor Network Scraping - -> **Note**: The Tor network scraping feature allows you to access and scrape .onion sites. This feature requires additional setup and should be used responsibly and legally. - -CyberScraper 2077 now supports scraping .onion sites through the Tor network, allowing you to access and extract data from the dark web safely and anonymously. This feature is perfect for researchers, security analysts, and investigators who need to gather information from Tor hidden services. - -### Prerequisites - -1. Install Tor on your system: - ```bash - # Ubuntu/Debian - sudo apt install tor - - # macOS (using Homebrew) - brew install tor - - # Start the Tor service - sudo service tor start # on Linux - brew services start tor # on macOS - ``` - -2. Install additional Python packages: - ```bash - pip install PySocks requests[socks] - ``` - -### Using Tor Scraping - -1. **Basic Usage**: - Simply enter an .onion URL, and CyberScraper will automatically detect and route it through the Tor network: - ``` - http://example123abc.onion - ``` - -2. **Safety Features**: - - Automatic .onion URL detection - - Built-in connection verification - - Tor Browser-like request headers - - Automatic circuit isolation +### Advanced Parsing & Navigation +```python +from scrapling.fetchers import Fetcher + +# Rich element selection and navigation +page = Fetcher.get('https://quotes.toscrape.com/') + +# Get quotes with multiple selection methods +quotes = page.css('.quote') # CSS selector +quotes = page.xpath('//div[@class="quote"]') # XPath +quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup-style +# Same as +quotes = page.find_all('div', class_='quote') +quotes = page.find_all(['div'], class_='quote') +quotes = page.find_all(class_='quote') # and so on... +# Find element by text content +quotes = page.find_by_text('quote', tag='div') + +# Advanced navigation +quote_text = page.css('.quote')[0].css('.text::text').get() +quote_text = page.css('.quote').css('.text::text').getall() # Chained selectors +first_quote = page.css('.quote')[0] +author = first_quote.next_sibling.css('.author::text') +parent_container = first_quote.parent + +# Element relationships and similarity +similar_elements = first_quote.find_similar() +below_elements = first_quote.below_elements() +``` +You can use the parser right away if you don't want to fetch websites like below: +```python +from scrapling.parser import Selector -### Configuration Options +page = Selector("...") +``` +And it works precisely the same way! -You can customize the Tor scraping behavior by adjusting the following settings: +### Async Session Management Examples ```python -tor_config = TorConfig( - socks_port=9050, # Default Tor SOCKS port - circuit_timeout=10, # Timeout for circuit creation - auto_renew_circuit=True, # Automatically renew Tor circuit - verify_connection=True # Verify Tor connection before scraping -) +import asyncio +from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession + +async with FetcherSession(http3=True) as session: # `FetcherSession` is context-aware and can work in both sync/async patterns + page1 = session.get('https://quotes.toscrape.com/') + page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135') + +# Async session usage +async with AsyncStealthySession(max_pages=2) as session: + tasks = [] + urls = ['https://example.com/page1', 'https://example.com/page2'] + + for url in urls: + task = session.fetch(url) + tasks.append(task) + + print(session.get_pool_stats()) # Optional - The status of the browser tabs pool (busy/free/error) + results = await asyncio.gather(*tasks) + print(session.get_pool_stats()) ``` -### Security Considerations +## CLI & Interactive Shell -- Always ensure you're complying with local laws and regulations -- Use a VPN in addition to Tor for extra security -- Be patient as Tor connections can be slower than regular web scraping -- Avoid sending personal or identifying information through Tor -- Some .onion sites may be offline or unreachable +Scrapling includes a powerful command-line interface: -### Docker Support +[![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339) -For Docker users, add these additional flags to enable Tor support: +Launch the interactive Web Scraping shell ```bash -docker run -p 8501:8501 \ - --network="host" \ - -e OPENAI_API_KEY="your-api-key" \ - cyberscraper-2077 +scrapling shell ``` - -### Example Usage - -

- CyberScraper 2077 Onion Scrape -

- -## ๐Ÿ” Setup Google Sheets Authentication - -1. Go to the Google Cloud Console (https://console.cloud.google.com/). -2. Select your project. -3. Navigate to "APIs & Services" > "Credentials". -4. Find your existing OAuth 2.0 Client ID and delete it. -5. Click "Create Credentials" > "OAuth client ID". -6. Choose "Web application" as the application type. -7. Name your client (e.g., "CyberScraper 2077 Web Client"). -8. Under "Authorized JavaScript origins", add: - - http://localhost:8501 - - http://localhost:8502 - - http://127.0.0.1:8501 - - http://127.0.0.1:8502 -9. Under "Authorized redirect URIs", add: - - http://localhost:8501/ - - http://127.0.0.1:8501/ - - http://localhost:8502/ - - http://127.0.0.1:8502/ -10. Click "Create" to generate the new client ID. -11. Download the new client configuration JSON file and rename it to `client_secret.json`. - -## โš™๏ธ Adjusting PlaywrightScraper Settings (optional) - -Customize the `PlaywrightScraper` settings to fit your scraping needs. If some websites are giving you issues, you might want to check the behavior of the website: - +Extract pages to a file directly without programming (Extracts the content inside the `body` tag by default). If the output file ends with `.txt`, then the text content of the target will be extracted. If it ends in `.md`, it will be a Markdown representation of the HTML content; if it ends in `.html`, it will be the HTML content itself. ```bash -use_stealth: bool = True, -simulate_human: bool = False, -use_custom_headers: bool = True, -hide_webdriver: bool = True, -bypass_cloudflare: bool = True: +scrapling extract get 'https://example.com' content.md +scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # All elements matching the CSS selector '#fromSkipToProducts' +scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless +scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare ``` -Adjust these settings based on your target website and environment for optimal results. +> [!NOTE] +> There are many additional features, but we want to keep this page concise, including the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/) -You can also bypass the captcha using the ```-captcha``` parameter at the end of the URL. The browser window will pop up, complete the captcha, and go back to your terminal window. Press enter and the bot will complete its task. +## Performance Benchmarks -## ๐Ÿค Contributing +Scrapling isn't just powerfulโ€”it's also blazing fast. The following benchmarks compare Scrapling's parser with the latest versions of other popular libraries. -We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberScraper 2077! +### Text Extraction Speed Test (5000 nested elements) -1. Fork the repository -2. Create your feature branch (`git checkout -b feature/amazing-feature`) -3. Commit your changes (`git commit -m 'Add some amazing feature'`) -4. Push to the branch (`git push origin feature/amazing-feature`) -5. Open a Pull Request +| # | Library | Time (ms) | vs Scrapling | +|---|:-----------------:|:---------:|:------------:| +| 1 | Scrapling | 2.02 | 1.0x | +| 2 | Parsel/Scrapy | 2.04 | 1.01 | +| 3 | Raw Lxml | 2.54 | 1.257 | +| 4 | PyQuery | 24.17 | ~12x | +| 5 | Selectolax | 82.63 | ~41x | +| 6 | MechanicalSoup | 1549.71 | ~767.1x | +| 7 | BS4 with Lxml | 1584.31 | ~784.3x | +| 8 | BS4 with html5lib | 3391.91 | ~1679.1x | -## ๐Ÿ”ง Troubleshooting -Ran into a glitch in the matrix? Let me know by adding the issue to this repo so that we can fix it together. +### Element Similarity & Text Search Performance -## โ“ FAQ +Scrapling's adaptive element finding capabilities significantly outperform alternatives: -**Q: Is CyberScraper 2077 legal to use?** -A: CyberScraper 2077 is designed for ethical web scraping. Always ensure you have the right to scrape a website and respect their robots.txt file. +| Library | Time (ms) | vs Scrapling | +|-------------|:---------:|:------------:| +| Scrapling | 2.39 | 1.0x | +| AutoScraper | 12.45 | 5.209x | -**Q: Can I use this for commercial purposes?** -A: Yes, under the terms of the MIT License. -## ๐Ÿ“„ License +> All benchmarks represent averages of 100+ runs. See [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology. -This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. Use it, mod it, sell it โ€“ just don't blame us if you end up flatlined. +## Installation -## ๐Ÿ“ž Contact +Scrapling requires Python 3.10 or higher: -Got questions? Need support? Want to hire me for a gig? - -- Email: owensingh72@proton.me -- Website: [owen.sh](https://owen.sh) - -## ๐Ÿšจ Disclaimer - -Listen up, choombas! Before you jack into this code, you better understand the risks: +```bash +pip install scrapling +``` -1. This software is provided "as is", without warranty of any kind, express or implied. +This installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies. + +### Optional Dependencies + +1. If you are going to use any of the extra features below, the fetchers, or their classes, you will need to install fetchers' dependencies and their browser dependencies as follows: + ```bash + pip install "scrapling[fetchers]" + + scrapling install # normal install + scrapling install --force # force reinstall + ``` + + This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies. + + Or you can install them from the code instead of running a command like this: + ```python + from scrapling.cli import install + + install([], standalone_mode=False) # normal install + install(["--force"], standalone_mode=False) # force reinstall + ``` + +2. Extra features: + - Install the MCP server feature: + ```bash + pip install "scrapling[ai]" + ``` + - Install shell features (Web Scraping shell and the `extract` command): + ```bash + pip install "scrapling[shell]" + ``` + - Install everything: + ```bash + pip install "scrapling[all]" + ``` + Remember that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already) + +### Docker +You can also install a Docker image with all extras and browsers with the following command from DockerHub: +```bash +docker pull pyd4vinci/scrapling +``` +Or download it from the GitHub registry: +```bash +docker pull ghcr.io/d4vinci/scrapling:latest +``` +This image is automatically built and pushed using GitHub Actions and the repository's main branch. -2. The authors are not liable for any damages or losses resulting from the use of this software. +## Contributing -3. This tool is intended for educational and research purposes only. Any illegal use is strictly prohibited. +We welcome contributions! Please read our [contributing guidelines](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before getting started. -4. We do not guarantee the accuracy, completeness, or reliability of any data obtained through this tool. +## Disclaimer -5. By using this software, you acknowledge that you are doing so at your own risk. +> [!CAUTION] +> This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files. -6. You are responsible for complying with all applicable laws and regulations in your use of this software. +## License -7. We reserve the right to modify or discontinue the software at any time without notice. +This work is licensed under the BSD-3-Clause License. -Remember, samurai: In the dark future of the NET, knowledge is power, but it's also a double-edged sword. Use this tool wisely, and may your connection always be strong and your firewalls impenetrable. Stay frosty out there in the digital frontier. +## Acknowledgments -![Alt](https://repobeats.axiom.co/api/embed/80758496e19179f355d6d71c180db7aca66d396b.svg "Repobeats analytics image") +This project includes code adapted from: +- Parsel (BSD License)โ€”Used for [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) submodule --- - -

- CyberScraper 2077 โ€“ Because in 2077, what makes someone a criminal? Getting caught. -

- -

- Built with love and chrome by the streets of Night City | ยฉ 2077 Owen Singh -

+
Designed & crafted with โค๏ธ by Karim Shoair.

\ No newline at end of file diff --git a/ROADMAP.md b/ROADMAP.md new file mode 100644 index 0000000000000000000000000000000000000000..a9db6d615a2ae8097b652ce43acd23d56e6b5223 --- /dev/null +++ b/ROADMAP.md @@ -0,0 +1,14 @@ +## TODOs +- [x] Add more tests and increase the code coverage. +- [x] Structure the tests folder in a better way. +- [x] Add more documentation. +- [x] Add the browsing ability. +- [x] Create detailed documentation for the 'readthedocs' website, preferably add GitHub action for deploying it. +- [ ] Create a Scrapy plugin/decorator to make it replace parsel in the response argument when needed. +- [x] Need to add more functionality to `AttributesHandler` and more navigation functions to `Selector` object (ex: functions similar to map, filter, and reduce functions but here pass it to the element and the function is executed on children, siblings, next elements, etc...) +- [x] Add `.filter` method to `Selectors` object and other similar methods. +- [ ] Add functionality to automatically detect pagination URLs +- [ ] Add the ability to auto-detect schemas in pages and manipulate them. +- [ ] Add `analyzer` ability that tries to learn about the page through meta-elements and return what it learned +- [ ] Add the ability to generate a regex from a group of elements (Like for all href attributes) +- \ No newline at end of file diff --git a/benchmarks.py b/benchmarks.py new file mode 100644 index 0000000000000000000000000000000000000000..fcb089c767eb1e9edcf90c51de8fc137a305cae3 --- /dev/null +++ b/benchmarks.py @@ -0,0 +1,146 @@ +import functools +import time +import timeit +from statistics import mean + +import requests +from autoscraper import AutoScraper +from bs4 import BeautifulSoup +from lxml import etree, html +from mechanicalsoup import StatefulBrowser +from parsel import Selector +from pyquery import PyQuery as pq +from selectolax.parser import HTMLParser + +from scrapling import Selector as ScraplingSelector + +large_html = ( + "" + '
' * 5000 + "
" * 5000 + "" +) + + +def benchmark(func): + @functools.wraps(func) + def wrapper(*args, **kwargs): + benchmark_name = func.__name__.replace("test_", "").replace("_", " ") + print(f"-> {benchmark_name}", end=" ", flush=True) + # Warm-up phase + timeit.repeat( + lambda: func(*args, **kwargs), number=2, repeat=2, globals=globals() + ) + # Measure time (1 run, repeat 100 times, take average) + times = timeit.repeat( + lambda: func(*args, **kwargs), + number=1, + repeat=100, + globals=globals(), + timer=time.process_time, + ) + min_time = round(mean(times) * 1000, 2) # Convert to milliseconds + print(f"average execution time: {min_time} ms") + return min_time + + return wrapper + + +@benchmark +def test_lxml(): + return [ + e.text + for e in etree.fromstring( + large_html, + # Scrapling and Parsel use the same parser inside, so this is just to make it fair + parser=html.HTMLParser(recover=True, huge_tree=True), + ).cssselect(".item") + ] + + +@benchmark +def test_bs4_lxml(): + return [e.text for e in BeautifulSoup(large_html, "lxml").select(".item")] + + +@benchmark +def test_bs4_html5lib(): + return [e.text for e in BeautifulSoup(large_html, "html5lib").select(".item")] + + +@benchmark +def test_pyquery(): + return [e.text() for e in pq(large_html)(".item").items()] + + +@benchmark +def test_scrapling(): + # No need to do `.extract()` like parsel to extract text + # Also, this is faster than `[t.text for t in Selector(large_html, adaptive=False).css('.item')]` + # for obvious reasons, of course. + return ScraplingSelector(large_html, adaptive=False).css(".item::text").getall() + + +@benchmark +def test_parsel(): + return Selector(text=large_html).css(".item::text").extract() + + +@benchmark +def test_mechanicalsoup(): + browser = StatefulBrowser() + browser.open_fake_page(large_html) + return [e.text for e in browser.page.select(".item")] + + +@benchmark +def test_selectolax(): + return [node.text() for node in HTMLParser(large_html).css(".item")] + + +def display(results): + # Sort and display results + sorted_results = sorted(results.items(), key=lambda x: x[1]) # Sort by time + scrapling_time = results["Scrapling"] + print("\nRanked Results (fastest to slowest):") + print(f" i. {'Library tested':<18} | {'avg. time (ms)':<15} | vs Scrapling") + print("-" * 50) + for i, (test_name, test_time) in enumerate(sorted_results, 1): + compare = round(test_time / scrapling_time, 3) + print(f" {i}. {test_name:<18} | {str(test_time):<15} | {compare}") + + +@benchmark +def test_scrapling_text(request_html): + return ScraplingSelector(request_html, adaptive=False).find_by_text("Tipping the Velvet", first_match=True, clean_match=False).find_similar(ignore_attributes=["title"]) + + +@benchmark +def test_autoscraper(request_html): + # autoscraper by default returns elements text + return AutoScraper().build(html=request_html, wanted_list=["Tipping the Velvet"]) + + +if __name__ == "__main__": + print( + " Benchmark: Speed of parsing and retrieving the text content of 5000 nested elements \n" + ) + results1 = { + "Raw Lxml": test_lxml(), + "Parsel/Scrapy": test_parsel(), + "Scrapling": test_scrapling(), + "Selectolax": test_selectolax(), + "PyQuery": test_pyquery(), + "BS4 with Lxml": test_bs4_lxml(), + "MechanicalSoup": test_mechanicalsoup(), + "BS4 with html5lib": test_bs4_html5lib(), + } + + display(results1) + print("\n" + "=" * 25) + req = requests.get("https://books.toscrape.com/index.html") + print( + " Benchmark: Speed of searching for an element by text content, and retrieving the text of similar elements\n" + ) + results2 = { + "Scrapling": test_scrapling_text(req.text), + "AutoScraper": test_autoscraper(req.text), + } + display(results2) diff --git a/cleanup.py b/cleanup.py new file mode 100644 index 0000000000000000000000000000000000000000..eced04e2df580f647a4a81868a659f32403c74a9 --- /dev/null +++ b/cleanup.py @@ -0,0 +1,42 @@ +import shutil +from pathlib import Path + + +# Clean up after installing for local development +def clean(): + # Get the current directory + base_dir = Path.cwd() + + # Directories and patterns to clean + cleanup_patterns = [ + "build", + "dist", + "*.egg-info", + "__pycache__", + ".eggs", + ".pytest_cache", + ] + + # Clean directories + for pattern in cleanup_patterns: + for path in base_dir.glob(pattern): + try: + if path.is_dir(): + shutil.rmtree(path) + else: + path.unlink() + print(f"Removed: {path}") + except Exception as e: + print(f"Could not remove {path}: {e}") + + # Remove compiled Python files + for path in base_dir.rglob("*.py[co]"): + try: + path.unlink() + print(f"Removed compiled file: {path}") + except Exception as e: + print(f"Could not remove {path}: {e}") + + +if __name__ == "__main__": + clean() diff --git a/docs/README_AR.md b/docs/README_AR.md new file mode 100644 index 0000000000000000000000000000000000000000..d5f525c0c0e60b9ace9f092793b0367a233f67f0 --- /dev/null +++ b/docs/README_AR.md @@ -0,0 +1,426 @@ + + +

+ + + + Scrapling Poster + + +
+ Effortless Web Scraping for the Modern Web +

+ +

+ + Tests + + PyPI version + + PyPI Downloads +
+ + Discord + + + X (formerly Twitter) Follow + +
+ + Supported Python versions +

+ +

+ ุทุฑู‚ ุงู„ุงุฎุชูŠุงุฑ + · + ุงุฎุชูŠุงุฑ Fetcher + · + ุงู„ุนู†ุงูƒุจ + · + ุชุฏูˆูŠุฑ ุงู„ุจุฑูˆูƒุณูŠ + · + ูˆุงุฌู‡ุฉ ุณุทุฑ ุงู„ุฃูˆุงู…ุฑ + · + ูˆุถุน MCP +

+ +Scrapling ู‡ูˆ ุฅุทุงุฑ ุนู…ู„ ุชูƒูŠููŠ ู„ู€ Web Scraping ูŠุชุนุงู…ู„ ู…ุน ูƒู„ ุดูŠุก ู…ู† ุทู„ุจ ูˆุงุญุฏ ุฅู„ู‰ ุฒุญู ูƒุงู…ู„ ุงู„ู†ุทุงู‚. + +ู…ุญู„ู„ู‡ ูŠุชุนู„ู… ู…ู† ุชุบูŠูŠุฑุงุช ุงู„ู…ูˆุงู‚ุน ูˆูŠุนูŠุฏ ุชุญุฏูŠุฏ ู…ูˆู‚ุน ุนู†ุงุตุฑูƒ ุชู„ู‚ุงุฆูŠุงู‹ ุนู†ุฏ ุชุญุฏูŠุซ ุงู„ุตูุญุงุช. ุฌูˆุงู„ุจู‡ ุชุชุฌุงูˆุฒ ุฃู†ุธู…ุฉ ู…ูƒุงูุญุฉ ุงู„ุฑูˆุจูˆุชุงุช ู…ุซู„ Cloudflare Turnstile ู…ุจุงุดุฑุฉู‹. ูˆุฅุทุงุฑ ุนู…ู„ Spider ุงู„ุฎุงุต ุจู‡ ูŠุชูŠุญ ู„ูƒ ุงู„ุชูˆุณุน ุฅู„ู‰ ุนู…ู„ูŠุงุช ุฒุญู ู…ุชุฒุงู…ู†ุฉ ูˆู…ุชุนุฏุฏุฉ ุงู„ุฌู„ุณุงุช ู…ุน ุฅูŠู‚ุงู/ุงุณุชุฆู†ุงู ูˆุชุฏูˆูŠุฑ ุชู„ู‚ุงุฆูŠ ู„ู€ Proxy - ูƒู„ ุฐู„ูƒ ููŠ ุจุถุนุฉ ุฃุณุทุฑ ู…ู† Python. ู…ูƒุชุจุฉ ูˆุงุญุฏุฉุŒ ุจุฏูˆู† ุชู†ุงุฒู„ุงุช. + +ุฒุญู ุณุฑูŠุน ู„ู„ุบุงูŠุฉ ู…ุน ุฅุญุตุงุฆูŠุงุช ููˆุฑูŠุฉ ูˆ Streaming. ู…ุจู†ูŠ ุจูˆุงุณุทุฉ ู…ุณุชุฎุฑุฌูŠ ุงู„ูˆูŠุจ ู„ู…ุณุชุฎุฑุฌูŠ ุงู„ูˆูŠุจ ูˆุงู„ู…ุณุชุฎุฏู…ูŠู† ุงู„ุนุงุฏูŠูŠู†ุŒ ู‡ู†ุงูƒ ุดูŠุก ู„ู„ุฌู…ูŠุน. + +```python +from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher +StealthyFetcher.adaptive = True +p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ุงุญุตู„ ุนู„ู‰ ุงู„ู…ูˆู‚ุน ุจุดูƒู„ ุฎููŠ! +products = p.css('.product', auto_save=True) # ุงุณุชุฎุฑุฌ ุจูŠุงู†ุงุช ุชู†ุฌูˆ ู…ู† ุชุบูŠูŠุฑุงุช ุชุตู…ูŠู… ุงู„ู…ูˆู‚ุน! +products = p.css('.product', adaptive=True) # ู„ุงุญู‚ุงู‹ุŒ ุฅุฐุง ุชุบูŠุฑุช ุจู†ูŠุฉ ุงู„ู…ูˆู‚ุนุŒ ู…ุฑุฑ `adaptive=True` ู„ู„ุนุซูˆุฑ ุนู„ูŠู‡ุง! +``` +ุฃูˆ ุชูˆุณุน ุฅู„ู‰ ุนู…ู„ูŠุงุช ุฒุญู ูƒุงู…ู„ุฉ +```python +from scrapling.spiders import Spider, Response + +class MySpider(Spider): + name = "demo" + start_urls = ["https://example.com/"] + + async def parse(self, response: Response): + for item in response.css('.product'): + yield {"title": item.css('h2::text').get()} + +MySpider().start() +``` + + +# ุงู„ุฑุนุงุฉ ุงู„ุจู„ุงุชูŠู†ูŠูˆู† + +ู‡ู„ ุชุฑูŠุฏ ุฃู† ุชูƒูˆู† ุฃูˆู„ ุดุฑูƒุฉ ุชุธู‡ุฑ ู‡ู†ุงุŸ ุงู†ู‚ุฑ [ู‡ู†ุง](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646) +# ุงู„ุฑุนุงุฉ + + + + + + + + + + + + + + + + + + + + +ู‡ู„ ุชุฑูŠุฏ ุนุฑุถ ุฅุนู„ุงู†ูƒ ู‡ู†ุงุŸ ุงู†ู‚ุฑ [ู‡ู†ุง](https://github.com/sponsors/D4Vinci) ูˆุงุฎุชุฑ ุงู„ู…ุณุชูˆู‰ ุงู„ุฐูŠ ูŠู†ุงุณุจูƒ! + +--- + +## ุงู„ู…ูŠุฒุงุช ุงู„ุฑุฆูŠุณูŠุฉ + +### Spiders โ€” ุฅุทุงุฑ ุนู…ู„ ุฒุญู ูƒุงู…ู„ +- ๐Ÿ•ท๏ธ **ูˆุงุฌู‡ุฉ Spider ุดุจูŠู‡ุฉ ุจู€ Scrapy**: ุนุฑู‘ู Spiders ู…ุน `start_urls`ุŒ ูˆ async `parse` callbacksุŒ ูˆูƒุงุฆู†ุงุช `Request`/`Response`. +- โšก **ุฒุญู ู…ุชุฒุงู…ู†**: ุญุฏูˆุฏ ุชุฒุงู…ู† ู‚ุงุจู„ุฉ ู„ู„ุชูƒูˆูŠู†ุŒ ูˆุชุญูƒู… ุจุงู„ุณุฑุนุฉ ุญุณุจ ุงู„ู†ุทุงู‚ุŒ ูˆุชุฃุฎูŠุฑุงุช ุงู„ุชู†ุฒูŠู„. +- ๐Ÿ”„ **ุฏุนู… ุงู„ุฌู„ุณุงุช ุงู„ู…ุชุนุฏุฏุฉ**: ูˆุงุฌู‡ุฉ ู…ูˆุญุฏุฉ ู„ุทู„ุจุงุช HTTPุŒ ูˆู…ุชุตูุญุงุช ุฎููŠุฉ ุจุฏูˆู† ูˆุงุฌู‡ุฉ ููŠ Spider ูˆุงุญุฏ โ€” ูˆุฌู‘ู‡ ุงู„ุทู„ุจุงุช ุฅู„ู‰ ุฌู„ุณุงุช ู…ุฎุชู„ูุฉ ุจุงู„ู…ุนุฑู‘ู. +- ๐Ÿ’พ **ุฅูŠู‚ุงู ูˆุงุณุชุฆู†ุงู**: ุงุณุชู…ุฑุงุฑูŠุฉ ุงู„ุฒุญู ุงู„ู‚ุงุฆู…ุฉ ุนู„ู‰ Checkpoint. ุงุถุบุท Ctrl+C ู„ู„ุฅูŠู‚ุงู ุจุณู„ุงุณุฉุ› ุฃุนุฏ ุงู„ุชุดุบูŠู„ ู„ู„ุงุณุชุฆู†ุงู ู…ู† ุญูŠุซ ุชูˆู‚ูุช. +- ๐Ÿ“ก **ูˆุถุน Streaming**: ุจุซ ุงู„ุนู†ุงุตุฑ ุงู„ู…ุณุชุฎุฑุฌุฉ ููˆุฑ ูˆุตูˆู„ู‡ุง ุนุจุฑ `async for item in spider.stream()` ู…ุน ุฅุญุตุงุฆูŠุงุช ููˆุฑูŠุฉ โ€” ู…ุซุงู„ูŠ ู„ูˆุงุฌู‡ุงุช ุงู„ู…ุณุชุฎุฏู… ูˆุฎุทูˆุท ุงู„ุฃู†ุงุจูŠุจ ูˆุนู…ู„ูŠุงุช ุงู„ุฒุญู ุงู„ุทูˆูŠู„ุฉ. +- ๐Ÿ›ก๏ธ **ูƒุดู ุงู„ุทู„ุจุงุช ุงู„ู…ุญุธูˆุฑุฉ**: ูƒุดู ุชู„ู‚ุงุฆูŠ ูˆุฅุนุงุฏุฉ ู…ุญุงูˆู„ุฉ ู„ู„ุทู„ุจุงุช ุงู„ู…ุญุธูˆุฑุฉ ู…ุน ู…ู†ุทู‚ ู‚ุงุจู„ ู„ู„ุชุฎุตูŠุต. +- ๐Ÿ“ฆ **ุชุตุฏูŠุฑ ู…ุฏู…ุฌ**: ุตุฏู‘ุฑ ุงู„ู†ุชุงุฆุฌ ุนุจุฑ ุงู„ุฎุทุงูุงุช ูˆุฎุท ุงู„ุฃู†ุงุจูŠุจ ุงู„ุฎุงุต ุจูƒ ุฃูˆ JSON/JSONL ุงู„ู…ุฏู…ุฌ ู…ุน `result.items.to_json()` / `result.items.to_jsonl()` ุนู„ู‰ ุงู„ุชูˆุงู„ูŠ. + +### ุฌู„ุจ ู…ุชู‚ุฏู… ู„ู„ู…ูˆุงู‚ุน ู…ุน ุฏุนู… ุงู„ุฌู„ุณุงุช +- **ุทู„ุจุงุช HTTP**: ุทู„ุจุงุช HTTP ุณุฑูŠุนุฉ ูˆุฎููŠุฉ ู…ุน ูุฆุฉ `Fetcher`. ูŠู…ูƒู†ู‡ุง ุชู‚ู„ูŠุฏ ุจุตู…ุฉ TLS ู„ู„ู…ุชุตูุญ ูˆุงู„ุฑุคูˆุณ ูˆุงุณุชุฎุฏุงู… HTTP/3. +- **ุงู„ุชุญู…ูŠู„ ุงู„ุฏูŠู†ุงู…ูŠูƒูŠ**: ุฌู„ุจ ุงู„ู…ูˆุงู‚ุน ุงู„ุฏูŠู†ุงู…ูŠูƒูŠุฉ ู…ุน ุฃุชู…ุชุฉ ูƒุงู…ู„ุฉ ู„ู„ู…ุชุตูุญ ู…ู† ุฎู„ุงู„ ูุฆุฉ `DynamicFetcher` ุงู„ุชูŠ ุชุฏุนู… Chromium ู…ู† Playwright ูˆ Google Chrome. +- **ุชุฌุงูˆุฒ ู…ูƒุงูุญุฉ ุงู„ุฑูˆุจูˆุชุงุช**: ู‚ุฏุฑุงุช ุชุฎููŠ ู…ุชู‚ุฏู…ุฉ ู…ุน `StealthyFetcher` ูˆุงู†ุชุญุงู„ fingerprint. ูŠู…ูƒู†ู‡ ุชุฌุงูˆุฒ ุฌู…ูŠุน ุฃู†ูˆุงุน Turnstile/Interstitial ู…ู† Cloudflare ุจุณู‡ูˆู„ุฉ ุจุงู„ุฃุชู…ุชุฉ. +- **ุฅุฏุงุฑุฉ ุงู„ุฌู„ุณุงุช**: ุฏุนู… ุงู„ุฌู„ุณุงุช ุงู„ู…ุณุชู…ุฑุฉ ู…ุน ูุฆุงุช `FetcherSession` ูˆ`StealthySession` ูˆ`DynamicSession` ู„ุฅุฏุงุฑุฉ ู…ู„ูุงุช ุชุนุฑูŠู ุงู„ุงุฑุชุจุงุท ูˆุงู„ุญุงู„ุฉ ุนุจุฑ ุงู„ุทู„ุจุงุช. +- **ุชุฏูˆูŠุฑ Proxy**: `ProxyRotator` ู…ุฏู…ุฌ ู…ุน ุงุณุชุฑุงุชูŠุฌูŠุงุช ุงู„ุชุฏูˆูŠุฑ ุงู„ุฏูˆุฑูŠ ุฃูˆ ุงู„ู…ุฎุตุตุฉ ุนุจุฑ ุฌู…ูŠุน ุฃู†ูˆุงุน ุงู„ุฌู„ุณุงุชุŒ ุจุงู„ุฅุถุงูุฉ ุฅู„ู‰ ุชุฌุงูˆุฒุงุช Proxy ู„ูƒู„ ุทู„ุจ. +- **ุญุธุฑ ุงู„ู†ุทุงู‚ุงุช**: ุญุธุฑ ุงู„ุทู„ุจุงุช ุฅู„ู‰ ู†ุทุงู‚ุงุช ู…ุญุฏุฏุฉ (ูˆู†ุทุงู‚ุงุชู‡ุง ุงู„ูุฑุนูŠุฉ) ููŠ ุงู„ุฌูˆุงู„ุจ ุงู„ู…ุนุชู…ุฏุฉ ุนู„ู‰ ุงู„ู…ุชุตูุญ. +- **ุฏุนู… Async**: ุฏุนู… async ูƒุงู…ู„ ุนุจุฑ ุฌู…ูŠุน ุงู„ุฌูˆุงู„ุจ ูˆูุฆุงุช ุงู„ุฌู„ุณุงุช async ุงู„ู…ุฎุตุตุฉ. + +### ุงู„ุงุณุชุฎุฑุงุฌ ุงู„ุชูƒูŠููŠ ูˆุงู„ุชูƒุงู…ู„ ู…ุน ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ +- ๐Ÿ”„ **ุชุชุจุน ุงู„ุนู†ุงุตุฑ ุงู„ุฐูƒูŠ**: ุฅุนุงุฏุฉ ุชุญุฏูŠุฏ ู…ูˆู‚ุน ุงู„ุนู†ุงุตุฑ ุจุนุฏ ุชุบูŠูŠุฑุงุช ุงู„ู…ูˆู‚ุน ุจุงุณุชุฎุฏุงู… ุฎูˆุงุฑุฒู…ูŠุงุช ุงู„ุชุดุงุจู‡ ุงู„ุฐูƒูŠุฉ. +- ๐ŸŽฏ **ุงู„ุงุฎุชูŠุงุฑ ุงู„ู…ุฑู† ุงู„ุฐูƒูŠ**: ู…ุญุฏุฏุงุช CSSุŒ ู…ุญุฏุฏุงุช XPathุŒ ุงู„ุจุญุซ ุงู„ู‚ุงุฆู… ุนู„ู‰ ุงู„ูู„ุงุชุฑุŒ ุงู„ุจุญุซ ุงู„ู†ุตูŠุŒ ุงู„ุจุญุซ ุจุงู„ุชุนุจูŠุฑุงุช ุงู„ุนุงุฏูŠุฉ ูˆุงู„ู…ุฒูŠุฏ. +- ๐Ÿ” **ุงู„ุจุญุซ ุนู† ุนู†ุงุตุฑ ู…ุดุงุจู‡ุฉ**: ุชุญุฏูŠุฏ ุงู„ุนู†ุงุตุฑ ุงู„ู…ุดุงุจู‡ุฉ ู„ู„ุนู†ุงุตุฑ ุงู„ู…ูˆุฌูˆุฏุฉ ุชู„ู‚ุงุฆูŠุงู‹. +- ๐Ÿค– **ุฎุงุฏู… MCP ู„ู„ุงุณุชุฎุฏุงู… ู…ุน ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ**: ุฎุงุฏู… MCP ู…ุฏู…ุฌ ู„ู€ Web Scraping ุจู…ุณุงุนุฏุฉ ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ ูˆุงุณุชุฎุฑุงุฌ ุงู„ุจูŠุงู†ุงุช. ูŠุชู…ูŠุฒ ุฎุงุฏู… MCP ุจู‚ุฏุฑุงุช ู‚ูˆูŠุฉ ู…ุฎุตุตุฉ ุชุณุชููŠุฏ ู…ู† Scrapling ู„ุงุณุชุฎุฑุงุฌ ุงู„ู…ุญุชูˆู‰ ุงู„ู…ุณุชู‡ุฏู ู‚ุจู„ ุชู…ุฑูŠุฑู‡ ุฅู„ู‰ ุงู„ุฐูƒุงุก ุงู„ุงุตุทู†ุงุนูŠ (Claude/Cursor/ุฅู„ุฎ)ุŒ ูˆุจุงู„ุชุงู„ูŠ ุชุณุฑูŠุน ุงู„ุนู…ู„ูŠุงุช ูˆุชู‚ู„ูŠู„ ุงู„ุชูƒุงู„ูŠู ุนู† ุทุฑูŠู‚ ุชู‚ู„ูŠู„ ุงุณุชุฎุฏุงู… ุงู„ุฑู…ูˆุฒ. ([ููŠุฏูŠูˆ ุชูˆุถูŠุญูŠ](https://www.youtube.com/watch?v=qyFk3ZNwOxE)) + +### ุจู†ูŠุฉ ุนุงู„ูŠุฉ ุงู„ุฃุฏุงุก ูˆู…ุฎุชุจุฑุฉ ู…ูŠุฏุงู†ูŠุงู‹ +- ๐Ÿš€ **ุณุฑูŠุน ูƒุงู„ุจุฑู‚**: ุฃุฏุงุก ู…ุญุณู‘ู† ูŠุชููˆู‚ ุนู„ู‰ ู…ุนุธู… ู…ูƒุชุจุงุช Web Scraping ููŠ Python. +- ๐Ÿ”‹ **ูุนุงู„ ููŠ ุงุณุชุฎุฏุงู… ุงู„ุฐุงูƒุฑุฉ**: ู‡ูŠุงูƒู„ ุจูŠุงู†ุงุช ู…ุญุณู‘ู†ุฉ ูˆุชุญู…ูŠู„ ูƒุณูˆู„ ู„ุฃู‚ู„ ุงุณุชุฎุฏุงู… ู„ู„ุฐุงูƒุฑุฉ. +- โšก **ุชุณู„ุณู„ JSON ุณุฑูŠุน**: ุฃุณุฑุน 10 ู…ุฑุงุช ู…ู† ุงู„ู…ูƒุชุจุฉ ุงู„ู‚ูŠุงุณูŠุฉ. +- ๐Ÿ—๏ธ **ู…ูุฎุชุจุฑ ู…ูŠุฏุงู†ูŠุงู‹**: ู„ุง ูŠู…ุชู„ูƒ Scrapling ูู‚ุท ุชุบุทูŠุฉ ุงุฎุชุจุงุฑ ุจู†ุณุจุฉ 92ูช ูˆุชุบุทูŠุฉ ูƒุงู…ู„ุฉ ู„ุชู„ู…ูŠุญุงุช ุงู„ุฃู†ูˆุงุนุŒ ุจู„ ุชู… ุงุณุชุฎุฏุงู…ู‡ ูŠูˆู…ูŠุงู‹ ู…ู† ู‚ุจู„ ู…ุฆุงุช ู…ุณุชุฎุฑุฌูŠ ุงู„ูˆูŠุจ ุฎู„ุงู„ ุงู„ุนุงู… ุงู„ู…ุงุถูŠ. + +### ุชุฌุฑุจุฉ ุตุฏูŠู‚ุฉ ู„ู„ู…ุทูˆุฑูŠู†/ู…ุณุชุฎุฑุฌูŠ ุงู„ูˆูŠุจ +- ๐ŸŽฏ **Shell ุชูุงุนู„ูŠ ู„ู€ Web Scraping**: Shell IPython ู…ุฏู…ุฌ ุงุฎุชูŠุงุฑูŠ ู…ุน ุชูƒุงู…ู„ ScraplingุŒ ูˆุงุฎุชุตุงุฑุงุชุŒ ูˆุฃุฏูˆุงุช ุฌุฏูŠุฏุฉ ู„ุชุณุฑูŠุน ุชุทูˆูŠุฑ ุณูƒุฑูŠุจุชุงุช Web ScrapingุŒ ู…ุซู„ ุชุญูˆูŠู„ ุทู„ุจุงุช curl ุฅู„ู‰ ุทู„ุจุงุช Scrapling ูˆุนุฑุถ ู†ุชุงุฆุฌ ุงู„ุทู„ุจุงุช ููŠ ู…ุชุตูุญูƒ. +- ๐Ÿš€ **ุงุณุชุฎุฏู…ู‡ ู…ุจุงุดุฑุฉ ู…ู† ุงู„ุทุฑููŠุฉ**: ุงุฎุชูŠุงุฑูŠุงู‹ุŒ ูŠู…ูƒู†ูƒ ุงุณุชุฎุฏุงู… Scrapling ู„ุงุณุชุฎุฑุงุฌ ุนู†ูˆุงู† URL ุฏูˆู† ูƒุชุงุจุฉ ุณุทุฑ ูˆุงุญุฏ ู…ู† ุงู„ูƒูˆุฏ! +- ๐Ÿ› ๏ธ **ูˆุงุฌู‡ุฉ ุชู†ู‚ู„ ุบู†ูŠุฉ**: ุงุฌุชูŠุงุฒ DOM ู…ุชู‚ุฏู… ู…ุน ุทุฑู‚ ุงู„ุชู†ู‚ู„ ุจูŠู† ุงู„ุนู†ุงุตุฑ ุงู„ูˆุงู„ุฏูŠุฉ ูˆุงู„ุดู‚ูŠู‚ุฉ ูˆุงู„ูุฑุนูŠุฉ. +- ๐Ÿงฌ **ู…ุนุงู„ุฌุฉ ู†ุตูˆุต ู…ุญุณู‘ู†ุฉ**: ุชุนุจูŠุฑุงุช ุนุงุฏูŠุฉ ู…ุฏู…ุฌุฉ ูˆุทุฑู‚ ุชู†ุธูŠู ูˆุนู…ู„ูŠุงุช ู†ุตูŠุฉ ู…ุญุณู‘ู†ุฉ. +- ๐Ÿ“ **ุฅู†ุดุงุก ู…ุญุฏุฏุงุช ุชู„ู‚ุงุฆูŠ**: ุฅู†ุดุงุก ู…ุญุฏุฏุงุช CSS/XPath ู‚ูˆูŠุฉ ู„ุฃูŠ ุนู†ุตุฑ. +- ๐Ÿ”Œ **ูˆุงุฌู‡ุฉ ู…ุฃู„ูˆูุฉ**: ู…ุดุงุจู‡ ู„ู€ Scrapy/BeautifulSoup ู…ุน ู†ูุณ ุงู„ุนู†ุงุตุฑ ุงู„ุฒุงุฆูุฉ ุงู„ู…ุณุชุฎุฏู…ุฉ ููŠ Scrapy/Parsel. +- ๐Ÿ“˜ **ุชุบุทูŠุฉ ูƒุงู…ู„ุฉ ู„ู„ุฃู†ูˆุงุน**: ุชู„ู…ูŠุญุงุช ู†ูˆุน ูƒุงู…ู„ุฉ ู„ุฏุนู… IDE ู…ู…ุชุงุฒ ูˆุฅูƒู…ุงู„ ุงู„ูƒูˆุฏ. ูŠุชู… ูุญุต ู‚ุงุนุฏุฉ ุงู„ูƒูˆุฏ ุจุงู„ูƒุงู…ู„ ุชู„ู‚ุงุฆูŠุงู‹ ุจูˆุงุณุทุฉ **PyRight** ูˆ**MyPy** ู…ุน ูƒู„ ุชุบูŠูŠุฑ. +- ๐Ÿ”‹ **ุตูˆุฑุฉ Docker ุฌุงู‡ุฒุฉ**: ู…ุน ูƒู„ ุฅุตุฏุงุฑุŒ ูŠุชู… ุจู†ุงุก ูˆุฏูุน ุตูˆุฑุฉ Docker ุชุญุชูˆูŠ ุนู„ู‰ ุฌู…ูŠุน ุงู„ู…ุชุตูุญุงุช ุชู„ู‚ุงุฆูŠุงู‹. + +## ุงู„ุจุฏุก + +ู„ู†ู„ู‚ู ู†ุธุฑุฉ ุณุฑูŠุนุฉ ุนู„ู‰ ู…ุง ูŠู…ูƒู† ู„ู€ Scrapling ูุนู„ู‡ ุฏูˆู† ุงู„ุชุนู…ู‚. + +### ุงู„ุงุณุชุฎุฏุงู… ุงู„ุฃุณุงุณูŠ +ุทู„ุจุงุช HTTP ู…ุน ุฏุนู… ุงู„ุฌู„ุณุงุช +```python +from scrapling.fetchers import Fetcher, FetcherSession + +with FetcherSession(impersonate='chrome') as session: # ุงุณุชุฎุฏู… ุฃุญุฏุซ ุฅุตุฏุงุฑ ู…ู† ุจุตู…ุฉ TLS ู„ู€ Chrome + page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) + quotes = page.css('.quote .text::text').getall() + +# ุฃูˆ ุงุณุชุฎุฏู… ุทู„ุจุงุช ู„ู…ุฑุฉ ูˆุงุญุฏุฉ +page = Fetcher.get('https://quotes.toscrape.com/') +quotes = page.css('.quote .text::text').getall() +``` +ูˆุถุน ุงู„ุชุฎููŠ ุงู„ู…ุชู‚ุฏู… +```python +from scrapling.fetchers import StealthyFetcher, StealthySession + +with StealthySession(headless=True, solve_cloudflare=True) as session: # ุฃุจู‚ู ุงู„ู…ุชุตูุญ ู…ูุชูˆุญุงู‹ ุญุชู‰ ุชู†ุชู‡ูŠ + page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False) + data = page.css('#padded_content a').getall() + +# ุฃูˆ ุงุณุชุฎุฏู… ู†ู…ุท ุงู„ุทู„ุจ ู„ู…ุฑุฉ ูˆุงุญุฏุฉุŒ ูŠูุชุญ ุงู„ู…ุชุตูุญ ู„ู‡ุฐุง ุงู„ุทู„ุจุŒ ุซู… ูŠุบู„ู‚ู‡ ุจุนุฏ ุงู„ุงู†ุชู‡ุงุก +page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare') +data = page.css('#padded_content a').getall() +``` +ุฃุชู…ุชุฉ ุงู„ู…ุชุตูุญ ุงู„ูƒุงู…ู„ุฉ +```python +from scrapling.fetchers import DynamicFetcher, DynamicSession + +with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # ุฃุจู‚ู ุงู„ู…ุชุตูุญ ู…ูุชูˆุญุงู‹ ุญุชู‰ ุชู†ุชู‡ูŠ + page = session.fetch('https://quotes.toscrape.com/', load_dom=False) + data = page.xpath('//span[@class="text"]/text()').getall() # ู…ุญุฏุฏ XPath ุฅุฐุง ูƒู†ุช ุชูุถู„ู‡ + +# ุฃูˆ ุงุณุชุฎุฏู… ู†ู…ุท ุงู„ุทู„ุจ ู„ู…ุฑุฉ ูˆุงุญุฏุฉุŒ ูŠูุชุญ ุงู„ู…ุชุตูุญ ู„ู‡ุฐุง ุงู„ุทู„ุจุŒ ุซู… ูŠุบู„ู‚ู‡ ุจุนุฏ ุงู„ุงู†ุชู‡ุงุก +page = DynamicFetcher.fetch('https://quotes.toscrape.com/') +data = page.css('.quote .text::text').getall() +``` + +### Spiders +ุงุจู†ู ุฒูˆุงุญู ูƒุงู…ู„ุฉ ู…ุน ุทู„ุจุงุช ู…ุชุฒุงู…ู†ุฉ ูˆุฃู†ูˆุงุน ุฌู„ุณุงุช ู…ุชุนุฏุฏุฉ ูˆุฅูŠู‚ุงู/ุงุณุชุฆู†ุงู: +```python +from scrapling.spiders import Spider, Request, Response + +class QuotesSpider(Spider): + name = "quotes" + start_urls = ["https://quotes.toscrape.com/"] + concurrent_requests = 10 + + async def parse(self, response: Response): + for quote in response.css('.quote'): + yield { + "text": quote.css('.text::text').get(), + "author": quote.css('.author::text').get(), + } + + next_page = response.css('.next a') + if next_page: + yield response.follow(next_page[0].attrib['href']) + +result = QuotesSpider().start() +print(f"Scraped {len(result.items)} quotes") +result.items.to_json("quotes.json") +``` +ุงุณุชุฎุฏู… ุฃู†ูˆุงุน ุฌู„ุณุงุช ู…ุชุนุฏุฏุฉ ููŠ Spider ูˆุงุญุฏ: +```python +from scrapling.spiders import Spider, Request, Response +from scrapling.fetchers import FetcherSession, AsyncStealthySession + +class MultiSessionSpider(Spider): + name = "multi" + start_urls = ["https://example.com/"] + + def configure_sessions(self, manager): + manager.add("fast", FetcherSession(impersonate="chrome")) + manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) + + async def parse(self, response: Response): + for link in response.css('a::attr(href)').getall(): + # ูˆุฌู‘ู‡ ุงู„ุตูุญุงุช ุงู„ู…ุญู…ูŠุฉ ุนุจุฑ ุฌู„ุณุฉ ุงู„ุชุฎููŠ + if "protected" in link: + yield Request(link, sid="stealth") + else: + yield Request(link, sid="fast", callback=self.parse) # callback ุตุฑูŠุญ +``` +ุฃูˆู‚ู ูˆุงุณุชุฃู†ู ุนู…ู„ูŠุงุช ุงู„ุฒุญู ุงู„ุทูˆูŠู„ุฉ ู…ุน Checkpoints ุจุชุดุบูŠู„ Spider ู‡ูƒุฐุง: +```python +QuotesSpider(crawldir="./crawl_data").start() +``` +ุงุถุบุท Ctrl+C ู„ู„ุฅูŠู‚ุงู ุจุณู„ุงุณุฉ โ€” ูŠุชู… ุญูุธ ุงู„ุชู‚ุฏู… ุชู„ู‚ุงุฆูŠุงู‹. ู„ุงุญู‚ุงู‹ุŒ ุนู†ุฏ ุชุดุบูŠู„ Spider ู…ุฑุฉ ุฃุฎุฑู‰ุŒ ู…ุฑุฑ ู†ูุณ `crawldir`ุŒ ูˆุณูŠุณุชุฃู†ู ู…ู† ุญูŠุซ ุชูˆู‚ู. + +### ุงู„ุชุญู„ูŠู„ ุงู„ู…ุชู‚ุฏู… ูˆุงู„ุชู†ู‚ู„ +```python +from scrapling.fetchers import Fetcher + +# ุงุฎุชูŠุงุฑ ุนู†ุงุตุฑ ุบู†ูŠ ูˆุชู†ู‚ู„ +page = Fetcher.get('https://quotes.toscrape.com/') + +# ุงุญุตู„ ุนู„ู‰ ุงู„ุงู‚ุชุจุงุณุงุช ุจุทุฑู‚ ุงุฎุชูŠุงุฑ ู…ุชุนุฏุฏุฉ +quotes = page.css('.quote') # ู…ุญุฏุฏ CSS +quotes = page.xpath('//div[@class="quote"]') # XPath +quotes = page.find_all('div', {'class': 'quote'}) # ุจุฃุณู„ูˆุจ BeautifulSoup +# ู†ูุณ ุงู„ุดูŠุก ู…ุซู„ +quotes = page.find_all('div', class_='quote') +quotes = page.find_all(['div'], class_='quote') +quotes = page.find_all(class_='quote') # ูˆู‡ูƒุฐุง... +# ุงู„ุจุญุซ ุนู† ุนู†ุตุฑ ุจู…ุญุชูˆู‰ ุงู„ู†ุต +quotes = page.find_by_text('quote', tag='div') + +# ุงู„ุชู†ู‚ู„ ุงู„ู…ุชู‚ุฏู… +quote_text = page.css('.quote')[0].css('.text::text').get() +quote_text = page.css('.quote').css('.text::text').getall() # ู…ุญุฏุฏุงุช ู…ุชุณู„ุณู„ุฉ +first_quote = page.css('.quote')[0] +author = first_quote.next_sibling.css('.author::text') +parent_container = first_quote.parent + +# ุนู„ุงู‚ุงุช ุงู„ุนู†ุงุตุฑ ูˆุงู„ุชุดุงุจู‡ +similar_elements = first_quote.find_similar() +below_elements = first_quote.below_elements() +``` +ูŠู…ูƒู†ูƒ ุงุณุชุฎุฏุงู… ุงู„ู…ุญู„ู„ ู…ุจุงุดุฑุฉ ุฅุฐุง ูƒู†ุช ู„ุง ุชุฑูŠุฏ ุฌู„ุจ ุงู„ู…ูˆุงู‚ุน ูƒู…ุง ูŠู„ูŠ: +```python +from scrapling.parser import Selector + +page = Selector("...") +``` +ูˆู‡ูˆ ูŠุนู…ู„ ุจู†ูุณ ุงู„ุทุฑูŠู‚ุฉ ุชู…ุงู…ุงู‹! + +### ุฃู…ุซู„ุฉ ุฅุฏุงุฑุฉ ุงู„ุฌู„ุณุงุช ุจุดูƒู„ Async +```python +import asyncio +from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession + +async with FetcherSession(http3=True) as session: # `FetcherSession` ูˆุงุนู ุจุงู„ุณูŠุงู‚ ูˆูŠุนู…ู„ ููŠ ูƒู„ุง ุงู„ู†ู…ุทูŠู† ุงู„ู…ุชุฒุงู…ู†/async + page1 = session.get('https://quotes.toscrape.com/') + page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135') + +# ุงุณุชุฎุฏุงู… ุฌู„ุณุฉ async +async with AsyncStealthySession(max_pages=2) as session: + tasks = [] + urls = ['https://example.com/page1', 'https://example.com/page2'] + + for url in urls: + task = session.fetch(url) + tasks.append(task) + + print(session.get_pool_stats()) # ุงุฎุชูŠุงุฑูŠ - ุญุงู„ุฉ ู…ุฌู…ูˆุนุฉ ุนู„ุงู…ุงุช ุชุจูˆูŠุจ ุงู„ู…ุชุตูุญ (ู…ุดุบูˆู„/ุญุฑ/ุฎุทุฃ) + results = await asyncio.gather(*tasks) + print(session.get_pool_stats()) +``` + +## ูˆุงุฌู‡ุฉ ุณุทุฑ ุงู„ุฃูˆุงู…ุฑ ูˆุงู„ู€ Shell ุงู„ุชูุงุนู„ูŠ + +ูŠุชุถู…ู† Scrapling ูˆุงุฌู‡ุฉ ุณุทุฑ ุฃูˆุงู…ุฑ ู‚ูˆูŠุฉ: + +[![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339) + +ุชุดุบูŠู„ Shell ุงู„ู€ Web Scraping ุงู„ุชูุงุนู„ูŠ +```bash +scrapling shell +``` +ุงุณุชุฎุฑุฌ ุงู„ุตูุญุงุช ุฅู„ู‰ ู…ู„ู ู…ุจุงุดุฑุฉ ุฏูˆู† ุจุฑู…ุฌุฉ (ูŠุณุชุฎุฑุฌ ุงู„ู…ุญุชูˆู‰ ุฏุงุฎู„ ูˆุณู… `body` ุงูุชุฑุงุถูŠุงู‹). ุฅุฐุง ุงู†ุชู‡ู‰ ู…ู„ู ุงู„ุฅุฎุฑุงุฌ ุจู€ `.txt`ุŒ ูุณูŠุชู… ุงุณุชุฎุฑุงุฌ ู…ุญุชูˆู‰ ุงู„ู†ุต ู„ู„ู‡ุฏู. ุฅุฐุง ุงู†ุชู‡ู‰ ุจู€ `.md`ุŒ ูุณูŠูƒูˆู† ุชู…ุซูŠู„ Markdown ู„ู…ุญุชูˆู‰ HTMLุ› ุฅุฐุง ุงู†ุชู‡ู‰ ุจู€ `.html`ุŒ ูุณูŠูƒูˆู† ู…ุญุชูˆู‰ HTML ู†ูุณู‡. +```bash +scrapling extract get 'https://example.com' content.md +scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # ุฌู…ูŠุน ุงู„ุนู†ุงุตุฑ ุงู„ู…ุทุงุจู‚ุฉ ู„ู…ุญุฏุฏ CSS '#fromSkipToProducts' +scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless +scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare +``` + +> [!NOTE] +> ู‡ู†ุงูƒ ุงู„ุนุฏูŠุฏ ู…ู† ุงู„ู…ูŠุฒุงุช ุงู„ุฅุถุงููŠุฉุŒ ู„ูƒู†ู†ุง ู†ุฑูŠุฏ ุฅุจู‚ุงุก ู‡ุฐู‡ ุงู„ุตูุญุฉ ู…ูˆุฌุฒุฉุŒ ุจู…ุง ููŠ ุฐู„ูƒ ุฎุงุฏู… MCP ูˆุงู„ู€ Shell ุงู„ุชูุงุนู„ูŠ ู„ู€ Web Scraping. ุชุญู‚ู‚ ู…ู† ุงู„ูˆุซุงุฆู‚ ุงู„ูƒุงู…ู„ุฉ [ู‡ู†ุง](https://scrapling.readthedocs.io/en/latest/) + +## ู…ุนุงูŠูŠุฑ ุงู„ุฃุฏุงุก + +Scrapling ู„ูŠุณ ู‚ูˆูŠุงู‹ ูุญุณุจ โ€” ุจู„ ู‡ูˆ ุฃูŠุถุงู‹ ุณุฑูŠุน ุจุดูƒู„ ู…ุฐู‡ู„. ุชู‚ุงุฑู† ุงู„ู…ุนุงูŠูŠุฑ ุงู„ุชุงู„ูŠุฉ ู…ุญู„ู„ Scrapling ู…ุน ุฃุญุฏุซ ุฅุตุฏุงุฑุงุช ุงู„ู…ูƒุชุจุงุช ุงู„ุดุงุฆุนุฉ ุงู„ุฃุฎุฑู‰. + +### ุงุฎุชุจุงุฑ ุณุฑุนุฉ ุงุณุชุฎุฑุงุฌ ุงู„ู†ุต (5000 ุนู†ุตุฑ ู…ุชุฏุงุฎู„) + +| # | ุงู„ู…ูƒุชุจุฉ | ุงู„ูˆู‚ุช (ms) | vs Scrapling | +|---|:-----------------:|:----------:|:------------:| +| 1 | Scrapling | 2.02 | 1.0x | +| 2 | Parsel/Scrapy | 2.04 | 1.01 | +| 3 | Raw Lxml | 2.54 | 1.257 | +| 4 | PyQuery | 24.17 | ~12x | +| 5 | Selectolax | 82.63 | ~41x | +| 6 | MechanicalSoup | 1549.71 | ~767.1x | +| 7 | BS4 with Lxml | 1584.31 | ~784.3x | +| 8 | BS4 with html5lib | 3391.91 | ~1679.1x | + + +### ุฃุฏุงุก ุชุดุงุจู‡ ุงู„ุนู†ุงุตุฑ ูˆุงู„ุจุญุซ ุงู„ู†ุตูŠ + +ู‚ุฏุฑุงุช ุงู„ุนุซูˆุฑ ุนู„ู‰ ุงู„ุนู†ุงุตุฑ ุงู„ุชูƒูŠููŠุฉ ู„ู€ Scrapling ุชุชููˆู‚ ุจุดูƒู„ ูƒุจูŠุฑ ุนู„ู‰ ุงู„ุจุฏุงุฆู„: + +| ุงู„ู…ูƒุชุจุฉ | ุงู„ูˆู‚ุช (ms) | vs Scrapling | +|-------------|:----------:|:------------:| +| Scrapling | 2.39 | 1.0x | +| AutoScraper | 12.45 | 5.209x | + + +> ุชู…ุซู„ ุฌู…ูŠุน ุงู„ู…ุนุงูŠูŠุฑ ู…ุชูˆุณุทุงุช ุฃูƒุซุฑ ู…ู† 100 ุชุดุบูŠู„. ุงู†ุธุฑ [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) ู„ู„ู…ู†ู‡ุฌูŠุฉ. + +## ุงู„ุชุซุจูŠุช + +ูŠุชุทู„ุจ Scrapling ุฅุตุฏุงุฑ Python 3.10 ุฃูˆ ุฃุนู„ู‰: + +```bash +pip install scrapling +``` + +ูŠุชุถู…ู† ู‡ุฐุง ุงู„ุชุซุจูŠุช ูู‚ุท ู…ุญุฑูƒ ุงู„ู…ุญู„ู„ ูˆุชุจุนูŠุงุชู‡ุŒ ุจุฏูˆู† ุฃูŠ ุฌูˆุงู„ุจ ุฃูˆ ุชุจุนูŠุงุช ุณุทุฑ ุงู„ุฃูˆุงู…ุฑ. + +### ุงู„ุชุจุนูŠุงุช ุงู„ุงุฎุชูŠุงุฑูŠุฉ + +1. ุฅุฐุง ูƒู†ุช ุณุชุณุชุฎุฏู… ุฃูŠุงู‹ ู…ู† ุงู„ู…ูŠุฒุงุช ุงู„ุฅุถุงููŠุฉ ุฃุฏู†ุงู‡ุŒ ุฃูˆ ุงู„ุฌูˆุงู„ุจุŒ ุฃูˆ ูุฆุงุชู‡ุงุŒ ูุณุชุญุชุงุฌ ุฅู„ู‰ ุชุซุจูŠุช ุชุจุนูŠุงุช ุงู„ุฌูˆุงู„ุจ ูˆุชุจุนูŠุงุช ุงู„ู…ุชุตูุญ ุงู„ุฎุงุตุฉ ุจู‡ุง ุนู„ู‰ ุงู„ู†ุญูˆ ุงู„ุชุงู„ูŠ: + ```bash + pip install "scrapling[fetchers]" + + scrapling install # normal install + scrapling install --force # force reinstall + ``` + + ูŠู‚ูˆู… ู‡ุฐุง ุจุชู†ุฒูŠู„ ุฌู…ูŠุน ุงู„ู…ุชุตูุญุงุชุŒ ุฅู„ู‰ ุฌุงู†ุจ ุชุจุนูŠุงุช ุงู„ู†ุธุงู… ูˆุชุจุนูŠุงุช ู…ุนุงู„ุฌุฉ fingerprint ุงู„ุฎุงุตุฉ ุจู‡ุง. + + ุฃูˆ ูŠู…ูƒู†ูƒ ุชุซุจูŠุชู‡ุง ู…ู† ุงู„ูƒูˆุฏ ุจุฏู„ุงู‹ ู…ู† ุชุดุบูŠู„ ุฃู…ุฑ ูƒุงู„ุชุงู„ูŠ: + ```python + from scrapling.cli import install + + install([], standalone_mode=False) # normal install + install(["--force"], standalone_mode=False) # force reinstall + ``` + +2. ู…ูŠุฒุงุช ุฅุถุงููŠุฉ: + - ุชุซุจูŠุช ู…ูŠุฒุฉ ุฎุงุฏู… MCP: + ```bash + pip install "scrapling[ai]" + ``` + - ุชุซุจูŠุช ู…ูŠุฒุงุช Shell (Shell ุงู„ู€ Web Scraping ูˆุฃู…ุฑ `extract`): + ```bash + pip install "scrapling[shell]" + ``` + - ุชุซุจูŠุช ูƒู„ ุดูŠุก: + ```bash + pip install "scrapling[all]" + ``` + ุชุฐูƒุฑ ุฃู†ูƒ ุชุญุชุงุฌ ุฅู„ู‰ ุชุซุจูŠุช ุชุจุนูŠุงุช ุงู„ู…ุชุตูุญ ู…ุน `scrapling install` ุจุนุฏ ุฃูŠ ู…ู† ู‡ุฐู‡ ุงู„ุฅุถุงูุงุช (ุฅุฐุง ู„ู… ุชูƒู† ู‚ุฏ ูุนู„ุช ุฐู„ูƒ ุจุงู„ูุนู„) + +### Docker +ูŠู…ูƒู†ูƒ ุฃูŠุถุงู‹ ุชุซุจูŠุช ุตูˆุฑุฉ Docker ู…ุน ุฌู…ูŠุน ุงู„ุฅุถุงูุงุช ูˆุงู„ู…ุชุตูุญุงุช ุจุงุณุชุฎุฏุงู… ุงู„ุฃู…ุฑ ุงู„ุชุงู„ูŠ ู…ู† DockerHub: +```bash +docker pull pyd4vinci/scrapling +``` +ุฃูˆ ุชู†ุฒูŠู„ู‡ุง ู…ู† ุณุฌู„ GitHub: +```bash +docker pull ghcr.io/d4vinci/scrapling:latest +``` +ูŠุชู… ุจู†ุงุก ู‡ุฐู‡ ุงู„ุตูˆุฑุฉ ูˆุฏูุนู‡ุง ุชู„ู‚ุงุฆูŠุงู‹ ุจุงุณุชุฎุฏุงู… GitHub Actions ูˆุงู„ูุฑุน ุงู„ุฑุฆูŠุณูŠ ู„ู„ู…ุณุชูˆุฏุน. + +## ุงู„ู…ุณุงู‡ู…ุฉ + +ู†ุฑุญุจ ุจุงู„ู…ุณุงู‡ู…ุงุช! ูŠุฑุฌู‰ ู‚ุฑุงุกุฉ [ุฅุฑุดุงุฏุงุช ุงู„ู…ุณุงู‡ู…ุฉ](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) ู‚ุจู„ ุงู„ุจุฏุก. + +## ุฅุฎู„ุงุก ุงู„ู…ุณุคูˆู„ูŠุฉ + +> [!CAUTION] +> ูŠุชู… ุชูˆููŠุฑ ู‡ุฐู‡ ุงู„ู…ูƒุชุจุฉ ู„ู„ุฃุบุฑุงุถ ุงู„ุชุนู„ูŠู…ูŠุฉ ูˆุงู„ุจุญุซูŠุฉ ูู‚ุท. ุจุงุณุชุฎุฏุงู… ู‡ุฐู‡ ุงู„ู…ูƒุชุจุฉุŒ ูุฅู†ูƒ ุชูˆุงูู‚ ุนู„ู‰ ุงู„ุงู…ุชุซุงู„ ู„ู‚ูˆุงู†ูŠู† ุงุณุชุฎุฑุงุฌ ุงู„ุจูŠุงู†ุงุช ูˆุงู„ุฎุตูˆุตูŠุฉ ุงู„ู…ุญู„ูŠุฉ ูˆุงู„ุฏูˆู„ูŠุฉ. ุงู„ู…ุคู„ููˆู† ูˆุงู„ู…ุณุงู‡ู…ูˆู† ุบูŠุฑ ู…ุณุคูˆู„ูŠู† ุนู† ุฃูŠ ุฅุณุงุกุฉ ุงุณุชุฎุฏุงู… ู„ู‡ุฐุง ุงู„ุจุฑู†ุงู…ุฌ. ุงุญุชุฑู… ุฏุงุฆู…ุงู‹ ุดุฑูˆุท ุฎุฏู…ุฉ ุงู„ู…ูˆุงู‚ุน ูˆู…ู„ูุงุช robots.txt. + +## ุงู„ุชุฑุฎูŠุต + +ู‡ุฐุง ุงู„ุนู…ู„ ู…ุฑุฎุต ุจู…ูˆุฌุจ ุชุฑุฎูŠุต BSD-3-Clause. + +## ุงู„ุดูƒุฑ ูˆุงู„ุชู‚ุฏูŠุฑ + +ูŠุชุถู…ู† ู‡ุฐุง ุงู„ู…ุดุฑูˆุน ูƒูˆุฏุงู‹ ู…ุนุฏู„ุงู‹ ู…ู†: +- Parsel (ุชุฑุฎูŠุต BSD) โ€” ูŠูุณุชุฎุฏู… ู„ู„ูˆุญุฏุฉ ุงู„ูุฑุนูŠุฉ [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) + +--- +
ู…ุตู…ู… ูˆู…ุตู†ูˆุน ุจู€ โค๏ธ ุจูˆุงุณุทุฉ ูƒุฑูŠู… ุดุนูŠุฑ.

diff --git a/docs/README_CN.md b/docs/README_CN.md new file mode 100644 index 0000000000000000000000000000000000000000..b18bff3772d1f8809b6f0fb3b70cc96726f4ae73 --- /dev/null +++ b/docs/README_CN.md @@ -0,0 +1,426 @@ + + +

+ + + + Scrapling Poster + + +
+ Effortless Web Scraping for the Modern Web +

+ +

+ + Tests + + PyPI version + + PyPI Downloads +
+ + Discord + + + X (formerly Twitter) Follow + +
+ + Supported Python versions +

+ +

+ ้€‰ๆ‹ฉๆ–นๆณ• + · + ้€‰ๆ‹ฉFetcher + · + ็ˆฌ่™ซ + · + ไปฃ็†่ฝฎๆข + · + CLI + · + MCPๆจกๅผ +

+ +Scraplingๆ˜ฏไธ€ไธช่‡ช้€‚ๅบ”Web Scrapingๆก†ๆžถ๏ผŒ่ƒฝๅค„็†ไปŽๅ•ไธช่ฏทๆฑ‚ๅˆฐๅคง่ง„ๆจก็ˆฌๅ–็š„ไธ€ๅˆ‡้œ€ๆฑ‚ใ€‚ + +ๅฎƒ็š„่งฃๆžๅ™จ่ƒฝๅคŸไปŽ็ฝ‘็ซ™ๅ˜ๅŒ–ไธญๅญฆไน ๏ผŒๅนถๅœจ้กต้ขๆ›ดๆ–ฐๆ—ถ่‡ชๅŠจ้‡ๆ–ฐๅฎšไฝๆ‚จ็š„ๅ…ƒ็ด ใ€‚ๅฎƒ็š„Fetcher่ƒฝๅคŸๅผ€็ฎฑๅณ็”จๅœฐ็ป•่ฟ‡Cloudflare Turnstile็ญ‰ๅๆœบๅ™จไบบ็ณป็ปŸใ€‚ๅฎƒ็š„Spiderๆก†ๆžถ่ฎฉๆ‚จๅฏไปฅๆ‰ฉๅฑ•ๅˆฐๅนถๅ‘ใ€ๅคšSession็ˆฌๅ–๏ผŒๆ”ฏๆŒๆš‚ๅœ/ๆขๅคๅ’Œ่‡ชๅŠจProxy่ฝฎๆขโ€”โ€”ๅช้œ€ๅ‡ ่กŒPythonไปฃ็ ใ€‚ไธ€ไธชๅบ“๏ผŒ้›ถๅฆฅๅใ€‚ + +ๆž้€Ÿ็ˆฌๅ–๏ผŒๅฎžๆ—ถ็ปŸ่ฎกๅ’ŒStreamingใ€‚็”ฑWeb ScraperไธบWeb Scraperๅ’Œๆ™ฎ้€š็”จๆˆท่€Œๆž„ๅปบ๏ผŒๆฏไธชไบบ้ƒฝ่ƒฝๆ‰พๅˆฐ้€‚ๅˆ่‡ชๅทฑ็š„ๅŠŸ่ƒฝใ€‚ + +```python +from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher +StealthyFetcher.adaptive = True +p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ้š็ง˜ๅœฐ่Žทๅ–็ฝ‘็ซ™๏ผ +products = p.css('.product', auto_save=True) # ๆŠ“ๅ–ๅœจ็ฝ‘็ซ™่ฎพ่ฎกๅ˜ๆ›ดๅŽไป่ƒฝๅญ˜ๆดป็š„ๆ•ฐๆฎ๏ผ +products = p.css('.product', adaptive=True) # ไน‹ๅŽ๏ผŒๅฆ‚ๆžœ็ฝ‘็ซ™็ป“ๆž„ๆ”นๅ˜๏ผŒไผ ้€’ `adaptive=True` ๆฅๆ‰พๅˆฐๅฎƒไปฌ๏ผ +``` +ๆˆ–ๆ‰ฉๅฑ•ไธบๅฎŒๆ•ด็ˆฌๅ– +```python +from scrapling.spiders import Spider, Response + +class MySpider(Spider): + name = "demo" + start_urls = ["https://example.com/"] + + async def parse(self, response: Response): + for item in response.css('.product'): + yield {"title": item.css('h2::text').get()} + +MySpider().start() +``` + + +# ้“‚้‡‘่ตžๅŠฉๅ•† + +ๆƒณๆˆไธบ็ฌฌไธ€ไธชๅ‡บ็Žฐๅœจ่ฟ™้‡Œ็š„ๅ…ฌๅธๅ—๏ผŸ็‚นๅ‡ป[่ฟ™้‡Œ](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646) +# ่ตžๅŠฉๅ•† + + + + + + + + + + + + + + + + + + + + +ๆƒณๅœจ่ฟ™้‡Œๅฑ•็คบๆ‚จ็š„ๅนฟๅ‘Šๅ—๏ผŸ็‚นๅ‡ป[่ฟ™้‡Œ](https://github.com/sponsors/D4Vinci)ๅนถ้€‰ๆ‹ฉ้€‚ๅˆๆ‚จ็š„็บงๅˆซ๏ผ + +--- + +## ไธป่ฆ็‰นๆ€ง + +### Spider โ€” ๅฎŒๆ•ด็š„็ˆฌๅ–ๆก†ๆžถ +- ๐Ÿ•ท๏ธ **็ฑปScrapy็š„Spider API**๏ผšไฝฟ็”จ`start_urls`ใ€async `parse` callbackๅ’Œ`Request`/`Response`ๅฏน่ฑกๅฎšไน‰Spiderใ€‚ +- โšก **ๅนถๅ‘็ˆฌๅ–**๏ผšๅฏ้…็ฝฎ็š„ๅนถๅ‘้™ๅˆถใ€ๆŒ‰ๅŸŸๅ่Š‚ๆตๅ’Œไธ‹่ฝฝๅปถ่ฟŸใ€‚ +- ๐Ÿ”„ **ๅคšSessionๆ”ฏๆŒ**๏ผš็ปŸไธ€ๆŽฅๅฃ๏ผŒๆ”ฏๆŒHTTP่ฏทๆฑ‚ๅ’Œ้š็ง˜ๆ— ๅคดๆต่งˆๅ™จๅœจๅŒไธ€ไธชSpiderไธญไฝฟ็”จโ€”โ€”้€š่ฟ‡IDๅฐ†่ฏทๆฑ‚่ทฏ็”ฑๅˆฐไธๅŒ็š„Sessionใ€‚ +- ๐Ÿ’พ **ๆš‚ๅœไธŽๆขๅค**๏ผšๅŸบไบŽCheckpoint็š„็ˆฌๅ–ๆŒไน…ๅŒ–ใ€‚ๆŒ‰Ctrl+Cไผ˜้›…ๅ…ณ้—ญ๏ผ›้‡ๅฏๅŽไปŽไธŠๆฌกๅœๆญข็š„ๅœฐๆ–น็ปง็ปญใ€‚ +- ๐Ÿ“ก **Streamingๆจกๅผ**๏ผš้€š่ฟ‡`async for item in spider.stream()`ไปฅๅฎžๆ—ถ็ปŸ่ฎกStreamingๆŠ“ๅ–็š„ๆ•ฐๆฎโ€”โ€”้žๅธธ้€‚ๅˆUIใ€็ฎก้“ๅ’Œ้•ฟๆ—ถ้—ด่ฟ่กŒ็š„็ˆฌๅ–ใ€‚ +- ๐Ÿ›ก๏ธ **่ขซ้˜ปๆญข่ฏทๆฑ‚ๆฃ€ๆต‹**๏ผš่‡ชๅŠจๆฃ€ๆต‹ๅนถ้‡่ฏ•่ขซ้˜ปๆญข็š„่ฏทๆฑ‚๏ผŒๆ”ฏๆŒ่‡ชๅฎšไน‰้€ป่พ‘ใ€‚ +- ๐Ÿ“ฆ **ๅ†…็ฝฎๅฏผๅ‡บ**๏ผš้€š่ฟ‡้’ฉๅญๅ’Œๆ‚จ่‡ชๅทฑ็š„็ฎก้“ๅฏผๅ‡บ็ป“ๆžœ๏ผŒๆˆ–ไฝฟ็”จๅ†…็ฝฎ็š„JSON/JSONL๏ผŒๅˆ†ๅˆซ้€š่ฟ‡`result.items.to_json()`/`result.items.to_jsonl()`ใ€‚ + +### ๆ”ฏๆŒSession็š„้ซ˜็บง็ฝ‘็ซ™่Žทๅ– +- **HTTP่ฏทๆฑ‚**๏ผšไฝฟ็”จ`Fetcher`็ฑป่ฟ›่กŒๅฟซ้€Ÿๅ’Œ้š็ง˜็š„HTTP่ฏทๆฑ‚ใ€‚ๅฏไปฅๆจกๆ‹Ÿๆต่งˆๅ™จ็š„TLS fingerprintใ€ๆ ‡ๅคดๅนถไฝฟ็”จHTTP/3ใ€‚ +- **ๅŠจๆ€ๅŠ ่ฝฝ**๏ผš้€š่ฟ‡`DynamicFetcher`็ฑปไฝฟ็”จๅฎŒๆ•ด็š„ๆต่งˆๅ™จ่‡ชๅŠจๅŒ–่Žทๅ–ๅŠจๆ€็ฝ‘็ซ™๏ผŒๆ”ฏๆŒPlaywright็š„Chromiumๅ’ŒGoogle Chromeใ€‚ +- **ๅๆœบๅ™จไบบ็ป•่ฟ‡**๏ผšไฝฟ็”จ`StealthyFetcher`็š„้ซ˜็บง้š็ง˜ๅŠŸ่ƒฝๅ’Œfingerprintไผช่ฃ…ใ€‚ๅฏไปฅ่ฝปๆพ่‡ชๅŠจ็ป•่ฟ‡ๆ‰€ๆœ‰็ฑปๅž‹็š„Cloudflare Turnstile/Interstitialใ€‚ +- **Session็ฎก็†**๏ผšไฝฟ็”จ`FetcherSession`ใ€`StealthySession`ๅ’Œ`DynamicSession`็ฑปๅฎž็ŽฐๆŒไน…ๅŒ–Sessionๆ”ฏๆŒ๏ผŒ็”จไบŽ่ทจ่ฏทๆฑ‚็š„cookieๅ’Œ็Šถๆ€็ฎก็†ใ€‚ +- **Proxy่ฝฎๆข**๏ผšๅ†…็ฝฎ`ProxyRotator`๏ผŒๆ”ฏๆŒ่ฝฎ่ฏขๆˆ–่‡ชๅฎšไน‰็ญ–็•ฅ๏ผŒ้€‚็”จไบŽๆ‰€ๆœ‰Session็ฑปๅž‹๏ผŒๅนถๆ”ฏๆŒๆŒ‰่ฏทๆฑ‚่ฆ†็›–Proxyใ€‚ +- **ๅŸŸๅๅฑ่”ฝ**๏ผšๅœจๅŸบไบŽๆต่งˆๅ™จ็š„Fetcherไธญๅฑ่”ฝๅฏน็‰นๅฎšๅŸŸๅ๏ผˆๅŠๅ…ถๅญๅŸŸๅ๏ผ‰็š„่ฏทๆฑ‚ใ€‚ +- **Asyncๆ”ฏๆŒ**๏ผšๆ‰€ๆœ‰Fetcherๅ’Œไธ“็”จasync Session็ฑป็š„ๅฎŒๆ•ดasyncๆ”ฏๆŒใ€‚ + +### ่‡ช้€‚ๅบ”ๆŠ“ๅ–ๅ’ŒAI้›†ๆˆ +- ๐Ÿ”„ **ๆ™บ่ƒฝๅ…ƒ็ด ่ทŸ่ธช**๏ผšไฝฟ็”จๆ™บ่ƒฝ็›ธไผผๆ€ง็ฎ—ๆณ•ๅœจ็ฝ‘็ซ™ๆ›ดๆ”นๅŽ้‡ๆ–ฐๅฎšไฝๅ…ƒ็ด ใ€‚ +- ๐ŸŽฏ **ๆ™บ่ƒฝ็ตๆดป้€‰ๆ‹ฉ**๏ผšCSS้€‰ๆ‹ฉๅ™จใ€XPath้€‰ๆ‹ฉๅ™จใ€ๅŸบไบŽ่ฟ‡ๆปคๅ™จ็š„ๆœ็ดขใ€ๆ–‡ๆœฌๆœ็ดขใ€ๆญฃๅˆ™่กจ่พพๅผๆœ็ดข็ญ‰ใ€‚ +- ๐Ÿ” **ๆŸฅๆ‰พ็›ธไผผๅ…ƒ็ด **๏ผš่‡ชๅŠจๅฎšไฝไธŽๅทฒๆ‰พๅˆฐๅ…ƒ็ด ็›ธไผผ็š„ๅ…ƒ็ด ใ€‚ +- ๐Ÿค– **ไธŽAIไธ€่ตทไฝฟ็”จ็š„MCPๆœๅŠกๅ™จ**๏ผšๅ†…็ฝฎMCPๆœๅŠกๅ™จ็”จไบŽAI่พ…ๅŠฉWeb Scrapingๅ’Œๆ•ฐๆฎๆๅ–ใ€‚MCPๆœๅŠกๅ™จๅ…ทๆœ‰ๅผบๅคง็š„่‡ชๅฎšไน‰ๅŠŸ่ƒฝ๏ผŒๅˆฉ็”จScraplingๅœจๅฐ†ๅ†…ๅฎนไผ ้€’็ป™AI๏ผˆClaude/Cursor็ญ‰๏ผ‰ไน‹ๅ‰ๆๅ–็›ฎๆ ‡ๅ†…ๅฎน๏ผŒไปŽ่€ŒๅŠ ๅฟซๆ“ไฝœๅนถ้€š่ฟ‡ๆœ€ๅฐๅŒ–tokenไฝฟ็”จๆฅ้™ไฝŽๆˆๆœฌใ€‚๏ผˆ[ๆผ”็คบ่ง†้ข‘](https://www.youtube.com/watch?v=qyFk3ZNwOxE)๏ผ‰ + +### ้ซ˜ๆ€ง่ƒฝๅ’Œ็ป่ฟ‡ๅฎžๆˆ˜ๆต‹่ฏ•็š„ๆžถๆž„ +- ๐Ÿš€ **้—ช็”ต่ˆฌๅฟซ้€Ÿ**๏ผšไผ˜ๅŒ–ๆ€ง่ƒฝ่ถ…่ถŠๅคงๅคšๆ•ฐPythonๆŠ“ๅ–ๅบ“ใ€‚ +- ๐Ÿ”‹ **ๅ†…ๅญ˜้ซ˜ๆ•ˆ**๏ผšไผ˜ๅŒ–็š„ๆ•ฐๆฎ็ป“ๆž„ๅ’Œๅปถ่ฟŸๅŠ ่ฝฝ๏ผŒๆœ€ๅฐๅ†…ๅญ˜ๅ ็”จใ€‚ +- โšก **ๅฟซ้€ŸJSONๅบๅˆ—ๅŒ–**๏ผšๆฏ”ๆ ‡ๅ‡†ๅบ“ๅฟซ10ๅ€ใ€‚ +- ๐Ÿ—๏ธ **็ป่ฟ‡ๅฎžๆˆ˜ๆต‹่ฏ•**๏ผšScraplingไธไป…ๆ‹ฅๆœ‰92%็š„ๆต‹่ฏ•่ฆ†็›–็އๅ’ŒๅฎŒๆ•ด็š„็ฑปๅž‹ๆ็คบ่ฆ†็›–็އ๏ผŒ่€Œไธ”ๅœจ่ฟ‡ๅŽปไธ€ๅนดไธญๆฏๅคฉ่ขซๆ•ฐ็™พๅWeb Scraperไฝฟ็”จใ€‚ + +### ๅฏนๅผ€ๅ‘่€…/Web Scraperๅ‹ๅฅฝ็š„ไฝ“้ชŒ +- ๐ŸŽฏ **ไบคไบ’ๅผWeb Scraping Shell**๏ผšๅฏ้€‰็š„ๅ†…็ฝฎIPython Shell๏ผŒๅ…ทๆœ‰Scrapling้›†ๆˆใ€ๅฟซๆทๆ–นๅผๅ’Œๆ–ฐๅทฅๅ…ท๏ผŒๅฏๅŠ ๅฟซWeb Scraping่„šๆœฌๅผ€ๅ‘๏ผŒไพ‹ๅฆ‚ๅฐ†curl่ฏทๆฑ‚่ฝฌๆขไธบScrapling่ฏทๆฑ‚ๅนถๅœจๆต่งˆๅ™จไธญๆŸฅ็œ‹่ฏทๆฑ‚็ป“ๆžœใ€‚ +- ๐Ÿš€ **็›ดๆŽฅไปŽ็ปˆ็ซฏไฝฟ็”จ**๏ผšๅฏ้€‰ๅœฐ๏ผŒๆ‚จๅฏไปฅไฝฟ็”จScraplingๆŠ“ๅ–URL่€Œๆ— ้œ€็ผ–ๅ†™ไปปไฝ•ไปฃ็ ๏ผ +- ๐Ÿ› ๏ธ **ไธฐๅฏŒ็š„ๅฏผ่ˆชAPI**๏ผšไฝฟ็”จ็ˆถ็บงใ€ๅ…„ๅผŸ็บงๅ’Œๅญ็บงๅฏผ่ˆชๆ–นๆณ•่ฟ›่กŒ้ซ˜็บงDOM้ๅކใ€‚ +- ๐Ÿงฌ **ๅขžๅผบ็š„ๆ–‡ๆœฌๅค„็†**๏ผšๅ†…็ฝฎๆญฃๅˆ™่กจ่พพๅผใ€ๆธ…็†ๆ–นๆณ•ๅ’Œไผ˜ๅŒ–็š„ๅญ—็ฌฆไธฒๆ“ไฝœใ€‚ +- ๐Ÿ“ **่‡ชๅŠจ้€‰ๆ‹ฉๅ™จ็”Ÿๆˆ**๏ผšไธบไปปไฝ•ๅ…ƒ็ด ็”Ÿๆˆๅผบๅคง็š„CSS/XPath้€‰ๆ‹ฉๅ™จใ€‚ +- ๐Ÿ”Œ **็†Ÿๆ‚‰็š„API**๏ผš็ฑปไผผไบŽScrapy/BeautifulSoup๏ผŒไฝฟ็”จไธŽScrapy/Parsel็›ธๅŒ็š„ไผชๅ…ƒ็ด ใ€‚ +- ๐Ÿ“˜ **ๅฎŒๆ•ด็š„็ฑปๅž‹่ฆ†็›–**๏ผšๅฎŒๆ•ด็š„็ฑปๅž‹ๆ็คบ๏ผŒๅ‡บ่‰ฒ็š„IDEๆ”ฏๆŒๅ’Œไปฃ็ ่กฅๅ…จใ€‚ๆ•ดไธชไปฃ็ ๅบ“ๅœจๆฏๆฌกๆ›ดๆ”นๆ—ถ้ƒฝไผš่‡ชๅŠจไฝฟ็”จ**PyRight**ๅ’Œ**MyPy**ๆ‰ซๆใ€‚ +- ๐Ÿ”‹ **็Žฐๆˆ็š„Docker้•œๅƒ**๏ผšๆฏๆฌกๅ‘ๅธƒๆ—ถ๏ผŒๅŒ…ๅซๆ‰€ๆœ‰ๆต่งˆๅ™จ็š„Docker้•œๅƒไผš่‡ชๅŠจๆž„ๅปบๅ’ŒๆŽจ้€ใ€‚ + +## ๅ…ฅ้—จ + +่ฎฉๆˆ‘ไปฌๅฟซ้€Ÿๅฑ•็คบScrapling็š„ๅŠŸ่ƒฝ๏ผŒๆ— ้œ€ๆทฑๅ…ฅไบ†่งฃใ€‚ + +### ๅŸบๆœฌ็”จๆณ• +ๆ”ฏๆŒSession็š„HTTP่ฏทๆฑ‚ +```python +from scrapling.fetchers import Fetcher, FetcherSession + +with FetcherSession(impersonate='chrome') as session: # ไฝฟ็”จChrome็š„ๆœ€ๆ–ฐ็‰ˆๆœฌTLS fingerprint + page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) + quotes = page.css('.quote .text::text').getall() + +# ๆˆ–ไฝฟ็”จไธ€ๆฌกๆ€ง่ฏทๆฑ‚ +page = Fetcher.get('https://quotes.toscrape.com/') +quotes = page.css('.quote .text::text').getall() +``` +้ซ˜็บง้š็ง˜ๆจกๅผ +```python +from scrapling.fetchers import StealthyFetcher, StealthySession + +with StealthySession(headless=True, solve_cloudflare=True) as session: # ไฟๆŒๆต่งˆๅ™จๆ‰“ๅผ€็›ดๅˆฐๅฎŒๆˆ + page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False) + data = page.css('#padded_content a').getall() + +# ๆˆ–ไฝฟ็”จไธ€ๆฌกๆ€ง่ฏทๆฑ‚ๆ ทๅผ๏ผŒไธบๆญค่ฏทๆฑ‚ๆ‰“ๅผ€ๆต่งˆๅ™จ๏ผŒๅฎŒๆˆๅŽๅ…ณ้—ญ +page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare') +data = page.css('#padded_content a').getall() +``` +ๅฎŒๆ•ด็š„ๆต่งˆๅ™จ่‡ชๅŠจๅŒ– +```python +from scrapling.fetchers import DynamicFetcher, DynamicSession + +with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # ไฟๆŒๆต่งˆๅ™จๆ‰“ๅผ€็›ดๅˆฐๅฎŒๆˆ + page = session.fetch('https://quotes.toscrape.com/', load_dom=False) + data = page.xpath('//span[@class="text"]/text()').getall() # ๅฆ‚ๆžœๆ‚จๅๅฅฝXPath้€‰ๆ‹ฉๅ™จ + +# ๆˆ–ไฝฟ็”จไธ€ๆฌกๆ€ง่ฏทๆฑ‚ๆ ทๅผ๏ผŒไธบๆญค่ฏทๆฑ‚ๆ‰“ๅผ€ๆต่งˆๅ™จ๏ผŒๅฎŒๆˆๅŽๅ…ณ้—ญ +page = DynamicFetcher.fetch('https://quotes.toscrape.com/') +data = page.css('.quote .text::text').getall() +``` + +### Spider +ๆž„ๅปบๅ…ทๆœ‰ๅนถๅ‘่ฏทๆฑ‚ใ€ๅคš็งSession็ฑปๅž‹ๅ’Œๆš‚ๅœ/ๆขๅคๅŠŸ่ƒฝ็š„ๅฎŒๆ•ด็ˆฌ่™ซ๏ผš +```python +from scrapling.spiders import Spider, Request, Response + +class QuotesSpider(Spider): + name = "quotes" + start_urls = ["https://quotes.toscrape.com/"] + concurrent_requests = 10 + + async def parse(self, response: Response): + for quote in response.css('.quote'): + yield { + "text": quote.css('.text::text').get(), + "author": quote.css('.author::text').get(), + } + + next_page = response.css('.next a') + if next_page: + yield response.follow(next_page[0].attrib['href']) + +result = QuotesSpider().start() +print(f"ๆŠ“ๅ–ไบ† {len(result.items)} ๆกๅผ•็”จ") +result.items.to_json("quotes.json") +``` +ๅœจๅ•ไธชSpiderไธญไฝฟ็”จๅคš็งSession็ฑปๅž‹๏ผš +```python +from scrapling.spiders import Spider, Request, Response +from scrapling.fetchers import FetcherSession, AsyncStealthySession + +class MultiSessionSpider(Spider): + name = "multi" + start_urls = ["https://example.com/"] + + def configure_sessions(self, manager): + manager.add("fast", FetcherSession(impersonate="chrome")) + manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) + + async def parse(self, response: Response): + for link in response.css('a::attr(href)').getall(): + # ๅฐ†ๅ—ไฟๆŠค็š„้กต้ข่ทฏ็”ฑๅˆฐ้š็ง˜Session + if "protected" in link: + yield Request(link, sid="stealth") + else: + yield Request(link, sid="fast", callback=self.parse) # ๆ˜พๅผcallback +``` +้€š่ฟ‡ๅฆ‚ไธ‹ๆ–นๅผ่ฟ่กŒSpiderๆฅๆš‚ๅœๅ’Œๆขๅค้•ฟๆ—ถ้—ด็ˆฌๅ–๏ผŒไฝฟ็”จCheckpoint๏ผš +```python +QuotesSpider(crawldir="./crawl_data").start() +``` +ๆŒ‰Ctrl+Cไผ˜้›…ๆš‚ๅœโ€”โ€”่ฟ›ๅบฆไผš่‡ชๅŠจไฟๅญ˜ใ€‚ไน‹ๅŽ๏ผŒๅฝ“ๆ‚จๅ†ๆฌกๅฏๅŠจSpiderๆ—ถ๏ผŒไผ ้€’็›ธๅŒ็š„`crawldir`๏ผŒๅฎƒๅฐ†ไปŽไธŠๆฌกๅœๆญข็š„ๅœฐๆ–น็ปง็ปญใ€‚ + +### ้ซ˜็บง่งฃๆžไธŽๅฏผ่ˆช +```python +from scrapling.fetchers import Fetcher + +# ไธฐๅฏŒ็š„ๅ…ƒ็ด ้€‰ๆ‹ฉๅ’Œๅฏผ่ˆช +page = Fetcher.get('https://quotes.toscrape.com/') + +# ไฝฟ็”จๅคš็ง้€‰ๆ‹ฉๆ–นๆณ•่Žทๅ–ๅผ•็”จ +quotes = page.css('.quote') # CSS้€‰ๆ‹ฉๅ™จ +quotes = page.xpath('//div[@class="quote"]') # XPath +quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup้ฃŽๆ ผ +# ็ญ‰ๅŒไบŽ +quotes = page.find_all('div', class_='quote') +quotes = page.find_all(['div'], class_='quote') +quotes = page.find_all(class_='quote') # ็ญ‰็ญ‰... +# ๆŒ‰ๆ–‡ๆœฌๅ†…ๅฎนๆŸฅๆ‰พๅ…ƒ็ด  +quotes = page.find_by_text('quote', tag='div') + +# ้ซ˜็บงๅฏผ่ˆช +quote_text = page.css('.quote')[0].css('.text::text').get() +quote_text = page.css('.quote').css('.text::text').getall() # ้“พๅผ้€‰ๆ‹ฉๅ™จ +first_quote = page.css('.quote')[0] +author = first_quote.next_sibling.css('.author::text') +parent_container = first_quote.parent + +# ๅ…ƒ็ด ๅ…ณ็ณปๅ’Œ็›ธไผผๆ€ง +similar_elements = first_quote.find_similar() +below_elements = first_quote.below_elements() +``` +ๅฆ‚ๆžœๆ‚จไธๆƒณ่Žทๅ–็ฝ‘็ซ™๏ผŒๅฏไปฅ็›ดๆŽฅไฝฟ็”จ่งฃๆžๅ™จ๏ผŒๅฆ‚ไธ‹ๆ‰€็คบ๏ผš +```python +from scrapling.parser import Selector + +page = Selector("...") +``` +็”จๆณ•ๅฎŒๅ…จ็›ธๅŒ๏ผ + +### Async Session็ฎก็†็คบไพ‹ +```python +import asyncio +from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession + +async with FetcherSession(http3=True) as session: # `FetcherSession`ๆ˜ฏไธŠไธ‹ๆ–‡ๆ„Ÿ็Ÿฅ็š„๏ผŒๅฏไปฅๅœจsync/asyncๆจกๅผไธ‹ๅทฅไฝœ + page1 = session.get('https://quotes.toscrape.com/') + page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135') + +# Async Session็”จๆณ• +async with AsyncStealthySession(max_pages=2) as session: + tasks = [] + urls = ['https://example.com/page1', 'https://example.com/page2'] + + for url in urls: + task = session.fetch(url) + tasks.append(task) + + print(session.get_pool_stats()) # ๅฏ้€‰ - ๆต่งˆๅ™จๆ ‡็ญพๆฑ ็š„็Šถๆ€๏ผˆๅฟ™/็ฉบ้—ฒ/้”™่ฏฏ๏ผ‰ + results = await asyncio.gather(*tasks) + print(session.get_pool_stats()) +``` + +## CLIๅ’Œไบคไบ’ๅผShell + +ScraplingๅŒ…ๅซๅผบๅคง็š„ๅ‘ฝไปค่กŒ็•Œ้ข๏ผš + +[![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339) + +ๅฏๅŠจไบคไบ’ๅผWeb Scraping Shell +```bash +scrapling shell +``` +็›ดๆŽฅๅฐ†้กต้ขๆๅ–ๅˆฐๆ–‡ไปถ่€Œๆ— ้œ€็ผ–็จ‹๏ผˆ้ป˜่ฎคๆๅ–`body`ๆ ‡็ญพๅ†…็š„ๅ†…ๅฎน๏ผ‰ใ€‚ๅฆ‚ๆžœ่พ“ๅ‡บๆ–‡ไปถไปฅ`.txt`็ป“ๅฐพ๏ผŒๅˆ™ๅฐ†ๆๅ–็›ฎๆ ‡็š„ๆ–‡ๆœฌๅ†…ๅฎนใ€‚ๅฆ‚ๆžœไปฅ`.md`็ป“ๅฐพ๏ผŒๅฎƒๅฐ†ๆ˜ฏHTMLๅ†…ๅฎน็š„Markdown่กจ็คบ๏ผ›ๅฆ‚ๆžœไปฅ`.html`็ป“ๅฐพ๏ผŒๅฎƒๅฐ†ๆ˜ฏHTMLๅ†…ๅฎนๆœฌ่บซใ€‚ +```bash +scrapling extract get 'https://example.com' content.md +scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # ๆ‰€ๆœ‰ๅŒน้…CSS้€‰ๆ‹ฉๅ™จ'#fromSkipToProducts'็š„ๅ…ƒ็ด  +scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless +scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare +``` + +> [!NOTE] +> ่ฟ˜ๆœ‰่ฎธๅคšๅ…ถไป–ๅŠŸ่ƒฝ๏ผŒไฝ†ๆˆ‘ไปฌๅธŒๆœ›ไฟๆŒๆญค้กต้ข็ฎ€ๆด๏ผŒๅŒ…ๆ‹ฌMCPๆœๅŠกๅ™จๅ’Œไบคไบ’ๅผWeb Scraping Shellใ€‚ๆŸฅ็œ‹ๅฎŒๆ•ดๆ–‡ๆกฃ[่ฟ™้‡Œ](https://scrapling.readthedocs.io/en/latest/) + +## ๆ€ง่ƒฝๅŸบๅ‡† + +Scraplingไธไป…ๅŠŸ่ƒฝๅผบๅคงโ€”โ€”ๅฎƒ่ฟ˜้€Ÿๅบฆๆžๅฟซใ€‚ไปฅไธ‹ๅŸบๅ‡†ๆต‹่ฏ•ๅฐ†Scrapling็š„่งฃๆžๅ™จไธŽๅ…ถไป–ๆต่กŒๅบ“็š„ๆœ€ๆ–ฐ็‰ˆๆœฌ่ฟ›่กŒไบ†ๆฏ”่พƒใ€‚ + +### ๆ–‡ๆœฌๆๅ–้€Ÿๅบฆๆต‹่ฏ•๏ผˆ5000ไธชๅตŒๅฅ—ๅ…ƒ็ด ๏ผ‰ + +| # | ๅบ“ | ๆ—ถ้—ด(ms) | vs Scrapling | +|---|:-----------------:|:---------:|:------------:| +| 1 | Scrapling | 2.02 | 1.0x | +| 2 | Parsel/Scrapy | 2.04 | 1.01 | +| 3 | Raw Lxml | 2.54 | 1.257 | +| 4 | PyQuery | 24.17 | ~12x | +| 5 | Selectolax | 82.63 | ~41x | +| 6 | MechanicalSoup | 1549.71 | ~767.1x | +| 7 | BS4 with Lxml | 1584.31 | ~784.3x | +| 8 | BS4 with html5lib | 3391.91 | ~1679.1x | + + +### ๅ…ƒ็ด ็›ธไผผๆ€งๅ’Œๆ–‡ๆœฌๆœ็ดขๆ€ง่ƒฝ + +Scrapling็š„่‡ช้€‚ๅบ”ๅ…ƒ็ด ๆŸฅๆ‰พๅŠŸ่ƒฝๆ˜Žๆ˜พไผ˜ไบŽๆ›ฟไปฃๆ–นๆกˆ๏ผš + +| ๅบ“ | ๆ—ถ้—ด(ms) | vs Scrapling | +|-------------|:---------:|:------------:| +| Scrapling | 2.39 | 1.0x | +| AutoScraper | 12.45 | 5.209x | + + +> ๆ‰€ๆœ‰ๅŸบๅ‡†ๆต‹่ฏ•ไปฃ่กจ100+ๆฌก่ฟ่กŒ็š„ๅนณๅ‡ๅ€ผใ€‚่ฏทๅ‚้˜…[benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py)ไบ†่งฃๆ–นๆณ•ใ€‚ + +## ๅฎ‰่ฃ… + +Scrapling้œ€่ฆPython 3.10ๆˆ–ๆ›ด้ซ˜็‰ˆๆœฌ๏ผš + +```bash +pip install scrapling +``` + +ๆญคๅฎ‰่ฃ…ไป…ๅŒ…ๆ‹ฌ่งฃๆžๅ™จๅผ•ๆ“ŽๅŠๅ…ถไพ่ต–้กน๏ผŒๆฒกๆœ‰ไปปไฝ•Fetcherๆˆ–ๅ‘ฝไปค่กŒไพ่ต–้กนใ€‚ + +### ๅฏ้€‰ไพ่ต–้กน + +1. ๅฆ‚ๆžœๆ‚จ่ฆไฝฟ็”จไปฅไธ‹ไปปไฝ•้ขๅค–ๅŠŸ่ƒฝใ€Fetcherๆˆ–ๅฎƒไปฌ็š„็ฑป๏ผŒๆ‚จๅฐ†้œ€่ฆๅฎ‰่ฃ…Fetcher็š„ไพ่ต–้กนๅ’Œๅฎƒไปฌ็š„ๆต่งˆๅ™จไพ่ต–้กน๏ผŒๅฆ‚ไธ‹ๆ‰€็คบ๏ผš + ```bash + pip install "scrapling[fetchers]" + + scrapling install # normal install + scrapling install --force # force reinstall + ``` + + ่ฟ™ไผšไธ‹่ฝฝๆ‰€ๆœ‰ๆต่งˆๅ™จ๏ผŒไปฅๅŠๅฎƒไปฌ็š„็ณป็ปŸไพ่ต–้กนๅ’Œfingerprintๆ“ไฝœไพ่ต–้กนใ€‚ + + ๆˆ–่€…ไฝ ๅฏไปฅไปŽไปฃ็ ไธญๅฎ‰่ฃ…๏ผŒ่€Œไธๆ˜ฏ่ฟ่กŒๅ‘ฝไปค๏ผš + ```python + from scrapling.cli import install + + install([], standalone_mode=False) # normal install + install(["--force"], standalone_mode=False) # force reinstall + ``` + +2. ้ขๅค–ๅŠŸ่ƒฝ๏ผš + - ๅฎ‰่ฃ…MCPๆœๅŠกๅ™จๅŠŸ่ƒฝ๏ผš + ```bash + pip install "scrapling[ai]" + ``` + - ๅฎ‰่ฃ…ShellๅŠŸ่ƒฝ๏ผˆWeb Scraping Shellๅ’Œ`extract`ๅ‘ฝไปค๏ผ‰๏ผš + ```bash + pip install "scrapling[shell]" + ``` + - ๅฎ‰่ฃ…ๆ‰€ๆœ‰ๅ†…ๅฎน๏ผš + ```bash + pip install "scrapling[all]" + ``` + ่ฏท่ฎฐไฝ๏ผŒๅœจๅฎ‰่ฃ…ไปปไฝ•่ฟ™ไบ›้ขๅค–ๅŠŸ่ƒฝๅŽ๏ผˆๅฆ‚ๆžœๆ‚จ่ฟ˜ๆฒกๆœ‰ๅฎ‰่ฃ…๏ผ‰๏ผŒๆ‚จ้œ€่ฆไฝฟ็”จ`scrapling install`ๅฎ‰่ฃ…ๆต่งˆๅ™จไพ่ต–้กน + +### Docker +ๆ‚จ่ฟ˜ๅฏไปฅไฝฟ็”จไปฅไธ‹ๅ‘ฝไปคไปŽDockerHubๅฎ‰่ฃ…ๅŒ…ๅซๆ‰€ๆœ‰้ขๅค–ๅŠŸ่ƒฝๅ’Œๆต่งˆๅ™จ็š„Docker้•œๅƒ๏ผš +```bash +docker pull pyd4vinci/scrapling +``` +ๆˆ–ไปŽGitHubๆณจๅ†Œ่กจไธ‹่ฝฝ๏ผš +```bash +docker pull ghcr.io/d4vinci/scrapling:latest +``` +ๆญค้•œๅƒไฝฟ็”จGitHub Actionsๅ’Œไป“ๅบ“ไธปๅˆ†ๆ”ฏ่‡ชๅŠจๆž„ๅปบๅ’ŒๆŽจ้€ใ€‚ + +## ่ดก็Œฎ + +ๆˆ‘ไปฌๆฌข่ฟŽ่ดก็Œฎ๏ผๅœจๅผ€ๅง‹ไน‹ๅ‰๏ผŒ่ฏท้˜…่ฏปๆˆ‘ไปฌ็š„[่ดก็ŒฎๆŒ‡ๅ—](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md)ใ€‚ + +## ๅ…่ดฃๅฃฐๆ˜Ž + +> [!CAUTION] +> ๆญคๅบ“ไป…็”จไบŽๆ•™่‚ฒๅ’Œ็ ”็ฉถ็›ฎ็š„ใ€‚ไฝฟ็”จๆญคๅบ“ๅณ่กจ็คบๆ‚จๅŒๆ„้ตๅฎˆๆœฌๅœฐๅ’Œๅ›ฝ้™…ๆ•ฐๆฎๆŠ“ๅ–ๅ’Œ้š็งๆณ•ๅพ‹ใ€‚ไฝœ่€…ๅ’Œ่ดก็Œฎ่€…ๅฏนๆœฌ่ฝฏไปถ็š„ไปปไฝ•ๆปฅ็”จไธๆ‰ฟๆ‹…่ดฃไปปใ€‚ๅง‹็ปˆๅฐŠ้‡็ฝ‘็ซ™็š„ๆœๅŠกๆกๆฌพๅ’Œrobots.txtๆ–‡ไปถใ€‚ + +## ่ฎธๅฏ่ฏ + +ๆœฌไฝœๅ“ๆ นๆฎBSD-3-Clause่ฎธๅฏ่ฏๆŽˆๆƒใ€‚ + +## ่‡ด่ฐข + +ๆญค้กน็›ฎๅŒ…ๅซๆ”น็ผ–่‡ชไปฅไธ‹ๅ†…ๅฎน็š„ไปฃ็ ๏ผš +- Parsel๏ผˆBSD่ฎธๅฏ่ฏ๏ผ‰โ€”โ€”็”จไบŽ[translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)ๅญๆจกๅ— + +--- +
็”ฑKarim Shoair็”จโค๏ธ่ฎพ่ฎกๅ’Œๅˆถไฝœใ€‚

diff --git a/docs/README_DE.md b/docs/README_DE.md new file mode 100644 index 0000000000000000000000000000000000000000..4772d7f2fe666b539e773580aab4d56364ab8734 --- /dev/null +++ b/docs/README_DE.md @@ -0,0 +1,426 @@ + + +

+ + + + Scrapling Poster + + +
+ Effortless Web Scraping for the Modern Web +

+ +

+ + Tests + + PyPI version + + PyPI Downloads +
+ + Discord + + + X (formerly Twitter) Follow + +
+ + Supported Python versions +

+ +

+ Auswahlmethoden + · + Einen Fetcher wรคhlen + · + Spiders + · + Proxy-Rotation + · + CLI + · + MCP-Modus +

+ +Scrapling ist ein adaptives Web-Scraping-Framework, das alles abdeckt -- von einer einzelnen Anfrage bis hin zu einem umfassenden Crawl. + +Sein Parser lernt aus Website-ร„nderungen und lokalisiert Ihre Elemente automatisch neu, wenn sich Seiten aktualisieren. Seine Fetcher umgehen Anti-Bot-Systeme wie Cloudflare Turnstile direkt ab Werk. Und sein Spider-Framework ermรถglicht es Ihnen, auf parallele Multi-Session-Crawls mit Pause & Resume und automatischer Proxy-Rotation hochzuskalieren -- alles in wenigen Zeilen Python. Eine Bibliothek, keine Kompromisse. + +Blitzschnelle Crawls mit Echtzeit-Statistiken und Streaming. Von Web Scrapern fรผr Web Scraper und normale Benutzer entwickelt, ist fรผr jeden etwas dabei. + +```python +from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher +StealthyFetcher.adaptive = True +p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Website unbemerkt abrufen! +products = p.css('.product', auto_save=True) # Daten scrapen, die Website-Designรคnderungen รผberleben! +products = p.css('.product', adaptive=True) # Spรคter, wenn sich die Website-Struktur รคndert, `adaptive=True` รผbergeben, um sie zu finden! +``` +Oder auf vollstรคndige Crawls hochskalieren +```python +from scrapling.spiders import Spider, Response + +class MySpider(Spider): + name = "demo" + start_urls = ["https://example.com/"] + + async def parse(self, response: Response): + for item in response.css('.product'): + yield {"title": item.css('h2::text').get()} + +MySpider().start() +``` + + +# Platin-Sponsoren + +Mรถchten Sie das erste Unternehmen sein, das hier erscheint? Klicken Sie [hier](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646) +# Sponsoren + + + + + + + + + + + + + + + + + + + + +Mรถchten Sie Ihre Anzeige hier zeigen? Klicken Sie [hier](https://github.com/sponsors/D4Vinci) und wรคhlen Sie die Stufe, die zu Ihnen passt! + +--- + +## Hauptmerkmale + +### Spiders -- Ein vollstรคndiges Crawling-Framework +- ๐Ÿ•ท๏ธ **Scrapy-รคhnliche Spider-API**: Definieren Sie Spiders mit `start_urls`, async `parse` Callbacks und `Request`/`Response`-Objekten. +- โšก **Paralleles Crawling**: Konfigurierbare Parallelitรคtslimits, domainbezogenes Throttling und Download-Verzรถgerungen. +- ๐Ÿ”„ **Multi-Session-Unterstรผtzung**: Einheitliche Schnittstelle fรผr HTTP-Anfragen und heimliche Headless-Browser in einem einzigen Spider -- leiten Sie Anfragen per ID an verschiedene Sessions weiter. +- ๐Ÿ’พ **Pause & Resume**: Checkpoint-basierte Crawl-Persistenz. Drรผcken Sie Strg+C fรผr ein kontrolliertes Herunterfahren; starten Sie neu, um dort fortzufahren, wo Sie aufgehรถrt haben. +- ๐Ÿ“ก **Streaming-Modus**: Gescrapte Elemente in Echtzeit streamen รผber `async for item in spider.stream()` mit Echtzeit-Statistiken -- ideal fรผr UI, Pipelines und lang laufende Crawls. +- ๐Ÿ›ก๏ธ **Erkennung blockierter Anfragen**: Automatische Erkennung und Wiederholung blockierter Anfragen mit anpassbarer Logik. +- ๐Ÿ“ฆ **Integrierter Export**: Ergebnisse รผber Hooks und Ihre eigene Pipeline oder den integrierten JSON/JSONL-Export mit `result.items.to_json()` / `result.items.to_jsonl()` exportieren. + +### Erweitertes Website-Abrufen mit Session-Unterstรผtzung +- **HTTP-Anfragen**: Schnelle und heimliche HTTP-Anfragen mit der `Fetcher`-Klasse. Kann Browser-TLS-Fingerprints und Header imitieren und HTTP/3 verwenden. +- **Dynamisches Laden**: Dynamische Websites mit vollstรคndiger Browser-Automatisierung รผber die `DynamicFetcher`-Klasse abrufen, die Playwrights Chromium und Google Chrome unterstรผtzt. +- **Anti-Bot-Umgehung**: Erweiterte Stealth-Fรคhigkeiten mit `StealthyFetcher` und Fingerprint-Spoofing. Kann alle Arten von Cloudflares Turnstile/Interstitial einfach mit Automatisierung umgehen. +- **Session-Verwaltung**: Persistente Session-Unterstรผtzung mit den Klassen `FetcherSession`, `StealthySession` und `DynamicSession` fรผr Cookie- und Zustandsverwaltung รผber Anfragen hinweg. +- **Proxy-Rotation**: Integrierter `ProxyRotator` mit zyklischen oder benutzerdefinierten Rotationsstrategien รผber alle Session-Typen hinweg, plus Proxy-รœberschreibungen pro Anfrage. +- **Domain-Blockierung**: Anfragen an bestimmte Domains (und deren Subdomains) in browserbasierten Fetchern blockieren. +- **Async-Unterstรผtzung**: Vollstรคndige async-Unterstรผtzung รผber alle Fetcher und dedizierte async Session-Klassen hinweg. + +### Adaptives Scraping & KI-Integration +- ๐Ÿ”„ **Intelligente Element-Verfolgung**: Elemente nach Website-ร„nderungen mit intelligenten ร„hnlichkeitsalgorithmen neu lokalisieren. +- ๐ŸŽฏ **Intelligente flexible Auswahl**: CSS-Selektoren, XPath-Selektoren, filterbasierte Suche, Textsuche, Regex-Suche und mehr. +- ๐Ÿ” **ร„hnliche Elemente finden**: Elemente, die gefundenen Elementen รคhnlich sind, automatisch lokalisieren. +- ๐Ÿค– **MCP-Server fรผr die Verwendung mit KI**: Integrierter MCP-Server fรผr KI-unterstรผtztes Web Scraping und Datenextraktion. Der MCP-Server verfรผgt รผber leistungsstarke, benutzerdefinierte Funktionen, die Scrapling nutzen, um gezielten Inhalt zu extrahieren, bevor er an die KI (Claude/Cursor/etc.) รผbergeben wird, wodurch Vorgรคnge beschleunigt und Kosten durch Minimierung der Token-Nutzung gesenkt werden. ([Demo-Video](https://www.youtube.com/watch?v=qyFk3ZNwOxE)) + +### Hochleistungs- und praxiserprobte Architektur +- ๐Ÿš€ **Blitzschnell**: Optimierte Leistung, die die meisten Python-Scraping-Bibliotheken รผbertrifft. +- ๐Ÿ”‹ **Speichereffizient**: Optimierte Datenstrukturen und Lazy Loading fรผr einen minimalen Speicher-Footprint. +- โšก **Schnelle JSON-Serialisierung**: 10x schneller als die Standardbibliothek. +- ๐Ÿ—๏ธ **Praxiserprobt**: Scrapling hat nicht nur eine Testabdeckung von 92% und eine vollstรคndige Type-Hints-Abdeckung, sondern wird seit dem letzten Jahr tรคglich von Hunderten von Web Scrapern verwendet. + +### Entwickler-/Web-Scraper-freundliche Erfahrung +- ๐ŸŽฏ **Interaktive Web-Scraping-Shell**: Optionale integrierte IPython-Shell mit Scrapling-Integration, Shortcuts und neuen Tools zur Beschleunigung der Web-Scraping-Skriptentwicklung, wie das Konvertieren von Curl-Anfragen in Scrapling-Anfragen und das Anzeigen von Anfrageergebnissen in Ihrem Browser. +- ๐Ÿš€ **Direkt vom Terminal aus verwenden**: Optional kรถnnen Sie Scrapling verwenden, um eine URL zu scrapen, ohne eine einzige Codezeile zu schreiben! +- ๐Ÿ› ๏ธ **Umfangreiche Navigations-API**: Erweiterte DOM-Traversierung mit Eltern-, Geschwister- und Kind-Navigationsmethoden. +- ๐Ÿงฌ **Verbesserte Textverarbeitung**: Integrierte Regex, Bereinigungsmethoden und optimierte String-Operationen. +- ๐Ÿ“ **Automatische Selektorgenerierung**: Robuste CSS/XPath-Selektoren fรผr jedes Element generieren. +- ๐Ÿ”Œ **Vertraute API**: ร„hnlich wie Scrapy/BeautifulSoup mit denselben Pseudo-Elementen, die in Scrapy/Parsel verwendet werden. +- ๐Ÿ“˜ **Vollstรคndige Typabdeckung**: Vollstรคndige Type Hints fรผr hervorragende IDE-Unterstรผtzung und Code-Vervollstรคndigung. Die gesamte Codebasis wird bei jeder ร„nderung automatisch mit **PyRight** und **MyPy** gescannt. +- ๐Ÿ”‹ **Fertiges Docker-Image**: Mit jeder Verรถffentlichung wird automatisch ein Docker-Image erstellt und gepusht, das alle Browser enthรคlt. + +## Erste Schritte + +Hier ein kurzer รœberblick รผber das, was Scrapling kann, ohne zu sehr ins Detail zu gehen. + +### Grundlegende Verwendung +HTTP-Anfragen mit Session-Unterstรผtzung +```python +from scrapling.fetchers import Fetcher, FetcherSession + +with FetcherSession(impersonate='chrome') as session: # Neueste Version von Chromes TLS-Fingerprint verwenden + page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) + quotes = page.css('.quote .text::text').getall() + +# Oder einmalige Anfragen verwenden +page = Fetcher.get('https://quotes.toscrape.com/') +quotes = page.css('.quote .text::text').getall() +``` +Erweiterter Stealth-Modus +```python +from scrapling.fetchers import StealthyFetcher, StealthySession + +with StealthySession(headless=True, solve_cloudflare=True) as session: # Browser offen halten, bis Sie fertig sind + page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False) + data = page.css('#padded_content a').getall() + +# Oder einmaligen Anfragenstil verwenden: รถffnet den Browser fรผr diese Anfrage und schlieรŸt ihn nach Abschluss +page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare') +data = page.css('#padded_content a').getall() +``` +Vollstรคndige Browser-Automatisierung +```python +from scrapling.fetchers import DynamicFetcher, DynamicSession + +with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Browser offen halten, bis Sie fertig sind + page = session.fetch('https://quotes.toscrape.com/', load_dom=False) + data = page.xpath('//span[@class="text"]/text()').getall() # XPath-Selektor, falls bevorzugt + +# Oder einmaligen Anfragenstil verwenden: รถffnet den Browser fรผr diese Anfrage und schlieรŸt ihn nach Abschluss +page = DynamicFetcher.fetch('https://quotes.toscrape.com/') +data = page.css('.quote .text::text').getall() +``` + +### Spiders +Vollstรคndige Crawler mit parallelen Anfragen, mehreren Session-Typen und Pause & Resume erstellen: +```python +from scrapling.spiders import Spider, Request, Response + +class QuotesSpider(Spider): + name = "quotes" + start_urls = ["https://quotes.toscrape.com/"] + concurrent_requests = 10 + + async def parse(self, response: Response): + for quote in response.css('.quote'): + yield { + "text": quote.css('.text::text').get(), + "author": quote.css('.author::text').get(), + } + + next_page = response.css('.next a') + if next_page: + yield response.follow(next_page[0].attrib['href']) + +result = QuotesSpider().start() +print(f"{len(result.items)} Zitate gescrapt") +result.items.to_json("quotes.json") +``` +Mehrere Session-Typen in einem einzigen Spider verwenden: +```python +from scrapling.spiders import Spider, Request, Response +from scrapling.fetchers import FetcherSession, AsyncStealthySession + +class MultiSessionSpider(Spider): + name = "multi" + start_urls = ["https://example.com/"] + + def configure_sessions(self, manager): + manager.add("fast", FetcherSession(impersonate="chrome")) + manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) + + async def parse(self, response: Response): + for link in response.css('a::attr(href)').getall(): + # Geschรผtzte Seiten รผber die Stealth-Session leiten + if "protected" in link: + yield Request(link, sid="stealth") + else: + yield Request(link, sid="fast", callback=self.parse) # Expliziter Callback +``` +Lange Crawls mit Checkpoints pausieren und fortsetzen, indem Sie den Spider so starten: +```python +QuotesSpider(crawldir="./crawl_data").start() +``` +Drรผcken Sie Strg+C, um kontrolliert zu pausieren -- der Fortschritt wird automatisch gespeichert. Wenn Sie den Spider spรคter erneut starten, รผbergeben Sie dasselbe `crawldir`, und er setzt dort fort, wo er aufgehรถrt hat. + +### Erweitertes Parsing & Navigation +```python +from scrapling.fetchers import Fetcher + +# Umfangreiche Elementauswahl und Navigation +page = Fetcher.get('https://quotes.toscrape.com/') + +# Zitate mit verschiedenen Auswahlmethoden abrufen +quotes = page.css('.quote') # CSS-Selektor +quotes = page.xpath('//div[@class="quote"]') # XPath +quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup-Stil +# Gleich wie +quotes = page.find_all('div', class_='quote') +quotes = page.find_all(['div'], class_='quote') +quotes = page.find_all(class_='quote') # und so weiter... +# Element nach Textinhalt finden +quotes = page.find_by_text('quote', tag='div') + +# Erweiterte Navigation +quote_text = page.css('.quote')[0].css('.text::text').get() +quote_text = page.css('.quote').css('.text::text').getall() # Verkettete Selektoren +first_quote = page.css('.quote')[0] +author = first_quote.next_sibling.css('.author::text') +parent_container = first_quote.parent + +# Elementbeziehungen und ร„hnlichkeit +similar_elements = first_quote.find_similar() +below_elements = first_quote.below_elements() +``` +Sie kรถnnen den Parser direkt verwenden, wenn Sie keine Websites abrufen mรถchten, wie unten gezeigt: +```python +from scrapling.parser import Selector + +page = Selector("...") +``` +Und es funktioniert genau auf die gleiche Weise! + +### Beispiele fรผr async Session-Verwaltung +```python +import asyncio +from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession + +async with FetcherSession(http3=True) as session: # `FetcherSession` ist kontextbewusst und kann sowohl in sync- als auch in async-Mustern arbeiten + page1 = session.get('https://quotes.toscrape.com/') + page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135') + +# Async-Session-Verwendung +async with AsyncStealthySession(max_pages=2) as session: + tasks = [] + urls = ['https://example.com/page1', 'https://example.com/page2'] + + for url in urls: + task = session.fetch(url) + tasks.append(task) + + print(session.get_pool_stats()) # Optional - Der Status des Browser-Tab-Pools (beschรคftigt/frei/Fehler) + results = await asyncio.gather(*tasks) + print(session.get_pool_stats()) +``` + +## CLI & Interaktive Shell + +Scrapling enthรคlt eine leistungsstarke Befehlszeilenschnittstelle: + +[![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339) + +Interaktive Web-Scraping-Shell starten +```bash +scrapling shell +``` +Seiten direkt ohne Programmierung in eine Datei extrahieren (extrahiert standardmรครŸig den Inhalt im `body`-Tag). Wenn die Ausgabedatei mit `.txt` endet, wird der Textinhalt des Ziels extrahiert. Wenn sie mit `.md` endet, ist es eine Markdown-Darstellung des HTML-Inhalts; wenn sie mit `.html` endet, ist es der HTML-Inhalt selbst. +```bash +scrapling extract get 'https://example.com' content.md +scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # Alle Elemente, die dem CSS-Selektor '#fromSkipToProducts' entsprechen +scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless +scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare +``` + +> [!NOTE] +> Es gibt viele zusรคtzliche Funktionen, aber wir mรถchten diese Seite prรคgnant halten, einschlieรŸlich des MCP-Servers und der interaktiven Web-Scraping-Shell. Schauen Sie sich die vollstรคndige Dokumentation [hier](https://scrapling.readthedocs.io/en/latest/) an + +## Leistungsbenchmarks + +Scrapling ist nicht nur leistungsstark -- es ist auch blitzschnell. Die folgenden Benchmarks vergleichen Scraplings Parser mit den neuesten Versionen anderer beliebter Bibliotheken. + +### Textextraktions-Geschwindigkeitstest (5000 verschachtelte Elemente) + +| # | Bibliothek | Zeit (ms) | vs Scrapling | +|---|:-----------------:|:---------:|:------------:| +| 1 | Scrapling | 2.02 | 1.0x | +| 2 | Parsel/Scrapy | 2.04 | 1.01 | +| 3 | Raw Lxml | 2.54 | 1.257 | +| 4 | PyQuery | 24.17 | ~12x | +| 5 | Selectolax | 82.63 | ~41x | +| 6 | MechanicalSoup | 1549.71 | ~767.1x | +| 7 | BS4 with Lxml | 1584.31 | ~784.3x | +| 8 | BS4 with html5lib | 3391.91 | ~1679.1x | + + +### Element-ร„hnlichkeit & Textsuche-Leistung + +Scraplings adaptive Element-Finding-Fรคhigkeiten รผbertreffen Alternativen deutlich: + +| Bibliothek | Zeit (ms) | vs Scrapling | +|-------------|:---------:|:------------:| +| Scrapling | 2.39 | 1.0x | +| AutoScraper | 12.45 | 5.209x | + + +> Alle Benchmarks stellen Durchschnittswerte von รผber 100 Durchlรคufen dar. Siehe [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) fรผr die Methodik. + +## Installation + +Scrapling erfordert Python 3.10 oder hรถher: + +```bash +pip install scrapling +``` + +Diese Installation enthรคlt nur die Parser-Engine und ihre Abhรคngigkeiten, ohne Fetcher oder Kommandozeilenabhรคngigkeiten. + +### Optionale Abhรคngigkeiten + +1. Wenn Sie eine der folgenden zusรคtzlichen Funktionen, die Fetcher oder ihre Klassen verwenden mรถchten, mรผssen Sie die Abhรคngigkeiten der Fetcher und ihre Browser-Abhรคngigkeiten wie folgt installieren: + ```bash + pip install "scrapling[fetchers]" + + scrapling install # normal install + scrapling install --force # force reinstall + ``` + + Dies lรคdt alle Browser zusammen mit ihren Systemabhรคngigkeiten und Fingerprint-Manipulationsabhรคngigkeiten herunter. + + Oder Sie kรถnnen sie aus dem Code heraus installieren, anstatt einen Befehl auszufรผhren: + ```python + from scrapling.cli import install + + install([], standalone_mode=False) # normal install + install(["--force"], standalone_mode=False) # force reinstall + ``` + +2. Zusรคtzliche Funktionen: + - MCP-Server-Funktion installieren: + ```bash + pip install "scrapling[ai]" + ``` + - Shell-Funktionen installieren (Web-Scraping-Shell und der `extract`-Befehl): + ```bash + pip install "scrapling[shell]" + ``` + - Alles installieren: + ```bash + pip install "scrapling[all]" + ``` + Denken Sie daran, dass Sie nach einem dieser Extras (falls noch nicht geschehen) die Browser-Abhรคngigkeiten mit `scrapling install` installieren mรผssen + +### Docker +Sie kรถnnen auch ein Docker-Image mit allen Extras und Browsern mit dem folgenden Befehl von DockerHub installieren: +```bash +docker pull pyd4vinci/scrapling +``` +Oder laden Sie es aus der GitHub-Registry herunter: +```bash +docker pull ghcr.io/d4vinci/scrapling:latest +``` +Dieses Image wird automatisch mit GitHub Actions und dem Hauptzweig des Repositorys erstellt und gepusht. + +## Beitragen + +Wir freuen uns รผber Beitrรคge! Bitte lesen Sie unsere [Beitragsrichtlinien](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md), bevor Sie beginnen. + +## Haftungsausschluss + +> [!CAUTION] +> Diese Bibliothek wird nur zu Bildungs- und Forschungszwecken bereitgestellt. Durch die Nutzung dieser Bibliothek erklรคren Sie sich damit einverstanden, lokale und internationale Gesetze zum Daten-Scraping und Datenschutz einzuhalten. Die Autoren und Mitwirkenden sind nicht verantwortlich fรผr Missbrauch dieser Software. Respektieren Sie immer die Nutzungsbedingungen von Websites und robots.txt-Dateien. + +## Lizenz + +Diese Arbeit ist unter der BSD-3-Clause-Lizenz lizenziert. + +## Danksagungen + +Dieses Projekt enthรคlt angepassten Code von: +- Parsel (BSD-Lizenz) -- Verwendet fรผr das [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)-Submodul + +--- +
Entworfen und hergestellt mit โค๏ธ von Karim Shoair.

diff --git a/docs/README_ES.md b/docs/README_ES.md new file mode 100644 index 0000000000000000000000000000000000000000..7037dd5d603f0d8b97ba3533f782c6c881b8d5cc --- /dev/null +++ b/docs/README_ES.md @@ -0,0 +1,426 @@ + + +

+ + + + Scrapling Poster + + +
+ Effortless Web Scraping for the Modern Web +

+ +

+ + Tests + + PyPI version + + PyPI Downloads +
+ + Discord + + + X (formerly Twitter) Follow + +
+ + Supported Python versions +

+ +

+ Mรฉtodos de selecciรณn + · + Elegir un fetcher + · + Spiders + · + Rotaciรณn de proxy + · + CLI + · + Modo MCP +

+ +Scrapling es un framework de Web Scraping adaptativo que se encarga de todo, desde una sola solicitud hasta un rastreo a gran escala. + +Su parser aprende de los cambios de los sitios web y relocaliza automรกticamente tus elementos cuando las pรกginas se actualizan. Sus fetchers evaden sistemas anti-bot como Cloudflare Turnstile de forma nativa. Y su framework Spider te permite escalar a rastreos concurrentes con mรบltiples sesiones, con Pause & Resume y rotaciรณn automรกtica de Proxy, todo en unas pocas lรญneas de Python. Una biblioteca, cero compromisos. + +Rastreos ultrarrรกpidos con estadรญsticas en tiempo real y Streaming. Construido por Web Scrapers para Web Scrapers y usuarios regulares, hay algo para todos. + +```python +from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher +StealthyFetcher.adaptive = True +p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ยกObtรฉn el sitio web bajo el radar! +products = p.css('.product', auto_save=True) # ยกExtrae datos que sobreviven a cambios de diseรฑo del sitio web! +products = p.css('.product', adaptive=True) # Mรกs tarde, si la estructura del sitio web cambia, ยกpasa `adaptive=True` para encontrarlos! +``` +O escala a rastreos completos +```python +from scrapling.spiders import Spider, Response + +class MySpider(Spider): + name = "demo" + start_urls = ["https://example.com/"] + + async def parse(self, response: Response): + for item in response.css('.product'): + yield {"title": item.css('h2::text').get()} + +MySpider().start() +``` + + +# Patrocinadores Platino + +ยฟQuieres ser la primera empresa en aparecer aquรญ? Haz clic [aquรญ](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646) +# Patrocinadores + + + + + + + + + + + + + + + + + + + + +ยฟQuieres mostrar tu anuncio aquรญ? ยกHaz clic [aquรญ](https://github.com/sponsors/D4Vinci) y elige el nivel que te convenga! + +--- + +## Caracterรญsticas Principales + +### Spiders โ€” Un Framework Completo de Rastreo +- ๐Ÿ•ท๏ธ **API de Spider al estilo Scrapy**: Define spiders con `start_urls`, callbacks async `parse`, y objetos `Request`/`Response`. +- โšก **Rastreo Concurrente**: Lรญmites de concurrencia configurables, limitaciรณn por dominio y retrasos de descarga. +- ๐Ÿ”„ **Soporte Multi-Session**: Interfaz unificada para solicitudes HTTP y navegadores headless sigilosos en un solo Spider โ€” enruta solicitudes a diferentes sesiones por ID. +- ๐Ÿ’พ **Pause & Resume**: Persistencia de rastreo basada en Checkpoint. Presiona Ctrl+C para un cierre ordenado; reinicia para continuar desde donde lo dejaste. +- ๐Ÿ“ก **Modo Streaming**: Transmite elementos extraรญdos a medida que llegan con `async for item in spider.stream()` con estadรญsticas en tiempo real โ€” ideal para UI, pipelines y rastreos de larga duraciรณn. +- ๐Ÿ›ก๏ธ **Detecciรณn de Solicitudes Bloqueadas**: Detecciรณn automรกtica y reintento de solicitudes bloqueadas con lรณgica personalizable. +- ๐Ÿ“ฆ **Exportaciรณn Integrada**: Exporta resultados a travรฉs de hooks y tu propio pipeline o el JSON/JSONL integrado con `result.items.to_json()` / `result.items.to_jsonl()` respectivamente. + +### Obtenciรณn Avanzada de Sitios Web con Soporte de Session +- **Solicitudes HTTP**: Solicitudes HTTP rรกpidas y sigilosas con la clase `Fetcher`. Puede imitar el fingerprint TLS de los navegadores, encabezados y usar HTTP/3. +- **Carga Dinรกmica**: Obtรฉn sitios web dinรกmicos con automatizaciรณn completa del navegador a travรฉs de la clase `DynamicFetcher` compatible con Chromium de Playwright y Google Chrome. +- **Evasiรณn Anti-bot**: Capacidades de sigilo avanzadas con `StealthyFetcher` y falsificaciรณn de fingerprint. Puede evadir fรกcilmente todos los tipos de Turnstile/Interstitial de Cloudflare con automatizaciรณn. +- **Gestiรณn de Session**: Soporte de sesiรณn persistente con las clases `FetcherSession`, `StealthySession` y `DynamicSession` para la gestiรณn de cookies y estado entre solicitudes. +- **Rotaciรณn de Proxy**: `ProxyRotator` integrado con estrategias de rotaciรณn cรญclica o personalizadas en todos los tipos de sesiรณn, ademรกs de sobrescrituras de Proxy por solicitud. +- **Bloqueo de Dominios**: Bloquea solicitudes a dominios especรญficos (y sus subdominios) en fetchers basados en navegador. +- **Soporte Async**: Soporte async completo en todos los fetchers y clases de sesiรณn async dedicadas. + +### Scraping Adaptativo e Integraciรณn con IA +- ๐Ÿ”„ **Seguimiento Inteligente de Elementos**: Relocaliza elementos despuรฉs de cambios en el sitio web usando algoritmos inteligentes de similitud. +- ๐ŸŽฏ **Selecciรณn Flexible Inteligente**: Selectores CSS, selectores XPath, bรบsqueda basada en filtros, bรบsqueda de texto, bรบsqueda regex y mรกs. +- ๐Ÿ” **Encontrar Elementos Similares**: Localiza automรกticamente elementos similares a los elementos encontrados. +- ๐Ÿค– **Servidor MCP para usar con IA**: Servidor MCP integrado para Web Scraping asistido por IA y extracciรณn de datos. El servidor MCP presenta capacidades potentes y personalizadas que aprovechan Scrapling para extraer contenido especรญfico antes de pasarlo a la IA (Claude/Cursor/etc), acelerando asรญ las operaciones y reduciendo costos al minimizar el uso de tokens. ([video demo](https://www.youtube.com/watch?v=qyFk3ZNwOxE)) + +### Arquitectura de Alto Rendimiento y Probada en Batalla +- ๐Ÿš€ **Ultrarrรกpido**: Rendimiento optimizado que supera a la mayorรญa de las bibliotecas de Web Scraping de Python. +- ๐Ÿ”‹ **Eficiente en Memoria**: Estructuras de datos optimizadas y carga diferida para una huella de memoria mรญnima. +- โšก **Serializaciรณn JSON Rรกpida**: 10 veces mรกs rรกpido que la biblioteca estรกndar. +- ๐Ÿ—๏ธ **Probado en batalla**: Scrapling no solo tiene una cobertura de pruebas del 92% y cobertura completa de type hints, sino que ha sido utilizado diariamente por cientos de Web Scrapers durante el รบltimo aรฑo. + +### Experiencia Amigable para Desarrolladores/Web Scrapers +- ๐ŸŽฏ **Shell Interactivo de Web Scraping**: Shell IPython integrado opcional con integraciรณn de Scrapling, atajos y nuevas herramientas para acelerar el desarrollo de scripts de Web Scraping, como convertir solicitudes curl a solicitudes Scrapling y ver resultados de solicitudes en tu navegador. +- ๐Ÿš€ **รšsalo directamente desde la Terminal**: Opcionalmente, ยกpuedes usar Scrapling para hacer scraping de una URL sin escribir ni una sola lรญnea de cรณdigo! +- ๐Ÿ› ๏ธ **API de Navegaciรณn Rica**: Recorrido avanzado del DOM con mรฉtodos de navegaciรณn de padres, hermanos e hijos. +- ๐Ÿงฌ **Procesamiento de Texto Mejorado**: Mรฉtodos integrados de regex, limpieza y operaciones de cadena optimizadas. +- ๐Ÿ“ **Generaciรณn Automรกtica de Selectores**: Genera selectores CSS/XPath robustos para cualquier elemento. +- ๐Ÿ”Œ **API Familiar**: Similar a Scrapy/BeautifulSoup con los mismos pseudo-elementos usados en Scrapy/Parsel. +- ๐Ÿ“˜ **Cobertura Completa de Tipos**: Type hints completos para excelente soporte de IDE y autocompletado de cรณdigo. Todo el cรณdigo fuente se escanea automรกticamente con **PyRight** y **MyPy** en cada cambio. +- ๐Ÿ”‹ **Imagen Docker Lista**: Con cada lanzamiento, se construye y publica automรกticamente una imagen Docker que contiene todos los navegadores. + +## Primeros Pasos + +Aquรญ tienes un vistazo rรกpido de lo que Scrapling puede hacer sin entrar en profundidad. + +### Uso Bรกsico +Solicitudes HTTP con soporte de sesiรณn +```python +from scrapling.fetchers import Fetcher, FetcherSession + +with FetcherSession(impersonate='chrome') as session: # Usa la รบltima versiรณn del fingerprint TLS de Chrome + page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) + quotes = page.css('.quote .text::text').getall() + +# O usa solicitudes de una sola vez +page = Fetcher.get('https://quotes.toscrape.com/') +quotes = page.css('.quote .text::text').getall() +``` +Modo sigiloso avanzado +```python +from scrapling.fetchers import StealthyFetcher, StealthySession + +with StealthySession(headless=True, solve_cloudflare=True) as session: # Mantรฉn el navegador abierto hasta que termines + page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False) + data = page.css('#padded_content a').getall() + +# O usa el estilo de solicitud de una sola vez, abre el navegador para esta solicitud, luego lo cierra despuรฉs de terminar +page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare') +data = page.css('#padded_content a').getall() +``` +Automatizaciรณn completa del navegador +```python +from scrapling.fetchers import DynamicFetcher, DynamicSession + +with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Mantรฉn el navegador abierto hasta que termines + page = session.fetch('https://quotes.toscrape.com/', load_dom=False) + data = page.xpath('//span[@class="text"]/text()').getall() # Selector XPath si lo prefieres + +# O usa el estilo de solicitud de una sola vez, abre el navegador para esta solicitud, luego lo cierra despuรฉs de terminar +page = DynamicFetcher.fetch('https://quotes.toscrape.com/') +data = page.css('.quote .text::text').getall() +``` + +### Spiders +Construye rastreadores completos con solicitudes concurrentes, mรบltiples tipos de sesiรณn y Pause & Resume: +```python +from scrapling.spiders import Spider, Request, Response + +class QuotesSpider(Spider): + name = "quotes" + start_urls = ["https://quotes.toscrape.com/"] + concurrent_requests = 10 + + async def parse(self, response: Response): + for quote in response.css('.quote'): + yield { + "text": quote.css('.text::text').get(), + "author": quote.css('.author::text').get(), + } + + next_page = response.css('.next a') + if next_page: + yield response.follow(next_page[0].attrib['href']) + +result = QuotesSpider().start() +print(f"Se extrajeron {len(result.items)} citas") +result.items.to_json("quotes.json") +``` +Usa mรบltiples tipos de sesiรณn en un solo Spider: +```python +from scrapling.spiders import Spider, Request, Response +from scrapling.fetchers import FetcherSession, AsyncStealthySession + +class MultiSessionSpider(Spider): + name = "multi" + start_urls = ["https://example.com/"] + + def configure_sessions(self, manager): + manager.add("fast", FetcherSession(impersonate="chrome")) + manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) + + async def parse(self, response: Response): + for link in response.css('a::attr(href)').getall(): + # Enruta las pรกginas protegidas a travรฉs de la sesiรณn sigilosa + if "protected" in link: + yield Request(link, sid="stealth") + else: + yield Request(link, sid="fast", callback=self.parse) # callback explรญcito +``` +Pausa y reanuda rastreos largos con checkpoints ejecutando el Spider asรญ: +```python +QuotesSpider(crawldir="./crawl_data").start() +``` +Presiona Ctrl+C para pausar de forma ordenada โ€” el progreso se guarda automรกticamente. Despuรฉs, cuando inicies el Spider de nuevo, pasa el mismo `crawldir`, y continuarรก desde donde se detuvo. + +### Anรกlisis Avanzado y Navegaciรณn +```python +from scrapling.fetchers import Fetcher + +# Selecciรณn rica de elementos y navegaciรณn +page = Fetcher.get('https://quotes.toscrape.com/') + +# Obtรฉn citas con mรบltiples mรฉtodos de selecciรณn +quotes = page.css('.quote') # Selector CSS +quotes = page.xpath('//div[@class="quote"]') # XPath +quotes = page.find_all('div', {'class': 'quote'}) # Estilo BeautifulSoup +# Igual que +quotes = page.find_all('div', class_='quote') +quotes = page.find_all(['div'], class_='quote') +quotes = page.find_all(class_='quote') # y asรญ sucesivamente... +# Encuentra elementos por contenido de texto +quotes = page.find_by_text('quote', tag='div') + +# Navegaciรณn avanzada +quote_text = page.css('.quote')[0].css('.text::text').get() +quote_text = page.css('.quote').css('.text::text').getall() # Selectores encadenados +first_quote = page.css('.quote')[0] +author = first_quote.next_sibling.css('.author::text') +parent_container = first_quote.parent + +# Relaciones y similitud de elementos +similar_elements = first_quote.find_similar() +below_elements = first_quote.below_elements() +``` +Puedes usar el parser directamente si no necesitas obtener sitios web, como se muestra a continuaciรณn: +```python +from scrapling.parser import Selector + +page = Selector("...") +``` +ยกY funciona exactamente de la misma manera! + +### Ejemplos de Gestiรณn de Session Async +```python +import asyncio +from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession + +async with FetcherSession(http3=True) as session: # `FetcherSession` es consciente del contexto y puede funcionar tanto en patrones sync/async + page1 = session.get('https://quotes.toscrape.com/') + page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135') + +# Uso de sesiรณn async +async with AsyncStealthySession(max_pages=2) as session: + tasks = [] + urls = ['https://example.com/page1', 'https://example.com/page2'] + + for url in urls: + task = session.fetch(url) + tasks.append(task) + + print(session.get_pool_stats()) # Opcional - El estado del pool de pestaรฑas del navegador (ocupado/libre/error) + results = await asyncio.gather(*tasks) + print(session.get_pool_stats()) +``` + +## CLI y Shell Interactivo + +Scrapling incluye una poderosa interfaz de lรญnea de comandos: + +[![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339) + +Lanzar el Shell interactivo de Web Scraping +```bash +scrapling shell +``` +Extraer pรกginas a un archivo directamente sin programar (Extrae el contenido dentro de la etiqueta `body` por defecto). Si el archivo de salida termina con `.txt`, entonces se extraerรก el contenido de texto del objetivo. Si termina con `.md`, serรก una representaciรณn Markdown del contenido HTML; si termina con `.html`, serรก el contenido HTML en sรญ mismo. +```bash +scrapling extract get 'https://example.com' content.md +scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # Todos los elementos que coinciden con el selector CSS '#fromSkipToProducts' +scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless +scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare +``` + +> [!NOTE] +> Hay muchas caracterรญsticas adicionales, pero queremos mantener esta pรกgina concisa, incluyendo el servidor MCP y el Shell Interactivo de Web Scraping. Consulta la documentaciรณn completa [aquรญ](https://scrapling.readthedocs.io/en/latest/) + +## Benchmarks de Rendimiento + +Scrapling no solo es potente, tambiรฉn es ultrarrรกpido. Los siguientes benchmarks comparan el parser de Scrapling con las รบltimas versiones de otras bibliotecas populares. + +### Prueba de Velocidad de Extracciรณn de Texto (5000 elementos anidados) + +| # | Biblioteca | Tiempo (ms) | vs Scrapling | +|---|:-----------------:|:-----------:|:------------:| +| 1 | Scrapling | 2.02 | 1.0x | +| 2 | Parsel/Scrapy | 2.04 | 1.01 | +| 3 | Raw Lxml | 2.54 | 1.257 | +| 4 | PyQuery | 24.17 | ~12x | +| 5 | Selectolax | 82.63 | ~41x | +| 6 | MechanicalSoup | 1549.71 | ~767.1x | +| 7 | BS4 with Lxml | 1584.31 | ~784.3x | +| 8 | BS4 with html5lib | 3391.91 | ~1679.1x | + + +### Rendimiento de Similitud de Elementos y Bรบsqueda de Texto + +Las capacidades de bรบsqueda adaptativa de elementos de Scrapling superan significativamente a las alternativas: + +| Biblioteca | Tiempo (ms) | vs Scrapling | +|-------------|:-----------:|:------------:| +| Scrapling | 2.39 | 1.0x | +| AutoScraper | 12.45 | 5.209x | + + +> Todos los benchmarks representan promedios de mรกs de 100 ejecuciones. Ver [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) para la metodologรญa. + +## Instalaciรณn + +Scrapling requiere Python 3.10 o superior: + +```bash +pip install scrapling +``` + +Esta instalaciรณn solo incluye el motor de anรกlisis y sus dependencias, sin ningรบn fetcher ni dependencias de lรญnea de comandos. + +### Dependencias Opcionales + +1. Si vas a usar alguna de las caracterรญsticas adicionales a continuaciรณn, los fetchers, o sus clases, necesitarรกs instalar las dependencias de los fetchers y sus dependencias del navegador de la siguiente manera: + ```bash + pip install "scrapling[fetchers]" + + scrapling install # normal install + scrapling install --force # force reinstall + ``` + + Esto descarga todos los navegadores, junto con sus dependencias del sistema y dependencias de manipulaciรณn de fingerprint. + + O puedes instalarlos desde el cรณdigo en lugar de ejecutar un comando: + ```python + from scrapling.cli import install + + install([], standalone_mode=False) # normal install + install(["--force"], standalone_mode=False) # force reinstall + ``` + +2. Caracterรญsticas adicionales: + - Instalar la caracterรญstica del servidor MCP: + ```bash + pip install "scrapling[ai]" + ``` + - Instalar caracterรญsticas del Shell (Shell de Web Scraping y el comando `extract`): + ```bash + pip install "scrapling[shell]" + ``` + - Instalar todo: + ```bash + pip install "scrapling[all]" + ``` + Recuerda que necesitas instalar las dependencias del navegador con `scrapling install` despuรฉs de cualquiera de estos extras (si no lo hiciste ya) + +### Docker +Tambiรฉn puedes instalar una imagen Docker con todos los extras y navegadores con el siguiente comando desde DockerHub: +```bash +docker pull pyd4vinci/scrapling +``` +O descรกrgala desde el registro de GitHub: +```bash +docker pull ghcr.io/d4vinci/scrapling:latest +``` +Esta imagen se construye y publica automรกticamente usando GitHub Actions y la rama principal del repositorio. + +## Contribuir + +ยกDamos la bienvenida a las contribuciones! Por favor lee nuestras [pautas de contribuciรณn](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) antes de comenzar. + +## Descargo de Responsabilidad + +> [!CAUTION] +> Esta biblioteca se proporciona solo con fines educativos y de investigaciรณn. Al usar esta biblioteca, aceptas cumplir con las leyes locales e internacionales de scraping de datos y privacidad. Los autores y contribuyentes no son responsables de ningรบn mal uso de este software. Respeta siempre los tรฉrminos de servicio de los sitios web y los archivos robots.txt. + +## Licencia + +Este trabajo estรก licenciado bajo la Licencia BSD-3-Clause. + +## Agradecimientos + +Este proyecto incluye cรณdigo adaptado de: +- Parsel (Licencia BSD)โ€”Usado para el submรณdulo [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) + +--- +
Diseรฑado y elaborado con โค๏ธ por Karim Shoair.

diff --git a/docs/README_JP.md b/docs/README_JP.md new file mode 100644 index 0000000000000000000000000000000000000000..890e049324d6a3679045a70360a785c67690c02c --- /dev/null +++ b/docs/README_JP.md @@ -0,0 +1,426 @@ + + +

+ + + + Scrapling Poster + + +
+ Effortless Web Scraping for the Modern Web +

+ +

+ + Tests + + PyPI version + + PyPI Downloads +
+ + Discord + + + X (formerly Twitter) Follow + +
+ + Supported Python versions +

+ +

+ ้ธๆŠžใƒกใ‚ฝใƒƒใƒ‰ + · + Fetcherใฎ้ธใณๆ–น + · + ใ‚นใƒ‘ใ‚คใƒ€ใƒผ + · + ใƒ—ใƒญใ‚ญใ‚ทใƒญใƒผใƒ†ใƒผใ‚ทใƒงใƒณ + · + CLI + · + MCPใƒขใƒผใƒ‰ +

+ +Scraplingใฏใ€ๅ˜ไธ€ใฎใƒชใ‚ฏใ‚จใ‚นใƒˆใ‹ใ‚‰ๆœฌๆ ผ็š„ใชใ‚ฏใƒญใƒผใƒซใพใงใ™ในใฆใ‚’ๅ‡ฆ็†ใ™ใ‚‹้ฉๅฟœๅž‹Web Scrapingใƒ•ใƒฌใƒผใƒ ใƒฏใƒผใ‚ฏใงใ™ใ€‚ + +ใใฎใƒ‘ใƒผใ‚ตใƒผใฏใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆใฎๅค‰ๆ›ดใ‹ใ‚‰ๅญฆ็ฟ’ใ—ใ€ใƒšใƒผใ‚ธใŒๆ›ดๆ–ฐใ•ใ‚ŒใŸใจใใซ่ฆ็ด ใ‚’่‡ชๅ‹•็š„ใซๅ†้…็ฝฎใ—ใพใ™ใ€‚Fetcherใฏใ™ใใซไฝฟใˆใ‚‹Cloudflare Turnstileใชใฉใฎใ‚ขใƒณใƒใƒœใƒƒใƒˆใ‚ทใ‚นใƒ†ใƒ ใ‚’ๅ›ž้ฟใ—ใพใ™ใ€‚ใใ—ใฆSpiderใƒ•ใƒฌใƒผใƒ ใƒฏใƒผใ‚ฏใซใ‚ˆใ‚Šใ€Pause & Resumeใ‚„่‡ชๅ‹•Proxyๅ›ž่ปขๆฉŸ่ƒฝใ‚’ๅ‚™ใˆใŸไธฆ่กŒใƒžใƒซใƒSessionใ‚ฏใƒญใƒผใƒซใธใจใ‚นใ‚ฑใƒผใƒซใ‚ขใƒƒใƒ—ใงใใพใ™ โ€” ใ™ในใฆใ‚ใšใ‹ๆ•ฐ่กŒใฎPythonใงใ€‚1ใคใฎใƒฉใ‚คใƒ–ใƒฉใƒชใ€ๅฆฅๅ”ใชใ—ใ€‚ + +ใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ ็ตฑ่จˆใจStreamingใซใ‚ˆใ‚‹่ถ…้ซ˜้€Ÿใ‚ฏใƒญใƒผใƒซใ€‚Web Scraperใซใ‚ˆใฃใฆใ€Web Scraperใจไธ€่ˆฌใƒฆใƒผใ‚ถใƒผใฎใŸใ‚ใซๆง‹็ฏ‰ใ•ใ‚Œใ€่ชฐใซใงใ‚‚ไฝ•ใ‹ใŒใ‚ใ‚Šใพใ™ใ€‚ + +```python +from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher +StealthyFetcher.adaptive = True +p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ใƒฌใƒผใƒ€ใƒผใฎไธ‹ใงใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆใ‚’ๅ–ๅพ—๏ผ +products = p.css('.product', auto_save=True) # ใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆใฎใƒ‡ใ‚ถใ‚คใƒณๅค‰ๆ›ดใซ่€ใˆใ‚‹ใƒ‡ใƒผใ‚ฟใ‚’ใ‚นใ‚ฏใƒฌใ‚คใƒ—๏ผ +products = p.css('.product', adaptive=True) # ๅพŒใงใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆใฎๆง‹้€ ใŒๅค‰ใ‚ใฃใŸใ‚‰ใ€`adaptive=True`ใ‚’ๆธกใ—ใฆ่ฆ‹ใคใ‘ใ‚‹๏ผ +``` +ใพใŸใฏๆœฌๆ ผ็š„ใชใ‚ฏใƒญใƒผใƒซใธใ‚นใ‚ฑใƒผใƒซใ‚ขใƒƒใƒ— +```python +from scrapling.spiders import Spider, Response + +class MySpider(Spider): + name = "demo" + start_urls = ["https://example.com/"] + + async def parse(self, response: Response): + for item in response.css('.product'): + yield {"title": item.css('h2::text').get()} + +MySpider().start() +``` + + +# ใƒ—ใƒฉใƒใƒŠใ‚นใƒใƒณใ‚ตใƒผ + +ใ“ใ“ใซๆœ€ๅˆใซ่กจ็คบใ•ใ‚Œใ‚‹ไผๆฅญใซใชใ‚Šใพใ›ใ‚“ใ‹๏ผŸ[ใ“ใกใ‚‰](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646)ใ‚’ใ‚ฏใƒชใƒƒใ‚ฏ +# ใ‚นใƒใƒณใ‚ตใƒผ + + + + + + + + + + + + + + + + + + + + +ใ“ใ“ใซๅบƒๅ‘Šใ‚’่กจ็คบใ—ใŸใ„ใงใ™ใ‹๏ผŸ[ใ“ใกใ‚‰](https://github.com/sponsors/D4Vinci)ใ‚’ใ‚ฏใƒชใƒƒใ‚ฏใ—ใฆใ€ใ‚ใชใŸใซๅˆใฃใŸใƒ†ใ‚ฃใ‚ขใ‚’้ธๆŠžใ—ใฆใใ ใ•ใ„๏ผ + +--- + +## ไธปใชๆฉŸ่ƒฝ + +### Spider โ€” ๆœฌๆ ผ็š„ใชใ‚ฏใƒญใƒผใƒซใƒ•ใƒฌใƒผใƒ ใƒฏใƒผใ‚ฏ +- ๐Ÿ•ท๏ธ **Scrapy้ขจใฎSpider API**๏ผš`start_urls`ใ€async `parse` callbackใ€`Request`/`Response`ใ‚ชใƒ–ใ‚ธใ‚งใ‚ฏใƒˆใงSpiderใ‚’ๅฎš็พฉใ€‚ +- โšก **ไธฆ่กŒใ‚ฏใƒญใƒผใƒซ**๏ผš่จญๅฎšๅฏ่ƒฝใชไธฆ่กŒๆ•ฐๅˆถ้™ใ€ใƒ‰ใƒกใ‚คใƒณใ”ใจใฎใ‚นใƒญใƒƒใƒˆใƒชใƒณใ‚ฐใ€ใƒ€ใ‚ฆใƒณใƒญใƒผใƒ‰้…ๅปถใ€‚ +- ๐Ÿ”„ **ใƒžใƒซใƒSessionใ‚ตใƒใƒผใƒˆ**๏ผšHTTPใƒชใ‚ฏใ‚จใ‚นใƒˆใจใ‚นใƒ†ใƒซใ‚นใƒ˜ใƒƒใƒ‰ใƒฌใ‚นใƒ–ใƒฉใ‚ฆใ‚ถใฎ็ตฑไธ€ใ‚คใƒณใ‚ฟใƒผใƒ•ใ‚งใƒผใ‚น โ€” IDใซใ‚ˆใฃใฆ็•ฐใชใ‚‹Sessionใซใƒชใ‚ฏใ‚จใ‚นใƒˆใ‚’ใƒซใƒผใƒ†ใ‚ฃใƒณใ‚ฐใ€‚ +- ๐Ÿ’พ **Pause & Resume**๏ผšCheckpointใƒ™ใƒผใ‚นใฎใ‚ฏใƒญใƒผใƒซๆฐธ็ถšๅŒ–ใ€‚Ctrl+Cใงๆญฃๅธธใซใ‚ทใƒฃใƒƒใƒˆใƒ€ใ‚ฆใƒณ๏ผ›ๅ†่ตทๅ‹•ใ™ใ‚‹ใจไธญๆ–ญใ—ใŸใจใ“ใ‚ใ‹ใ‚‰ๅ†้–‹ใ€‚ +- ๐Ÿ“ก **Streamingใƒขใƒผใƒ‰**๏ผš`async for item in spider.stream()`ใงใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ ็ตฑ่จˆใจใจใ‚‚ใซใ‚นใ‚ฏใƒฌใ‚คใƒ—ใ•ใ‚ŒใŸใ‚ขใ‚คใƒ†ใƒ ใ‚’Streamingใงๅ—ไฟก โ€” UIใ€ใƒ‘ใ‚คใƒ—ใƒฉใ‚คใƒณใ€้•ทๆ™‚้–“ๅฎŸ่กŒใ‚ฏใƒญใƒผใƒซใซๆœ€้ฉใ€‚ +- ๐Ÿ›ก๏ธ **ใƒ–ใƒญใƒƒใ‚ฏใ•ใ‚ŒใŸใƒชใ‚ฏใ‚จใ‚นใƒˆใฎๆคœๅ‡บ**๏ผšใ‚ซใ‚นใ‚ฟใƒžใ‚คใ‚บๅฏ่ƒฝใชใƒญใ‚ธใƒƒใ‚ฏใซใ‚ˆใ‚‹ใƒ–ใƒญใƒƒใ‚ฏใ•ใ‚ŒใŸใƒชใ‚ฏใ‚จใ‚นใƒˆใฎ่‡ชๅ‹•ๆคœๅ‡บใจใƒชใƒˆใƒฉใ‚คใ€‚ +- ๐Ÿ“ฆ **็ต„ใฟ่พผใฟใ‚จใ‚ฏใ‚นใƒใƒผใƒˆ**๏ผšใƒ•ใƒƒใ‚ฏใ‚„็‹ฌ่‡ชใฎใƒ‘ใ‚คใƒ—ใƒฉใ‚คใƒณใ€ใพใŸใฏ็ต„ใฟ่พผใฟใฎJSON/JSONLใง็ตๆžœใ‚’ใ‚จใ‚ฏใ‚นใƒใƒผใƒˆใ€‚ใใ‚Œใžใ‚Œ`result.items.to_json()` / `result.items.to_jsonl()`ใ‚’ไฝฟ็”จใ€‚ + +### Sessionใ‚ตใƒใƒผใƒˆไป˜ใ้ซ˜ๅบฆใชใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆๅ–ๅพ— +- **HTTPใƒชใ‚ฏใ‚จใ‚นใƒˆ**๏ผš`Fetcher`ใ‚ฏใƒฉใ‚นใง้ซ˜้€Ÿใ‹ใคใ‚นใƒ†ใƒซใ‚นใชHTTPใƒชใ‚ฏใ‚จใ‚นใƒˆใ€‚ใƒ–ใƒฉใ‚ฆใ‚ถใฎTLS fingerprintใ€ใƒ˜ใƒƒใƒ€ใƒผใ‚’ๆจกๅ€ฃใ—ใ€HTTP/3ใ‚’ไฝฟ็”จๅฏ่ƒฝใ€‚ +- **ๅ‹•็š„่ชญใฟ่พผใฟ**๏ผšPlaywrightใฎChromiumใจGoogle Chromeใ‚’ใ‚ตใƒใƒผใƒˆใ™ใ‚‹`DynamicFetcher`ใ‚ฏใƒฉใ‚นใซใ‚ˆใ‚‹ๅฎŒๅ…จใชใƒ–ใƒฉใ‚ฆใ‚ถ่‡ชๅ‹•ๅŒ–ใงๅ‹•็š„ใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆใ‚’ๅ–ๅพ—ใ€‚ +- **ใ‚ขใƒณใƒใƒœใƒƒใƒˆๅ›ž้ฟ**๏ผš`StealthyFetcher`ใจfingerprintๅฝ่ฃ…ใซใ‚ˆใ‚‹้ซ˜ๅบฆใชใ‚นใƒ†ใƒซใ‚นๆฉŸ่ƒฝใ€‚่‡ชๅ‹•ๅŒ–ใงCloudflareใฎTurnstile/Interstitialใฎใ™ในใฆใฎใ‚ฟใ‚คใƒ—ใ‚’็ฐกๅ˜ใซๅ›ž้ฟใ€‚ +- **Session็ฎก็†**๏ผšใƒชใ‚ฏใ‚จใ‚นใƒˆ้–“ใงCookieใจ็Šถๆ…‹ใ‚’็ฎก็†ใ™ใ‚‹ใŸใ‚ใฎ`FetcherSession`ใ€`StealthySession`ใ€`DynamicSession`ใ‚ฏใƒฉใ‚นใซใ‚ˆใ‚‹ๆฐธ็ถš็š„ใชSessionใ‚ตใƒใƒผใƒˆใ€‚ +- **Proxyๅ›ž่ปข**๏ผšใ™ในใฆใฎSessionใ‚ฟใ‚คใƒ—ใซๅฏพๅฟœใ—ใŸใƒฉใ‚ฆใƒณใƒ‰ใƒญใƒ“ใƒณใพใŸใฏใ‚ซใ‚นใ‚ฟใƒ ๆˆฆ็•ฅใฎ็ต„ใฟ่พผใฟ`ProxyRotator`ใ€ใ•ใ‚‰ใซใƒชใ‚ฏใ‚จใ‚นใƒˆใ”ใจใฎProxyใ‚ชใƒผใƒใƒผใƒฉใ‚คใƒ‰ใ€‚ +- **ใƒ‰ใƒกใ‚คใƒณใƒ–ใƒญใƒƒใ‚ฏ**๏ผšใƒ–ใƒฉใ‚ฆใ‚ถใƒ™ใƒผใ‚นใฎFetcherใง็‰นๅฎšใฎใƒ‰ใƒกใ‚คใƒณ๏ผˆใŠใ‚ˆใณใใฎใ‚ตใƒ–ใƒ‰ใƒกใ‚คใƒณ๏ผ‰ใธใฎใƒชใ‚ฏใ‚จใ‚นใƒˆใ‚’ใƒ–ใƒญใƒƒใ‚ฏใ€‚ +- **asyncใ‚ตใƒใƒผใƒˆ**๏ผšใ™ในใฆใฎFetcherใŠใ‚ˆใณๅฐ‚็”จasyncSessionใ‚ฏใƒฉใ‚นๅ…จไฝ“ใงใฎๅฎŒๅ…จใชasyncใ‚ตใƒใƒผใƒˆใ€‚ + +### ้ฉๅฟœๅž‹ใ‚นใ‚ฏใƒฌใ‚คใƒ”ใƒณใ‚ฐใจAI็ตฑๅˆ +- ๐Ÿ”„ **ใ‚นใƒžใƒผใƒˆ่ฆ็ด ่ฟฝ่ทก**๏ผšใ‚คใƒณใƒ†ใƒชใ‚ธใ‚งใƒณใƒˆใช้กžไผผๆ€งใ‚ขใƒซใ‚ดใƒชใ‚บใƒ ใ‚’ไฝฟ็”จใ—ใฆใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆใฎๅค‰ๆ›ดๅพŒใซ่ฆ็ด ใ‚’ๅ†้…็ฝฎใ€‚ +- ๐ŸŽฏ **ใ‚นใƒžใƒผใƒˆๆŸ”่ปŸ้ธๆŠž**๏ผšCSSใ‚ปใƒฌใ‚ฏใ‚ฟใ€XPathใ‚ปใƒฌใ‚ฏใ‚ฟใ€ใƒ•ใ‚ฃใƒซใ‚ฟใƒ™ใƒผใ‚นๆคœ็ดขใ€ใƒ†ใ‚ญใ‚นใƒˆๆคœ็ดขใ€ๆญฃ่ฆ่กจ็พๆคœ็ดขใชใฉใ€‚ +- ๐Ÿ” **้กžไผผ่ฆ็ด ใฎๆคœๅ‡บ**๏ผš่ฆ‹ใคใ‹ใฃใŸ่ฆ็ด ใซ้กžไผผใ—ใŸ่ฆ็ด ใ‚’่‡ชๅ‹•็š„ใซ็‰นๅฎšใ€‚ +- ๐Ÿค– **AIใจไฝฟ็”จใ™ใ‚‹MCPใ‚ตใƒผใƒใƒผ**๏ผšAIๆ”ฏๆดWeb Scrapingใจใƒ‡ใƒผใ‚ฟๆŠฝๅ‡บใฎใŸใ‚ใฎ็ต„ใฟ่พผใฟMCPใ‚ตใƒผใƒใƒผใ€‚MCPใ‚ตใƒผใƒใƒผใฏใ€AI๏ผˆClaude/Cursorใชใฉ๏ผ‰ใซๆธกใ™ๅ‰ใซScraplingใ‚’ๆดป็”จใ—ใฆใ‚ฟใƒผใ‚ฒใƒƒใƒˆใ‚ณใƒณใƒ†ใƒณใƒ„ใ‚’ๆŠฝๅ‡บใ™ใ‚‹ๅผทๅŠ›ใงใ‚ซใ‚นใ‚ฟใƒ ใชๆฉŸ่ƒฝใ‚’ๅ‚™ใˆใฆใŠใ‚Šใ€ๆ“ไฝœใ‚’้ซ˜้€ŸๅŒ–ใ—ใ€ใƒˆใƒผใ‚ฏใƒณไฝฟ็”จ้‡ใ‚’ๆœ€ๅฐ้™ใซๆŠ‘ใˆใ‚‹ใ“ใจใงใ‚ณใ‚นใƒˆใ‚’ๅ‰Šๆธ›ใ—ใพใ™ใ€‚๏ผˆ[ใƒ‡ใƒขๅ‹•็”ป](https://www.youtube.com/watch?v=qyFk3ZNwOxE)๏ผ‰ + +### ้ซ˜ๆ€ง่ƒฝใงๅฎŸๆˆฆใƒ†ใ‚นใƒˆๆธˆใฟใฎใ‚ขใƒผใ‚ญใƒ†ใ‚ฏใƒใƒฃ +- ๐Ÿš€ **่ถ…้ซ˜้€Ÿ**๏ผšใปใจใ‚“ใฉใฎPythonใ‚นใ‚ฏใƒฌใ‚คใƒ”ใƒณใ‚ฐใƒฉใ‚คใƒ–ใƒฉใƒชใ‚’ไธŠๅ›žใ‚‹ๆœ€้ฉๅŒ–ใ•ใ‚ŒใŸใƒ‘ใƒ•ใ‚ฉใƒผใƒžใƒณใ‚นใ€‚ +- ๐Ÿ”‹ **ใƒกใƒขใƒชๅŠน็އ**๏ผšๆœ€ๅฐใฎใƒกใƒขใƒชใƒ•ใƒƒใƒˆใƒ—ใƒชใƒณใƒˆใฎใŸใ‚ใฎๆœ€้ฉๅŒ–ใ•ใ‚ŒใŸใƒ‡ใƒผใ‚ฟๆง‹้€ ใจ้…ๅปถ่ชญใฟ่พผใฟใ€‚ +- โšก **้ซ˜้€ŸJSONใ‚ทใƒชใ‚ขใƒซๅŒ–**๏ผšๆจ™ๆบ–ใƒฉใ‚คใƒ–ใƒฉใƒชใฎ10ๅ€ใฎ้€Ÿๅบฆใ€‚ +- ๐Ÿ—๏ธ **ๅฎŸๆˆฆใƒ†ใ‚นใƒˆๆธˆใฟ**๏ผšScraplingใฏ92%ใฎใƒ†ใ‚นใƒˆใ‚ซใƒใƒฌใƒƒใ‚ธใจๅฎŒๅ…จใชๅž‹ใƒ’ใƒณใƒˆใ‚ซใƒใƒฌใƒƒใ‚ธใ‚’ๅ‚™ใˆใฆใ„ใ‚‹ใ ใ‘ใงใชใใ€้ŽๅŽป1ๅนด้–“ใซๆ•ฐ็™พไบบใฎWeb Scraperใซใ‚ˆใฃใฆๆฏŽๆ—ฅไฝฟ็”จใ•ใ‚Œใฆใใพใ—ใŸใ€‚ + +### ้–‹็™บ่€…/Web Scraperใซใ‚„ใ•ใ—ใ„ไฝ“้จ“ +- ๐ŸŽฏ **ใ‚คใƒณใ‚ฟใƒฉใ‚ฏใƒ†ใ‚ฃใƒ–Web Scraping Shell**๏ผšScrapling็ตฑๅˆใ€ใ‚ทใƒงใƒผใƒˆใ‚ซใƒƒใƒˆใ€curlใƒชใ‚ฏใ‚จใ‚นใƒˆใ‚’Scraplingใƒชใ‚ฏใ‚จใ‚นใƒˆใซๅค‰ๆ›ใ—ใŸใ‚Šใ€ใƒ–ใƒฉใ‚ฆใ‚ถใงใƒชใ‚ฏใ‚จใ‚นใƒˆ็ตๆžœใ‚’่กจ็คบใ—ใŸใ‚Šใ™ใ‚‹ใชใฉใฎๆ–ฐใ—ใ„ใƒ„ใƒผใƒซใ‚’ๅ‚™ใˆใŸใ‚ชใƒ—ใ‚ทใƒงใƒณใฎ็ต„ใฟ่พผใฟIPython Shellใงใ€Web Scrapingใ‚นใ‚ฏใƒชใƒ—ใƒˆใฎ้–‹็™บใ‚’ๅŠ ้€Ÿใ€‚ +- ๐Ÿš€ **ใ‚ฟใƒผใƒŸใƒŠใƒซใ‹ใ‚‰็›ดๆŽฅไฝฟ็”จ**๏ผšใ‚ชใƒ—ใ‚ทใƒงใƒณใงใ€ใ‚ณใƒผใƒ‰ใ‚’ไธ€่กŒใ‚‚ๆ›ธใ‹ใšใซScraplingใ‚’ไฝฟ็”จใ—ใฆURLใ‚’ใ‚นใ‚ฏใƒฌใ‚คใƒ—ใงใใพใ™๏ผ +- ๐Ÿ› ๏ธ **่ฑŠๅฏŒใชใƒŠใƒ“ใ‚ฒใƒผใ‚ทใƒงใƒณAPI**๏ผš่ฆชใ€ๅ…„ๅผŸใ€ๅญใฎใƒŠใƒ“ใ‚ฒใƒผใ‚ทใƒงใƒณใƒกใ‚ฝใƒƒใƒ‰ใซใ‚ˆใ‚‹้ซ˜ๅบฆใชDOMใƒˆใƒฉใƒใƒผใ‚ตใƒซใ€‚ +- ๐Ÿงฌ **ๅผทๅŒ–ใ•ใ‚ŒใŸใƒ†ใ‚ญใ‚นใƒˆๅ‡ฆ็†**๏ผš็ต„ใฟ่พผใฟใฎๆญฃ่ฆ่กจ็พใ€ใ‚ฏใƒชใƒผใƒ‹ใƒณใ‚ฐใƒกใ‚ฝใƒƒใƒ‰ใ€ๆœ€้ฉๅŒ–ใ•ใ‚ŒใŸๆ–‡ๅญ—ๅˆ—ๆ“ไฝœใ€‚ +- ๐Ÿ“ **่‡ชๅ‹•ใ‚ปใƒฌใ‚ฏใ‚ฟ็”Ÿๆˆ**๏ผšไปปๆ„ใฎ่ฆ็ด ใซๅฏพใ—ใฆๅ …็‰ขใชCSS/XPathใ‚ปใƒฌใ‚ฏใ‚ฟใ‚’็”Ÿๆˆใ€‚ +- ๐Ÿ”Œ **้ฆดๆŸ“ใฟใฎใ‚ใ‚‹API**๏ผšScrapy/Parselใงไฝฟ็”จใ•ใ‚Œใฆใ„ใ‚‹ๅŒใ˜็–‘ไผผ่ฆ็ด ใ‚’ๆŒใคScrapy/BeautifulSoupใซไผผใŸ่จญ่จˆใ€‚ +- ๐Ÿ“˜ **ๅฎŒๅ…จใชๅž‹ใ‚ซใƒใƒฌใƒƒใ‚ธ**๏ผšๅ„ชใ‚ŒใŸIDEใ‚ตใƒใƒผใƒˆใจใ‚ณใƒผใƒ‰่ฃœๅฎŒใฎใŸใ‚ใฎๅฎŒๅ…จใชๅž‹ใƒ’ใƒณใƒˆใ€‚ใ‚ณใƒผใƒ‰ใƒ™ใƒผใ‚นๅ…จไฝ“ใŒๅค‰ๆ›ดใฎใŸใณใซ**PyRight**ใจ**MyPy**ใง่‡ชๅ‹•็š„ใซใ‚นใ‚ญใƒฃใƒณใ•ใ‚Œใพใ™ใ€‚ +- ๐Ÿ”‹ **ใ™ใใซไฝฟใˆใ‚‹Dockerใ‚คใƒกใƒผใ‚ธ**๏ผšๅ„ใƒชใƒชใƒผใ‚นใงใ€ใ™ในใฆใฎใƒ–ใƒฉใ‚ฆใ‚ถใ‚’ๅซใ‚€Dockerใ‚คใƒกใƒผใ‚ธใŒ่‡ชๅ‹•็š„ใซใƒ“ใƒซใƒ‰ใŠใ‚ˆใณใƒ—ใƒƒใ‚ทใƒฅใ•ใ‚Œใพใ™ใ€‚ + +## ใฏใ˜ใ‚ใซ + +ๆทฑใๆŽ˜ใ‚Šไธ‹ใ’ใšใซใ€Scraplingใซใงใใ‚‹ใ“ใจใฎ็ฐกๅ˜ใชๆฆ‚่ฆใ‚’ใŠ่ฆ‹ใ›ใ—ใพใ—ใ‚‡ใ†ใ€‚ + +### ๅŸบๆœฌ็š„ใชไฝฟใ„ๆ–น +Sessionใ‚ตใƒใƒผใƒˆไป˜ใHTTPใƒชใ‚ฏใ‚จใ‚นใƒˆ +```python +from scrapling.fetchers import Fetcher, FetcherSession + +with FetcherSession(impersonate='chrome') as session: # ChromeใฎTLS fingerprintใฎๆœ€ๆ–ฐใƒใƒผใ‚ธใƒงใƒณใ‚’ไฝฟ็”จ + page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) + quotes = page.css('.quote .text::text').getall() + +# ใพใŸใฏไธ€ๅ›ž้™ใ‚Šใฎใƒชใ‚ฏใ‚จใ‚นใƒˆใ‚’ไฝฟ็”จ +page = Fetcher.get('https://quotes.toscrape.com/') +quotes = page.css('.quote .text::text').getall() +``` +้ซ˜ๅบฆใชใ‚นใƒ†ใƒซใ‚นใƒขใƒผใƒ‰ +```python +from scrapling.fetchers import StealthyFetcher, StealthySession + +with StealthySession(headless=True, solve_cloudflare=True) as session: # ๅฎŒไบ†ใ™ใ‚‹ใพใงใƒ–ใƒฉใ‚ฆใ‚ถใ‚’้–‹ใ„ใŸใพใพใซใ™ใ‚‹ + page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False) + data = page.css('#padded_content a').getall() + +# ใพใŸใฏไธ€ๅ›ž้™ใ‚Šใฎใƒชใ‚ฏใ‚จใ‚นใƒˆใ‚นใ‚ฟใ‚คใƒซใ€ใ“ใฎใƒชใ‚ฏใ‚จใ‚นใƒˆใฎใŸใ‚ใซใƒ–ใƒฉใ‚ฆใ‚ถใ‚’้–‹ใใ€ๅฎŒไบ†ๅพŒใซ้–‰ใ˜ใ‚‹ +page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare') +data = page.css('#padded_content a').getall() +``` +ๅฎŒๅ…จใชใƒ–ใƒฉใ‚ฆใ‚ถ่‡ชๅ‹•ๅŒ– +```python +from scrapling.fetchers import DynamicFetcher, DynamicSession + +with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # ๅฎŒไบ†ใ™ใ‚‹ใพใงใƒ–ใƒฉใ‚ฆใ‚ถใ‚’้–‹ใ„ใŸใพใพใซใ™ใ‚‹ + page = session.fetch('https://quotes.toscrape.com/', load_dom=False) + data = page.xpath('//span[@class="text"]/text()').getall() # ใŠๅฅฝใฟใงใ‚ใ‚ŒใฐXPathใ‚ปใƒฌใ‚ฏใ‚ฟใ‚’ไฝฟ็”จ + +# ใพใŸใฏไธ€ๅ›ž้™ใ‚Šใฎใƒชใ‚ฏใ‚จใ‚นใƒˆใ‚นใ‚ฟใ‚คใƒซใ€ใ“ใฎใƒชใ‚ฏใ‚จใ‚นใƒˆใฎใŸใ‚ใซใƒ–ใƒฉใ‚ฆใ‚ถใ‚’้–‹ใใ€ๅฎŒไบ†ๅพŒใซ้–‰ใ˜ใ‚‹ +page = DynamicFetcher.fetch('https://quotes.toscrape.com/') +data = page.css('.quote .text::text').getall() +``` + +### Spider +ไธฆ่กŒใƒชใ‚ฏใ‚จใ‚นใƒˆใ€่ค‡ๆ•ฐใฎSessionใ‚ฟใ‚คใƒ—ใ€Pause & Resumeใ‚’ๅ‚™ใˆใŸๆœฌๆ ผ็š„ใชใ‚ฏใƒญใƒผใƒฉใƒผใ‚’ๆง‹็ฏ‰๏ผš +```python +from scrapling.spiders import Spider, Request, Response + +class QuotesSpider(Spider): + name = "quotes" + start_urls = ["https://quotes.toscrape.com/"] + concurrent_requests = 10 + + async def parse(self, response: Response): + for quote in response.css('.quote'): + yield { + "text": quote.css('.text::text').get(), + "author": quote.css('.author::text').get(), + } + + next_page = response.css('.next a') + if next_page: + yield response.follow(next_page[0].attrib['href']) + +result = QuotesSpider().start() +print(f"{len(result.items)}ไปถใฎๅผ•็”จใ‚’ใ‚นใ‚ฏใƒฌใ‚คใƒ—ใ—ใพใ—ใŸ") +result.items.to_json("quotes.json") +``` +ๅ˜ไธ€ใฎSpiderใง่ค‡ๆ•ฐใฎSessionใ‚ฟใ‚คใƒ—ใ‚’ไฝฟ็”จ๏ผš +```python +from scrapling.spiders import Spider, Request, Response +from scrapling.fetchers import FetcherSession, AsyncStealthySession + +class MultiSessionSpider(Spider): + name = "multi" + start_urls = ["https://example.com/"] + + def configure_sessions(self, manager): + manager.add("fast", FetcherSession(impersonate="chrome")) + manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) + + async def parse(self, response: Response): + for link in response.css('a::attr(href)').getall(): + # ไฟ่ญทใ•ใ‚ŒใŸใƒšใƒผใ‚ธใฏใ‚นใƒ†ใƒซใ‚นSessionใ‚’้€šใ—ใฆใƒซใƒผใƒ†ใ‚ฃใƒณใ‚ฐ + if "protected" in link: + yield Request(link, sid="stealth") + else: + yield Request(link, sid="fast", callback=self.parse) # ๆ˜Ž็คบ็š„ใชcallback +``` +Checkpointใ‚’ไฝฟ็”จใ—ใฆ้•ทๆ™‚้–“ใฎใ‚ฏใƒญใƒผใƒซใ‚’Pause & Resume๏ผš +```python +QuotesSpider(crawldir="./crawl_data").start() +``` +Ctrl+Cใ‚’ๆŠผใ™ใจๆญฃๅธธใซไธ€ๆ™‚ๅœๆญขใ—ใ€้€ฒๆ—ใฏ่‡ชๅ‹•็š„ใซไฟๅญ˜ใ•ใ‚Œใพใ™ใ€‚ๅพŒใงSpiderใ‚’ๅ†ๅบฆ่ตทๅ‹•ใ™ใ‚‹้š›ใซๅŒใ˜`crawldir`ใ‚’ๆธกใ™ใจใ€ไธญๆ–ญใ—ใŸใจใ“ใ‚ใ‹ใ‚‰ๅ†้–‹ใ—ใพใ™ใ€‚ + +### ้ซ˜ๅบฆใชใƒ‘ใƒผใ‚นใจใƒŠใƒ“ใ‚ฒใƒผใ‚ทใƒงใƒณ +```python +from scrapling.fetchers import Fetcher + +# ่ฑŠๅฏŒใช่ฆ็ด ้ธๆŠžใจใƒŠใƒ“ใ‚ฒใƒผใ‚ทใƒงใƒณ +page = Fetcher.get('https://quotes.toscrape.com/') + +# ่ค‡ๆ•ฐใฎ้ธๆŠžใƒกใ‚ฝใƒƒใƒ‰ใงๅผ•็”จใ‚’ๅ–ๅพ— +quotes = page.css('.quote') # CSSใ‚ปใƒฌใ‚ฏใ‚ฟ +quotes = page.xpath('//div[@class="quote"]') # XPath +quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoupใ‚นใ‚ฟใ‚คใƒซ +# ไปฅไธ‹ใจๅŒใ˜ +quotes = page.find_all('div', class_='quote') +quotes = page.find_all(['div'], class_='quote') +quotes = page.find_all(class_='quote') # ใชใฉ... +# ใƒ†ใ‚ญใ‚นใƒˆๅ†…ๅฎนใง่ฆ็ด ใ‚’ๆคœ็ดข +quotes = page.find_by_text('quote', tag='div') + +# ้ซ˜ๅบฆใชใƒŠใƒ“ใ‚ฒใƒผใ‚ทใƒงใƒณ +quote_text = page.css('.quote')[0].css('.text::text').get() +quote_text = page.css('.quote').css('.text::text').getall() # ใƒใ‚งใƒผใƒณใ‚ปใƒฌใ‚ฏใ‚ฟ +first_quote = page.css('.quote')[0] +author = first_quote.next_sibling.css('.author::text') +parent_container = first_quote.parent + +# ่ฆ็ด ใฎ้–ข้€ฃๆ€งใจ้กžไผผๆ€ง +similar_elements = first_quote.find_similar() +below_elements = first_quote.below_elements() +``` +ใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆใ‚’ๅ–ๅพ—ใ›ใšใซใƒ‘ใƒผใ‚ตใƒผใ‚’ใ™ใใซไฝฟ็”จใ™ใ‚‹ใ“ใจใ‚‚ใงใใพใ™๏ผš +```python +from scrapling.parser import Selector + +page = Selector("...") +``` +ใพใฃใŸใๅŒใ˜ๆ–นๆณ•ใงๅ‹•ไฝœใ—ใพใ™๏ผ + +### ้žๅŒๆœŸSession็ฎก็†ใฎไพ‹ +```python +import asyncio +from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession + +async with FetcherSession(http3=True) as session: # `FetcherSession`ใฏใ‚ณใƒณใƒ†ใ‚ญใ‚นใƒˆใ‚ขใ‚ฆใ‚งใ‚ขใงใ€ๅŒๆœŸ/้žๅŒๆœŸไธกๆ–นใฎใƒ‘ใ‚ฟใƒผใƒณใงๅ‹•ไฝœๅฏ่ƒฝ + page1 = session.get('https://quotes.toscrape.com/') + page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135') + +# ้žๅŒๆœŸSessionใฎไฝฟ็”จ +async with AsyncStealthySession(max_pages=2) as session: + tasks = [] + urls = ['https://example.com/page1', 'https://example.com/page2'] + + for url in urls: + task = session.fetch(url) + tasks.append(task) + + print(session.get_pool_stats()) # ใ‚ชใƒ—ใ‚ทใƒงใƒณ - ใƒ–ใƒฉใ‚ฆใ‚ถใ‚ฟใƒ–ใƒ—ใƒผใƒซใฎใ‚นใƒ†ใƒผใ‚ฟใ‚น๏ผˆใƒ“ใ‚ธใƒผ/ใƒ•ใƒชใƒผ/ใ‚จใƒฉใƒผ๏ผ‰ + results = await asyncio.gather(*tasks) + print(session.get_pool_stats()) +``` + +## CLIใจใ‚คใƒณใ‚ฟใƒฉใ‚ฏใƒ†ใ‚ฃใƒ–Shell + +ScraplingใซใฏๅผทๅŠ›ใชใ‚ณใƒžใƒณใƒ‰ใƒฉใ‚คใƒณใ‚คใƒณใ‚ฟใƒผใƒ•ใ‚งใƒผใ‚นใŒๅซใพใ‚Œใฆใ„ใพใ™๏ผš + +[![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339) + +ใ‚คใƒณใ‚ฟใƒฉใ‚ฏใƒ†ใ‚ฃใƒ–Web Scraping Shellใ‚’่ตทๅ‹• +```bash +scrapling shell +``` +ใƒ—ใƒญใ‚ฐใƒฉใƒŸใƒณใ‚ฐใ›ใšใซ็›ดๆŽฅใƒšใƒผใ‚ธใ‚’ใƒ•ใ‚กใ‚คใƒซใซๆŠฝๅ‡บ๏ผˆใƒ‡ใƒ•ใ‚ฉใƒซใƒˆใง`body`ใ‚ฟใ‚ฐๅ†…ใฎใ‚ณใƒณใƒ†ใƒณใƒ„ใ‚’ๆŠฝๅ‡บ๏ผ‰ใ€‚ๅ‡บๅŠ›ใƒ•ใ‚กใ‚คใƒซใŒ`.txt`ใง็ต‚ใ‚ใ‚‹ๅ ดๅˆใ€ใ‚ฟใƒผใ‚ฒใƒƒใƒˆใฎใƒ†ใ‚ญใ‚นใƒˆใ‚ณใƒณใƒ†ใƒณใƒ„ใŒๆŠฝๅ‡บใ•ใ‚Œใพใ™ใ€‚`.md`ใง็ต‚ใ‚ใ‚‹ๅ ดๅˆใ€HTMLใ‚ณใƒณใƒ†ใƒณใƒ„ใฎMarkdown่กจ็พใซใชใ‚Šใพใ™ใ€‚`.html`ใง็ต‚ใ‚ใ‚‹ๅ ดๅˆใ€HTMLใ‚ณใƒณใƒ†ใƒณใƒ„ใใฎใ‚‚ใฎใซใชใ‚Šใพใ™ใ€‚ +```bash +scrapling extract get 'https://example.com' content.md +scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # CSSใ‚ปใƒฌใ‚ฏใ‚ฟ'#fromSkipToProducts'ใซไธ€่‡ดใ™ใ‚‹ใ™ในใฆใฎ่ฆ็ด  +scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless +scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare +``` + +> [!NOTE] +> MCPใ‚ตใƒผใƒใƒผใ‚„ใ‚คใƒณใ‚ฟใƒฉใ‚ฏใƒ†ใ‚ฃใƒ–Web Scraping Shellใชใฉใ€ไป–ใซใ‚‚ๅคšใใฎ่ฟฝๅŠ ๆฉŸ่ƒฝใŒใ‚ใ‚Šใพใ™ใŒใ€ใ“ใฎใƒšใƒผใ‚ธใฏ็ฐกๆฝ”ใซไฟใกใŸใ„ใจๆ€ใ„ใพใ™ใ€‚ๅฎŒๅ…จใชใƒ‰ใ‚ญใƒฅใƒกใƒณใƒˆใฏ[ใ“ใกใ‚‰](https://scrapling.readthedocs.io/en/latest/)ใ‚’ใ”่ฆงใใ ใ•ใ„ + +## ใƒ‘ใƒ•ใ‚ฉใƒผใƒžใƒณใ‚นใƒ™ใƒณใƒใƒžใƒผใ‚ฏ + +ScraplingใฏๅผทๅŠ›ใงใ‚ใ‚‹ใ ใ‘ใงใชใใ€่ถ…้ซ˜้€Ÿใงใ™ใ€‚ไปฅไธ‹ใฎใƒ™ใƒณใƒใƒžใƒผใ‚ฏใฏใ€Scraplingใฎใƒ‘ใƒผใ‚ตใƒผใ‚’ไป–ใฎไบบๆฐ—ใƒฉใ‚คใƒ–ใƒฉใƒชใฎๆœ€ๆ–ฐใƒใƒผใ‚ธใƒงใƒณใจๆฏ”่ผƒใ—ใฆใ„ใพใ™ใ€‚ + +### ใƒ†ใ‚ญใ‚นใƒˆๆŠฝๅ‡บ้€Ÿๅบฆใƒ†ใ‚นใƒˆ๏ผˆ5000ๅ€‹ใฎใƒใ‚นใƒˆใ•ใ‚ŒใŸ่ฆ็ด ๏ผ‰ + +| # | ใƒฉใ‚คใƒ–ใƒฉใƒช | ๆ™‚้–“(ms) | vs Scrapling | +|---|:-----------------:|:---------:|:------------:| +| 1 | Scrapling | 2.02 | 1.0x | +| 2 | Parsel/Scrapy | 2.04 | 1.01 | +| 3 | Raw Lxml | 2.54 | 1.257 | +| 4 | PyQuery | 24.17 | ~12x | +| 5 | Selectolax | 82.63 | ~41x | +| 6 | MechanicalSoup | 1549.71 | ~767.1x | +| 7 | BS4 with Lxml | 1584.31 | ~784.3x | +| 8 | BS4 with html5lib | 3391.91 | ~1679.1x | + + +### ่ฆ็ด ้กžไผผๆ€งใจใƒ†ใ‚ญใ‚นใƒˆๆคœ็ดขใฎใƒ‘ใƒ•ใ‚ฉใƒผใƒžใƒณใ‚น + +Scraplingใฎ้ฉๅฟœๅž‹่ฆ็ด ๆคœ็ดขๆฉŸ่ƒฝใฏไปฃๆ›ฟๆ‰‹ๆฎตใ‚’ๅคงๅน…ใซไธŠๅ›žใ‚Šใพใ™๏ผš + +| ใƒฉใ‚คใƒ–ใƒฉใƒช | ๆ™‚้–“(ms) | vs Scrapling | +|-------------|:---------:|:------------:| +| Scrapling | 2.39 | 1.0x | +| AutoScraper | 12.45 | 5.209x | + + +> ใ™ในใฆใฎใƒ™ใƒณใƒใƒžใƒผใ‚ฏใฏ100ๅ›žไปฅไธŠใฎๅฎŸ่กŒใฎๅนณๅ‡ใ‚’่กจใ—ใพใ™ใ€‚ๆ–นๆณ•่ซ–ใซใคใ„ใฆใฏ[benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py)ใ‚’ๅ‚็…งใ—ใฆใใ ใ•ใ„ใ€‚ + +## ใ‚คใƒณใ‚นใƒˆใƒผใƒซ + +ScraplingใซใฏPython 3.10ไปฅไธŠใŒๅฟ…่ฆใงใ™๏ผš + +```bash +pip install scrapling +``` + +ใ“ใฎใ‚คใƒณใ‚นใƒˆใƒผใƒซใซใฏใƒ‘ใƒผใ‚ตใƒผใ‚จใƒณใ‚ธใƒณใจใใฎไพๅญ˜้–ขไฟ‚ใฎใฟใŒๅซใพใ‚ŒใฆใŠใ‚Šใ€Fetcherใ‚„ใ‚ณใƒžใƒณใƒ‰ใƒฉใ‚คใƒณไพๅญ˜้–ขไฟ‚ใฏๅซใพใ‚Œใฆใ„ใพใ›ใ‚“ใ€‚ + +### ใ‚ชใƒ—ใ‚ทใƒงใƒณใฎไพๅญ˜้–ขไฟ‚ + +1. ไปฅไธ‹ใฎ่ฟฝๅŠ ๆฉŸ่ƒฝใ€Fetcherใ€ใพใŸใฏใใ‚Œใ‚‰ใฎใ‚ฏใƒฉใ‚นใฎใ„ใšใ‚Œใ‹ใ‚’ไฝฟ็”จใ™ใ‚‹ๅ ดๅˆใฏใ€Fetcherใฎไพๅญ˜้–ขไฟ‚ใจใƒ–ใƒฉใ‚ฆใ‚ถใฎไพๅญ˜้–ขไฟ‚ใ‚’ๆฌกใฎใ‚ˆใ†ใซใ‚คใƒณใ‚นใƒˆใƒผใƒซใ™ใ‚‹ๅฟ…่ฆใŒใ‚ใ‚Šใพใ™๏ผš + ```bash + pip install "scrapling[fetchers]" + + scrapling install # normal install + scrapling install --force # force reinstall + ``` + + ใ“ใ‚Œใซใ‚ˆใ‚Šใ€ใ™ในใฆใฎใƒ–ใƒฉใ‚ฆใ‚ถใ€ใŠใ‚ˆใณใใ‚Œใ‚‰ใฎใ‚ทใ‚นใƒ†ใƒ ไพๅญ˜้–ขไฟ‚ใจfingerprintๆ“ไฝœไพๅญ˜้–ขไฟ‚ใŒใƒ€ใ‚ฆใƒณใƒญใƒผใƒ‰ใ•ใ‚Œใพใ™ใ€‚ + + ใพใŸใฏใ€ใ‚ณใƒžใƒณใƒ‰ใ‚’ๅฎŸ่กŒใ™ใ‚‹ไปฃใ‚ใ‚Šใซใ‚ณใƒผใƒ‰ใ‹ใ‚‰ใ‚คใƒณใ‚นใƒˆใƒผใƒซใ™ใ‚‹ใ“ใจใ‚‚ใงใใพใ™๏ผš + ```python + from scrapling.cli import install + + install([], standalone_mode=False) # normal install + install(["--force"], standalone_mode=False) # force reinstall + ``` + +2. ่ฟฝๅŠ ๆฉŸ่ƒฝ๏ผš + - MCPใ‚ตใƒผใƒใƒผๆฉŸ่ƒฝใ‚’ใ‚คใƒณใ‚นใƒˆใƒผใƒซ๏ผš + ```bash + pip install "scrapling[ai]" + ``` + - ShellๆฉŸ่ƒฝ๏ผˆWeb Scraping Shellใจ`extract`ใ‚ณใƒžใƒณใƒ‰๏ผ‰ใ‚’ใ‚คใƒณใ‚นใƒˆใƒผใƒซ๏ผš + ```bash + pip install "scrapling[shell]" + ``` + - ใ™ในใฆใ‚’ใ‚คใƒณใ‚นใƒˆใƒผใƒซ๏ผš + ```bash + pip install "scrapling[all]" + ``` + ใ“ใ‚Œใ‚‰ใฎ่ฟฝๅŠ ๆฉŸ่ƒฝใฎใ„ใšใ‚Œใ‹ใฎๅพŒ๏ผˆใพใ ใ‚คใƒณใ‚นใƒˆใƒผใƒซใ—ใฆใ„ใชใ„ๅ ดๅˆ๏ผ‰ใ€`scrapling install`ใงใƒ–ใƒฉใ‚ฆใ‚ถใฎไพๅญ˜้–ขไฟ‚ใ‚’ใ‚คใƒณใ‚นใƒˆใƒผใƒซใ™ใ‚‹ๅฟ…่ฆใŒใ‚ใ‚‹ใ“ใจใ‚’ๅฟ˜ใ‚Œใชใ„ใงใใ ใ•ใ„ + +### Docker +DockerHubใ‹ใ‚‰ๆฌกใฎใ‚ณใƒžใƒณใƒ‰ใงใ™ในใฆใฎ่ฟฝๅŠ ๆฉŸ่ƒฝใจใƒ–ใƒฉใ‚ฆใ‚ถใ‚’ๅซใ‚€Dockerใ‚คใƒกใƒผใ‚ธใ‚’ใ‚คใƒณใ‚นใƒˆใƒผใƒซใ™ใ‚‹ใ“ใจใ‚‚ใงใใพใ™๏ผš +```bash +docker pull pyd4vinci/scrapling +``` +ใพใŸใฏGitHubใƒฌใ‚ธใ‚นใƒˆใƒชใ‹ใ‚‰ใƒ€ใ‚ฆใƒณใƒญใƒผใƒ‰๏ผš +```bash +docker pull ghcr.io/d4vinci/scrapling:latest +``` +ใ“ใฎใ‚คใƒกใƒผใ‚ธใฏใ€GitHub Actionsใจใƒชใƒใ‚ธใƒˆใƒชใฎใƒกใ‚คใƒณใƒ–ใƒฉใƒณใƒใ‚’ไฝฟ็”จใ—ใฆ่‡ชๅ‹•็š„ใซใƒ“ใƒซใƒ‰ใŠใ‚ˆใณใƒ—ใƒƒใ‚ทใƒฅใ•ใ‚Œใพใ™ใ€‚ + +## ่ฒข็Œฎ + +่ฒข็Œฎใ‚’ๆญ“่ฟŽใ—ใพใ™๏ผๅง‹ใ‚ใ‚‹ๅ‰ใซ[่ฒข็Œฎใ‚ฌใ‚คใƒ‰ใƒฉใ‚คใƒณ](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md)ใ‚’ใŠ่ชญใฟใใ ใ•ใ„ใ€‚ + +## ๅ…่ฒฌไบ‹้ … + +> [!CAUTION] +> ใ“ใฎใƒฉใ‚คใƒ–ใƒฉใƒชใฏๆ•™่‚ฒใŠใ‚ˆใณ็ ”็ฉถ็›ฎ็š„ใฎใฟใงๆไพ›ใ•ใ‚Œใฆใ„ใพใ™ใ€‚ใ“ใฎใƒฉใ‚คใƒ–ใƒฉใƒชใ‚’ไฝฟ็”จใ™ใ‚‹ใ“ใจใซใ‚ˆใ‚Šใ€ๅœฐๅŸŸใŠใ‚ˆใณๅ›ฝ้š›็š„ใชใƒ‡ใƒผใ‚ฟใ‚นใ‚ฏใƒฌใ‚คใƒ”ใƒณใ‚ฐใŠใ‚ˆใณใƒ—ใƒฉใ‚คใƒใ‚ทใƒผๆณ•ใซๆบ–ๆ‹ ใ™ใ‚‹ใ“ใจใซๅŒๆ„ใ—ใŸใ‚‚ใฎใจใฟใชใ•ใ‚Œใพใ™ใ€‚่‘—่€…ใŠใ‚ˆใณ่ฒข็Œฎ่€…ใฏใ€ใ“ใฎใ‚ฝใƒ•ใƒˆใ‚ฆใ‚งใ‚ขใฎ่ชค็”จใซใคใ„ใฆ่ฒฌไปปใ‚’่ฒ ใ„ใพใ›ใ‚“ใ€‚ๅธธใซใ‚ฆใ‚งใƒ–ใ‚ตใ‚คใƒˆใฎๅˆฉ็”จ่ฆ็ด„ใจrobots.txtใƒ•ใ‚กใ‚คใƒซใ‚’ๅฐŠ้‡ใ—ใฆใใ ใ•ใ„ใ€‚ + +## ใƒฉใ‚คใ‚ปใƒณใ‚น + +ใ“ใฎไฝœๅ“ใฏBSD-3-Clauseใƒฉใ‚คใ‚ปใƒณใ‚นใฎไธ‹ใงใƒฉใ‚คใ‚ปใƒณใ‚นใ•ใ‚Œใฆใ„ใพใ™ใ€‚ + +## ่ฌ่พž + +ใ“ใฎใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆใซใฏๆฌกใ‹ใ‚‰้ฉๅฟœใ•ใ‚ŒใŸใ‚ณใƒผใƒ‰ใŒๅซใพใ‚Œใฆใ„ใพใ™๏ผš +- Parsel๏ผˆBSDใƒฉใ‚คใ‚ปใƒณใ‚น๏ผ‰โ€” [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)ใ‚ตใƒ–ใƒขใ‚ธใƒฅใƒผใƒซใซไฝฟ็”จ + +--- +
Karim Shoairใซใ‚ˆใฃใฆโค๏ธใงใƒ‡ใ‚ถใ‚คใƒณใŠใ‚ˆใณไฝœๆˆใ•ใ‚Œใพใ—ใŸใ€‚

diff --git a/docs/README_RU.md b/docs/README_RU.md new file mode 100644 index 0000000000000000000000000000000000000000..091244c7ddc08395c83b6a3f66e21498a02fec6d --- /dev/null +++ b/docs/README_RU.md @@ -0,0 +1,426 @@ + + +

+ + + + Scrapling Poster + + +
+ Effortless Web Scraping for the Modern Web +

+ +

+ + Tests + + PyPI version + + PyPI Downloads +
+ + Discord + + + X (formerly Twitter) Follow + +
+ + Supported Python versions +

+ +

+ ะœะตั‚ะพะดั‹ ะฒั‹ะฑะพั€ะฐ + · + ะ’ั‹ะฑะพั€ Fetcher + · + ะŸะฐัƒะบะธ + · + ะ ะพั‚ะฐั†ะธั ะฟั€ะพะบัะธ + · + CLI + · + ะ ะตะถะธะผ MCP +

+ +Scrapling โ€” ัั‚ะพ ะฐะดะฐะฟั‚ะธะฒะฝั‹ะน ั„ั€ะตะนะผะฒะพั€ะบ ะดะปั Web Scraping, ะบะพั‚ะพั€ั‹ะน ะฑะตั€ั‘ั‚ ะฝะฐ ัะตะฑั ะฒัั‘: ะพั‚ ะพะดะฝะพะณะพ ะทะฐะฟั€ะพัะฐ ะดะพ ะฟะพะปะฝะพะผะฐััˆั‚ะฐะฑะฝะพะณะพ ะพะฑั…ะพะดะฐ ัะฐะนั‚ะพะฒ. + +ะ•ะณะพ ะฟะฐั€ัะตั€ ัƒั‡ะธั‚ัั ะฝะฐ ะธะทะผะตะฝะตะฝะธัั… ัะฐะนั‚ะพะฒ ะธ ะฐะฒั‚ะพะผะฐั‚ะธั‡ะตัะบะธ ะฟะตั€ะตะผะตั‰ะฐะตั‚ ะฒะฐัˆะธ ัะปะตะผะตะฝั‚ั‹ ะฟั€ะธ ะพะฑะฝะพะฒะปะตะฝะธะธ ัั‚ั€ะฐะฝะธั†. ะ•ะณะพ Fetcher'ั‹ ะพะฑั…ะพะดัั‚ ะฐะฝั‚ะธ-ะฑะพั‚ ัะธัั‚ะตะผั‹ ะฒั€ะพะดะต Cloudflare Turnstile ะฟั€ัะผะพ ะธะท ะบะพั€ะพะฑะบะธ. ะ ะตะณะพ Spider-ั„ั€ะตะนะผะฒะพั€ะบ ะฟะพะทะฒะพะปัะตั‚ ะผะฐััˆั‚ะฐะฑะธั€ะพะฒะฐั‚ัŒัั ะดะพ ะฟะฐั€ะฐะปะปะตะปัŒะฝั‹ั…, ะผะฝะพะณะพัะตััะธะพะฝะฝั‹ั… ะพะฑั…ะพะดะพะฒ ั Pause & Resume ะธ ะฐะฒั‚ะพะผะฐั‚ะธั‡ะตัะบะพะน ั€ะพั‚ะฐั†ะธะตะน Proxy โ€” ะธ ะฒัั‘ ัั‚ะพ ะฒ ะฝะตัะบะพะปัŒะบะธั… ัั‚ั€ะพะบะฐั… Python. ะžะดะฝะฐ ะฑะธะฑะปะธะพั‚ะตะบะฐ, ะฑะตะท ะบะพะผะฟั€ะพะผะธััะพะฒ. + +ะœะพะปะฝะธะตะฝะพัะฝะพ ะฑั‹ัั‚ั€ั‹ะต ะพะฑั…ะพะดั‹ ั ะพั‚ัะปะตะถะธะฒะฐะฝะธะตะผ ัั‚ะฐั‚ะธัั‚ะธะบะธ ะฒ ั€ะตะฐะปัŒะฝะพะผ ะฒั€ะตะผะตะฝะธ ะธ Streaming. ะกะพะทะดะฐะฝะพ ะฒะตะฑ-ัะบั€ะฐะฟะตั€ะฐะผะธ ะดะปั ะฒะตะฑ-ัะบั€ะฐะฟะตั€ะพะฒ ะธ ะพะฑั‹ั‡ะฝั‹ั… ะฟะพะปัŒะทะพะฒะฐั‚ะตะปะตะน โ€” ะทะดะตััŒ ะตัั‚ัŒ ั‡ั‚ะพ-ั‚ะพ ะดะปั ะบะฐะถะดะพะณะพ. + +```python +from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher +StealthyFetcher.adaptive = True +p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ะ—ะฐะณั€ัƒะทะธั‚ะต ัะฐะนั‚ ะฝะตะทะฐะผะตั‚ะฝะพ! +products = p.css('.product', auto_save=True) # ะกะบั€ะฐะฟัŒั‚ะต ะดะฐะฝะฝั‹ะต, ะบะพั‚ะพั€ั‹ะต ะฟะตั€ะตะถะธะฒัƒั‚ ะธะทะผะตะฝะตะฝะธั ะดะธะทะฐะนะฝะฐ ัะฐะนั‚ะฐ! +products = p.css('.product', adaptive=True) # ะŸะพะทะถะต, ะตัะปะธ ัั‚ั€ัƒะบั‚ัƒั€ะฐ ัะฐะนั‚ะฐ ะธะทะผะตะฝะธั‚ัั, ะฟะตั€ะตะดะฐะนั‚ะต `adaptive=True`, ั‡ั‚ะพะฑั‹ ะฝะฐะนั‚ะธ ะธั…! +``` +ะ˜ะปะธ ะผะฐััˆั‚ะฐะฑะธั€ัƒะนั‚ะต ะดะพ ะฟะพะปะฝะพะณะพ ะพะฑั…ะพะดะฐ +```python +from scrapling.spiders import Spider, Response + +class MySpider(Spider): + name = "demo" + start_urls = ["https://example.com/"] + + async def parse(self, response: Response): + for item in response.css('.product'): + yield {"title": item.css('h2::text').get()} + +MySpider().start() +``` + + +# ะŸะปะฐั‚ะธะฝะพะฒั‹ะต ัะฟะพะฝัะพั€ั‹ + +ะฅะพั‚ะธั‚ะต ัั‚ะฐั‚ัŒ ะฟะตั€ะฒะพะน ะบะพะผะฟะฐะฝะธะตะน, ะบะพั‚ะพั€ะฐั ะฟะพัะฒะธั‚ัั ะทะดะตััŒ? ะะฐะถะผะธั‚ะต [ะทะดะตััŒ](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646) +# ะกะฟะพะฝัะพั€ั‹ + + + + + + + + + + + + + + + + + + + + +ะฅะพั‚ะธั‚ะต ะฟะพะบะฐะทะฐั‚ัŒ ะทะดะตััŒ ัะฒะพัŽ ั€ะตะบะปะฐะผัƒ? ะะฐะถะผะธั‚ะต [ะทะดะตััŒ](https://github.com/sponsors/D4Vinci) ะธ ะฒั‹ะฑะตั€ะธั‚ะต ะฟะพะดั…ะพะดัั‰ะธะน ะฒะฐะผ ัƒั€ะพะฒะตะฝัŒ! + +--- + +## ะšะปัŽั‡ะตะฒั‹ะต ะพัะพะฑะตะฝะฝะพัั‚ะธ + +### Spider'ั‹ โ€” ะฟะพะปะฝะพั†ะตะฝะฝั‹ะน ั„ั€ะตะนะผะฒะพั€ะบ ะดะปั ะพะฑั…ะพะดะฐ ัะฐะนั‚ะพะฒ +- ๐Ÿ•ท๏ธ **Scrapy-ะฟะพะดะพะฑะฝั‹ะน Spider API**: ะžะฟั€ะตะดะตะปัะนั‚ะต Spider'ะพะฒ ั `start_urls`, async `parse` callback'ะฐะผะธ ะธ ะพะฑัŠะตะบั‚ะฐะผะธ `Request`/`Response`. +- โšก **ะŸะฐั€ะฐะปะปะตะปัŒะฝั‹ะน ะพะฑั…ะพะด**: ะะฐัั‚ั€ะฐะธะฒะฐะตะผั‹ะต ะปะธะผะธั‚ั‹ ะฟะฐั€ะฐะปะปะตะปะธะทะผะฐ, ะพะณั€ะฐะฝะธั‡ะตะฝะธะต ัะบะพั€ะพัั‚ะธ ะฟะพ ะดะพะผะตะฝัƒ ะธ ะทะฐะดะตั€ะถะบะธ ะทะฐะณั€ัƒะทะบะธ. +- ๐Ÿ”„ **ะŸะพะดะดะตั€ะถะบะฐ ะฝะตัะบะพะปัŒะบะธั… ัะตััะธะน**: ะ•ะดะธะฝั‹ะน ะธะฝั‚ะตั€ั„ะตะนั ะดะปั HTTP-ะทะฐะฟั€ะพัะพะฒ ะธ ัะบั€ั‹ั‚ะฝั‹ั… headless-ะฑั€ะฐัƒะทะตั€ะพะฒ ะฒ ะพะดะฝะพะผ Spider โ€” ะผะฐั€ัˆั€ัƒั‚ะธะทะธั€ัƒะนั‚ะต ะทะฐะฟั€ะพัั‹ ะบ ั€ะฐะทะฝั‹ะผ ัะตััะธัะผ ะฟะพ ID. +- ๐Ÿ’พ **Pause & Resume**: Persistence ะพะฑั…ะพะดะฐ ะฝะฐ ะพัะฝะพะฒะต Checkpoint'ะพะฒ. ะะฐะถะผะธั‚ะต Ctrl+C ะดะปั ะผัะณะบะพะน ะพัั‚ะฐะฝะพะฒะบะธ; ะฟะตั€ะตะทะฐะฟัƒัั‚ะธั‚ะต, ั‡ั‚ะพะฑั‹ ะฟั€ะพะดะพะปะถะธั‚ัŒ ั ั‚ะพะณะพ ะผะตัั‚ะฐ, ะณะดะต ะฒั‹ ะพัั‚ะฐะฝะพะฒะธะปะธััŒ. +- ๐Ÿ“ก **ะ ะตะถะธะผ Streaming**: ะกั‚ั€ะธะผัŒั‚ะต ะธะทะฒะปะตั‡ั‘ะฝะฝั‹ะต ัะปะตะผะตะฝั‚ั‹ ะฟะพ ะผะตั€ะต ะธั… ะฟะพัั‚ัƒะฟะปะตะฝะธั ั‡ะตั€ะตะท `async for item in spider.stream()` ัะพ ัั‚ะฐั‚ะธัั‚ะธะบะพะน ะฒ ั€ะตะฐะปัŒะฝะพะผ ะฒั€ะตะผะตะฝะธ โ€” ะธะดะตะฐะปัŒะฝะพ ะดะปั UI, ะบะพะฝะฒะตะนะตั€ะพะฒ ะธ ะดะปะธั‚ะตะปัŒะฝั‹ั… ะพะฑั…ะพะดะพะฒ. +- ๐Ÿ›ก๏ธ **ะžะฑะฝะฐั€ัƒะถะตะฝะธะต ะทะฐะฑะปะพะบะธั€ะพะฒะฐะฝะฝั‹ั… ะทะฐะฟั€ะพัะพะฒ**: ะะฒั‚ะพะผะฐั‚ะธั‡ะตัะบะพะต ะพะฑะฝะฐั€ัƒะถะตะฝะธะต ะธ ะฟะพะฒั‚ะพั€ะฝะฐั ะพั‚ะฟั€ะฐะฒะบะฐ ะทะฐะฑะปะพะบะธั€ะพะฒะฐะฝะฝั‹ั… ะทะฐะฟั€ะพัะพะฒ ั ะฝะฐัั‚ั€ะฐะธะฒะฐะตะผะพะน ะปะพะณะธะบะพะน. +- ๐Ÿ“ฆ **ะ’ัั‚ั€ะพะตะฝะฝั‹ะน ัะบัะฟะพั€ั‚**: ะญะบัะฟะพั€ั‚ะธั€ัƒะนั‚ะต ั€ะตะทัƒะปัŒั‚ะฐั‚ั‹ ั‡ะตั€ะตะท ั…ัƒะบะธ ะธ ัะพะฑัั‚ะฒะตะฝะฝั‹ะน ะบะพะฝะฒะตะนะตั€ ะธะปะธ ะฒัั‚ั€ะพะตะฝะฝั‹ะน JSON/JSONL ั `result.items.to_json()` / `result.items.to_jsonl()` ัะพะพั‚ะฒะตั‚ัั‚ะฒะตะฝะฝะพ. + +### ะŸั€ะพะดะฒะธะฝัƒั‚ะฐั ะทะฐะณั€ัƒะทะบะฐ ัะฐะนั‚ะพะฒ ั ะฟะพะดะดะตั€ะถะบะพะน Session +- **HTTP-ะทะฐะฟั€ะพัั‹**: ะ‘ั‹ัั‚ั€ั‹ะต ะธ ัะบั€ั‹ั‚ะฝั‹ะต HTTP-ะทะฐะฟั€ะพัั‹ ั ะบะปะฐััะพะผ `Fetcher`. ะœะพะถะตั‚ ะธะผะธั‚ะธั€ะพะฒะฐั‚ัŒ TLS fingerprint ะฑั€ะฐัƒะทะตั€ะฐ, ะทะฐะณะพะปะพะฒะบะธ ะธ ะธัะฟะพะปัŒะทะพะฒะฐั‚ัŒ HTTP/3. +- **ะ”ะธะฝะฐะผะธั‡ะตัะบะฐั ะทะฐะณั€ัƒะทะบะฐ**: ะ—ะฐะณั€ัƒะทะบะฐ ะดะธะฝะฐะผะธั‡ะตัะบะธั… ัะฐะนั‚ะพะฒ ั ะฟะพะปะฝะพะน ะฐะฒั‚ะพะผะฐั‚ะธะทะฐั†ะธะตะน ะฑั€ะฐัƒะทะตั€ะฐ ั‡ะตั€ะตะท ะบะปะฐัั `DynamicFetcher`, ะฟะพะดะดะตั€ะถะธะฒะฐัŽั‰ะธะน Chromium ะพั‚ Playwright ะธ Google Chrome. +- **ะžะฑั…ะพะด ะฐะฝั‚ะธ-ะฑะพั‚ะพะฒ**: ะ ะฐััˆะธั€ะตะฝะฝั‹ะต ะฒะพะทะผะพะถะฝะพัั‚ะธ ัะบั€ั‹ั‚ะฝะพัั‚ะธ ั `StealthyFetcher` ะธ ะฟะพะดะผะตะฝัƒ fingerprint'ะพะฒ. ะœะพะถะตั‚ ะปะตะณะบะพ ะพะฑะพะนั‚ะธ ะฒัะต ั‚ะธะฟั‹ Cloudflare Turnstile/Interstitial ั ะฟะพะผะพั‰ัŒัŽ ะฐะฒั‚ะพะผะฐั‚ะธะทะฐั†ะธะธ. +- **ะฃะฟั€ะฐะฒะปะตะฝะธะต ัะตััะธัะผะธ**: ะŸะพะดะดะตั€ะถะบะฐ ะฟะพัั‚ะพัะฝะฝั‹ั… ัะตััะธะน ั ะบะปะฐััะฐะผะธ `FetcherSession`, `StealthySession` ะธ `DynamicSession` ะดะปั ัƒะฟั€ะฐะฒะปะตะฝะธั cookie ะธ ัะพัั‚ะพัะฝะธะตะผ ะผะตะถะดัƒ ะทะฐะฟั€ะพัะฐะผะธ. +- **ะ ะพั‚ะฐั†ะธั Proxy**: ะ’ัั‚ั€ะพะตะฝะฝั‹ะน `ProxyRotator` ั ั†ะธะบะปะธั‡ะตัะบะพะน ะธะปะธ ะฟะพะปัŒะทะพะฒะฐั‚ะตะปัŒัะบะธะผะธ ัั‚ั€ะฐั‚ะตะณะธัะผะธ ะดะปั ะฒัะตั… ั‚ะธะฟะพะฒ ัะตััะธะน, ะฐ ั‚ะฐะบะถะต ะฟะตั€ะตะพะฟั€ะตะดะตะปะตะฝะธะต Proxy ะดะปั ะบะฐะถะดะพะณะพ ะทะฐะฟั€ะพัะฐ. +- **ะ‘ะปะพะบะธั€ะพะฒะบะฐ ะดะพะผะตะฝะพะฒ**: ะ‘ะปะพะบะธั€ัƒะนั‚ะต ะทะฐะฟั€ะพัั‹ ะบ ะพะฟั€ะตะดะตะปั‘ะฝะฝั‹ะผ ะดะพะผะตะฝะฐะผ (ะธ ะธั… ะฟะพะดะดะพะผะตะฝะฐะผ) ะฒ ะฑั€ะฐัƒะทะตั€ะฝั‹ั… Fetcher'ะฐั…. +- **ะŸะพะดะดะตั€ะถะบะฐ async**: ะŸะพะปะฝะฐั async-ะฟะพะดะดะตั€ะถะบะฐ ะฒะพ ะฒัะตั… Fetcher'ะฐั… ะธ ะฒั‹ะดะตะปะตะฝะฝั‹ั… async-ะบะปะฐััะฐั… ัะตััะธะน. + +### ะะดะฐะฟั‚ะธะฒะฝั‹ะน ัะบั€ะฐะฟะธะฝะณ ะธ ะธะฝั‚ะตะณั€ะฐั†ะธั ั ะ˜ะ˜ +- ๐Ÿ”„ **ะฃะผะฝะพะต ะพั‚ัะปะตะถะธะฒะฐะฝะธะต ัะปะตะผะตะฝั‚ะพะฒ**: ะŸะตั€ะตะผะตั‰ะฐะนั‚ะต ัะปะตะผะตะฝั‚ั‹ ะฟะพัะปะต ะธะทะผะตะฝะตะฝะธะน ัะฐะนั‚ะฐ ั ะฟะพะผะพั‰ัŒัŽ ะธะฝั‚ะตะปะปะตะบั‚ัƒะฐะปัŒะฝั‹ั… ะฐะปะณะพั€ะธั‚ะผะพะฒ ะฟะพะดะพะฑะธั. +- ๐ŸŽฏ **ะฃะผะฝั‹ะน ะณะธะฑะบะธะน ะฒั‹ะฑะพั€**: CSS-ัะตะปะตะบั‚ะพั€ั‹, XPath-ัะตะปะตะบั‚ะพั€ั‹, ะฟะพะธัะบ ะฝะฐ ะพัะฝะพะฒะต ั„ะธะปัŒั‚ั€ะพะฒ, ั‚ะตะบัั‚ะพะฒั‹ะน ะฟะพะธัะบ, ะฟะพะธัะบ ะฟะพ ั€ะตะณัƒะปัั€ะฝั‹ะผ ะฒั‹ั€ะฐะถะตะฝะธัะผ ะธ ะผะฝะพะณะพะต ะดั€ัƒะณะพะต. +- ๐Ÿ” **ะŸะพะธัะบ ะฟะพั…ะพะถะธั… ัะปะตะผะตะฝั‚ะพะฒ**: ะะฒั‚ะพะผะฐั‚ะธั‡ะตัะบะธ ะฝะฐั…ะพะดะธั‚ะต ัะปะตะผะตะฝั‚ั‹, ะฟะพั…ะพะถะธะต ะฝะฐ ะฝะฐะนะดะตะฝะฝั‹ะต. +- ๐Ÿค– **MCP-ัะตั€ะฒะตั€ ะดะปั ะธัะฟะพะปัŒะทะพะฒะฐะฝะธั ั ะ˜ะ˜**: ะ’ัั‚ั€ะพะตะฝะฝั‹ะน MCP-ัะตั€ะฒะตั€ ะดะปั Web Scraping ั ะฟะพะผะพั‰ัŒัŽ ะ˜ะ˜ ะธ ะธะทะฒะปะตั‡ะตะฝะธั ะดะฐะฝะฝั‹ั…. MCP-ัะตั€ะฒะตั€ ะพะฑะปะฐะดะฐะตั‚ ะผะพั‰ะฝั‹ะผะธ ะฟะพะปัŒะทะพะฒะฐั‚ะตะปัŒัะบะธะผะธ ะฒะพะทะผะพะถะฝะพัั‚ัะผะธ, ะบะพั‚ะพั€ั‹ะต ะธัะฟะพะปัŒะทัƒัŽั‚ Scrapling ะดะปั ะธะทะฒะปะตั‡ะตะฝะธั ั†ะตะปะตะฒะพะณะพ ะบะพะฝั‚ะตะฝั‚ะฐ ะฟะตั€ะตะด ะฟะตั€ะตะดะฐั‡ะตะน ะตะณะพ ะ˜ะ˜ (Claude/Cursor/ะธ ั‚.ะด.), ั‚ะตะผ ัะฐะผั‹ะผ ัƒัะบะพั€ัั ะพะฟะตั€ะฐั†ะธะธ ะธ ัะฝะธะถะฐั ะทะฐั‚ั€ะฐั‚ั‹ ะทะฐ ัั‡ั‘ั‚ ะผะธะฝะธะผะธะทะฐั†ะธะธ ะธัะฟะพะปัŒะทะพะฒะฐะฝะธั ั‚ะพะบะตะฝะพะฒ. ([ะดะตะผะพ-ะฒะธะดะตะพ](https://www.youtube.com/watch?v=qyFk3ZNwOxE)) + +### ะ’ั‹ัะพะบะพะฟั€ะพะธะทะฒะพะดะธั‚ะตะปัŒะฝะฐั ะธ ะฟั€ะพะฒะตั€ะตะฝะฝะฐั ะฒ ะฑะพัั… ะฐั€ั…ะธั‚ะตะบั‚ัƒั€ะฐ +- ๐Ÿš€ **ะœะพะปะฝะธะตะฝะพัะฝะฐั ัะบะพั€ะพัั‚ัŒ**: ะžะฟั‚ะธะผะธะทะธั€ะพะฒะฐะฝะฝะฐั ะฟั€ะพะธะทะฒะพะดะธั‚ะตะปัŒะฝะพัั‚ัŒ, ะฟั€ะตะฒะพัั…ะพะดัั‰ะฐั ะฑะพะปัŒัˆะธะฝัั‚ะฒะพ Python-ะฑะธะฑะปะธะพั‚ะตะบ ะดะปั ัะบั€ะฐะฟะธะฝะณะฐ. +- ๐Ÿ”‹ **ะญั„ั„ะตะบั‚ะธะฒะฝะพะต ะธัะฟะพะปัŒะทะพะฒะฐะฝะธะต ะฟะฐะผัั‚ะธ**: ะžะฟั‚ะธะผะธะทะธั€ะพะฒะฐะฝะฝั‹ะต ัั‚ั€ัƒะบั‚ัƒั€ั‹ ะดะฐะฝะฝั‹ั… ะธ ะปะตะฝะธะฒะฐั ะทะฐะณั€ัƒะทะบะฐ ะดะปั ะผะธะฝะธะผะฐะปัŒะฝะพะณะพ ะฟะพั‚ั€ะตะฑะปะตะฝะธั ะฟะฐะผัั‚ะธ. +- โšก **ะ‘ั‹ัั‚ั€ะฐั ัะตั€ะธะฐะปะธะทะฐั†ะธั JSON**: ะ’ 10 ั€ะฐะท ะฑั‹ัั‚ั€ะตะต ัั‚ะฐะฝะดะฐั€ั‚ะฝะพะน ะฑะธะฑะปะธะพั‚ะตะบะธ. +- ๐Ÿ—๏ธ **ะŸั€ะพะฒะตั€ะตะฝะพ ะฒ ะฑะพัั…**: Scrapling ะธะผะตะตั‚ ะฝะต ั‚ะพะปัŒะบะพ 92% ะฟะพะบั€ั‹ั‚ะธั ั‚ะตัั‚ะฐะผะธ ะธ ะฟะพะปะฝะพะต ะฟะพะบั€ั‹ั‚ะธะต type hints, ะฝะพ ะธ ะตะถะตะดะฝะตะฒะฝะพ ะธัะฟะพะปัŒะทะพะฒะฐะปัั ัะพั‚ะฝัะผะธ ะฒะตะฑ-ัะบั€ะฐะฟะตั€ะพะฒ ะฒ ั‚ะตั‡ะตะฝะธะต ะฟะพัะปะตะดะฝะตะณะพ ะณะพะดะฐ. + +### ะฃะดะพะฑะฝั‹ะน ะดะปั ั€ะฐะทั€ะฐะฑะพั‚ั‡ะธะบะพะฒ/ะฒะตะฑ-ัะบั€ะฐะฟะตั€ะพะฒ ะพะฟั‹ั‚ +- ๐ŸŽฏ **ะ˜ะฝั‚ะตั€ะฐะบั‚ะธะฒะฝะฐั Web Scraping Shell**: ะžะฟั†ะธะพะฝะฐะปัŒะฝะฐั ะฒัั‚ั€ะพะตะฝะฝะฐั IPython-ะพะฑะพะปะพั‡ะบะฐ ั ะธะฝั‚ะตะณั€ะฐั†ะธะตะน Scrapling, ัั€ะปั‹ะบะฐะผะธ ะธ ะฝะพะฒั‹ะผะธ ะธะฝัั‚ั€ัƒะผะตะฝั‚ะฐะผะธ ะดะปั ัƒัะบะพั€ะตะฝะธั ั€ะฐะทั€ะฐะฑะพั‚ะบะธ ัะบั€ะธะฟั‚ะพะฒ Web Scraping, ั‚ะฐะบะธะผะธ ะบะฐะบ ะฟั€ะตะพะฑั€ะฐะทะพะฒะฐะฝะธะต curl-ะทะฐะฟั€ะพัะพะฒ ะฒ ะทะฐะฟั€ะพัั‹ Scrapling ะธ ะฟั€ะพัะผะพั‚ั€ ั€ะตะทัƒะปัŒั‚ะฐั‚ะพะฒ ะทะฐะฟั€ะพัะพะฒ ะฒ ะฑั€ะฐัƒะทะตั€ะต. +- ๐Ÿš€ **ะ˜ัะฟะพะปัŒะทัƒะนั‚ะต ะฟั€ัะผะพ ะธะท ั‚ะตั€ะผะธะฝะฐะปะฐ**: ะŸั€ะธ ะถะตะปะฐะฝะธะธ ะฒั‹ ะผะพะถะตั‚ะต ะธัะฟะพะปัŒะทะพะฒะฐั‚ัŒ Scrapling ะดะปั ัะบั€ะฐะฟะธะฝะณะฐ URL ะฑะตะท ะฝะฐะฟะธัะฐะฝะธั ะฝะธ ะพะดะฝะพะน ัั‚ั€ะพะบะธ ะบะพะดะฐ! +- ๐Ÿ› ๏ธ **ะ‘ะพะณะฐั‚ั‹ะน API ะฝะฐะฒะธะณะฐั†ะธะธ**: ะ ะฐััˆะธั€ะตะฝะฝั‹ะน ะพะฑั…ะพะด DOM ั ะผะตั‚ะพะดะฐะผะธ ะฝะฐะฒะธะณะฐั†ะธะธ ะฟะพ ั€ะพะดะธั‚ะตะปัะผ, ะฑั€ะฐั‚ัŒัะผ ะธ ะดะตั‚ัะผ. +- ๐Ÿงฌ **ะฃะปัƒั‡ัˆะตะฝะฝะฐั ะพะฑั€ะฐะฑะพั‚ะบะฐ ั‚ะตะบัั‚ะฐ**: ะ’ัั‚ั€ะพะตะฝะฝั‹ะต ั€ะตะณัƒะปัั€ะฝั‹ะต ะฒั‹ั€ะฐะถะตะฝะธั, ะผะตั‚ะพะดั‹ ะพั‡ะธัั‚ะบะธ ะธ ะพะฟั‚ะธะผะธะทะธั€ะพะฒะฐะฝะฝั‹ะต ะพะฟะตั€ะฐั†ะธะธ ัะพ ัั‚ั€ะพะบะฐะผะธ. +- ๐Ÿ“ **ะะฒั‚ะพะผะฐั‚ะธั‡ะตัะบะฐั ะณะตะฝะตั€ะฐั†ะธั ัะตะปะตะบั‚ะพั€ะพะฒ**: ะ“ะตะฝะตั€ะฐั†ะธั ะฝะฐะดั‘ะถะฝั‹ั… CSS/XPath-ัะตะปะตะบั‚ะพั€ะพะฒ ะดะปั ะปัŽะฑะพะณะพ ัะปะตะผะตะฝั‚ะฐ. +- ๐Ÿ”Œ **ะ—ะฝะฐะบะพะผั‹ะน API**: ะŸะพั…ะพะถ ะฝะฐ Scrapy/BeautifulSoup ั ั‚ะตะผะธ ะถะต ะฟัะตะฒะดะพัะปะตะผะตะฝั‚ะฐะผะธ, ะธัะฟะพะปัŒะทัƒะตะผั‹ะผะธ ะฒ Scrapy/Parsel. +- ๐Ÿ“˜ **ะŸะพะปะฝะพะต ะฟะพะบั€ั‹ั‚ะธะต ั‚ะธะฟะฐะผะธ**: ะŸะพะปะฝั‹ะต type hints ะดะปั ะพั‚ะปะธั‡ะฝะพะน ะฟะพะดะดะตั€ะถะบะธ IDE ะธ ะฐะฒั‚ะพะดะพะฟะพะปะฝะตะฝะธั ะบะพะดะฐ. ะ’ัั ะบะพะดะพะฒะฐั ะฑะฐะทะฐ ะฐะฒั‚ะพะผะฐั‚ะธั‡ะตัะบะธ ะฟั€ะพะฒะตั€ัะตั‚ัั **PyRight** ะธ **MyPy** ะฟั€ะธ ะบะฐะถะดะพะผ ะธะทะผะตะฝะตะฝะธะธ. +- ๐Ÿ”‹ **ะ“ะพั‚ะพะฒั‹ะน Docker-ะพะฑั€ะฐะท**: ะก ะบะฐะถะดั‹ะผ ั€ะตะปะธะทะพะผ ะฐะฒั‚ะพะผะฐั‚ะธั‡ะตัะบะธ ัะพะทะดะฐั‘ั‚ัั ะธ ะฟัƒะฑะปะธะบัƒะตั‚ัั Docker-ะพะฑั€ะฐะท, ัะพะดะตั€ะถะฐั‰ะธะน ะฒัะต ะฑั€ะฐัƒะทะตั€ั‹. + +## ะะฐั‡ะฐะปะพ ั€ะฐะฑะพั‚ั‹ + +ะ”ะฐะฒะฐะนั‚ะต ะบั€ะฐั‚ะบะพ ะฟะพะบะฐะถะตะผ, ะฝะฐ ั‡ั‚ะพ ัะฟะพัะพะฑะตะฝ Scrapling, ะฑะตะท ะณะปัƒะฑะพะบะพะณะพ ะฟะพะณั€ัƒะถะตะฝะธั. + +### ะ‘ะฐะทะพะฒะพะต ะธัะฟะพะปัŒะทะพะฒะฐะฝะธะต +HTTP-ะทะฐะฟั€ะพัั‹ ั ะฟะพะดะดะตั€ะถะบะพะน Session +```python +from scrapling.fetchers import Fetcher, FetcherSession + +with FetcherSession(impersonate='chrome') as session: # ะ˜ัะฟะพะปัŒะทัƒะนั‚ะต ะฟะพัะปะตะดะฝัŽัŽ ะฒะตั€ัะธัŽ TLS fingerprint Chrome + page = session.get('https://quotes.toscrape.com/', stealthy_headers=True) + quotes = page.css('.quote .text::text').getall() + +# ะ˜ะปะธ ะธัะฟะพะปัŒะทัƒะนั‚ะต ะพะดะฝะพั€ะฐะทะพะฒั‹ะต ะทะฐะฟั€ะพัั‹ +page = Fetcher.get('https://quotes.toscrape.com/') +quotes = page.css('.quote .text::text').getall() +``` +ะ ะฐััˆะธั€ะตะฝะฝั‹ะน ั€ะตะถะธะผ ัะบั€ั‹ั‚ะฝะพัั‚ะธ +```python +from scrapling.fetchers import StealthyFetcher, StealthySession + +with StealthySession(headless=True, solve_cloudflare=True) as session: # ะ”ะตั€ะถะธั‚ะต ะฑั€ะฐัƒะทะตั€ ะพั‚ะบั€ั‹ั‚ั‹ะผ, ะฟะพะบะฐ ะฝะต ะทะฐะบะพะฝั‡ะธั‚ะต + page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False) + data = page.css('#padded_content a').getall() + +# ะ˜ะปะธ ะธัะฟะพะปัŒะทัƒะนั‚ะต ัั‚ะธะปัŒ ะพะดะฝะพั€ะฐะทะพะฒะพะณะพ ะทะฐะฟั€ะพัะฐ โ€” ะพั‚ะบั€ั‹ะฒะฐะตั‚ ะฑั€ะฐัƒะทะตั€ ะดะปั ัั‚ะพะณะพ ะทะฐะฟั€ะพัะฐ, ะทะฐั‚ะตะผ ะทะฐะบั€ั‹ะฒะฐะตั‚ ะตะณะพ ะฟะพัะปะต ะทะฐะฒะตั€ัˆะตะฝะธั +page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare') +data = page.css('#padded_content a').getall() +``` +ะŸะพะปะฝะฐั ะฐะฒั‚ะพะผะฐั‚ะธะทะฐั†ะธั ะฑั€ะฐัƒะทะตั€ะฐ +```python +from scrapling.fetchers import DynamicFetcher, DynamicSession + +with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # ะ”ะตั€ะถะธั‚ะต ะฑั€ะฐัƒะทะตั€ ะพั‚ะบั€ั‹ั‚ั‹ะผ, ะฟะพะบะฐ ะฝะต ะทะฐะบะพะฝั‡ะธั‚ะต + page = session.fetch('https://quotes.toscrape.com/', load_dom=False) + data = page.xpath('//span[@class="text"]/text()').getall() # XPath-ัะตะปะตะบั‚ะพั€, ะตัะปะธ ะฒั‹ ะฟั€ะตะดะฟะพั‡ะธั‚ะฐะตั‚ะต ะตะณะพ + +# ะ˜ะปะธ ะธัะฟะพะปัŒะทัƒะนั‚ะต ัั‚ะธะปัŒ ะพะดะฝะพั€ะฐะทะพะฒะพะณะพ ะทะฐะฟั€ะพัะฐ โ€” ะพั‚ะบั€ั‹ะฒะฐะตั‚ ะฑั€ะฐัƒะทะตั€ ะดะปั ัั‚ะพะณะพ ะทะฐะฟั€ะพัะฐ, ะทะฐั‚ะตะผ ะทะฐะบั€ั‹ะฒะฐะตั‚ ะตะณะพ ะฟะพัะปะต ะทะฐะฒะตั€ัˆะตะฝะธั +page = DynamicFetcher.fetch('https://quotes.toscrape.com/') +data = page.css('.quote .text::text').getall() +``` + +### Spider'ั‹ +ะกะพะทะดะฐะฒะฐะนั‚ะต ะฟะพะปะฝะพั†ะตะฝะฝั‹ะต ะพะฑั…ะพะดั‡ะธะบะธ ั ะฟะฐั€ะฐะปะปะตะปัŒะฝั‹ะผะธ ะทะฐะฟั€ะพัะฐะผะธ, ะฝะตัะบะพะปัŒะบะธะผะธ ั‚ะธะฟะฐะผะธ ัะตััะธะน ะธ Pause & Resume: +```python +from scrapling.spiders import Spider, Request, Response + +class QuotesSpider(Spider): + name = "quotes" + start_urls = ["https://quotes.toscrape.com/"] + concurrent_requests = 10 + + async def parse(self, response: Response): + for quote in response.css('.quote'): + yield { + "text": quote.css('.text::text').get(), + "author": quote.css('.author::text').get(), + } + + next_page = response.css('.next a') + if next_page: + yield response.follow(next_page[0].attrib['href']) + +result = QuotesSpider().start() +print(f"ะ˜ะทะฒะปะตั‡ะตะฝะพ {len(result.items)} ั†ะธั‚ะฐั‚") +result.items.to_json("quotes.json") +``` +ะ˜ัะฟะพะปัŒะทัƒะนั‚ะต ะฝะตัะบะพะปัŒะบะพ ั‚ะธะฟะพะฒ ัะตััะธะน ะฒ ะพะดะฝะพะผ Spider: +```python +from scrapling.spiders import Spider, Request, Response +from scrapling.fetchers import FetcherSession, AsyncStealthySession + +class MultiSessionSpider(Spider): + name = "multi" + start_urls = ["https://example.com/"] + + def configure_sessions(self, manager): + manager.add("fast", FetcherSession(impersonate="chrome")) + manager.add("stealth", AsyncStealthySession(headless=True), lazy=True) + + async def parse(self, response: Response): + for link in response.css('a::attr(href)').getall(): + # ะะฐะฟั€ะฐะฒะปัะนั‚ะต ะทะฐั‰ะธั‰ั‘ะฝะฝั‹ะต ัั‚ั€ะฐะฝะธั†ั‹ ั‡ะตั€ะตะท stealth-ัะตััะธัŽ + if "protected" in link: + yield Request(link, sid="stealth") + else: + yield Request(link, sid="fast", callback=self.parse) # ัะฒะฝั‹ะน callback +``` +ะŸั€ะธะพัั‚ะฐะฝะฐะฒะปะธะฒะฐะนั‚ะต ะธ ะฒะพะทะพะฑะฝะพะฒะปัะนั‚ะต ะดะปะธั‚ะตะปัŒะฝั‹ะต ะพะฑั…ะพะดั‹ ั ะฟะพะผะพั‰ัŒัŽ Checkpoint'ะพะฒ, ะทะฐะฟัƒัะบะฐั Spider ัะปะตะดัƒัŽั‰ะธะผ ะพะฑั€ะฐะทะพะผ: +```python +QuotesSpider(crawldir="./crawl_data").start() +``` +ะะฐะถะผะธั‚ะต Ctrl+C ะดะปั ะผัะณะบะพะน ะพัั‚ะฐะฝะพะฒะบะธ โ€” ะฟั€ะพะณั€ะตัั ัะพั…ั€ะฐะฝัะตั‚ัั ะฐะฒั‚ะพะผะฐั‚ะธั‡ะตัะบะธ. ะŸะพะทะถะต, ะบะพะณะดะฐ ะฒั‹ ัะฝะพะฒะฐ ะทะฐะฟัƒัั‚ะธั‚ะต Spider, ะฟะตั€ะตะดะฐะนั‚ะต ั‚ะพั‚ ะถะต `crawldir`, ะธ ะพะฝ ะฟั€ะพะดะพะปะถะธั‚ ั ั‚ะพะณะพ ะผะตัั‚ะฐ, ะณะดะต ะพัั‚ะฐะฝะพะฒะธะปัั. + +### ะŸั€ะพะดะฒะธะฝัƒั‚ั‹ะน ะฟะฐั€ัะธะฝะณ ะธ ะฝะฐะฒะธะณะฐั†ะธั +```python +from scrapling.fetchers import Fetcher + +# ะ‘ะพะณะฐั‚ั‹ะน ะฒั‹ะฑะพั€ ัะปะตะผะตะฝั‚ะพะฒ ะธ ะฝะฐะฒะธะณะฐั†ะธั +page = Fetcher.get('https://quotes.toscrape.com/') + +# ะŸะพะปัƒั‡ะตะฝะธะต ั†ะธั‚ะฐั‚ ั€ะฐะทะปะธั‡ะฝั‹ะผะธ ะผะตั‚ะพะดะฐะผะธ ะฒั‹ะฑะพั€ะฐ +quotes = page.css('.quote') # CSS-ัะตะปะตะบั‚ะพั€ +quotes = page.xpath('//div[@class="quote"]') # XPath +quotes = page.find_all('div', {'class': 'quote'}) # ะ’ ัั‚ะธะปะต BeautifulSoup +# ะขะพ ะถะต ัะฐะผะพะต, ั‡ั‚ะพ +quotes = page.find_all('div', class_='quote') +quotes = page.find_all(['div'], class_='quote') +quotes = page.find_all(class_='quote') # ะธ ั‚ะฐะบ ะดะฐะปะตะต... +# ะะฐะนั‚ะธ ัะปะตะผะตะฝั‚ ะฟะพ ั‚ะตะบัั‚ะพะฒะพะผัƒ ัะพะดะตั€ะถะธะผะพะผัƒ +quotes = page.find_by_text('quote', tag='div') + +# ะŸั€ะพะดะฒะธะฝัƒั‚ะฐั ะฝะฐะฒะธะณะฐั†ะธั +quote_text = page.css('.quote')[0].css('.text::text').get() +quote_text = page.css('.quote').css('.text::text').getall() # ะฆะตะฟะพั‡ะบะฐ ัะตะปะตะบั‚ะพั€ะพะฒ +first_quote = page.css('.quote')[0] +author = first_quote.next_sibling.css('.author::text') +parent_container = first_quote.parent + +# ะกะฒัะทะธ ัะปะตะผะตะฝั‚ะพะฒ ะธ ะฟะพะดะพะฑะธะต +similar_elements = first_quote.find_similar() +below_elements = first_quote.below_elements() +``` +ะ’ั‹ ะผะพะถะตั‚ะต ะธัะฟะพะปัŒะทะพะฒะฐั‚ัŒ ะฟะฐั€ัะตั€ ะฝะฐะฟั€ัะผัƒัŽ, ะตัะปะธ ะฝะต ั…ะพั‚ะธั‚ะต ะทะฐะณั€ัƒะถะฐั‚ัŒ ัะฐะนั‚ั‹, ะบะฐะบ ะฟะพะบะฐะทะฐะฝะพ ะฝะธะถะต: +```python +from scrapling.parser import Selector + +page = Selector("...") +``` +ะ˜ ะพะฝ ั€ะฐะฑะพั‚ะฐะตั‚ ั‚ะพั‡ะฝะพ ั‚ะฐะบ ะถะต! + +### ะŸั€ะธะผะตั€ั‹ async Session +```python +import asyncio +from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession + +async with FetcherSession(http3=True) as session: # `FetcherSession` ะบะพะฝั‚ะตะบัั‚ะฝะพ-ะพัะฒะตะดะพะผะปั‘ะฝ ะธ ะผะพะถะตั‚ ั€ะฐะฑะพั‚ะฐั‚ัŒ ะบะฐะบ ะฒ sync, ั‚ะฐะบ ะธ ะฒ async-ั€ะตะถะธะผะฐั… + page1 = session.get('https://quotes.toscrape.com/') + page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135') + +# ะ˜ัะฟะพะปัŒะทะพะฒะฐะฝะธะต async-ัะตััะธะธ +async with AsyncStealthySession(max_pages=2) as session: + tasks = [] + urls = ['https://example.com/page1', 'https://example.com/page2'] + + for url in urls: + task = session.fetch(url) + tasks.append(task) + + print(session.get_pool_stats()) # ะžะฟั†ะธะพะฝะฐะปัŒะฝะพ โ€” ัั‚ะฐั‚ัƒั ะฟัƒะปะฐ ะฒะบะปะฐะดะพะบ ะฑั€ะฐัƒะทะตั€ะฐ (ะทะฐะฝัั‚/ัะฒะพะฑะพะดะตะฝ/ะพัˆะธะฑะบะฐ) + results = await asyncio.gather(*tasks) + print(session.get_pool_stats()) +``` + +## CLI ะธ ะธะฝั‚ะตั€ะฐะบั‚ะธะฒะฝะฐั Shell + +Scrapling ะฒะบะปัŽั‡ะฐะตั‚ ะผะพั‰ะฝั‹ะน ะธะฝั‚ะตั€ั„ะตะนั ะบะพะผะฐะฝะดะฝะพะน ัั‚ั€ะพะบะธ: + +[![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339) + +ะ—ะฐะฟัƒัั‚ะธั‚ัŒ ะธะฝั‚ะตั€ะฐะบั‚ะธะฒะฝัƒัŽ Web Scraping Shell +```bash +scrapling shell +``` +ะ˜ะทะฒะปะตั‡ัŒ ัั‚ั€ะฐะฝะธั†ั‹ ะฒ ั„ะฐะนะป ะฝะฐะฟั€ัะผัƒัŽ ะฑะตะท ะฟั€ะพะณั€ะฐะผะผะธั€ะพะฒะฐะฝะธั (ะฟะพ ัƒะผะพะปั‡ะฐะฝะธัŽ ะธะทะฒะปะตะบะฐะตั‚ ัะพะดะตั€ะถะธะผะพะต ะฒะฝัƒั‚ั€ะธ ั‚ะตะณะฐ `body`). ะ•ัะปะธ ะฒั‹ั…ะพะดะฝะพะน ั„ะฐะนะป ะทะฐะบะฐะฝั‡ะธะฒะฐะตั‚ัั ะฝะฐ `.txt`, ะฑัƒะดะตั‚ ะธะทะฒะปะตั‡ะตะฝะพ ั‚ะตะบัั‚ะพะฒะพะต ัะพะดะตั€ะถะธะผะพะต ั†ะตะปะธ. ะ•ัะปะธ ะทะฐะบะฐะฝั‡ะธะฒะฐะตั‚ัั ะฝะฐ `.md`, ัั‚ะพ ะฑัƒะดะตั‚ Markdown-ะฟั€ะตะดัั‚ะฐะฒะปะตะฝะธะต HTML-ัะพะดะตั€ะถะธะผะพะณะพ; ะตัะปะธ ะทะฐะบะฐะฝั‡ะธะฒะฐะตั‚ัั ะฝะฐ `.html`, ัั‚ะพ ะฑัƒะดะตั‚ ัะฐะผะพ HTML-ัะพะดะตั€ะถะธะผะพะต. +```bash +scrapling extract get 'https://example.com' content.md +scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # ะ’ัะต ัะปะตะผะตะฝั‚ั‹, ัะพะพั‚ะฒะตั‚ัั‚ะฒัƒัŽั‰ะธะต CSS-ัะตะปะตะบั‚ะพั€ัƒ '#fromSkipToProducts' +scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless +scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare +``` + +> [!NOTE] +> ะ•ัั‚ัŒ ะผะฝะพะถะตัั‚ะฒะพ ะดะพะฟะพะปะฝะธั‚ะตะปัŒะฝั‹ั… ะฒะพะทะผะพะถะฝะพัั‚ะตะน, ะฝะพ ะผั‹ ั…ะพั‚ะธะผ ัะพั…ั€ะฐะฝะธั‚ัŒ ัั‚ัƒ ัั‚ั€ะฐะฝะธั†ัƒ ะบั€ะฐั‚ะบะพะน, ะฒะบะปัŽั‡ะฐั MCP-ัะตั€ะฒะตั€ ะธ ะธะฝั‚ะตั€ะฐะบั‚ะธะฒะฝัƒัŽ Web Scraping Shell. ะžะทะฝะฐะบะพะผัŒั‚ะตััŒ ั ะฟะพะปะฝะพะน ะดะพะบัƒะผะตะฝั‚ะฐั†ะธะตะน [ะทะดะตััŒ](https://scrapling.readthedocs.io/en/latest/) + +## ะขะตัั‚ั‹ ะฟั€ะพะธะทะฒะพะดะธั‚ะตะปัŒะฝะพัั‚ะธ + +Scrapling ะฝะต ั‚ะพะปัŒะบะพ ะผะพั‰ะฝั‹ะน โ€” ะพะฝ ะตั‰ั‘ ะธ ะฝะตะฒะตั€ะพัั‚ะฝะพ ะฑั‹ัั‚ั€ั‹ะน. ะกะปะตะดัƒัŽั‰ะธะต ั‚ะตัั‚ั‹ ะฟั€ะพะธะทะฒะพะดะธั‚ะตะปัŒะฝะพัั‚ะธ ัั€ะฐะฒะฝะธะฒะฐัŽั‚ ะฟะฐั€ัะตั€ Scrapling ั ะฟะพัะปะตะดะฝะธะผะธ ะฒะตั€ัะธัะผะธ ะดั€ัƒะณะธั… ะฟะพะฟัƒะปัั€ะฝั‹ั… ะฑะธะฑะปะธะพั‚ะตะบ. + +### ะขะตัั‚ ัะบะพั€ะพัั‚ะธ ะธะทะฒะปะตั‡ะตะฝะธั ั‚ะตะบัั‚ะฐ (5000 ะฒะปะพะถะตะฝะฝั‹ั… ัะปะตะผะตะฝั‚ะพะฒ) + +| # | ะ‘ะธะฑะปะธะพั‚ะตะบะฐ | ะ’ั€ะตะผั (ะผั) | vs Scrapling | +|---|:-----------------:|:----------:|:------------:| +| 1 | Scrapling | 2.02 | 1.0x | +| 2 | Parsel/Scrapy | 2.04 | 1.01 | +| 3 | Raw Lxml | 2.54 | 1.257 | +| 4 | PyQuery | 24.17 | ~12x | +| 5 | Selectolax | 82.63 | ~41x | +| 6 | MechanicalSoup | 1549.71 | ~767.1x | +| 7 | BS4 with Lxml | 1584.31 | ~784.3x | +| 8 | BS4 with html5lib | 3391.91 | ~1679.1x | + + +### ะŸั€ะพะธะทะฒะพะดะธั‚ะตะปัŒะฝะพัั‚ัŒ ะฟะพะดะพะฑะธั ัะปะตะผะตะฝั‚ะพะฒ ะธ ั‚ะตะบัั‚ะพะฒะพะณะพ ะฟะพะธัะบะฐ + +ะ’ะพะทะผะพะถะฝะพัั‚ะธ ะฐะดะฐะฟั‚ะธะฒะฝะพะณะพ ะฟะพะธัะบะฐ ัะปะตะผะตะฝั‚ะพะฒ Scrapling ะทะฝะฐั‡ะธั‚ะตะปัŒะฝะพ ะฟั€ะตะฒะพัั…ะพะดัั‚ ะฐะปัŒั‚ะตั€ะฝะฐั‚ะธะฒั‹: + +| ะ‘ะธะฑะปะธะพั‚ะตะบะฐ | ะ’ั€ะตะผั (ะผั) | vs Scrapling | +|-------------|:----------:|:------------:| +| Scrapling | 2.39 | 1.0x | +| AutoScraper | 12.45 | 5.209x | + + +> ะ’ัะต ั‚ะตัั‚ั‹ ะฟั€ะพะธะทะฒะพะดะธั‚ะตะปัŒะฝะพัั‚ะธ ะฟั€ะตะดัั‚ะฐะฒะปััŽั‚ ัะพะฑะพะน ัั€ะตะดะฝะธะต ะทะฝะฐั‡ะตะฝะธั ะฑะพะปะตะต 100 ะทะฐะฟัƒัะบะพะฒ. ะกะผ. [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) ะดะปั ะผะตั‚ะพะดะพะปะพะณะธะธ. + +## ะฃัั‚ะฐะฝะพะฒะบะฐ + +Scrapling ั‚ั€ะตะฑัƒะตั‚ Python 3.10 ะธะปะธ ะฒั‹ัˆะต: + +```bash +pip install scrapling +``` + +ะญั‚ะฐ ัƒัั‚ะฐะฝะพะฒะบะฐ ะฒะบะปัŽั‡ะฐะตั‚ ั‚ะพะปัŒะบะพ ะดะฒะธะถะพะบ ะฟะฐั€ัะตั€ะฐ ะธ ะตะณะพ ะทะฐะฒะธัะธะผะพัั‚ะธ, ะฑะตะท ะบะฐะบะธั…-ะปะธะฑะพ Fetcher'ะพะฒ ะธะปะธ ะทะฐะฒะธัะธะผะพัั‚ะตะน ะบะพะผะฐะฝะดะฝะพะน ัั‚ั€ะพะบะธ. + +### ะžะฟั†ะธะพะฝะฐะปัŒะฝั‹ะต ะทะฐะฒะธัะธะผะพัั‚ะธ + +1. ะ•ัะปะธ ะฒั‹ ัะพะฑะธั€ะฐะตั‚ะตััŒ ะธัะฟะพะปัŒะทะพะฒะฐั‚ัŒ ะบะฐะบะธะต-ะปะธะฑะพ ะธะท ะดะพะฟะพะปะฝะธั‚ะตะปัŒะฝั‹ั… ะฒะพะทะผะพะถะฝะพัั‚ะตะน ะฝะธะถะต, Fetcher'ั‹ ะธะปะธ ะธั… ะบะปะฐััั‹, ะฒะฐะผ ะฝะตะพะฑั…ะพะดะธะผะพ ัƒัั‚ะฐะฝะพะฒะธั‚ัŒ ะทะฐะฒะธัะธะผะพัั‚ะธ Fetcher'ะพะฒ ะธ ะฑั€ะฐัƒะทะตั€ะพะฒ ัะปะตะดัƒัŽั‰ะธะผ ะพะฑั€ะฐะทะพะผ: + ```bash + pip install "scrapling[fetchers]" + + scrapling install # normal install + scrapling install --force # force reinstall + ``` + + ะญั‚ะพ ะทะฐะณั€ัƒะทะธั‚ ะฒัะต ะฑั€ะฐัƒะทะตั€ั‹ ะฒะผะตัั‚ะต ั ะธั… ัะธัั‚ะตะผะฝั‹ะผะธ ะทะฐะฒะธัะธะผะพัั‚ัะผะธ ะธ ะทะฐะฒะธัะธะผะพัั‚ัะผะธ ะดะปั ะผะฐะฝะธะฟัƒะปัั†ะธะธ fingerprint'ะฐะผะธ. + + ะ˜ะปะธ ะฒั‹ ะผะพะถะตั‚ะต ัƒัั‚ะฐะฝะพะฒะธั‚ัŒ ะธั… ะธะท ะบะพะดะฐ ะฒะผะตัั‚ะพ ะฒั‹ะฟะพะปะฝะตะฝะธั ะบะพะผะฐะฝะดั‹: + ```python + from scrapling.cli import install + + install([], standalone_mode=False) # normal install + install(["--force"], standalone_mode=False) # force reinstall + ``` + +2. ะ”ะพะฟะพะปะฝะธั‚ะตะปัŒะฝั‹ะต ะฒะพะทะผะพะถะฝะพัั‚ะธ: + - ะฃัั‚ะฐะฝะพะฒะธั‚ัŒ ั„ัƒะฝะบั†ะธัŽ MCP-ัะตั€ะฒะตั€ะฐ: + ```bash + pip install "scrapling[ai]" + ``` + - ะฃัั‚ะฐะฝะพะฒะธั‚ัŒ ั„ัƒะฝะบั†ะธะธ Shell (Web Scraping Shell ะธ ะบะพะผะฐะฝะดะฐ `extract`): + ```bash + pip install "scrapling[shell]" + ``` + - ะฃัั‚ะฐะฝะพะฒะธั‚ัŒ ะฒัั‘: + ```bash + pip install "scrapling[all]" + ``` + ะŸะพะผะฝะธั‚ะต, ั‡ั‚ะพ ะฒะฐะผ ะฝัƒะถะฝะพ ัƒัั‚ะฐะฝะพะฒะธั‚ัŒ ะทะฐะฒะธัะธะผะพัั‚ะธ ะฑั€ะฐัƒะทะตั€ะพะฒ ั ะฟะพะผะพั‰ัŒัŽ `scrapling install` ะฟะพัะปะต ะปัŽะฑะพะณะพ ะธะท ัั‚ะธั… ะดะพะฟะพะปะฝะตะฝะธะน (ะตัะปะธ ะฒั‹ ะตั‰ั‘ ัั‚ะพะณะพ ะฝะต ัะดะตะปะฐะปะธ) + +### Docker +ะ’ั‹ ั‚ะฐะบะถะต ะผะพะถะตั‚ะต ัƒัั‚ะฐะฝะพะฒะธั‚ัŒ Docker-ะพะฑั€ะฐะท ัะพ ะฒัะตะผะธ ะดะพะฟะพะปะฝะตะฝะธัะผะธ ะธ ะฑั€ะฐัƒะทะตั€ะฐะผะธ ั ะฟะพะผะพั‰ัŒัŽ ัะปะตะดัƒัŽั‰ะตะน ะบะพะผะฐะฝะดั‹ ะธะท DockerHub: +```bash +docker pull pyd4vinci/scrapling +``` +ะ˜ะปะธ ัะบะฐั‡ะฐะนั‚ะต ะตะณะพ ะธะท ั€ะตะตัั‚ั€ะฐ GitHub: +```bash +docker pull ghcr.io/d4vinci/scrapling:latest +``` +ะญั‚ะพั‚ ะพะฑั€ะฐะท ะฐะฒั‚ะพะผะฐั‚ะธั‡ะตัะบะธ ัะพะทะดะฐั‘ั‚ัั ะธ ะฟัƒะฑะปะธะบัƒะตั‚ัั ั ะฟะพะผะพั‰ัŒัŽ GitHub Actions ะธ ะพัะฝะพะฒะฝะพะน ะฒะตั‚ะบะธ ั€ะตะฟะพะทะธั‚ะพั€ะธั. + +## ะฃั‡ะฐัั‚ะธะต ะฒ ั€ะฐะทั€ะฐะฑะพั‚ะบะต + +ะœั‹ ะฟั€ะธะฒะตั‚ัั‚ะฒัƒะตะผ ัƒั‡ะฐัั‚ะธะต! ะŸะพะถะฐะปัƒะนัั‚ะฐ, ะฟั€ะพั‡ะธั‚ะฐะนั‚ะต ะฝะฐัˆะธ [ั€ัƒะบะพะฒะพะดัั‚ะฒะฐ ะฟะพ ัƒั‡ะฐัั‚ะธัŽ ะฒ ั€ะฐะทั€ะฐะฑะพั‚ะบะต](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) ะฟะตั€ะตะด ะฝะฐั‡ะฐะปะพะผ ั€ะฐะฑะพั‚ั‹. + +## ะžั‚ะบะฐะท ะพั‚ ะพั‚ะฒะตั‚ัั‚ะฒะตะฝะฝะพัั‚ะธ + +> [!CAUTION] +> ะญั‚ะฐ ะฑะธะฑะปะธะพั‚ะตะบะฐ ะฟั€ะตะดะพัั‚ะฐะฒะปัะตั‚ัั ั‚ะพะปัŒะบะพ ะฒ ะพะฑั€ะฐะทะพะฒะฐั‚ะตะปัŒะฝั‹ั… ะธ ะธััะปะตะดะพะฒะฐั‚ะตะปัŒัะบะธั… ั†ะตะปัั…. ะ˜ัะฟะพะปัŒะทัƒั ัั‚ัƒ ะฑะธะฑะปะธะพั‚ะตะบัƒ, ะฒั‹ ัะพะณะปะฐัˆะฐะตั‚ะตััŒ ัะพะฑะปัŽะดะฐั‚ัŒ ะผะตัั‚ะฝั‹ะต ะธ ะผะตะถะดัƒะฝะฐั€ะพะดะฝั‹ะต ะทะฐะบะพะฝั‹ ะพ ัะบั€ะฐะฟะธะฝะณะต ะดะฐะฝะฝั‹ั… ะธ ะบะพะฝั„ะธะดะตะฝั†ะธะฐะปัŒะฝะพัั‚ะธ. ะะฒั‚ะพั€ั‹ ะธ ัƒั‡ะฐัั‚ะฝะธะบะธ ะฝะต ะฝะตััƒั‚ ะพั‚ะฒะตั‚ัั‚ะฒะตะฝะฝะพัั‚ะธ ะทะฐ ะปัŽะฑะพะต ะฝะตะฟั€ะฐะฒะพะผะตั€ะฝะพะต ะธัะฟะพะปัŒะทะพะฒะฐะฝะธะต ัั‚ะพะณะพ ะฟั€ะพะณั€ะฐะผะผะฝะพะณะพ ะพะฑะตัะฟะตั‡ะตะฝะธั. ะ’ัะตะณะดะฐ ัƒะฒะฐะถะฐะนั‚ะต ัƒัะปะพะฒะธั ะพะฑัะปัƒะถะธะฒะฐะฝะธั ะฒะตะฑ-ัะฐะนั‚ะพะฒ ะธ ั„ะฐะนะปั‹ robots.txt. + +## ะ›ะธั†ะตะฝะทะธั + +ะญั‚ะฐ ั€ะฐะฑะพั‚ะฐ ะปะธั†ะตะฝะทะธั€ะพะฒะฐะฝะฐ ะฟะพ ะปะธั†ะตะฝะทะธะธ BSD-3-Clause. + +## ะ‘ะปะฐะณะพะดะฐั€ะฝะพัั‚ะธ + +ะญั‚ะพั‚ ะฟั€ะพะตะบั‚ ะฒะบะปัŽั‡ะฐะตั‚ ะบะพะด, ะฐะดะฐะฟั‚ะธั€ะพะฒะฐะฝะฝั‹ะน ะธะท: +- Parsel (ะปะธั†ะตะฝะทะธั BSD) โ€” ะ˜ัะฟะพะปัŒะทัƒะตั‚ัั ะดะปั ะฟะพะดะผะพะดัƒะปั [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) + +--- +
ะ ะฐะทั€ะฐะฑะพั‚ะฐะฝะพ ะธ ัะพะทะดะฐะฝะพ ั โค๏ธ ะšะฐั€ะธะผ ะจะพะฐะธั€.

diff --git a/docs/ai/mcp-server.md b/docs/ai/mcp-server.md new file mode 100644 index 0000000000000000000000000000000000000000..bf59d66c0e6f45fbc2acec471dfbc0855cc8ca73 --- /dev/null +++ b/docs/ai/mcp-server.md @@ -0,0 +1,294 @@ +# Scrapling MCP Server Guide + + + +The **Scrapling MCP Server** is a new feature that brings Scrapling's powerful Web Scraping capabilities directly to your favorite AI chatbot or AI agent. This integration allows you to scrape websites, extract data, and bypass anti-bot protections conversationally through Claude's AI interface or any interface that supports MCP. + +## Features + +The Scrapling MCP Server provides six powerful tools for web scraping: + +### ๐Ÿš€ Basic HTTP Scraping +- **`get`**: Fast HTTP requests with browser fingerprint impersonation, generating real browser headers matching the TLS version, HTTP/3, and more! +- **`bulk_get`**: An async version of the above tool that allows scraping of multiple URLs at the same time! + +### ๐ŸŒ Dynamic Content Scraping +- **`fetch`**: Rapidly fetch dynamic content with Chromium/Chrome browser with complete control over the request/browser, and more! +- **`bulk_fetch`**: An async version of the above tool that allows scraping of multiple URLs in different browser tabs at the same time! + +### ๐Ÿ”’ Stealth Scraping +- **`stealthy_fetch`**: Uses our Stealthy browser to bypass Cloudflare Turnstile/Interstitial and other anti-bot systems with complete control over the request/browser! +- **`bulk_stealthy_fetch`**: An async version of the above tool that allows stealth scraping of multiple URLs in different browser tabs at the same time! + +### Key Capabilities +- **Smart Content Extraction**: Convert web pages/elements to Markdown, HTML, or extract a clean version of the text content +- **CSS Selector Support**: Use the Scrapling engine to target specific elements with precision before handing the content to the AI +- **Anti-Bot Bypass**: Handle Cloudflare Turnstile, Interstitial, and other protections +- **Proxy Support**: Use proxies for anonymity and geo-targeting +- **Browser Impersonation**: Mimic real browsers with TLS fingerprinting, real browser headers matching that version, and more +- **Parallel Processing**: Scrape multiple URLs concurrently for efficiency + +#### But why use Scrapling MCP Server instead of other available tools? + +Aside from its stealth capabilities and ability to bypass Cloudflare Turnstile/Interstitial, Scrapling's server is the only one that lets you select specific elements to pass to the AI, saving a lot of time and tokens! + +The way other servers work is that they extract the content, then pass it all to the AI to extract the fields you want. This causes the AI to consume far more tokens than needed (from irrelevant content). Scrapling solves this problem by allowing you to pass a CSS selector to narrow down the content you want before passing it to the AI, which makes the whole process much faster and more efficient. + +If you don't know how to write/use CSS selectors, don't worry. You can tell the AI in the prompt to write selectors to match possible fields for you and watch it try different combinations until it finds the right one, as we will show in the examples section. + +## Installation + +Install Scrapling with MCP Support, then double-check that the browser dependencies are installed. + +```bash +# Install Scrapling with MCP server dependencies +pip install "scrapling[ai]" + +# Install browser dependencies +scrapling install +``` + +Or use the Docker image directly from the Docker registry: +```bash +docker pull pyd4vinci/scrapling +``` +Or download it from the GitHub registry: +```bash +docker pull ghcr.io/d4vinci/scrapling:latest +``` + +## Setting up the MCP Server + +Here we will explain how to add Scrapling MCP Server to [Claude Desktop](https://claude.ai/download) and [Claude Code](https://www.anthropic.com/claude-code), but the same logic applies to any other chatbot that supports MCP: + +### Claude Desktop + +1. Open Claude Desktop +2. Click the hamburger menu (โ˜ฐ) at the top left โ†’ Settings โ†’ Developer โ†’ Edit Config +3. Add the Scrapling MCP server configuration: +```json +"ScraplingServer": { + "command": "scrapling", + "args": [ + "mcp" + ] +} +``` +If that's the first MCP server you're adding, set the content of the file to this: +```json +{ + "mcpServers": { + "ScraplingServer": { + "command": "scrapling", + "args": [ + "mcp" + ] + } + } +} +``` +As per the [official article](https://modelcontextprotocol.io/quickstart/user), this action either creates a new configuration file if none exists or opens your existing configuration. The file is located at + +1. **MacOS**: `~/Library/Application Support/Claude/claude_desktop_config.json` +2. **Windows**: `%APPDATA%\Claude\claude_desktop_config.json` + +To ensure it's working, use the full path to the `scrapling` executable. Open the terminal and execute the following command: + +1. **MacOS**: `which scrapling` +2. **Windows**: `where scrapling` + +For me, on my Mac, it returned `/Users//.venv/bin/scrapling`, so the config I used in the end is: +```json +{ + "mcpServers": { + "ScraplingServer": { + "command": "/Users//.venv/bin/scrapling", + "args": [ + "mcp" + ] + } + } +} +``` +#### Docker +If you are using the Docker image, then it would be something like +```json +{ + "mcpServers": { + "ScraplingServer": { + "command": "docker", + "args": [ + "run", "-i", "--rm", "scrapling", "mcp" + ] + } + } +} +``` + +The same logic applies to [Cursor](https://cursor.com/docs/context/mcp), [WindSurf](https://windsurf.com/university/tutorials/configuring-first-mcp-server), and others. + +### Claude Code +Here it's much simpler to do. If you have [Claude Code](https://www.anthropic.com/claude-code) installed, open the terminal and execute the following command: + +```bash +claude mcp add ScraplingServer "/Users//.venv/bin/scrapling" mcp +``` +Same as above, to get Scrapling's executable path, open the terminal and execute the following command: + +1. **MacOS**: `which scrapling` +2. **Windows**: `where scrapling` + +Here's the main article from Anthropic on [how to add MCP servers to Claude code](https://docs.anthropic.com/en/docs/claude-code/mcp#option-1%3A-add-a-local-stdio-server) for further details. + + +Then, after you've added the server, you need to completely quit and restart the app you used above. In Claude Desktop, you should see an MCP server indicator (๐Ÿ”ง) in the bottom-right corner of the chat input or see `ScraplingServer` in the `Search and tools` dropdown in the chat input box. + +### Streamable HTTP +As per version 0.3.6, we have added the ability to make the MCP server use the 'Streamable HTTP' transport mode instead of the traditional 'stdio' transport. + +So instead of using the following command (the 'stdio' one): +```bash +scrapling mcp +``` +Use the following to enable 'Streamable HTTP' transport mode: +```bash +scrapling mcp --http +``` +Hence, the default value for the host the server is listening to is '0.0.0.0' and the port is 8000, which both can be configured as below: +```bash +scrapling mcp --http --host '127.0.0.1' --port 8000 +``` + +## Examples + +Now we will show you some examples of prompts we used while testing the MCP server, but you are probably more creative than we are and better at prompt engineering than we are :) + +We will gradually go from simple prompts to more complex ones. We will use Claude Desktop for the examples, but the same logic applies to the rest, of course. + +1. **Basic Web Scraping** + + Extract the main content from a webpage as Markdown: + + ``` + Scrape the main content from https://example.com and convert it to markdown format. + ``` + + Claude will use the `get` tool to fetch the page and return clean, readable content. If it fails, it will continue retrying every second for 3 attempts, unless you instruct it otherwise. If it fails to retrieve content for any reason, such as protection or if it's a dynamic website, it will automatically try the other tools. If Claude didn't do that automatically for some reason, you can add that to the prompt. + + A more optimized version of the same prompt would be: + ``` + Use regular requests to scrape the main content from https://example.com and convert it to markdown format. + ``` + This tells Claude which tool to use here, so it doesn't have to guess. Sometimes it will start using normal requests on its own, and at other times, it will assume browsers are better suited for this website without any apparent reason. As a rule of thumb, you should always tell Claude which tool to use to save time and money and get consistent results. + +2. **Targeted Data Extraction** + + Extract specific elements using CSS selectors: + + ``` + Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds. + ``` + + The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try up to 5 times in case the website has connection issues, but the default setting should be fine for most cases. + +3. **E-commerce Data Collection** + + Another example of a bit more complex prompt: + ``` + Extract product information from these e-commerce URLs using bulk browser fetches: + - https://shop1.com/product-a + - https://shop2.com/product-b + - https://shop3.com/product-c + + Get the product names, prices, and descriptions from each page. + ``` + + Claude will use `bulk_fetch` to concurrently scrape all URLs, then analyze the extracted data. + +4. **More advanced workflow** + + Let's say I want to get all the action games available on PlayStation's store first page right now. I can use the following prompt to do that: + ``` + Extract the URLs of all games in this page, then do a bulk request to them and return a list of all action games: https://store.playstation.com/en-us/pages/browse + ``` + Note that I instructed it to use a bulk request for all the URLs collected. If I hadn't mentioned it, sometimes it works as intended, and other times it makes a separate request to each URL, which takes significantly longer. This prompt takes approximately one minute to complete. + + However, because I wasn't specific enough, it actually used the `stealthy_fetch` here and the `bulk_stealthy_fetch` in the second step, which unnecessarily consumed a large number of tokens. A better prompt would be: + ``` + Use normal requests to extract the URLs of all games in this page, then do a bulk request to them and return a list of all action games: https://store.playstation.com/en-us/pages/browse + ``` + And if you know how to write CSS selectors, you can instruct Claude to apply the selectors to the elements you want, and it will nearly complete the task immediately. + ``` + Use normal requests to extract the URLs of all games on the page below, then perform a bulk request to them and return a list of all action games. + The selector for games in the first page is `[href*="/concept/"]` and the selector for the genre in the second request is `[data-qa="gameInfo#releaseInformation#genre-value"]`. + + URL: https://store.playstation.com/en-us/pages/browse + ``` + +5. **Get data from a website with Cloudflare protection** + + If you think the website you are targeting has Cloudflare protection, tell Claude instead of letting it discover it on its own. + ``` + What's the price of this product? Be cautious, as it utilizes Cloudflare's Turnstile protection. Make the browser visible while you work. + + https://ao.com/product/oo101uk-ninja-woodfire-outdoor-pizza-oven-brown-99357-685.aspx + ``` + +6. **Long workflow** + + You can, for example, use a prompt like this: + ``` + Extract all product URLs for the following category, then return the prices and details for the first 3 products. + + https://www.arnotts.ie/furniture/bedroom/bed-frames/ + ``` + But a better prompt would be: + ``` + Go to the following category URL and extract all product URLs using the CSS selector "a". Then, fetch the first 3 product pages in parallel and extract each productโ€™s price and details. + + Keep the output in markdown format to reduce irrelevant content. + + Category URL: + https://www.arnotts.ie/furniture/bedroom/bed-frames/ + ``` + +And so on, you get the idea. Your creativity is the key here. + +## Best Practices + +Here is some technical advice for you. + +### 1. Choose the Right Tool +- **`get`**: Fast, simple websites +- **`fetch`**: Sites with JavaScript/dynamic content +- **`stealthy_fetch`**: Protected sites, Cloudflare, anti-bot systems + +### 2. Optimize Performance +- Use bulk tools for multiple URLs +- Disable unnecessary resources +- Set appropriate timeouts +- Use CSS selectors for targeted extraction + +### 3. Handle Dynamic Content +- Use `network_idle` for SPAs +- Set `wait_selector` for specific elements +- Increase timeout for slow-loading sites + +### 4. Data Quality +- Use `main_content_only=true` to avoid navigation/ads +- Choose an appropriate `extraction_type` for your use case + +## Legal and Ethical Considerations + +โš ๏ธ **Important Guidelines:** + +- **Check robots.txt**: Visit `https://website.com/robots.txt` to see scraping rules +- **Respect rate limits**: Don't overwhelm servers with requests +- **Terms of Service**: Read and comply with website terms +- **Copyright**: Respect intellectual property rights +- **Privacy**: Be mindful of personal data protection laws +- **Commercial use**: Ensure you have permission for business purposes + +--- + +*Built with โค๏ธ by the Scrapling team. Happy scraping!* \ No newline at end of file diff --git a/docs/api-reference/custom-types.md b/docs/api-reference/custom-types.md new file mode 100644 index 0000000000000000000000000000000000000000..805e0d4f1d260b6887d7160eae2bc6d2b553627e --- /dev/null +++ b/docs/api-reference/custom-types.md @@ -0,0 +1,26 @@ +--- +search: + exclude: true +--- + +# Custom Types API Reference + +Here's the reference information for all custom types of classes Scrapling implemented, with all their parameters, attributes, and methods. + +You can import all of them directly like below: + +```python +from scrapling.core.custom_types import TextHandler, TextHandlers, AttributesHandler +``` + +## ::: scrapling.core.custom_types.TextHandler + handler: python + :docstring: + +## ::: scrapling.core.custom_types.TextHandlers + handler: python + :docstring: + +## ::: scrapling.core.custom_types.AttributesHandler + handler: python + :docstring: diff --git a/docs/api-reference/fetchers.md b/docs/api-reference/fetchers.md new file mode 100644 index 0000000000000000000000000000000000000000..cd3d90cb72c78a38506f1925776ed65018a8c450 --- /dev/null +++ b/docs/api-reference/fetchers.md @@ -0,0 +1,63 @@ +--- +search: + exclude: true +--- + +# Fetchers Classes + +Here's the reference information for all fetcher-type classes' parameters, attributes, and methods. + +You can import all of them directly like below: + +```python +from scrapling.fetchers import ( + Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher, + FetcherSession, AsyncStealthySession, StealthySession, DynamicSession, AsyncDynamicSession +) +``` + +## ::: scrapling.fetchers.Fetcher + handler: python + :docstring: + +## ::: scrapling.fetchers.AsyncFetcher + handler: python + :docstring: + +## ::: scrapling.fetchers.DynamicFetcher + handler: python + :docstring: + +## ::: scrapling.fetchers.StealthyFetcher + handler: python + :docstring: + + +## Session Classes + +### HTTP Sessions + +## ::: scrapling.fetchers.FetcherSession + handler: python + :docstring: + +### Stealth Sessions + +## ::: scrapling.fetchers.StealthySession + handler: python + :docstring: + +## ::: scrapling.fetchers.AsyncStealthySession + handler: python + :docstring: + +### Dynamic Sessions + +## ::: scrapling.fetchers.DynamicSession + handler: python + :docstring: + +## ::: scrapling.fetchers.AsyncDynamicSession + handler: python + :docstring: + diff --git a/docs/api-reference/mcp-server.md b/docs/api-reference/mcp-server.md new file mode 100644 index 0000000000000000000000000000000000000000..03cb10227144f48277934245c52cad2017443e7d --- /dev/null +++ b/docs/api-reference/mcp-server.md @@ -0,0 +1,39 @@ +--- +search: + exclude: true +--- + +# MCP Server API Reference + +The **Scrapling MCP Server** provides six powerful tools for web scraping through the Model Context Protocol (MCP). This server integrates Scrapling's capabilities directly into AI chatbots and agents, allowing conversational web scraping with advanced anti-bot bypass features. + +You can start the MCP server by running: + +```bash +scrapling mcp +``` + +Or import the server class directly: + +```python +from scrapling.core.ai import ScraplingMCPServer + +server = ScraplingMCPServer() +server.serve(http=False, host="0.0.0.0", port=8000) +``` + +## Response Model + +The standardized response structure that's returned by all MCP server tools: + +## ::: scrapling.core.ai.ResponseModel + handler: python + :docstring: + +## MCP Server Class + +The main MCP server class that provides all web scraping tools: + +## ::: scrapling.core.ai.ScraplingMCPServer + handler: python + :docstring: \ No newline at end of file diff --git a/docs/api-reference/proxy-rotation.md b/docs/api-reference/proxy-rotation.md new file mode 100644 index 0000000000000000000000000000000000000000..61a8fad70c763bb548274cb83e18c042fe241a55 --- /dev/null +++ b/docs/api-reference/proxy-rotation.md @@ -0,0 +1,18 @@ +--- +search: + exclude: true +--- + +# Proxy Rotation + +The `ProxyRotator` class provides thread-safe proxy rotation for any fetcher or session. + +You can import it directly like below: + +```python +from scrapling.fetchers import ProxyRotator +``` + +## ::: scrapling.engines.toolbelt.proxy_rotation.ProxyRotator + handler: python + :docstring: diff --git a/docs/api-reference/response.md b/docs/api-reference/response.md new file mode 100644 index 0000000000000000000000000000000000000000..26f6895453a96fd9b51a4c1405233933f9024f3b --- /dev/null +++ b/docs/api-reference/response.md @@ -0,0 +1,18 @@ +--- +search: + exclude: true +--- + +# Response Class + +The `Response` class wraps HTTP responses returned by all fetchers, providing access to status, headers, body, cookies, and a `Selector` for parsing. + +You can import the `Response` class like below: + +```python +from scrapling.engines.toolbelt.custom import Response +``` + +## ::: scrapling.engines.toolbelt.custom.Response + handler: python + :docstring: diff --git a/docs/api-reference/selector.md b/docs/api-reference/selector.md new file mode 100644 index 0000000000000000000000000000000000000000..a4be82b684851777519616d85703ca13a0d6b5ea --- /dev/null +++ b/docs/api-reference/selector.md @@ -0,0 +1,25 @@ +--- +search: + exclude: true +--- + +# Selector Class + +The `Selector` class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities. + +Here's the reference information for the `Selector` class, with all its parameters, attributes, and methods. + +You can import the `Selector` class directly from `scrapling`: + +```python +from scrapling.parser import Selector +``` + +## ::: scrapling.parser.Selector + handler: python + :docstring: + +## ::: scrapling.parser.Selectors + handler: python + :docstring: + diff --git a/docs/api-reference/spiders.md b/docs/api-reference/spiders.md new file mode 100644 index 0000000000000000000000000000000000000000..0a69abbf8d696306718da67a57f99c45f822194c --- /dev/null +++ b/docs/api-reference/spiders.md @@ -0,0 +1,42 @@ +--- +search: + exclude: true +--- + +# Spider Classes + +Here's the reference information for the spider framework classes' parameters, attributes, and methods. + +You can import them directly like below: + +```python +from scrapling.spiders import Spider, Request, CrawlResult, SessionManager, Response +``` + +## ::: scrapling.spiders.Spider + handler: python + :docstring: + +## ::: scrapling.spiders.Request + handler: python + :docstring: + +## Result Classes + +## ::: scrapling.spiders.result.CrawlResult + handler: python + :docstring: + +## ::: scrapling.spiders.result.CrawlStats + handler: python + :docstring: + +## ::: scrapling.spiders.result.ItemList + handler: python + :docstring: + +## Session Management + +## ::: scrapling.spiders.session.SessionManager + handler: python + :docstring: diff --git a/docs/assets/cover_dark.png b/docs/assets/cover_dark.png new file mode 100644 index 0000000000000000000000000000000000000000..a81bdc1b485c010ba36f6db4a5396c97135539bd --- /dev/null +++ b/docs/assets/cover_dark.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:8eec59d31fa1c41f1a35ee8e08a412e975eeabf1347b1bb6ca609cd454edf044 +size 113787 diff --git a/docs/assets/cover_dark.svg b/docs/assets/cover_dark.svg new file mode 100644 index 0000000000000000000000000000000000000000..07479588fdd3612065534a885f16c8ddceb854a0 --- /dev/null +++ b/docs/assets/cover_dark.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/cover_light.png b/docs/assets/cover_light.png new file mode 100644 index 0000000000000000000000000000000000000000..7b8b61f0150b15a09b9044c03dfc5e7adf89998a Binary files /dev/null and b/docs/assets/cover_light.png differ diff --git a/docs/assets/cover_light.svg b/docs/assets/cover_light.svg new file mode 100644 index 0000000000000000000000000000000000000000..5ec513b51cbdb7fee49710943161b2dbc2eb9474 --- /dev/null +++ b/docs/assets/cover_light.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/assets/favicon.ico b/docs/assets/favicon.ico new file mode 100644 index 0000000000000000000000000000000000000000..ca2dbf9c38352499db237f3ddc7a22220213db39 --- /dev/null +++ b/docs/assets/favicon.ico @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:9d2643963074a37762e2f2896b3146c7601a262838cecbcac30b69baa497d4f8 +size 267230 diff --git a/docs/assets/logo.png b/docs/assets/logo.png new file mode 100644 index 0000000000000000000000000000000000000000..3c4f0cd80119976493548fbd20049122a5a1d922 Binary files /dev/null and b/docs/assets/logo.png differ diff --git a/docs/assets/main_cover.png b/docs/assets/main_cover.png new file mode 100644 index 0000000000000000000000000000000000000000..fc38ca55f9b259a7f451f9ce6ec990dc221e0a70 --- /dev/null +++ b/docs/assets/main_cover.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a80343a3e9f04e64c08c568ff2e452cccd2b24157d24b7263fc5d677d14ccc40 +size 454701 diff --git a/docs/assets/scrapling_shell_curl.png b/docs/assets/scrapling_shell_curl.png new file mode 100644 index 0000000000000000000000000000000000000000..b2642aa7b1cd33b9fff26f76229b3b0ce6f477c1 --- /dev/null +++ b/docs/assets/scrapling_shell_curl.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:39c5c7aa963d31dc4f8584f34058600487c1941160dcfdcb8d11f1c699935c13 +size 351213 diff --git a/docs/assets/spider_architecture.png b/docs/assets/spider_architecture.png new file mode 100644 index 0000000000000000000000000000000000000000..60387434467be7cf1ac4a7d99ae018bd78715046 --- /dev/null +++ b/docs/assets/spider_architecture.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:49bca39a1cb9a532074bc6530ec2b6b1ea625e7a9f042659d2bfffcb7dcee84a +size 129786 diff --git a/docs/benchmarks.md b/docs/benchmarks.md new file mode 100644 index 0000000000000000000000000000000000000000..40a0fdf54235ef036e70388b75080f7432a6f3aa --- /dev/null +++ b/docs/benchmarks.md @@ -0,0 +1,28 @@ +# Performance Benchmarks + +Scrapling isn't just powerfulโ€”it's also blazing fast. The following benchmarks compare Scrapling's parser with the latest versions of other popular libraries. + +### Text Extraction Speed Test (5000 nested elements) + +| # | Library | Time (ms) | vs Scrapling | +|---|:-----------------:|:---------:|:------------:| +| 1 | Scrapling | 2.02 | 1.0x | +| 2 | Parsel/Scrapy | 2.04 | 1.01 | +| 3 | Raw Lxml | 2.54 | 1.257 | +| 4 | PyQuery | 24.17 | ~12x | +| 5 | Selectolax | 82.63 | ~41x | +| 6 | MechanicalSoup | 1549.71 | ~767.1x | +| 7 | BS4 with Lxml | 1584.31 | ~784.3x | +| 8 | BS4 with html5lib | 3391.91 | ~1679.1x | + + +### Element Similarity & Text Search Performance + +Scrapling's adaptive element finding capabilities significantly outperform alternatives: + +| Library | Time (ms) | vs Scrapling | +|-------------|:---------:|:------------:| +| Scrapling | 2.39 | 1.0x | +| AutoScraper | 12.45 | 5.209x | + +> All benchmarks represent averages of 100+ runs. See [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology. diff --git a/docs/cli/extract-commands.md b/docs/cli/extract-commands.md new file mode 100644 index 0000000000000000000000000000000000000000..fa622d0428d95ff2ef58812df17845a0bc21aba0 --- /dev/null +++ b/docs/cli/extract-commands.md @@ -0,0 +1,355 @@ +# Scrapling Extract Command Guide + +**Web Scraping through the terminal without requiring any programming!** + +The `scrapling extract` command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction. + +!!! success "Prerequisites" + + 1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use. + 2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object. + 3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class. + 4. You've completed or read at least one page from the fetchers section to use here for requests: [HTTP requests](../fetching/static.md), [Dynamic websites](../fetching/dynamic.md), or [Dynamic websites with hard protections](../fetching/stealthy.md). + + +## What is the Extract Command group? + +The extract command is a set of simple terminal tools that: + +- **Downloads web pages** and saves their content to files. +- **Converts HTML to readable formats** like Markdown, keeps it as HTML, or just extracts the text content of the page. +- **Supports custom CSS selectors** to extract specific parts of the page. +- **Handles HTTP requests and fetching through browsers** +- **Highly customizable** with custom headers, cookies, proxies, and the rest of the options. Almost all the options available through the code are also accessible through the command line. + +## Quick Start + +- **Basic Website Download** + + Download a website's text content as clean, readable text: + ```bash + scrapling extract get "https://example.com" page_content.txt + ``` + This makes an HTTP GET request and saves the webpage's text content to `page_content.txt`. + +- **Save as Different Formats** + + Choose your output format by changing the file extension: + ```bash + # Convert the HTML content to Markdown, then save it to the file (great for documentation) + scrapling extract get "https://blog.example.com" article.md + + # Save the HTML content as it is to the file + scrapling extract get "https://example.com" page.html + + # Save a clean version of the text content of the webpage to the file + scrapling extract get "https://example.com" content.txt + + # Or use the Docker image with something like this: + docker run -v $(pwd)/output:/output scrapling extract get "https://blog.example.com" /output/article.md + ``` + +- **Extract Specific Content** + + All commands can use CSS selectors to extract specific parts of the page through `--css-selector` or `-s` as you will see in the examples below. + +## Available Commands + +You can display the available commands through `scrapling extract --help` to get the following list: +```bash +Usage: scrapling extract [OPTIONS] COMMAND [ARGS]... + + Fetch web pages using various fetchers and extract full/selected HTML content as HTML, Markdown, or extract text content. + +Options: + --help Show this message and exit. + +Commands: + get Perform a GET request and save the content to a file. + post Perform a POST request and save the content to a file. + put Perform a PUT request and save the content to a file. + delete Perform a DELETE request and save the content to a file. + fetch Use DynamicFetcher to fetch content with browser... + stealthy-fetch Use StealthyFetcher to fetch content with advanced... +``` + +We will go through each command in detail below. + +### HTTP Requests + +1. **GET Request** + + The most common command for downloading website content: + + ```bash + scrapling extract get [URL] [OUTPUT_FILE] [OPTIONS] + ``` + + **Examples:** + ```bash + # Basic download + scrapling extract get "https://news.site.com" news.md + + # Download with custom timeout + scrapling extract get "https://example.com" content.txt --timeout 60 + + # Extract only specific content using CSS selectors + scrapling extract get "https://blog.example.com" articles.md --css-selector "article" + + # Send a request with cookies + scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john" + + # Add user agent + scrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0" + + # Add multiple headers + scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US" + ``` + Get the available options for the command with `scrapling extract get --help` as follows: + ```bash + Usage: scrapling extract get [OPTIONS] URL OUTPUT_FILE + + Perform a GET request and save the content to a file. + + The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. + + Options: + -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times) + --cookies TEXT Cookies string in format "name1=value1;name2=value2" + --timeout INTEGER Request timeout in seconds (default: 30) + --proxy TEXT Proxy URL in format "http://username:password@host:port" + -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. + -p, --params TEXT Query parameters in format "key=value" (can be used multiple times) + --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True) + --verify / --no-verify Whether to verify SSL certificates (default: True) + --impersonate TEXT Browser to impersonate (e.g., chrome, firefox). + --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True) + --help Show this message and exit. + + ``` + Note that the options will work in the same way for all other request commands, so no need to repeat them. + +2. **Post Request** + + ```bash + scrapling extract post [URL] [OUTPUT_FILE] [OPTIONS] + ``` + + **Examples:** + ```bash + # Submit form data + scrapling extract post "https://api.site.com/search" results.html --data "query=python&type=tutorial" + + # Send JSON data + scrapling extract post "https://api.site.com" response.json --json '{"username": "test", "action": "search"}' + ``` + Get the available options for the command with `scrapling extract post --help` as follows: + ```bash + Usage: scrapling extract post [OPTIONS] URL OUTPUT_FILE + + Perform a POST request and save the content to a file. + + The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. + + Options: + -d, --data TEXT Form data to include in the request body (as string, ex: "param1=value1¶m2=value2") + -j, --json TEXT JSON data to include in the request body (as string) + -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times) + --cookies TEXT Cookies string in format "name1=value1;name2=value2" + --timeout INTEGER Request timeout in seconds (default: 30) + --proxy TEXT Proxy URL in format "http://username:password@host:port" + -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. + -p, --params TEXT Query parameters in format "key=value" (can be used multiple times) + --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True) + --verify / --no-verify Whether to verify SSL certificates (default: True) + --impersonate TEXT Browser to impersonate (e.g., chrome, firefox). + --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True) + --help Show this message and exit. + + ``` + +3. **Put Request** + + ```bash + scrapling extract put [URL] [OUTPUT_FILE] [OPTIONS] + ``` + + **Examples:** + ```bash + # Send data + scrapling extract put "https://scrapling.requestcatcher.com/put" results.html --data "update=info" --impersonate "firefox" + + # Send JSON data + scrapling extract put "https://scrapling.requestcatcher.com/put" response.json --json '{"username": "test", "action": "search"}' + ``` + Get the available options for the command with `scrapling extract put --help` as follows: + ```bash + Usage: scrapling extract put [OPTIONS] URL OUTPUT_FILE + + Perform a PUT request and save the content to a file. + + The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. + + Options: + -d, --data TEXT Form data to include in the request body + -j, --json TEXT JSON data to include in the request body (as string) + -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times) + --cookies TEXT Cookies string in format "name1=value1;name2=value2" + --timeout INTEGER Request timeout in seconds (default: 30) + --proxy TEXT Proxy URL in format "http://username:password@host:port" + -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. + -p, --params TEXT Query parameters in format "key=value" (can be used multiple times) + --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True) + --verify / --no-verify Whether to verify SSL certificates (default: True) + --impersonate TEXT Browser to impersonate (e.g., chrome, firefox). + --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True) + --help Show this message and exit. + ``` + +4. **Delete Request** + + ```bash + scrapling extract delete [URL] [OUTPUT_FILE] [OPTIONS] + ``` + + **Examples:** + ```bash + # Send data + scrapling extract delete "https://scrapling.requestcatcher.com/delete" results.html + + # Send JSON data + scrapling extract delete "https://scrapling.requestcatcher.com/" response.txt --impersonate "chrome" + ``` + Get the available options for the command with `scrapling extract delete --help` as follows: + ```bash + Usage: scrapling extract delete [OPTIONS] URL OUTPUT_FILE + + Perform a DELETE request and save the content to a file. + + The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. + + Options: + -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times) + --cookies TEXT Cookies string in format "name1=value1;name2=value2" + --timeout INTEGER Request timeout in seconds (default: 30) + --proxy TEXT Proxy URL in format "http://username:password@host:port" + -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. + -p, --params TEXT Query parameters in format "key=value" (can be used multiple times) + --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True) + --verify / --no-verify Whether to verify SSL certificates (default: True) + --impersonate TEXT Browser to impersonate (e.g., chrome, firefox). + --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True) + --help Show this message and exit. + ``` + +### Browsers fetching + +1. **fetch - Handle Dynamic Content** + + For websites that load content with dynamic content or have slight protection + + ```bash + scrapling extract fetch [URL] [OUTPUT_FILE] [OPTIONS] + ``` + + **Examples:** + ```bash + # Wait for JavaScript to load content and finish network activity + scrapling extract fetch "https://scrapling.requestcatcher.com/" content.md --network-idle + + # Wait for specific content to appear + scrapling extract fetch "https://scrapling.requestcatcher.com/" data.txt --wait-selector ".content-loaded" + + # Run in visible browser mode (helpful for debugging) + scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources + ``` + Get the available options for the command with `scrapling extract fetch --help` as follows: + ```bash + Usage: scrapling extract fetch [OPTIONS] URL OUTPUT_FILE + + Use DynamicFetcher to fetch content with browser automation. + + The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. + + Options: + --headless / --no-headless Run browser in headless mode (default: True) + --disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False) + --network-idle / --no-network-idle Wait for network idle (default: False) + --timeout INTEGER Timeout in milliseconds (default: 30000) + --wait INTEGER Additional wait time in milliseconds after page load (default: 0) + -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. + --wait-selector TEXT CSS selector to wait for before proceeding + --locale TEXT Specify user locale. Defaults to the system default locale. + --real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False) + --proxy TEXT Proxy URL in format "http://username:password@host:port" + -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times) + --help Show this message and exit. + ``` + +2. **stealthy-fetch - Bypass Protection** + + For websites with anti-bot protection or Cloudflare protection + + ```bash + scrapling extract stealthy-fetch [URL] [OUTPUT_FILE] [OPTIONS] + ``` + + **Examples:** + ```bash + # Bypass basic protection + scrapling extract stealthy-fetch "https://scrapling.requestcatcher.com" content.md + + # Solve Cloudflare challenges + scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a" + + # Use a proxy for anonymity. + scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080" + ``` + Get the available options for the command with `scrapling extract stealthy-fetch --help` as follows: + ```bash + Usage: scrapling extract stealthy-fetch [OPTIONS] URL OUTPUT_FILE + + Use StealthyFetcher to fetch content with advanced stealth features. + + The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. + + Options: + --headless / --no-headless Run browser in headless mode (default: True) + --disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False) + --block-webrtc / --allow-webrtc Block WebRTC entirely (default: False) + --solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False) + --allow-webgl / --block-webgl Allow WebGL (default: True) + --network-idle / --no-network-idle Wait for network idle (default: False) + --real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False) + --timeout INTEGER Timeout in milliseconds (default: 30000) + --wait INTEGER Additional wait time in milliseconds after page load (default: 0) + -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. + --wait-selector TEXT CSS selector to wait for before proceeding + --hide-canvas / --show-canvas Add noise to canvas operations (default: False) + --proxy TEXT Proxy URL in format "http://username:password@host:port" + -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times) + --help Show this message and exit. + ``` + +## When to use each command + +If you are not a Web Scraping expert and can't decide what to choose, you can use the following formula to help you decide: + +- Use **`get`** with simple websites, blogs, or news articles +- Use **`fetch`** with modern web apps, or sites with dynamic content +- Use **`stealthy-fetch`** with protected sites, Cloudflare, or anti-bot systems + +## Legal and Ethical Considerations + +โš ๏ธ **Important Guidelines:** + +- **Check robots.txt**: Visit `https://website.com/robots.txt` to see scraping rules +- **Respect rate limits**: Don't overwhelm servers with requests +- **Terms of Service**: Read and comply with website terms +- **Copyright**: Respect intellectual property rights +- **Privacy**: Be mindful of personal data protection laws +- **Commercial use**: Ensure you have permission for business purposes + +--- + +*Happy scraping! Remember to always respect website policies and comply with all applicable laws and regulations.* \ No newline at end of file diff --git a/docs/cli/interactive-shell.md b/docs/cli/interactive-shell.md new file mode 100644 index 0000000000000000000000000000000000000000..b897ce7970310403eb0ab9348b5f4743f00b4437 --- /dev/null +++ b/docs/cli/interactive-shell.md @@ -0,0 +1,245 @@ +# Scrapling Interactive Shell Guide + + + +**Powerful Web Scraping REPL for Developers and Data Scientists** + +The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools, such as conversion of the curl command. + +!!! success "Prerequisites" + + 1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use. + 2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object. + 3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class. + 4. You've completed or read at least one page from the fetchers section to use here for requests: [HTTP requests](../fetching/static.md), [Dynamic websites](../fetching/dynamic.md), or [Dynamic websites with hard protections](../fetching/stealthy.md). + + +## Why use the Interactive Shell? + +The interactive shell transforms web scraping from a slow script-and-run cycle into a fast, exploratory experience. It's perfect for: + +- **Rapid prototyping**: Test scraping strategies instantly +- **Data exploration**: Interactively navigate and extract from websites +- **Learning Scrapling**: Experiment with features in real-time +- **Debugging scrapers**: Step through requests and inspect results +- **Converting workflows**: Transform curl commands from browser DevTools to a Fetcher request in a one-liner + +## Getting Started + +### Launch the Shell + +```bash +# Start the interactive shell +scrapling shell + +# Execute code and exit (useful for scripting) +scrapling shell -c "get('https://quotes.toscrape.com'); print(len(page.css('.quote')))" + +# Set logging level +scrapling shell --loglevel info +``` + +Once launched, you'll see the Scrapling banner and can immediately start scraping as the video above shows: + +```python +# No imports needed - everything is ready! +>>> get('https://news.ycombinator.com') + +>>> # Explore the page structure +>>> page.css('a')[:5] # Look at first 5 links + +>>> # Refine your selectors +>>> stories = page.css('.titleline>a') +>>> len(stories) +30 + +>>> # Extract specific data +>>> for story in stories[:3]: +... title = story.text +... url = story['href'] +... print(f"{title}: {url}") + +>>> # Try different approaches +>>> titles = page.css('.titleline>a::text') # Direct text extraction +>>> urls = page.css('.titleline>a::attr(href)') # Direct attribute extraction +``` + +## Built-in Shortcuts + +The shell provides convenient shortcuts that eliminate boilerplate code: + +- **`get(url, **kwargs)`** - HTTP GET request (instead of `Fetcher.get`) +- **`post(url, **kwargs)`** - HTTP POST request (instead of `Fetcher.post`) +- **`put(url, **kwargs)`** - HTTP PUT request (instead of `Fetcher.put`) +- **`delete(url, **kwargs)`** - HTTP DELETE request (instead of `Fetcher.delete`) +- **`fetch(url, **kwargs)`** - Browser-based fetch (instead of `DynamicFetcher.fetch`) +- **`stealthy_fetch(url, **kwargs)`** - Stealthy browser fetch (instead of `StealthyFetcher.fetch`) + +The most commonly used classes are automatically available without any import, including `Fetcher`, `AsyncFetcher`, `DynamicFetcher`, `StealthyFetcher`, and `Selector`. + +### Smart Page Management + +The shell automatically tracks your requests and pages: + +- **Current Page Access** + + The `page` and `response` commands are automatically updated with the last fetched page: + + ```python + >>> get('https://quotes.toscrape.com') + >>> # 'page' and 'response' both refer to the last fetched page + >>> page.url + 'https://quotes.toscrape.com' + >>> response.status # Same as page.status + 200 + ``` + +- **Page History** + + The `pages` command keeps track of the last five pages (it's a `Selectors` object): + + ```python + >>> get('https://site1.com') + >>> get('https://site2.com') + >>> get('https://site3.com') + + >>> # Access last 5 pages + >>> len(pages) # `Selectors` object with `page` history + 3 + >>> pages[0].url # First page in history + 'https://site1.com' + >>> pages[-1].url # Most recent page + 'https://site3.com' + + >>> # Work with historical pages + >>> for i, old_page in enumerate(pages): + ... print(f"Page {i}: {old_page.url} - {old_page.status}") + ``` + +## Additional helpful commands + +### Page Visualization + +View scraped pages in your browser: + +```python +>>> get('https://quotes.toscrape.com') +>>> view(page) # Opens the page HTML in your default browser +``` + +### Curl Command Integration + +The shell provides a few functions to help you convert curl commands from the browser DevTools to `Fetcher` requests: `uncurl` and `curl2fetcher`. + +First, you need to copy a request as a curl command like the following: + +Copying a request as a curl command from Chrome + +- **Convert Curl command to Request Object** + + ```python + >>> curl_cmd = '''curl 'https://scrapling.requestcatcher.com/post' \ + ... -X POST \ + ... -H 'Content-Type: application/json' \ + ... -d '{"name": "test", "value": 123}' ''' + + >>> request = uncurl(curl_cmd) + >>> request.method + 'post' + >>> request.url + 'https://scrapling.requestcatcher.com/post' + >>> request.headers + {'Content-Type': 'application/json'} + ``` + +- **Execute Curl Command Directly** + + ```python + >>> # Convert and execute in one step + >>> curl2fetcher(curl_cmd) + >>> page.status + 200 + >>> page.json()['json'] + {'name': 'test', 'value': 123} + ``` + +### IPython Features + +The shell inherits all IPython capabilities: + +```python +>>> # Magic commands +>>> %time page = get('https://example.com') # Time execution +>>> %history # Show command history +>>> %save filename.py 1-10 # Save commands 1-10 to file + +>>> # Tab completion works everywhere +>>> page.c # Shows: css, cookies, headers, etc. +>>> Fetcher. # Shows all Fetcher methods + +>>> # Object inspection +>>> get? # Show get documentation +``` + +## Examples + +Here are a few examples generated via AI: + +#### E-commerce Data Collection + +```python +>>> # Start with product listing page +>>> catalog = get('https://shop.example.com/products') + +>>> # Find product links +>>> product_links = catalog.css('.product-link::attr(href)') +>>> print(f"Found {len(product_links)} products") + +>>> # Sample a few products first +>>> for link in product_links[:3]: +... product = get(f"https://shop.example.com{link}") +... name = product.css('.product-name::text').get('') +... price = product.css('.price::text').get('') +... print(f"{name}: {price}") + +>>> # Scale up with sessions for efficiency +>>> from scrapling.fetchers import FetcherSession +>>> with FetcherSession() as session: +... products = [] +... for link in product_links: +... product = session.get(f"https://shop.example.com{link}") +... products.append({ +... 'name': product.css('.product-name::text').get(''), +... 'price': product.css('.price::text').get(''), +... 'url': link +... }) +``` + +#### API Integration and Testing + +```python +>>> # Test API endpoints interactively +>>> response = get('https://jsonplaceholder.typicode.com/posts/1') +>>> response.json() +{'userId': 1, 'id': 1, 'title': 'sunt aut...', 'body': 'quia et...'} + +>>> # Test POST requests +>>> new_post = post('https://jsonplaceholder.typicode.com/posts', +... json={'title': 'Test Post', 'body': 'Test content', 'userId': 1}) +>>> new_post.json()['id'] +101 + +>>> # Test with different data +>>> updated = put(f'https://jsonplaceholder.typicode.com/posts/{new_post.json()["id"]}', +... json={'title': 'Updated Title'}) +``` + +## Getting Help + +If you need help other than what is available in-terminal, you can: + +- [Scrapling Documentation](https://scrapling.readthedocs.io/) +- [Discord Community](https://discord.gg/EMgGbDceNQ) +- [GitHub Issues](https://github.com/D4Vinci/Scrapling/issues) + +And that's it! Happy scraping! The shell makes web scraping as easy as a conversation. \ No newline at end of file diff --git a/docs/cli/overview.md b/docs/cli/overview.md new file mode 100644 index 0000000000000000000000000000000000000000..dc4346ff87cf1e8426d8b9d380ab8151945d36e4 --- /dev/null +++ b/docs/cli/overview.md @@ -0,0 +1,30 @@ +# Command Line Interface + +Since v0.3, Scrapling includes a powerful command-line interface that provides three main capabilities: + +1. **Interactive Shell**: An interactive Web Scraping shell based on IPython that provides many shortcuts and useful tools +2. **Extract Commands**: Scrape websites from the terminal without any programming +3. **Utility Commands**: Installation and management tools + +```bash +# Launch interactive shell +scrapling shell + +# Convert the content of a page to markdown and save it to a file +scrapling extract get "https://example.com" content.md + +# Get help for any command +scrapling --help +scrapling extract --help +``` + +## Requirements +This section requires you to install the extra `shell` dependency group, like the following: +```bash +pip install "scrapling[shell]" +``` +and the installation of the fetchers' dependencies with the following command +```bash +scrapling install +``` +This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies. \ No newline at end of file diff --git a/docs/development/adaptive_storage_system.md b/docs/development/adaptive_storage_system.md new file mode 100644 index 0000000000000000000000000000000000000000..788ad6e2bbd85bcb4da2908d09b09cca308ecc84 --- /dev/null +++ b/docs/development/adaptive_storage_system.md @@ -0,0 +1,68 @@ +# Writing your retrieval system + +Scrapling uses SQLite by default, but this tutorial shows how to write your own storage system to store element properties for the `adaptive` feature. + +You might want to use Firebase, for example, and share the database between multiple spiders on different machines. It's a great idea to use an online database like that because spiders can share adaptive data with each other. + +So first, to make your storage class work, it must do the big 3: + +1. Inherit from the abstract class `scrapling.core.storage.StorageSystemMixin` and accept a string argument, which will be the `url` argument to maintain the library logic. +2. Use the decorator `functools.lru_cache` on top of the class to follow the Singleton design pattern as other classes. +3. Implement methods `save` and `retrieve`, as you see from the type hints: + - The method `save` returns nothing and will get two arguments from the library + * The first one is of type `lxml.html.HtmlElement`, which is the element itself. It must be converted to a dictionary using the `element_to_dict` function in the submodule `scrapling.core.utils._StorageTools` to maintain the same format, and then saved to your database as you wish. + * The second one is a string, the identifier used for retrieval. The combination result of this identifier and the `url` argument from initialization must be unique for each row, or the `adaptive` data will be messed up. + - The method `retrieve` takes a string, which is the identifier; using it with the `url` passed on initialization, the element's dictionary is retrieved from the database and returned if it exists; otherwise, it returns `None`. + +> If the instructions weren't clear enough for you, you can check my implementation using SQLite3 in [storage_adaptors](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py) file + +If your class meets these criteria, the rest is straightforward. If you plan to use the library in a threaded application, ensure your class supports it. The default used class is thread-safe. + +Some helper functions are added to the abstract class if you want to use them. It's easier to see it for yourself in the [code](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/storage.py); it's heavily commented :) + + +## Real-World Example: Redis Storage + +Here's a more practical example generated by AI using Redis: + +```python +import redis +import orjson +from functools import lru_cache +from scrapling.core.storage import StorageSystemMixin +from scrapling.core.utils import _StorageTools + +@lru_cache(None) +class RedisStorage(StorageSystemMixin): + def __init__(self, host='localhost', port=6379, db=0, url=None): + super().__init__(url) + self.redis = redis.Redis( + host=host, + port=port, + db=db, + decode_responses=False + ) + + def save(self, element, identifier: str) -> None: + # Convert element to dictionary + element_dict = _StorageTools.element_to_dict(element) + + # Create key + key = f"scrapling:{self._get_base_url()}:{identifier}" + + # Store as JSON + self.redis.set( + key, + orjson.dumps(element_dict) + ) + + def retrieve(self, identifier: str) -> dict | None: + # Get data + key = f"scrapling:{self._get_base_url()}:{identifier}" + data = self.redis.get(key) + + # Parse JSON if exists + if data: + return orjson.loads(data) + return None +``` \ No newline at end of file diff --git a/docs/development/scrapling_custom_types.md b/docs/development/scrapling_custom_types.md new file mode 100644 index 0000000000000000000000000000000000000000..2f638a98f9edc6442f3346a4d5c4887367e6b55d --- /dev/null +++ b/docs/development/scrapling_custom_types.md @@ -0,0 +1,23 @@ +# Using Scrapling's custom types + +> You can take advantage of the custom-made types for Scrapling and use them outside the library if you want. It's better than copying their code, after all :) + +### All current types can be imported alone, like below +```python +>>> from scrapling.core.custom_types import TextHandler, AttributesHandler + +>>> somestring = TextHandler('{}') +>>> somestring.json() +'{}' +>>> somedict_1 = AttributesHandler({'a': 1}) +>>> somedict_2 = AttributesHandler(a=1) +``` + +Note that `TextHandler` is a subclass of Python's `str`, so all standard operations/methods that work with Python strings will work. +If you want to check the type in your code, it's better to use Python's built-in `issubclass` function. + +The class `AttributesHandler` is a subclass of `collections.abc.Mapping`, so it's immutable (read-only), and all operations are inherited from it. The data passed can be accessed later through the `_data` property, but be careful; it's of type `types.MappingProxyType`, so it's immutable (read-only) as well (faster than `collections.abc.Mapping` by fractions of seconds). + +So, to make it simple for you, if you are new to Python, the same operations and methods from the Python standard `dict` type will all work with the class `AttributesHandler` except for the ones that try to modify the actual data. + +If you want to modify the data inside `AttributesHandler`, you have to convert it to a dictionary first, e.g., using the `dict` function, and then change it outside. \ No newline at end of file diff --git a/docs/donate.md b/docs/donate.md new file mode 100644 index 0000000000000000000000000000000000000000..9c8349efc431d26a453d855d1179c946216abb9f --- /dev/null +++ b/docs/donate.md @@ -0,0 +1,30 @@ +I've been creating all of these projects in my spare time and have invested considerable resources & effort in providing them to the community for free. By becoming a sponsor, you'd be directly funding my coffee reserves, helping me fulfill my responsibilities, and enabling me to continuously update existing projects and potentially create new ones. + +You can sponsor me directly through the [GitHub Sponsors program](https://github.com/sponsors/D4Vinci) or [Buy Me a Coffee](https://buymeacoffee.com/d4vinci). + +Thank you, stay curious, and hack the planet! โค๏ธ + +## Advertisement +If you are looking to **advertise** your business to our target audience, check out the [available tiers](https://github.com/sponsors/D4Vinci): + +### 1. [The Silver tier](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=435496) ($50/month) +Perks: + +1. Your logo will be featured at [the top of Scrapling's project page](https://github.com/D4Vinci/Scrapling?tab=readme-ov-file#sponsors). +2. The same logo will be featured at [the top of Scrapling's PyPI page](https://pypi.org/project/scrapling/) and [the top of Docker's image page](https://hub.docker.com/r/pyd4vinci/scrapling), the same way it was placed on the project's page. + +### 2. [The Gold tier](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=435495) ($100/month) +Perks: + +1. Your logo will be featured at [the top of Scrapling's project page](https://github.com/D4Vinci/Scrapling?tab=readme-ov-file#sponsors). +2. The same logo will be featured at [the top of Scrapling's PyPI page](https://pypi.org/project/scrapling/) and [the top of Docker's image page](https://hub.docker.com/r/pyd4vinci/scrapling), the same way it was placed on the project's page. +3. Your logo will be featured as a top sponsor on [Scrapling's website](https://scrapling.readthedocs.io/en/latest/) main page. + +### 3. [The Platinum tier](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=586646) ($300/month) +Perks: + +1. Your logo will have a special placement at [the very top of Scrapling's project page](https://github.com/D4Vinci/Scrapling?tab=readme-ov-file#platinum-sponsors) with an 80-word paragraph or less. +2. The same logo will be featured at [the top of Scrapling's PyPI page](https://pypi.org/project/scrapling/) and [the top of Docker's image page](https://hub.docker.com/r/pyd4vinci/scrapling), the same way it was placed on the project's page. +3. Your logo will have a special placement as a top sponsor on [Scrapling's website](https://scrapling.readthedocs.io/en/latest/) main page. +4. A partner role at our Discord server. +5. A Shoutout at the end of each [Release notes](https://github.com/D4Vinci/Scrapling/releases). \ No newline at end of file diff --git a/docs/fetching/choosing.md b/docs/fetching/choosing.md new file mode 100644 index 0000000000000000000000000000000000000000..dcba9b9be49287cb1fcd70b6ce57de3cd4bf37ae --- /dev/null +++ b/docs/fetching/choosing.md @@ -0,0 +1,85 @@ +# Fetchers basics + +## Introduction +Fetchers are classes that can do requests or fetch pages for you easily in a single-line fashion with many features and then return a [Response](#response-object) object. Starting with v0.3, all fetchers have separate classes to keep the session running, so for example, a fetcher that uses a browser will keep the browser open till you finish all your requests through it instead of opening multiple browsers. So it depends on your use case. + +This feature was introduced because, before v0.2, Scrapling was only a parsing engine. The target here is to gradually become the one-stop shop for all Web Scraping needs. + +> Fetchers are not wrappers built on top of other libraries. However, they only use these libraries as an engine to request/fetch pages. To further clarify this, all fetchers have features that the underlying engines don't, while still fully leveraging those engines and optimizing them for Web Scraping. + +## Fetchers Overview + +Scrapling provides three different fetcher classes with their session classes; each fetcher is designed for a specific use case. + +The following table compares them and can be quickly used for guidance. + + +| Feature | Fetcher | DynamicFetcher | StealthyFetcher | +|--------------------|---------------------------------------------------|-----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------| +| Relative speed | ๐Ÿ‡๐Ÿ‡๐Ÿ‡๐Ÿ‡๐Ÿ‡ | ๐Ÿ‡๐Ÿ‡๐Ÿ‡ | ๐Ÿ‡๐Ÿ‡๐Ÿ‡ | +| Stealth | โญโญ | โญโญโญ | โญโญโญโญโญ | +| Anti-Bot options | โญโญ | โญโญโญ | โญโญโญโญโญ | +| JavaScript loading | โŒ | โœ… | โœ… | +| Memory Usage | โญ | โญโญโญ | โญโญโญ | +| Best used for | Basic scraping when HTTP requests alone can do it | - Dynamically loaded websites
- Small automation
- Small-Mid protections | - Dynamically loaded websites
- Small automation
- Small-Complicated protections | +| Browser(s) | โŒ | Chromium and Google Chrome | Chromium and Google Chrome | +| Browser API used | โŒ | PlayWright | PlayWright | +| Setup Complexity | Simple | Simple | Simple | + +In the following pages, we will talk about each one in detail. + +## Parser configuration in all fetchers +All fetchers share the same import method, as you will see in the upcoming pages +```python +>>> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher +``` +Then you use it right away without initializing like this, and it will use the default parser settings: +```python +>>> page = StealthyFetcher.fetch('https://example.com') +``` +If you want to configure the parser ([Selector class](../parsing/main_classes.md#selector)) that will be used on the response before returning it for you, then do this first: +```python +>>> from scrapling.fetchers import Fetcher +>>> Fetcher.configure(adaptive=True, keep_comments=False, keep_cdata=False) # and the rest +``` +or +```python +>>> from scrapling.fetchers import Fetcher +>>> Fetcher.adaptive=True +>>> Fetcher.keep_comments=False +>>> Fetcher.keep_cdata=False # and the rest +``` +Then, continue your code as usual. + +The available configuration arguments are: `adaptive`, `adaptive_domain`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `.display_config()`. + +!!! info + + The `adaptive` argument is disabled by default; you must enable it to use that feature. + +### Set parser config per request +As you probably understand, the logic above for setting the parser config will apply globally to all requests/fetches made through that class, and it's intended for simplicity. + +If your use case requires a different configuration for each request/fetch, you can pass a dictionary to the request method (`fetch`/`get`/`post`/...) to an argument named `selector_config`. + +## Response Object +The `Response` object is the same as the [Selector](../parsing/main_classes.md#selector) class, but it has additional details about the response, like response headers, status, cookies, etc., as shown below: +```python +>>> from scrapling.fetchers import Fetcher +>>> page = Fetcher.get('https://example.com') + +>>> page.status # HTTP status code +>>> page.reason # Status message +>>> page.cookies # Response cookies as a dictionary +>>> page.headers # Response headers +>>> page.request_headers # Request headers +>>> page.history # Response history of redirections, if any +>>> page.body # Raw response body as bytes +>>> page.encoding # Response encoding +>>> page.meta # Response metadata dictionary (e.g., proxy used). Mainly helpful with the spiders system. +``` +All fetchers return the `Response` object. + +!!! note + + Unlike the [Selector](../parsing/main_classes.md#selector) class, the `Response` class's body is always bytes since v0.4. \ No newline at end of file diff --git a/docs/fetching/dynamic.md b/docs/fetching/dynamic.md new file mode 100644 index 0000000000000000000000000000000000000000..31574a6eb7b2f4aa6e4865d2b5e270238b66e659 --- /dev/null +++ b/docs/fetching/dynamic.md @@ -0,0 +1,322 @@ +# Fetching dynamic websites + +Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and little under-the-hood stealth improvements. + +As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page). + +!!! success "Prerequisites" + + 1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use. + 2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object. + 3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class. + +## Basic Usage +You have one primary way to import this Fetcher, which is the same for all fetchers. + +```python +>>> from scrapling.fetchers import DynamicFetcher +``` +Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers) + +Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments) + +!!! abstract + + The async version of the `fetch` method is `async_fetch`, of course. + + +This fetcher currently provides three main run options that can be combined as desired. + +Which are: + +### 1. Vanilla Playwright +```python +DynamicFetcher.fetch('https://example.com') +``` +Using it in that manner will open a Chromium browser and load the page. There are optimizations for speed, and some stealth goes automatically under the hood, but other than that, there are no tricks or extra features unless you enable some; it's just a plain PlayWright API. + +### 2. Real Chrome +```python +DynamicFetcher.fetch('https://example.com', real_chrome=True) +``` +If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium. This will make your requests look more authentic, so they're less detectable for better results. + +If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually: +```commandline +playwright install chrome +``` + +### 3. CDP Connection +```python +DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222') +``` +Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/). + + +!!! note "Notes:" + + * There was a `stealth` option here, but it was moved to the `StealthyFetcher` class, as explained on the next page, with additional features since version 0.3.13.
+ * This makes it less confusing for new users, easier to maintain, and provides other benefits, as explained on the [StealthyFetcher page](../fetching/stealthy.md). + +## Full list of arguments +Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them. + +| Argument | Description | Optional | +|:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| +| url | Target url | โŒ | +| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | โœ”๏ธ | +| disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | โœ”๏ธ | +| cookies | Set cookies for the next request. | โœ”๏ธ | +| useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and version.** | โœ”๏ธ | +| network_idle | Wait for the page until there are no network connections for at least 500 ms. | โœ”๏ธ | +| load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | โœ”๏ธ | +| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | โœ”๏ธ | +| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | โœ”๏ธ | +| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | โœ”๏ธ | +| wait_selector | Wait for a specific css selector to be in a specific state. | โœ”๏ธ | +| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | โœ”๏ธ | +| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | โœ”๏ธ | +| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | โœ”๏ธ | +| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | โœ”๏ธ | +| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | โœ”๏ธ | +| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | โœ”๏ธ | +| locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | โœ”๏ธ | +| timezone_id | Changes the timezone of the browser. Defaults to the system timezone. | โœ”๏ธ | +| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | โœ”๏ธ | +| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | โœ”๏ธ | +| extra_flags | A list of additional browser flags to pass to the browser on launch. | โœ”๏ธ | +| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | โœ”๏ธ | +| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | โœ”๏ธ | +| blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | โœ”๏ธ | +| proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | โœ”๏ธ | +| retries | Number of retry attempts for failed requests. Defaults to 3. | โœ”๏ธ | +| retry_delay | Seconds to wait between retry attempts. Defaults to 1. | โœ”๏ธ | + +In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `blocked_domains`, `proxy`, and `selector_config`. + +!!! note "Notes:" + + 1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading. + 2. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument. + 3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`. + 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions. + + +## Examples +It's easier to understand with examples, so let's take a look. + +### Resource Control + +```python +# Disable unnecessary resources +page = DynamicFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc. +``` + +### Domain Blocking + +```python +# Block requests to specific domains (and their subdomains) +page = DynamicFetcher.fetch('https://example.com', blocked_domains={"ads.example.com", "tracker.net"}) +``` + +### Network Control + +```python +# Wait for network idle (Consider fetch to be finished when there are no network connections for at least 500 ms) +page = DynamicFetcher.fetch('https://example.com', network_idle=True) + +# Custom timeout (in milliseconds) +page = DynamicFetcher.fetch('https://example.com', timeout=30000) # 30 seconds + +# Proxy support (It can also be a dictionary with only the keys 'server', 'username', and 'password'.) +page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port') +``` + +### Proxy Rotation + +```python +from scrapling.fetchers import DynamicSession, ProxyRotator + +# Set up proxy rotation +rotator = ProxyRotator([ + "http://proxy1:8080", + "http://proxy2:8080", + "http://proxy3:8080", +]) + +# Use with session - rotates proxy automatically with each request +with DynamicSession(proxy_rotator=rotator, headless=True) as session: + page1 = session.fetch('https://example1.com') + page2 = session.fetch('https://example2.com') + + # Override rotator for a specific request + page3 = session.fetch('https://example3.com', proxy='http://specific-proxy:8080') +``` + +!!! warning + + Remember that by default, all browser-based fetchers and sessions use a persistent browser context with a pool of tabs. However, since browsers can't set a proxy per tab, when you use a `ProxyRotator`, the fetcher will automatically open a separate context for each proxy, with one tab per context. Once the tab's job is done, both the tab and its context are closed. + +### Downloading Files + +```python +page = DynamicFetcher.fetch('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/main_cover.png') + +with open(file='main_cover.png', mode='wb') as f: + f.write(page.body) +``` + +The `body` attribute of the `Response` object always returns `bytes`. + +### Browser Automation +This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues. + +This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for purposes beyond automation. You can alter the page as you want. + +In the example below, I used the pages' [mouse events](https://playwright.dev/python/docs/api/class-mouse) to scroll the page with the mouse wheel, then move the mouse. +```python +from playwright.sync_api import Page + +def scroll_page(page: Page): + page.mouse.wheel(10, 0) + page.mouse.move(100, 400) + page.mouse.up() + +page = DynamicFetcher.fetch('https://example.com', page_action=scroll_page) +``` +Of course, if you use the async fetch version, the function must also be async. +```python +from playwright.async_api import Page + +async def scroll_page(page: Page): + await page.mouse.wheel(10, 0) + await page.mouse.move(100, 400) + await page.mouse.up() + +page = await DynamicFetcher.async_fetch('https://example.com', page_action=scroll_page) +``` + +### Wait Conditions + +```python +# Wait for the selector +page = DynamicFetcher.fetch( + 'https://example.com', + wait_selector='h1', + wait_selector_state='visible' +) +``` +This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM. + +After that, if `load_dom` is enabled (the default), the fetcher will check again to see if all JavaScript files are loaded and executed (in the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above. + +The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)): + +- `attached`: Wait for an element to be present in the DOM. +- `detached`: Wait for an element to not be present in the DOM. +- `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible. +- `hidden`: wait for an element to be either detached from the DOM, or have an empty bounding box, or `visibility:hidden`. This is opposite to the `'visible'` option. + +### Some Stealth Features + +```python +page = DynamicFetcher.fetch( + 'https://example.com', + google_search=True, + useragent='Mozilla/5.0...', # Custom user agent + locale='en-US', # Set browser locale +) +``` + +### General example +```python +from scrapling.fetchers import DynamicFetcher + +def scrape_dynamic_content(): + # Use Playwright for JavaScript content + page = DynamicFetcher.fetch( + 'https://example.com/dynamic', + network_idle=True, + wait_selector='.content' + ) + + # Extract dynamic content + content = page.css('.content') + + return { + 'title': content.css('h1::text').get(), + 'items': [ + item.text for item in content.css('.item') + ] + } +``` + +## Session Management + +To keep the browser open until you make multiple requests with the same configuration, use `DynamicSession`/`AsyncDynamicSession` classes. Those classes can accept all the arguments that the `fetch` function can take, which enables you to specify a config for the entire session. + +```python +from scrapling.fetchers import DynamicSession + +# Create a session with default configuration +with DynamicSession( + headless=True, + disable_resources=True, + real_chrome=True +) as session: + # Make multiple requests with the same browser instance + page1 = session.fetch('https://example1.com') + page2 = session.fetch('https://example2.com') + page3 = session.fetch('https://dynamic-site.com') + + # All requests reuse the same tab on the same browser instance +``` + +### Async Session Usage + +```python +import asyncio +from scrapling.fetchers import AsyncDynamicSession + +async def scrape_multiple_sites(): + async with AsyncDynamicSession( + network_idle=True, + timeout=30000, + max_pages=3 + ) as session: + # Make async requests with shared browser configuration + pages = await asyncio.gather( + session.fetch('https://spa-app1.com'), + session.fetch('https://spa-app2.com'), + session.fetch('https://dynamic-content.com') + ) + return pages +``` + +You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages that can be displayed at once. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then: + +1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal. +2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive. + +This logic allows for multiple URLs to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :) + +In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one. + +### Session Benefits + +- **Browser reuse**: Much faster subsequent requests by reusing the same browser instance. +- **Cookie persistence**: Automatic cookie and session state handling as any browser does automatically. +- **Consistent fingerprint**: Same browser fingerprint across all requests. +- **Memory efficiency**: Better resource usage compared to launching new browsers with each fetch. + +## When to Use + +Use DynamicFetcher when: + +- Need browser automation +- Want multiple browser options +- Using a real Chrome browser +- Need custom browser config +- Want a few stealth options + +If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md). \ No newline at end of file diff --git a/docs/fetching/static.md b/docs/fetching/static.md new file mode 100644 index 0000000000000000000000000000000000000000..0071785d8f9229c00d2d44cc95f21cb368b3531b --- /dev/null +++ b/docs/fetching/static.md @@ -0,0 +1,439 @@ +# HTTP requests + +The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities. + +!!! success "Prerequisites" + + 1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use. + 2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object. + 3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class. + +## Basic Usage +You have one primary way to import this Fetcher, which is the same for all fetchers. + +```python +>>> from scrapling.fetchers import Fetcher +``` +Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers) + +### Shared arguments +All methods for making requests here share some arguments, so let's discuss them first. + +- **url**: The targeted URL +- **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of the URL's domain. +- **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default** +- **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**. +- **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**. +- **retry_delay**: Number of seconds to wait between retry attempts. **Defaults to 1 second**. +- **impersonate**: Impersonate specific browsers' TLS fingerprints. Accepts browser strings or a list of them like `"chrome110"`, `"firefox102"`, `"safari15_5"` to use specific versions or `"chrome"`, `"firefox"`, `"safari"`, `"edge"` to automatically use the latest version available. This makes your requests appear to come from real browsers at the TLS level. If you pass it a list of strings, it will choose a random one with each request. **Defaults to the latest available Chrome version.** +- **http3**: Use HTTP/3 protocol for requests. **Defaults to False**. It might be problematic if used with `impersonate`. +- **cookies**: Cookies to use in the request. Can be a dictionary of `nameโ†’value` or a list of dictionaries. +- **proxy**: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`. +- **proxy_auth**: HTTP basic auth for proxy, tuple of (username, password). +- **proxies**: Dict of proxies to use. Format: `{"http": proxy_url, "https": proxy_url}`. +- **proxy_rotator**: A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy` or `proxies`. +- **headers**: Headers to include in the request. Can override any header generated by the `stealthy_headers` argument +- **max_redirects**: Maximum number of redirects. **Defaults to 30**, use -1 for unlimited. +- **verify**: Whether to verify HTTPS certificates. **Defaults to True**. +- **cert**: Tuple of (cert, key) filenames for the client certificate. +- **selector_config**: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. + +!!! note "Notes:" + + 1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)
+ 2. The available browsers to impersonate, along with their corresponding versions, are automatically displayed in the argument autocompletion and updated with each `curl_cffi` update.
+ 3. If any of the arguments `impersonate` or `stealthy_headers` are enabled, the fetchers will automatically generate real browser headers that match the browser version used. + +Other than this, for further customization, you can pass any arguments that `curl_cffi` supports for any method if that method doesn't already support them. + +### HTTP Methods +There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests. + +Examples are the best way to explain this: + +> Hence: `OPTIONS` and `HEAD` methods are not supported. +#### GET +```python +>>> from scrapling.fetchers import Fetcher +>>> # Basic GET +>>> page = Fetcher.get('https://example.com') +>>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) +>>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:password@localhost:8030') +>>> # With parameters +>>> page = Fetcher.get('https://example.com/search', params={'q': 'query'}) +>>> +>>> # With headers +>>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'}) +>>> # Basic HTTP authentication +>>> page = Fetcher.get("https://example.com", auth=("my_user", "password123")) +>>> # Browser impersonation +>>> page = Fetcher.get('https://example.com', impersonate='chrome') +>>> # HTTP/3 support +>>> page = Fetcher.get('https://example.com', http3=True) +``` +And for asynchronous requests, it's a small adjustment +```python +>>> from scrapling.fetchers import AsyncFetcher +>>> # Basic GET +>>> page = await AsyncFetcher.get('https://example.com') +>>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) +>>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:password@localhost:8030') +>>> # With parameters +>>> page = await AsyncFetcher.get('https://example.com/search', params={'q': 'query'}) +>>> +>>> # With headers +>>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'}) +>>> # Basic HTTP authentication +>>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123")) +>>> # Browser impersonation +>>> page = await AsyncFetcher.get('https://example.com', impersonate='chrome110') +>>> # HTTP/3 support +>>> page = await AsyncFetcher.get('https://example.com', http3=True) +``` +Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is a [Selector](../parsing/main_classes.md#selector) as we said, so you can use it directly +```python +>>> page.css('.something.something') + +>>> page = Fetcher.get('https://api.github.com/events') +>>> page.json() +[{'id': '', + 'type': 'PushEvent', + 'actor': {'id': '', + 'login': '', + 'display_login': '', + 'gravatar_id': '', + 'url': 'https://api.github.com/users/', + 'avatar_url': 'https://avatars.githubusercontent.com/u/'}, + 'repo': {'id': '', +... +``` +#### POST +```python +>>> from scrapling.fetchers import Fetcher +>>> # Basic POST +>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, params={'q': 'query'}) +>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True) +>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome") +>>> # Another example of form-encoded data +>>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True) +>>> # JSON data +>>> page = Fetcher.post('https://example.com/api', json={'key': 'value'}) +``` +And for asynchronous requests, it's a small adjustment +```python +>>> from scrapling.fetchers import AsyncFetcher +>>> # Basic POST +>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}) +>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True) +>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome") +>>> # Another example of form-encoded data +>>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True) +>>> # JSON data +>>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'}) +``` +#### PUT +```python +>>> from scrapling.fetchers import Fetcher +>>> # Basic PUT +>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}) +>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome") +>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030') +>>> # Another example of form-encoded data +>>> page = Fetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']}) +``` +And for asynchronous requests, it's a small adjustment +```python +>>> from scrapling.fetchers import AsyncFetcher +>>> # Basic PUT +>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}) +>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome") +>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030') +>>> # Another example of form-encoded data +>>> page = await AsyncFetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']}) +``` + +#### DELETE +```python +>>> from scrapling.fetchers import Fetcher +>>> page = Fetcher.delete('https://example.com/resource/123') +>>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome") +>>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030') +``` +And for asynchronous requests, it's a small adjustment +```python +>>> from scrapling.fetchers import AsyncFetcher +>>> page = await AsyncFetcher.delete('https://example.com/resource/123') +>>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome") +>>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030') +``` + +## Session Management + +For making multiple requests with the same configuration, use the `FetcherSession` class. It can be used in both synchronous and asynchronous code without issue; the class automatically detects and changes the session type, without requiring a different import. + +The `FetcherSession` class can accept nearly all the arguments that the methods can take, which enables you to specify a config for the entire session and later choose a different config for one of the requests effortlessly, as you will see in the following examples. + +```python +from scrapling.fetchers import FetcherSession + +# Create a session with default configuration +with FetcherSession( + impersonate='chrome', + http3=True, + stealthy_headers=True, + timeout=30, + retries=3 +) as session: + # Make multiple requests with the same settings and the same cookies + page1 = session.get('https://scrapling.requestcatcher.com/get') + page2 = session.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}) + page3 = session.get('https://api.github.com/events') + + # All requests share the same session and connection pool +``` + +You can also use a `ProxyRotator` with `FetcherSession` for automatic proxy rotation across requests: + +```python +from scrapling.fetchers import FetcherSession, ProxyRotator + +rotator = ProxyRotator([ + 'http://proxy1:8080', + 'http://proxy2:8080', + 'http://proxy3:8080', +]) + +with FetcherSession(proxy_rotator=rotator, impersonate='chrome') as session: + # Each request automatically uses the next proxy in rotation + page1 = session.get('https://example.com/page1') + page2 = session.get('https://example.com/page2') + + # You can check which proxy was used via the response metadata + print(page1.meta['proxy']) +``` + +You can also override the session proxy (or rotator) for a specific request by passing `proxy=` directly to the request method: + +```python +with FetcherSession(proxy='http://default-proxy:8080') as session: + # Uses the session proxy + page1 = session.get('https://example.com/page1') + + # Override the proxy for this specific request + page2 = session.get('https://example.com/page2', proxy='http://special-proxy:9090') +``` + +And here's an async example + +```python +async with FetcherSession(impersonate='firefox', http3=True) as session: + # All standard HTTP methods available + response = await session.get('https://example.com') + response = await session.post('https://scrapling.requestcatcher.com/post', json={'data': 'value'}) + response = await session.put('https://scrapling.requestcatcher.com/put', data={'update': 'info'}) + response = await session.delete('https://scrapling.requestcatcher.com/delete') +``` +or better +```python +import asyncio +from scrapling.fetchers import FetcherSession + +# Async session usage +async with FetcherSession(impersonate="safari") as session: + urls = ['https://example.com/page1', 'https://example.com/page2'] + + tasks = [ + session.get(url) for url in urls + ] + + pages = await asyncio.gather(*tasks) +``` + +The `Fetcher` class uses `FetcherSession` to create a temporary session with each request you make. + +### Session Benefits + +- **A lot faster**: 10 times faster than creating a single session for each request +- **Cookie persistence**: Automatic cookie handling across requests +- **Resource efficiency**: Better memory and CPU usage for multiple requests +- **Centralized configuration**: Single place to manage request settings + +## Examples +Some well-rounded examples to aid newcomers to Web Scraping + +### Basic HTTP Request + +```python +from scrapling.fetchers import Fetcher + +# Make a request +page = Fetcher.get('https://example.com') + +# Check the status +if page.status == 200: + # Extract title + title = page.css('title::text').get() + print(f"Page title: {title}") + + # Extract all links + links = page.css('a::attr(href)').getall() + print(f"Found {len(links)} links") +``` + +### Product Scraping + +```python +from scrapling.fetchers import Fetcher + +def scrape_products(): + page = Fetcher.get('https://example.com/products') + + # Find all product elements + products = page.css('.product') + + results = [] + for product in products: + results.append({ + 'title': product.css('.title::text').get(), + 'price': product.css('.price::text').re_first(r'\d+\.\d{2}'), + 'description': product.css('.description::text').get(), + 'in_stock': product.has_class('in-stock') + }) + + return results +``` + +### Downloading Files + +```python +from scrapling.fetchers import Fetcher + +page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/main_cover.png') +with open(file='main_cover.png', mode='wb') as f: + f.write(page.body) +``` + +### Pagination Handling + +```python +from scrapling.fetchers import Fetcher + +def scrape_all_pages(): + base_url = 'https://example.com/products?page={}' + page_num = 1 + all_products = [] + + while True: + # Get current page + page = Fetcher.get(base_url.format(page_num)) + + # Find products + products = page.css('.product') + if not products: + break + + # Process products + for product in products: + all_products.append({ + 'name': product.css('.name::text').get(), + 'price': product.css('.price::text').get() + }) + + # Next page + page_num += 1 + + return all_products +``` + +### Form Submission + +```python +from scrapling.fetchers import Fetcher + +# Submit login form +response = Fetcher.post( + 'https://example.com/login', + data={ + 'username': 'user@example.com', + 'password': 'password123' + } +) + +# Check login success +if response.status == 200: + # Extract user info + user_name = response.css('.user-name::text').get() + print(f"Logged in as: {user_name}") +``` + +### Table Extraction + +```python +from scrapling.fetchers import Fetcher + +def extract_table(): + page = Fetcher.get('https://example.com/data') + + # Find table + table = page.css('table')[0] + + # Extract headers + headers = [ + th.text for th in table.css('thead th') + ] + + # Extract rows + rows = [] + for row in table.css('tbody tr'): + cells = [td.text for td in row.css('td')] + rows.append(dict(zip(headers, cells))) + + return rows +``` + +### Navigation Menu + +```python +from scrapling.fetchers import Fetcher + +def extract_menu(): + page = Fetcher.get('https://example.com') + + # Find navigation + nav = page.css('nav')[0] + + menu = {} + for item in nav.css('li'): + links = item.css('a') + if links: + link = links[0] + menu[link.text] = { + 'url': link['href'], + 'has_submenu': bool(item.css('.submenu')) + } + + return menu +``` + +## When to Use + +Use `Fetcher` when: + +- Need rapid HTTP requests. +- Want minimal overhead. +- Don't need JavaScript execution (the website can be scraped through requests). +- Need some stealth features (ex, the targeted website is using protection but doesn't use JavaScript challenges). + +Use `FetcherSession` when: + +- Making multiple requests to the same or different sites. +- Need to maintain cookies/authentication between requests. +- Want connection pooling for better performance. +- Require consistent configuration across requests. +- Working with APIs that require a session state. + +Use other fetchers when: + +- Need browser automation. +- Need advanced anti-bot/stealth capabilities. +- Need JavaScript support or interacting with dynamic content \ No newline at end of file diff --git a/docs/fetching/stealthy.md b/docs/fetching/stealthy.md new file mode 100644 index 0000000000000000000000000000000000000000..fcbc8ea3eb3d7d0b2185f79ca626b5f473aa4847 --- /dev/null +++ b/docs/fetching/stealthy.md @@ -0,0 +1,345 @@ +# Fetching dynamic websites with hard protections + +Here, we will discuss the `StealthyFetcher` class. This class is very similar to the [DynamicFetcher](dynamic.md#introduction) class, including the browsers, the automation, and the use of [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities; most of them are handled automatically under the hood, and the rest is up to you to enable. + +As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later. + +!!! success "Prerequisites" + + 1. You've completed or read the [DynamicFetcher](dynamic.md#introduction) page since this class builds upon it, and we won't repeat the same information here for that reason. + 2. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use. + 3. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object. + 4. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class. + +## Basic Usage +You have one primary way to import this Fetcher, which is the same for all fetchers. + +```python +>>> from scrapling.fetchers import StealthyFetcher +``` +Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers) + +!!! abstract + + The async version of the `fetch` method is `async_fetch`, of course. + +## What does it do? + +The `StealthyFetcher` class is a stealthy version of the [DynamicFetcher](dynamic.md#introduction) class, and here are some of the things it does: + +1. It easily bypasses all types of Cloudflare's Turnstile/Interstitial automatically. +2. It bypasses CDP runtime leaks and WebRTC leaks. +3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do. +4. It generates canvas noise to prevent fingerprinting through canvas. +5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks. +6. It makes requests look as if they came from Google's search page of the requested website. +7. and other anti-protection options... + +## Full list of arguments +Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments + + +| Argument | Description | Optional | +|:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| +| url | Target url | โŒ | +| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | โœ”๏ธ | +| disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | โœ”๏ธ | +| cookies | Set cookies for the next request. | โœ”๏ธ | +| useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and version.** | โœ”๏ธ | +| network_idle | Wait for the page until there are no network connections for at least 500 ms. | โœ”๏ธ | +| load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | โœ”๏ธ | +| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | โœ”๏ธ | +| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | โœ”๏ธ | +| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | โœ”๏ธ | +| wait_selector | Wait for a specific css selector to be in a specific state. | โœ”๏ธ | +| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | โœ”๏ธ | +| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | โœ”๏ธ | +| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | โœ”๏ธ | +| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | โœ”๏ธ | +| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | โœ”๏ธ | +| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | โœ”๏ธ | +| locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | โœ”๏ธ | +| timezone_id | Changes the timezone of the browser. Defaults to the system timezone. | โœ”๏ธ | +| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | โœ”๏ธ | +| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | โœ”๏ธ | +| extra_flags | A list of additional browser flags to pass to the browser on launch. | โœ”๏ธ | +| solve_cloudflare | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you. | โœ”๏ธ | +| block_webrtc | Forces WebRTC to respect proxy settings to prevent local IP address leak. | โœ”๏ธ | +| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | โœ”๏ธ | +| allow_webgl | Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | โœ”๏ธ | +| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | โœ”๏ธ | +| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | โœ”๏ธ | +| blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | โœ”๏ธ | +| proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | โœ”๏ธ | +| retries | Number of retry attempts for failed requests. Defaults to 3. | โœ”๏ธ | +| retry_delay | Seconds to wait between retry attempts. Defaults to 1. | โœ”๏ธ | + +In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, `blocked_domains`, `proxy`, and `selector_config`. + +!!! note "Notes:" + + 1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`. + 2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading. + 3. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument. + 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions. + +## Examples +It's easier to understand with examples, so we will now review most of the arguments individually. Since it's the same class as the [DynamicFetcher](dynamic.md#introduction), you can refer to that page for more examples, as we won't repeat all the examples from there. + +### Cloudflare and stealth options + +```python +# Automatic Cloudflare solver +page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True) + +# Works with other stealth options +page = StealthyFetcher.fetch( + 'https://protected-site.com', + solve_cloudflare=True, + block_webrtc=True, + real_chrome=True, + hide_canvas=True, + google_search=True, + proxy='http://username:password@host:port', # It can also be a dictionary with only the keys 'server', 'username', and 'password'. +) +``` + +The `solve_cloudflare` parameter enables automatic detection and solving all types of Cloudflare's Turnstile/Interstitial challenges: + +- JavaScript challenges (managed) +- Interactive challenges (clicking verification boxes) +- Invisible challenges (automatic background verification) + +And even solves the custom pages with embedded captcha. + +!!! notes "**Important notes:**" + + 1. Sometimes, with websites that use custom implementations, you will need to use `wait_selector` to make sure Scrapling waits for the real website content to be loaded after solving the captcha. Some websites can be the real definition of an edge case while we are trying to make the solver as generic as possible. + 2. The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time. + 3. This feature works seamlessly with proxies and other stealth options. + +### Browser Automation +This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues. + +This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for purposes beyond automation. You can alter the page as you want. + +In the example below, I used the pages' [mouse events](https://playwright.dev/python/docs/api/class-mouse) to scroll the page with the mouse wheel, then move the mouse. +```python +from playwright.sync_api import Page + +def scroll_page(page: Page): + page.mouse.wheel(10, 0) + page.mouse.move(100, 400) + page.mouse.up() + +page = StealthyFetcher.fetch('https://example.com', page_action=scroll_page) +``` +Of course, if you use the async fetch version, the function must also be async. +```python +from playwright.async_api import Page + +async def scroll_page(page: Page): + await page.mouse.wheel(10, 0) + await page.mouse.move(100, 400) + await page.mouse.up() + +page = await StealthyFetcher.async_fetch('https://example.com', page_action=scroll_page) +``` + +### Wait Conditions +```python +# Wait for the selector +page = StealthyFetcher.fetch( + 'https://example.com', + wait_selector='h1', + wait_selector_state='visible' +) +``` +This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM. + +After that, if `load_dom` is enabled (the default), the fetcher will check again to see if all JavaScript files are loaded and executed (in the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above. + +The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)): + +- `attached`: Wait for an element to be present in the DOM. +- `detached`: Wait for an element to not be present in the DOM. +- `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible. +- `hidden`: wait for an element to be either detached from the DOM, or have an empty bounding box, or `visibility:hidden`. This is opposite to the `'visible'` option. + + +### Real-world example (Amazon) +This is for educational purposes only; this example was generated by AI, which also shows how easy it is to work with Scrapling through AI +```python +def scrape_amazon_product(url): + # Use StealthyFetcher to bypass protection + page = StealthyFetcher.fetch(url) + + # Extract product details + return { + 'title': page.css('#productTitle::text').get().clean(), + 'price': page.css('.a-price .a-offscreen::text').get(), + 'rating': page.css('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text').get(), + 'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'), + 'features': [ + li.get().clean() for li in page.css('#feature-bullets li span::text') + ], + 'availability': page.css('#availability')[0].get_all_text(strip=True), + 'images': [ + img.attrib['src'] for img in page.css('#altImages img') + ] + } +``` + +## Session Management + +To keep the browser open until you make multiple requests with the same configuration, use `StealthySession`/`AsyncStealthySession` classes. Those classes can accept all the arguments that the `fetch` function can take, which enables you to specify a config for the entire session. + +```python +from scrapling.fetchers import StealthySession + +# Create a session with default configuration +with StealthySession( + headless=True, + real_chrome=True, + block_webrtc=True, + solve_cloudflare=True +) as session: + # Make multiple requests with the same browser instance + page1 = session.fetch('https://example1.com') + page2 = session.fetch('https://example2.com') + page3 = session.fetch('https://nopecha.com/demo/cloudflare') + + # All requests reuse the same tab on the same browser instance +``` + +### Async Session Usage + +```python +import asyncio +from scrapling.fetchers import AsyncStealthySession + +async def scrape_multiple_sites(): + async with AsyncStealthySession( + real_chrome=True, + block_webrtc=True, + solve_cloudflare=True, + timeout=60000, # 60 seconds for Cloudflare challenges + max_pages=3 + ) as session: + # Make async requests with shared browser configuration + pages = await asyncio.gather( + session.fetch('https://site1.com'), + session.fetch('https://site2.com'), + session.fetch('https://protected-site.com') + ) + return pages +``` + +You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages that can be displayed at once. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then: + +1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal. +2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive. + +This logic allows for multiple URLs to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :) + +In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one. + +### Session Benefits + +- **Browser reuse**: Much faster subsequent requests by reusing the same browser instance. +- **Cookie persistence**: Automatic cookie and session state handling as any browser does automatically. +- **Consistent fingerprint**: Same browser fingerprint across all requests. +- **Memory efficiency**: Better resource usage compared to launching new browsers with each fetch. + +## Using Camoufox as an engine + +This fetcher used a custom version of [Camoufox](https://github.com/daijro/camoufox) as an engine before version 0.3.13, which was replaced by [patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright) for many reasons. If you see that Camoufox is stable on your device, has no high memory issues, and you want to continue using it, then you can. + +First, you will need to install the Camoufox library, browser, and Firefox system dependencies if you didn't already: +```commandline +pip install camoufox +playwright install-deps firefox +camoufox fetch +``` +Then you will inherit from `StealthySession` and set it as below: +```python +from scrapling.fetchers import StealthySession +from playwright.sync_api import sync_playwright +from camoufox.utils import launch_options as generate_launch_options + +class StealthySession(StealthySession): + def start(self): + """Create a browser for this instance and context.""" + if not self.playwright: + self.playwright = sync_playwright().start() + # Configure camoufox run options here + launch_options = generate_launch_options(**{"headless": True, "user_data_dir": ''}) + # Here's an example, part of what we have been doing before v0.3.13 + launch_options = generate_launch_options(**{ + "geoip": False, + "proxy": self._config.proxy, + "headless": self._config.headless, + "humanize": True if self._config.solve_cloudflare else False, # Better enable humanize for Cloudflare, otherwise it's up to you + "i_know_what_im_doing": True, # To turn warnings off with the user configurations + "allow_webgl": self._config.allow_webgl, + "block_webrtc": self._config.block_webrtc, + "os": None, + "user_data_dir": self._config.user_data_dir, + "firefox_user_prefs": { + # This is what enabling `enable_cache` does internally, so we do it from here instead + "browser.sessionhistory.max_entries": 10, + "browser.sessionhistory.max_total_viewers": -1, + "browser.cache.memory.enable": True, + "browser.cache.disk_cache_ssl": True, + "browser.cache.disk.smart_size.enabled": True, + }, + # etc... + }) + self.context = self.playwright.firefox.launch_persistent_context(**launch_options) + else: + raise RuntimeError("Session has been already started") +``` +After that, you can use it normally as before, even for solving Cloudflare challenges: +```python +with StealthySession(solve_cloudflare=True, headless=True) as session: + page = session.fetch('https://sergiodemo.com/security/challenge/legacy-challenge') + if page.css('#page-not-found-404'): + print('Cloudflare challenge solved successfully!') +``` + +The same logic applies to the `AsyncStealthySession` class with a few differences: +```python +from scrapling.fetchers import AsyncStealthySession +from playwright.async_api import async_playwright +from camoufox.utils import launch_options as generate_launch_options + +class AsyncStealthySession(AsyncStealthySession): + async def start(self): + """Create a browser for this instance and context.""" + if not self.playwright: + self.playwright = await async_playwright().start() + # Configure camoufox run options here + launch_options = generate_launch_options(**{"headless": True, "user_data_dir": ''}) + # or set the launch options as in the above example + self.context = await self.playwright.firefox.launch_persistent_context(**launch_options) + else: + raise RuntimeError("Session has been already started") + +async with AsyncStealthySession(solve_cloudflare=True, headless=True) as session: + page = await session.fetch('https://sergiodemo.com/security/challenge/legacy-challenge') + if page.css('#page-not-found-404'): + print('Cloudflare challenge solved successfully!') +``` + +Enjoy! :) + +## When to Use + +Use StealthyFetcher when: + +- Bypassing anti-bot protection +- Need a reliable browser fingerprint +- Full JavaScript support needed +- Want automatic stealth features +- Need browser automation +- Dealing with Cloudflare protection \ No newline at end of file diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000000000000000000000000000000000000..2bafd8044caa683a4f2879096dc0616b802ee780 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,213 @@ + + +
+ + +

Effortless Web Scraping for the Modern Web


+ +Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl. + +Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation โ€” all in a few lines of Python. One library, zero compromises. + +Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone. + +```python +from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher +StealthyFetcher.adaptive = True +page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar! +products = page.css('.product', auto_save=True) # Scrape data that survives website design changes! +products = page.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them! +``` +Or scale up to full crawls +```python +from scrapling.spiders import Spider, Response + +class MySpider(Spider): + name = "demo" + start_urls = ["https://example.com/"] + + async def parse(self, response: Response): + for item in response.css('.product'): + yield {"title": item.css('h2::text').get()} + +MySpider().start() +``` + +## Top Sponsors + + +
+ + + + + + + + + +
+ + +Do you want to show your ad here? Click [here](https://github.com/sponsors/D4Vinci/sponsorships?tier_id=435495) and enjoy the rest of the perks! + +## Key Features + +### Spiders โ€” A Full Crawling Framework +- ๐Ÿ•ท๏ธ **Scrapy-like Spider API**: Define spiders with `start_urls`, async `parse` callbacks, and `Request`/`Response` objects. +- โšก **Concurrent Crawling**: Configurable concurrency limits, per-domain throttling, and download delays. +- ๐Ÿ”„ **Multi-Session Support**: Unified interface for HTTP requests, and stealthy headless browsers in a single spider โ€” route requests to different sessions by ID. +- ๐Ÿ’พ **Pause & Resume**: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off. +- ๐Ÿ“ก **Streaming Mode**: Stream scraped items as they arrive via `async for item in spider.stream()` with real-time stats โ€” ideal for UI, pipelines, and long-running crawls. +- ๐Ÿ›ก๏ธ **Blocked Request Detection**: Automatic detection and retry of blocked requests with customizable logic. +- ๐Ÿ“ฆ **Built-in Export**: Export results through hooks and your own pipeline or the built-in JSON/JSONL with `result.items.to_json()` / `result.items.to_jsonl()` respectively. + +### Advanced Websites Fetching with Session Support +- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3. +- **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium and Google's Chrome. +- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation. +- **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests. +- **Proxy Rotation**: Built-in `ProxyRotator` with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides. +- **Domain Blocking**: Block requests to specific domains (and their subdomains) in browser-based fetchers. +- **Async Support**: Complete async support across all fetchers and dedicated async session classes. + +### Adaptive Scraping & AI Integration +- ๐Ÿ”„ **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms. +- ๐ŸŽฏ **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more. +- ๐Ÿ” **Find Similar Elements**: Automatically locate elements similar to found elements. +- ๐Ÿค– **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE)) + +### High-Performance & battle-tested Architecture +- ๐Ÿš€ **Lightning Fast**: Optimized performance outperforming most Python scraping libraries. +- ๐Ÿ”‹ **Memory Efficient**: Optimized data structures and lazy loading for a minimal memory footprint. +- โšก **Fast JSON Serialization**: 10x faster than the standard library. +- ๐Ÿ—๏ธ **Battle tested**: Not only does Scrapling have 92% test coverage and full type hints coverage, but it has been used daily by hundreds of Web Scrapers over the past year. + +### Developer/Web Scraper Friendly Experience +- ๐ŸŽฏ **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser. +- ๐Ÿš€ **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single line of code! +- ๐Ÿ› ๏ธ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods. +- ๐Ÿงฌ **Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations. +- ๐Ÿ“ **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element. +- ๐Ÿ”Œ **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel. +- ๐Ÿ“˜ **Complete Type Coverage**: Full type hints for excellent IDE support and code completion. The entire codebase is automatically scanned with **PyRight** and **MyPy** with each change. +- ๐Ÿ”‹ **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed. + + +## Star History +Scraplingโ€™s GitHub stars have grown steadily since its release (see chart below). + + + + + + +## Installation +Scrapling requires Python 3.10 or higher: + +```bash +pip install scrapling +``` + +This installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies. + +### Optional Dependencies + +1. If you are going to use any of the extra features below, the fetchers, or their classes, you will need to install fetchers' dependencies and their browser dependencies as follows: + ```bash + pip install "scrapling[fetchers]" + + scrapling install # normal install + scrapling install --force # force reinstall + ``` + + This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies. + + Or you can install them from the code instead of running a command like this: + ```python + from scrapling.cli import install + + install([], standalone_mode=False) # normal install + install(["--force"], standalone_mode=False) # force reinstall + ``` + +2. Extra features: + + + - Install the MCP server feature: + ```bash + pip install "scrapling[ai]" + ``` + - Install shell features (Web Scraping shell and the `extract` command): + ```bash + pip install "scrapling[shell]" + ``` + - Install everything: + ```bash + pip install "scrapling[all]" + ``` + Don't forget that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already) + +### Docker +You can also install a Docker image with all extras and browsers with the following command from DockerHub: +```bash +docker pull pyd4vinci/scrapling +``` +Or download it from the GitHub registry: +```bash +docker pull ghcr.io/d4vinci/scrapling:latest +``` +This image is automatically built and pushed using GitHub Actions and the repository's main branch. + +## How the documentation is organized +Scrapling has extensive documentation, so we try to follow the [Diรกtaxis documentation framework](https://diataxis.fr/). + +## Support + +If you like Scrapling and want to support its development: + +- โญ Star the [GitHub repository](https://github.com/D4Vinci/Scrapling) +- ๐Ÿš€ Follow us on [Twitter](https://x.com/Scrapling_dev) and join the [discord server](https://discord.gg/EMgGbDceNQ) +- ๐Ÿ’ Consider [sponsoring the project or buying me a coffee](donate.md) :wink: +- ๐Ÿ› Report bugs and suggest features through [GitHub Issues](https://github.com/D4Vinci/Scrapling/issues) + +## License + +This project is licensed under the BSD-3 License. See the [LICENSE](https://github.com/D4Vinci/Scrapling/blob/main/LICENSE) file for details. \ No newline at end of file diff --git a/docs/overrides/main.html b/docs/overrides/main.html new file mode 100644 index 0000000000000000000000000000000000000000..2c3c5129edb5f29c92306d52042ad511f7937f57 --- /dev/null +++ b/docs/overrides/main.html @@ -0,0 +1,21 @@ +{% extends "base.html" %} + +{% block extrahead %} + + + + + + + + + + + + + + + + + +{% endblock %} \ No newline at end of file diff --git a/docs/overview.md b/docs/overview.md new file mode 100644 index 0000000000000000000000000000000000000000..aa91967ed56985fb41e0e1d14a472d28270fd50e --- /dev/null +++ b/docs/overview.md @@ -0,0 +1,345 @@ +## Pick Your Path + +Not sure where to start? Pick the path that matches what you're trying to do: + +| I want to... | Start here | +|:---|:---| +| **Parse HTML** I already have | [Querying elements](parsing/selection.md) โ€” CSS, XPath, and text-based selection | +| **Quickly scrape a page** and prototype | Pick a [fetcher](fetching/choosing.md) and test right away, or launch the [interactive shell](cli/interactive-shell.md) | +| **Build a crawler** that scales | [Spiders](spiders/getting-started.md) โ€” concurrent, multi-session crawls with pause/resume | +| **Scrape without writing code** | [CLI extract commands](cli/extract-commands.md) or hook up the [MCP server](ai/mcp-server.md) to your favourite AI tool | +| **Migrate** from another library | [From BeautifulSoup](tutorials/migrating_from_beautifulsoup.md) or [Scrapy comparison](spiders/architecture.md#comparison-with-scrapy) | + +--- + +We will start by quickly reviewing the parsing capabilities. Then we will fetch websites using custom browsers, make requests, and parse the responses. + +Here's an HTML document generated by ChatGPT that we will be using as an example throughout this page: +```html + + + Complex Web Page + + + +
+ +
+
+
+

Products

+
+
+

Product 1

+

This is product 1

+ $10.99 + +
+ +
+

Product 2

+

This is product 2

+ $20.99 + +
+ +
+

Product 3

+

This is product 3

+ $15.99 + +
+
+
+ +
+

Customer Reviews

+
+
+

Great product!

+ John Doe +
+
+

Good value for money.

+ Jane Smith +
+
+
+
+ + + +``` +Starting with loading raw HTML above like this +```python +from scrapling.parser import Selector +page = Selector(html_doc) +page # Complex Web Page</tit...'> +``` +Get all text content on the page recursively +```python +page.get_all_text(ignore_tags=('script', 'style')) +# 'Complex Web Page\nHome\nAbout\nContact\nProducts\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock\nCustomer Reviews\nGreat product!\nJohn Doe\nGood value for money.\nJane Smith' +``` + +## Finding elements +If there's an element you want to find on the page, you will find it! Your creativity level is the only limitation! + +Finding the first HTML `section` element +```python +section_element = page.find('section') +# <data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'> +``` +Find all `section` elements +```python +section_elements = page.find_all('section') +# [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>, <data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>] +``` +Find all `section` elements whose `id` attribute value is `products`. +```python +section_elements = page.find_all('section', {'id':"products"}) +# Same as +section_elements = page.find_all('section', id="products") +# [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>] +``` +Find all `section` elements whose `id` attribute value contains `product`. +```python +section_elements = page.find_all('section', {'id*':"product"}) +``` +Find all `h3` elements whose text content matches this regex `Product \d` +```python +page.find_all('h3', re.compile(r'Product \d')) +# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>] +``` +Find all `h3` and `h2` elements whose text content matches the regex `Product` only +```python +page.find_all(['h3', 'h2'], re.compile(r'Product')) +# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>] +``` +Find all elements whose text content matches exactly `Products` (Whitespaces are not taken into consideration) +```python +page.find_by_text('Products', first_match=False) +# [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>] +``` +Or find all elements whose text content matches regex `Product \d` +```python +page.find_by_regex(r'Product \d', first_match=False) +# [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>] +``` +Find all elements that are similar to the element you want +```python +target_element = page.find_by_regex(r'Product \d', first_match=True) +# <data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'> +target_element.find_similar() +# [<data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>] +``` +Find the first element that matches a CSS selector +```python +page.css('.product-list [data-id="1"]')[0] +# <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'> +``` +Find all elements that match a CSS selector +```python +page.css('.product-list article') +# [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>] +``` +Find the first element that matches an XPath selector +```python +page.xpath("//*[@id='products']/div/article")[0] +# <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'> +``` +Find all elements that match an XPath selector +```python +page.xpath("//*[@id='products']/div/article") +# [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>] +``` + +With this, we just scratched the surface of these functions; more advanced options with these selection methods are shown later. +## Accessing elements' data +It's as simple as +```python +>>> section_element.tag +'section' +>>> print(section_element.attrib) +{'id': 'products', 'schema': '{"jsonable": "data"}'} +>>> section_element.attrib['schema'].json() # If an attribute value can be converted to json, then use `.json()` to convert it +{'jsonable': 'data'} +>>> section_element.text # Direct text content +'' +>>> section_element.get_all_text() # All text content recursively +'Products\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock' +>>> section_element.html_content # The HTML content of the element +'<section id="products" schema=\'{"jsonable": "data"}\'><h2>Products</h2>\n <div class="product-list">\n <article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article><article class="product" data-id="2"><h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article><article class="product" data-id="3"><h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article></div>\n </section>' +>>> print(section_element.prettify()) # The prettified version +''' +<section id="products" schema='{"jsonable": "data"}'><h2>Products</h2> + <div class="product-list"> + <article class="product" data-id="1"><h3>Product 1</h3> + <p class="description">This is product 1</p> + <span class="price">$10.99</span> + <div class="hidden stock">In stock: 5</div> + </article><article class="product" data-id="2"><h3>Product 2</h3> + <p class="description">This is product 2</p> + <span class="price">$20.99</span> + <div class="hidden stock">In stock: 3</div> + </article><article class="product" data-id="3"><h3>Product 3</h3> + <p class="description">This is product 3</p> + <span class="price">$15.99</span> + <div class="hidden stock">Out of stock</div> + </article> + </div> +</section> +''' +>>> section_element.path # All the ancestors in the DOM tree of this element +[<data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'>, + <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>, + <data='<html><head><title>Complex Web Page</tit...'>] +>>> section_element.generate_css_selector +'#products' +>>> section_element.generate_full_css_selector +'body > main > #products > #products' +>>> section_element.generate_xpath_selector +"//*[@id='products']" +>>> section_element.generate_full_xpath_selector +"//body/main/*[@id='products']" +``` + +## Navigation +Using the elements we found above + +```python +>>> section_element.parent +<data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'> +>>> section_element.parent.tag +'main' +>>> section_element.parent.parent.tag +'body' +>>> section_element.children +[<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>, + <data='<div class="product-list"> <article clas...' parent='<section id="products" schema='{"jsonabl...'>] +>>> section_element.siblings +[<data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>] +>>> section_element.next # gets the next element, the same logic applies to `quote.previous`. +<data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'> +>>> section_element.children.css('h2::text').getall() +['Products'] +>>> page.css('[data-id="1"]')[0].has_class('product') +True +``` +If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element, like the one below +```python +for ancestor in section_element.iterancestors(): + # do something with it... +``` +You can search for a specific ancestor of an element that satisfies a function; all you need to do is pass a function that takes a `Selector` object as an argument and returns `True` if the condition is satisfied or `False` otherwise, like below: +```python +>>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav')) +<data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'> +``` + +## Fetching websites +Instead of passing the raw HTML to Scrapling, you can retrieve a website's response directly via HTTP requests or by fetching it in a browser. + +A fetcher is made for every use case. + +### HTTP Requests +For simple HTTP requests, there's a `Fetcher` class that can be imported and used as below: +```python +from scrapling.fetchers import Fetcher +page = Fetcher.get('https://scrapling.requestcatcher.com/get', impersonate="chrome") +``` +With that out of the way, here's how to do all HTTP methods: +```python +>>> from scrapling.fetchers import Fetcher +>>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) +>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030') +>>> page = Fetcher.put('https://scrapling.requestcatcher.com/put', data={'key': 'value'}) +>>> page = Fetcher.delete('https://scrapling.requestcatcher.com/delete') +``` +For Async requests, you will replace the import like below: +```python +>>> from scrapling.fetchers import AsyncFetcher +>>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) +>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030') +>>> page = await AsyncFetcher.put('https://scrapling.requestcatcher.com/put', data={'key': 'value'}) +>>> page = await AsyncFetcher.delete('https://scrapling.requestcatcher.com/delete') +``` + +!!! note "Notes:" + + 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default. + 2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version. + 3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic + +This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md) + +### Dynamic loading +We have you covered if you deal with dynamic websites like most today! + +The `DynamicFetcher` class (formerly `PlayWrightFetcher`) offers many options for fetching and loading web pages using Chromium-based browsers. +```python +>>> from scrapling.fetchers import DynamicFetcher +>>> page = DynamicFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option +>>> page.css("#search a::attr(href)").get() +'https://github.com/D4Vinci/Scrapling' +>>> # The async version of fetch +>>> page = await DynamicFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) +>>> page.css("#search a::attr(href)").get() +'https://github.com/D4Vinci/Scrapling' +``` +It's built on top of [Playwright](https://playwright.dev/python/), and it's currently providing two main run options that can be mixed as you want: + +- Vanilla Playwright without any modifications other than the ones you chose. It uses the Chromium browser. +- Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it. + + +Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments. + +### Dynamic anti-protection loading +We also have you covered if you deal with dynamic websites with annoying anti-protections! + +The `StealthyFetcher` class uses a stealthy version of the `DynamicFetcher` explained above. + +Some of the things it does: + +1. It easily bypasses all types of Cloudflare's Turnstile/Interstitial automatically. +2. It bypasses CDP runtime leaks and WebRTC leaks. +3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do. +4. It generates canvas noise to prevent fingerprinting through canvas. +5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks. +6. It makes requests look as if they came from Google's search page of the requested website. +7. and other anti-protection options... + +```python +>>> from scrapling.fetchers import StealthyFetcher +>>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection') # Running headless by default +>>> page.status == 200 +True +>>> page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True) # Solve Cloudflare captcha automatically if presented +>>> page.status == 200 +True +>>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments... +>>> # The async version of fetch +>>> page = await StealthyFetcher.async_fetch('https://www.browserscan.net/bot-detection') +>>> page.status == 200 +True +``` + +Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/stealthy.md) for all details and the complete list of arguments. + +--- + +That's Scrapling at a glance. If you want to learn more, continue to the next section. \ No newline at end of file diff --git a/docs/parsing/adaptive.md b/docs/parsing/adaptive.md new file mode 100644 index 0000000000000000000000000000000000000000..23dcaf3c19436f7442380d44ae3a75cac742f78e --- /dev/null +++ b/docs/parsing/adaptive.md @@ -0,0 +1,230 @@ +# Adaptive scraping + +!!! success "Prerequisites" + + 1. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector) object. + 2. You've completed or read the [Main classes](../parsing/main_classes.md) page to understand the [Selector](../parsing/main_classes.md#selector) class. + +Adaptive scraping (previously known as automatch) is one of Scrapling's most powerful features. It allows your scraper to survive website changes by intelligently tracking and relocating elements. + +Let's say you are scraping a page with a structure like this: +```html +<div class="container"> + <section class="products"> + <article class="product" id="p1"> + <h3>Product 1</h3> + <p class="description">Description 1</p> + </article> + <article class="product" id="p2"> + <h3>Product 2</h3> + <p class="description">Description 2</p> + </article> + </section> +</div> +``` +And you want to scrape the first product, the one with the `p1` ID. You will probably write a selector like this +```python +page.css('#p1') +``` +When website owners implement structural changes like +```html +<div class="new-container"> + <div class="product-wrapper"> + <section class="products"> + <article class="product new-class" data-id="p1"> + <div class="product-info"> + <h3>Product 1</h3> + <p class="new-description">Description 1</p> + </div> + </article> + <article class="product new-class" data-id="p2"> + <div class="product-info"> + <h3>Product 2</h3> + <p class="new-description">Description 2</p> + </div> + </article> + </section> + </div> +</div> +``` +The selector will no longer function, and your code needs maintenance. That's where Scrapling's `adaptive` feature comes into play. + +With Scrapling, you can enable the `adaptive` feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element, and without AI :) + +```python +from scrapling import Selector, Fetcher +# Before the change +page = Selector(page_source, adaptive=True, url='example.com') +# or +Fetcher.adaptive = True +page = Fetcher.get('https://example.com') +# then +element = page.css('#p1', auto_save=True) +if not element: # One day website changes? + element = page.css('#p1', adaptive=True) # Scrapling still finds it! +# the rest of your code... +``` +Below, I will show you an example of how to use this feature. Then, we will dive deep into how to use it and provide details about this feature. Note that it works with all selection methods, not just CSS/XPATH selection. + +## Real-World Scenario +Let's use a real website as an example and use one of the fetchers to fetch its source. To achieve this, we need to identify a website that is about to update its design/structure, copy its source, and then wait for the website to change. Of course, that's nearly impossible to know unless I know the website's owner, but that will make it a staged test, haha. + +To solve this issue, I will use [The Web Archive](https://archive.org/)'s [Wayback Machine](https://web.archive.org/). Here is a copy of [StackOverFlow's website in 2010](https://web.archive.org/web/20100102003420/http://stackoverflow.com/); pretty old, eh?</br>Let's see if the adaptive feature can extract the same button in the old design from 2010 and the current design using the same selector :) + +If I want to extract the Questions button from the old design, I can use a selector like this: `#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a`. This selector is too specific because it was generated by Google Chrome. + + +Now, let's test the same selector in both versions +```python +>> from scrapling import Fetcher +>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a' +>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/" +>> new_url = "https://stackoverflow.com/" +>> Fetcher.configure(adaptive = True, adaptive_domain='stackoverflow.com') +>> +>> page = Fetcher.get(old_url, timeout=30) +>> element1 = page.css(selector, auto_save=True)[0] +>> +>> # Same selector but used in the updated website +>> page = Fetcher.get(new_url) +>> element2 = page.css(selector, adaptive=True)[0] +>> +>> if element1.text == element2.text: +... print('Scrapling found the same element in the old and new designs!') +'Scrapling found the same element in the old and new designs!' +``` +Note that I introduced a new argument called `adaptive_domain`. This is because, for Scrapling, these are two different domains (`archive.org` and `stackoverflow.com`), so Scrapling will isolate their `adaptive` data. To inform Scrapling that they are the same website, we must pass the custom domain we wish to use while saving `adaptive` data for both, ensuring Scrapling doesn't isolate them. + +The code will be the same in a real-world scenario, except it will use the same URL for both requests, so you won't need to use the `adaptive_domain` argument. This is the closest example I can give to real-world cases, so I hope it didn't confuse you :) + +Hence, in the two examples above, I used both the `Selector` and `Fetcher` classes to show that the adaptive logic is the same. + +!!! info + + The main reason for creating the `adaptive_domain` argument was to handle if the website changed its URL while changing the design/structure. In that case, you can use it to continue using the previously stored adaptive data for the new URL. Otherwise, scrapling will consider it a new website and discard the old data. + +## How the adaptive scraping feature works +Adaptive scraping works in two phases: + +1. **Save Phase**: Store unique properties of elements +2. **Match Phase**: Find elements with similar properties later + +Let's say you've selected an element through any method and want the library to find it the next time you scrape this website, even if it undergoes structural/design changes. + +With as few technical details as possible, the general logic goes as follows: + + 1. You tell Scrapling to save that element's unique properties in one of the ways we will show below. + 2. Scrapling uses its configured database (SQLite by default) and saves each element's unique properties. + 3. Now, because everything about the element can be changed or removed by the website's owner(s), nothing from the element can be used as a unique identifier for the database. To solve this issue, I made the storage system rely on two things: + 1. The domain of the current website. If you are using the `Selector` class, pass it when initializing; if you are using a fetcher, the domain will be automatically taken from the URL. + 2. An `identifier` to query that element's properties from the database. You don't always have to set the identifier yourself; we'll discuss this later. + + Together, they will later be used to retrieve the element's unique properties from the database. + + 4. Later, when the website's structure changes, you tell Scrapling to find the element by enabling `adaptive`. Scrapling retrieves the element's unique properties and matches all elements on the page against the unique properties we already have for this element. A score is calculated based on their similarity to the desired element. In that comparison, everything is taken into consideration, as you will see later + 5. The element(s) with the highest similarity score to the wanted element are returned. + +### The unique properties +You might wonder what unique properties we are referring to when discussing the removal or alteration of all element properties. + +For Scrapling, the unique elements we are relying on are: + +- Element tag name, text, attributes (names and values), siblings (tag names only), and path (tag names only). +- Element's parent tag name, attributes (names and values), and text. + +But you need to understand that the comparison between elements isn't exact; it's more about how similar these values are. So everything is considered, even the values' order, like the order in which the element class names were written before and the order in which the same element class names are written now. + +## How to use adaptive feature +The adaptive feature can be applied to any found element, and it's added as arguments to CSS/XPath Selection methods, as you saw above, but we will get back to that later. + +First, you must enable the `adaptive` feature by passing `adaptive=True` to the [Selector](main_classes.md#selector) class when you initialize it or enable it in the fetcher you are using of the available fetchers, as we will show. + +Examples: +```python +>>> from scrapling import Selector, Fetcher +>>> page = Selector(html_doc, adaptive=True) +# OR +>>> Fetcher.adaptive = True +>>> page = Fetcher.get('https://example.com') +``` +If you are using the [Selector](main_classes.md#selector) class, you need to pass the url of the website you are using with the argument `url` so Scrapling can separate the properties saved for each element by domain. + +If you didn't pass a URL, the word `default` will be used in place of the URL field while saving the element's unique properties. So, this will only be an issue if you use the same identifier later for a different website and don't pass the URL parameter when initializing it. The save process overwrites previous data, and the `adaptive` feature uses only the latest saved properties. + +Besides those arguments, we have `storage` and `storage_args`. Both are for the class to connect to the database; by default, it uses the SQLite class provided by the library. Those arguments shouldn't matter unless you want to write your own storage system, which we will cover on a [separate page in the development section](../development/adaptive_storage_system.md). + +Now that you've enabled the `adaptive` feature globally, you have two main ways to use it. + +### The CSS/XPath Selection way +As you have seen in the example above, first, you have to use the `auto_save` argument while selecting an element that exists on the page, like below +```python +element = page.css('#p1', auto_save=True) +``` +And when the element doesn't exist, you can use the same selector and the `adaptive` argument, and the library will find it for you +```python +element = page.css('#p1', adaptive=True) +``` +Pretty simple, eh? + +Well, a lot happened under the hood here. Remember the identifier we mentioned before that you need to set to retrieve the element you want? Here, with the `css`/`xpath` methods, the identifier is set automatically as the selector you passed here to make things easier :) + +Additionally, for all these methods, you can pass the `identifier` argument to set it yourself. This is useful in some instances, or you can use it to save properties with the `auto_save` argument. + +### The manual way +You manually save and retrieve an element, then relocate it, which all happens within the `adaptive` feature, as shown below. This allows you to relocate any element using any method or selection! + +First, let's say you got an element like this by text: +```python +>>> element = page.find_by_text('Tipping the Velvet', first_match=True) +``` +You can save its unique properties using the `save` method, as shown below, but you must set the identifier yourself. For this example, I chose `my_special_element` as an identifier, but it's best to use a meaningful identifier in your code for the same reason you use meaningful variable names :) +```python +>>> page.save(element, 'my_special_element') +``` +Now, later, when you want to retrieve it and relocate it inside the page with `adaptive`, it would be like this +```python +>>> element_dict = page.retrieve('my_special_element') +>>> page.relocate(element_dict, selector_type=True) +[<data='<a href="catalogue/tipping-the-velvet_99...' parent='<h3><a href="catalogue/tipping-the-velve...'>] +>>> page.relocate(element_dict, selector_type=True).css('::text').getall() +['Tipping the Velvet'] +``` +Hence, the `retrieve` and `relocate` methods are used. + +If you want to keep it as a `lxml.etree` object, leave the `selector_type` argument +```python +>>> page.relocate(element_dict) +[<Element a at 0x105a2a7b0>] +``` + +## Troubleshooting + +### No Matches Found +```python +# 1. Check if data was saved +element_data = page.retrieve('identifier') +if not element_data: + print("No data saved for this identifier") + +# 2. Try with different identifier +products = page.css('.product', adaptive=True, identifier='old_selector') + +# 3. Save again with new identifier +products = page.css('.new-product', auto_save=True, identifier='new_identifier') +``` + +### Wrong Elements Matched +```python +# Use more specific selectors +products = page.css('.product-list .product', auto_save=True) + +# Or save with more context +product = page.find_by_text('Product Name').parent +page.save(product, 'specific_product') +``` + +## Known Issues +In the `adaptive` save process, only the unique properties of the first element in the selection results are saved. So if the selector you are using selects different elements on the page in other locations, `adaptive` will return the first element to you only when you relocate it later. This doesn't include combined CSS selectors (Using commas to combine more than one selector, for example), as these selectors are separated and each is executed alone. + +## Final thoughts +Explaining this feature in detail without complications turned out to be challenging. However, still, if there's something left unclear, you can head out to the [discussions section](https://github.com/D4Vinci/Scrapling/discussions), and I will reply to you ASAP, or the Discord server, or reach out to me privately and have a chat :) \ No newline at end of file diff --git a/docs/parsing/main_classes.md b/docs/parsing/main_classes.md new file mode 100644 index 0000000000000000000000000000000000000000..f310037f2a85aab526341bb1ab6e05d634f58fae --- /dev/null +++ b/docs/parsing/main_classes.md @@ -0,0 +1,607 @@ +# Parsing main classes + +!!! success "Prerequisites" + + - Youโ€™ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector) object. + +After exploring the various ways to select elements with Scrapling and its related features, let's take a step back and examine the [Selector](#selector) class in general, as well as other objects, to gain a better understanding of the parsing engine. + +The [Selector](#selector) class is the core parsing engine in Scrapling, providing HTML parsing and element selection capabilities. You can always import it with any of the following imports +```python +from scrapling import Selector +from scrapling.parser import Selector +``` +Then use it directly as you already learned in the [overview](../overview.md) page +```python +page = Selector( + '<html>...</html>', + url='https://example.com' +) + +# Then select elements as you like +elements = page.css('.product') +``` +In Scrapling, the main object you deal with after passing an HTML source or fetching a website is, of course, a [Selector](#selector) object. Any operation you do, like selection, navigation, etc., will return either a [Selector](#selector) object or a [Selectors](#selectors) object, given that the result is element/elements from the page, not text or similar. + +In other words, the main page is a [Selector](#selector) object, and the elements within are [Selector](#selector) objects, and so on. Any text, such as the text content inside elements or the text inside element attributes, is a [TextHandler](#texthandler) object, and the attributes of each element are stored as [AttributesHandler](#attributeshandler). We will return to both objects later, so let's focus on the [Selector](#selector) object. + +## Selector +### Arguments explained +The most important one is `content`, it's used to pass the HTML code you want to parse, and it accepts the HTML content as `str` or `bytes`. + +Otherwise, you have the arguments `url`, `adaptive`, `storage`, and `storage_args`. All these arguments are settings used with the `adaptive` feature, and they don't make a difference if you are not going to use that feature, so just ignore them for now, and we will explain them in the [adaptive](adaptive.md) feature page. + +Then you have the arguments for parsing adjustments or adjusting/manipulating the HTML content while the library is parsing it: + +- **encoding**: This is the encoding that will be used while parsing the HTML. The default is `UTF-8`. +- **keep_comments**: This tells the library whether to keep HTML comments while parsing the page. It's disabled by default because it can cause issues with your scraping in various ways. +- **keep_cdata**: Same logic as the HTML comments. [cdata](https://stackoverflow.com/questions/7092236/what-is-cdata-in-html) is removed by default for cleaner HTML. + +I have intended to ignore the arguments `huge_tree` and `root` to avoid making this page more complicated than needed. +You may notice that I'm doing that a lot because it involves advanced features that you don't need to know to use the library. The development section will cover these missing parts if you are very invested. + +After that, most properties on the main page and its elements are lazily loaded. This means they don't get initialized until you use them like the text content of a page/element, and this is one of the reasons for Scrapling speed :) + +### Properties +You have already seen much of this on the [overview](../overview.md) page, but don't worry if you didn't. We will review it more thoroughly using more advanced methods/usages. For clarity, the properties for traversal are separated below in the [traversal](#traversal) section. + +Let's say we are parsing this HTML page for simplicity: +```html +<html> + <head> + <title>Some page + + +
+
+

Product 1

+

This is product 1

+ $10.99 + +
+ +
+

Product 2

+

This is product 2

+ $20.99 + +
+ +
+

Product 3

+

This is product 3

+ $15.99 + +
+
+ + + + +``` +Load the page directly as shown before: +```python +from scrapling import Selector +page = Selector(html_doc) +``` +Get all text content on the page recursively +```python +>>> page.get_all_text() +'Some page\n\n \n\n \nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock' +``` +Get the first article, as explained before; we will use it as an example +```python +article = page.find('article') +``` +With the same logic, get all text content on the element recursively +```python +>>> article.get_all_text() +'Product 1\nThis is product 1\n$10.99\nIn stock: 5' +``` +But if you try to get the direct text content, it will be empty because it doesn't have direct text in the HTML code above +```python +>>> article.text +'' +``` +The `get_all_text` method has the following optional arguments: + +1. **separator**: All strings collected will be concatenated using this separator. The default is '\n'. +2. **strip**: If enabled, strings will be stripped before concatenation. Disabled by default. +3. **ignore_tags**: A tuple of all tag names you want to ignore in the final results and ignore any elements nested within them. The default is `('script', 'style',)`. +4. **valid_values**: If enabled, the method will only collect elements with real values, so all elements with empty text content or only whitespaces will be ignored. It's enabled by default + +By the way, the text returned here is not a standard string but a [TextHandler](#texthandler); we will get to this in detail later, so if the text content can be serialized to JSON, use `.json()` on it +```python +>>> script = page.find('script') +>>> script.json() +{'lastUpdated': '2024-09-22T10:30:00Z', 'totalProducts': 3} +``` +Let's continue to get the element tag +```python +>>> article.tag +'article' +``` +If you use it on the page directly, you will find that you are operating on the root `html` element +```python +>>> page.tag +'html' +``` +Now, I think I've hammered the (`page`/`element`) idea, so I won't return to it. + +Getting the attributes of the element +```python +>>> print(article.attrib) +{'class': 'product', 'data-id': '1'} +``` +Access a specific attribute with any of the following +```python +>>> article.attrib['class'] +>>> article.attrib.get('class') +>>> article['class'] # new in v0.3 +``` +Check if the attributes contain a specific attribute with any of the methods below +```python +>>> 'class' in article.attrib +>>> 'class' in article # new in v0.3 +``` +Get the HTML content of the element +```python +>>> article.html_content +'

Product 1

\n

This is product 1

\n $10.99\n \n
' +``` +Get the prettified version of the element's HTML content +```python +print(article.prettify()) +``` +```html +

Product 1

+

This is product 1

+ $10.99 + +
+``` +Use the `.body` property to get the raw content of the page. Starting from v0.4, when used on a `Response` object from fetchers, `.body` always returns `bytes`. +```python +>>> page.body +'\n \n Some page\n \n ...' +``` +To get all the ancestors in the DOM tree of this element +```python +>>> article.path +[
, +
, + Some page] +``` +Generate a CSS shortened selector if possible, or generate the full selector +```python +>>> article.generate_css_selector +'body > div > article' +>>> article.generate_full_css_selector +'body > div > article' +``` +Same case with XPath +```python +>>> article.generate_xpath_selector +"//body/div/article" +>>> article.generate_full_xpath_selector +"//body/div/article" +``` + +### Traversal +Using the elements we found above, we will go over the properties/methods for moving on the page in detail. + +If you are unfamiliar with the DOM tree or the tree data structure in general, the following traversal part can be confusing. I recommend you look up these concepts online to better understand them. + +If you are too lazy to search about it, here's a quick explanation to give you a good idea.
+In simple words, the `html` element is the root of the website's tree, as every page starts with an `html` element.
+This element will be positioned directly above elements such as `head` and `body`. These are considered "children" of the `html` element, and the `html` element is considered their "parent". The element `body` is a "sibling" of the element `head` and vice versa. + +Accessing the parent of an element +```python +>>> article.parent +
+>>> article.parent.tag +'div' +``` +You can chain it as you want, which applies to all similar properties/methods we will review. +```python +>>> article.parent.parent.tag +'body' +``` +Get the children of an element +```python +>>> article.children +[Product 1' parent='
, + This is product 1...' parent='
, + $10.99' parent='
, +