AUXteam commited on
Commit
e840680
·
verified ·
1 Parent(s): a53c022

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +5 -0
  2. Dockerfile +4 -0
  3. PROJECT_SCAN.md +70 -0
  4. Scrapling/.bandit.yml +11 -0
  5. Scrapling/.dockerignore +110 -0
  6. Scrapling/.github/FUNDING.yml +3 -0
  7. Scrapling/.github/ISSUE_TEMPLATE/01-bug_report.yml +82 -0
  8. Scrapling/.github/ISSUE_TEMPLATE/02-feature_request.yml +19 -0
  9. Scrapling/.github/ISSUE_TEMPLATE/03-other.yml +19 -0
  10. Scrapling/.github/ISSUE_TEMPLATE/04-docs_issue.yml +40 -0
  11. Scrapling/.github/ISSUE_TEMPLATE/config.yml +10 -0
  12. Scrapling/.github/PULL_REQUEST_TEMPLATE.md +51 -0
  13. Scrapling/.github/workflows/code-quality.yml +184 -0
  14. Scrapling/.github/workflows/docker-build.yml +86 -0
  15. Scrapling/.github/workflows/release-and-publish.yml +74 -0
  16. Scrapling/.github/workflows/tests.yml +109 -0
  17. Scrapling/.gitignore +107 -0
  18. Scrapling/.pre-commit-config.yaml +20 -0
  19. Scrapling/.readthedocs.yaml +21 -0
  20. Scrapling/CODE_OF_CONDUCT.md +128 -0
  21. Scrapling/CONTRIBUTING.md +106 -0
  22. Scrapling/Dockerfile +39 -0
  23. Scrapling/LICENSE +28 -0
  24. Scrapling/MANIFEST.in +13 -0
  25. Scrapling/README.md +419 -0
  26. Scrapling/ROADMAP.md +14 -0
  27. Scrapling/benchmarks.py +146 -0
  28. Scrapling/cleanup.py +42 -0
  29. Scrapling/docs/README_AR.md +414 -0
  30. Scrapling/docs/README_CN.md +414 -0
  31. Scrapling/docs/README_DE.md +414 -0
  32. Scrapling/docs/README_ES.md +414 -0
  33. Scrapling/docs/README_JP.md +414 -0
  34. Scrapling/docs/README_RU.md +414 -0
  35. Scrapling/docs/ai/mcp-server.md +294 -0
  36. Scrapling/docs/api-reference/custom-types.md +26 -0
  37. Scrapling/docs/api-reference/fetchers.md +63 -0
  38. Scrapling/docs/api-reference/mcp-server.md +39 -0
  39. Scrapling/docs/api-reference/proxy-rotation.md +18 -0
  40. Scrapling/docs/api-reference/response.md +18 -0
  41. Scrapling/docs/api-reference/selector.md +25 -0
  42. Scrapling/docs/api-reference/spiders.md +42 -0
  43. Scrapling/docs/assets/cover_dark.png +3 -0
  44. Scrapling/docs/assets/cover_dark.svg +0 -0
  45. Scrapling/docs/assets/cover_light.png +0 -0
  46. Scrapling/docs/assets/cover_light.svg +1 -0
  47. Scrapling/docs/assets/favicon.ico +3 -0
  48. Scrapling/docs/assets/logo.png +0 -0
  49. Scrapling/docs/assets/main_cover.png +3 -0
  50. Scrapling/docs/assets/scrapling_shell_curl.png +3 -0
.gitattributes ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Scrapling/docs/assets/cover_dark.png filter=lfs diff=lfs merge=lfs -text
2
+ Scrapling/docs/assets/favicon.ico filter=lfs diff=lfs merge=lfs -text
3
+ Scrapling/docs/assets/main_cover.png filter=lfs diff=lfs merge=lfs -text
4
+ Scrapling/docs/assets/scrapling_shell_curl.png filter=lfs diff=lfs merge=lfs -text
5
+ Scrapling/docs/assets/spider_architecture.png filter=lfs diff=lfs merge=lfs -text
Dockerfile CHANGED
@@ -84,6 +84,10 @@ RUN mkdir -p /home/user/.streamlit && chown -R user:user /home/user
84
  # Copy the rest of the application
85
  COPY --chown=user:user . .
86
 
 
 
 
 
87
  # Set permissions for the start script
88
  RUN chmod +x start.sh
89
 
 
84
  # Copy the rest of the application
85
  COPY --chown=user:user . .
86
 
87
+ # Install Scrapling in editable mode and its browser dependencies
88
+ RUN uv pip install --system -e ./Scrapling[fetchers]
89
+ RUN playwright install chromium
90
+
91
  # Set permissions for the start script
92
  RUN chmod +x start.sh
93
 
PROJECT_SCAN.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CyberScraper-2077 Project Scan & Integration Analysis
2
+
3
+ ## 1. Project Overview
4
+ CyberScraper-2077 is an advanced, AI-powered web scraping tool with a futuristic cyberpunk theme. It leverages Large Language Models (LLMs) like OpenAI, Gemini, and Ollama to extract structured data from websites. It features a Streamlit-based UI, supports Tor for .onion sites, and currently uses a custom `PlaywrightScraper` (based on `patchright`) for stealthy scraping.
5
+
6
+ ## 2. Architecture Analysis
7
+
8
+ ### Core Components
9
+ * **Frontend (`main.py`, `app/`):** Built with Streamlit. Handles user interaction, chat interface, and displays scraped data. Manages session state and API keys.
10
+ * **Orchestrator (`src/web_extractor.py`):** The central logic (`WebExtractor` class). It coordinates:
11
+ * **Fetching:** Calls `PlaywrightScraper` or `TorScraper` to get raw HTML.
12
+ * **Preprocessing:** Cleans HTML using BeautifulSoup.
13
+ * **Extraction:** Uses LangChain and LLMs to parse content and answer user queries.
14
+ * **Agentic Loop:** Has an experimental agentic loop for iterative investigation using `browser_tools`.
15
+ * **Scraping Engine (`src/scrapers/`):**
16
+ * **`PlaywrightScraper`:** A custom implementation using `patchright` (undetected Playwright). Features:
17
+ * **Stealth:** Uses persistent contexts and `patchright`'s built-in stealth.
18
+ * **Concurrency:** `scrape_multiple_pages` handles basic pagination.
19
+ * **CAPTCHA:** Manual `handle_captcha` method.
20
+ * **`TorScraper`:** Uses `requests` with SOCKS proxy for .onion sites.
21
+ * **Utilities:**
22
+ * `src/models.py` / `src/ollama_models.py`: Model management.
23
+ * `src/utils/`: Error handling, Google Sheets integration.
24
+
25
+ ### Data Flow
26
+ 1. User inputs URL/Query in Streamlit.
27
+ 2. `StreamlitWebScraperChat` initializes `WebExtractor`.
28
+ 3. `WebExtractor` determines if it's a new URL or chat.
29
+ 4. If new URL:
30
+ * `PlaywrightScraper.fetch_content` is called.
31
+ * Content is preprocessed and cached.
32
+ * LLM is called with context to answer query or extract data.
33
+
34
+ ## 3. Scrapling Integration Analysis
35
+ The `Scrapling` folder contains a new, sophisticated scraping library. Integrating it will enhance CyberScraper-2077 by providing:
36
+
37
+ * **Adaptive Parsing:** Robustness against layout changes.
38
+ * **Advanced Fetching:** `StealthyFetcher` and `DynamicFetcher` (wrapping Playwright) for better evasion.
39
+ * **Spider Framework:** A Scrapy-like `Spider` class for efficient, concurrent, and persistent crawling of multiple pages.
40
+
41
+ ### Key Integration Points
42
+ * **Replacement/Augmentation of `PlaywrightScraper`:**
43
+ * `Scrapling`'s `StealthyFetcher` can replace the manual `patchright` setup in `PlaywrightScraper`.
44
+ * `Scrapling`'s `Spider` can replace the manual `scrape_multiple_pages` loop, offering better concurrency and state management.
45
+ * **Dependencies:**
46
+ * `Scrapling` requires `playwright` (implied by imports in `Scrapling/scrapling/engines/_browsers/_controllers.py`).
47
+ * Current repo uses `patchright`. Integration must ensure `Scrapling` works with the installed environment. `Scrapling`'s `pyproject.toml` lists both `playwright` and `patchright` in optional dependencies, but the code imports `playwright`. I will need to ensure `playwright` is installed or alias it if `patchright` is to be used exclusively (though `patchright` module name is different).
48
+ * **Async Compatibility:**
49
+ * `Scrapling` supports async fetchers and spiders, compatible with `WebExtractor`'s async nature.
50
+
51
+ ## 4. Proposed Integration Plan
52
+
53
+ 1. **Dependency Management:**
54
+ * Install `Scrapling` in editable mode (`pip install -e Scrapling[fetchers]`).
55
+ * Ensure compatible versions of `playwright`/`patchright`.
56
+ 2. **Adapter Implementation (`src/scrapers/scrapling_adapter.py`):**
57
+ * Create a `ScraplingAdapter` class implementing the interface expected by `WebExtractor` (matching `PlaywrightScraper`'s `fetch_content` signature).
58
+ * Implement `fetch_content` to use:
59
+ * `StealthyFetcher` (or `DynamicFetcher`) for single URLs.
60
+ * A dynamic `Spider` subclass for multi-page/crawl tasks.
61
+ 3. **WebExtractor Update:**
62
+ * Modify `src/web_extractor.py` to instantiate `ScraplingAdapter` instead of `PlaywrightScraper`.
63
+ * Map existing config (headless, proxy) to `Scrapling` configuration.
64
+ 4. **Verification:**
65
+ * Test single page extraction.
66
+ * Test multi-page crawling.
67
+
68
+ ## 5. Potential Issues & Risks
69
+ * **Dependency Conflict:** `Scrapling` might pull a different version of `playwright` or `patchright`.
70
+ * **Browser Management:** `Scrapling` manages its own browser instances. I need to ensure it plays nicely with Streamlit's lifecycle (though `WebExtractor` creates a new scraper instance).
Scrapling/.bandit.yml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ skips:
2
+ - B101
3
+ - B311
4
+ - B113 # `Requests call without timeout` these requests are done in the benchmark and examples scripts only
5
+ - B403 # We are using pickle for tests only
6
+ - B404 # Using subprocess library
7
+ - B602 # subprocess call with shell=True identified
8
+ - B110 # Try, Except, Pass detected.
9
+ - B104 # Possible binding to all interfaces.
10
+ - B301 # Pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue.
11
+ - B108 # Probable insecure usage of temp file/directory.
Scrapling/.dockerignore ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Github
2
+ .github/
3
+
4
+ # docs
5
+ docs/
6
+ images/
7
+ .cache/
8
+ .claude/
9
+
10
+ # cached files
11
+ __pycache__/
12
+ *.py[cod]
13
+ .cache
14
+ .DS_Store
15
+ *~
16
+ .*.sw[po]
17
+ .build
18
+ .ve
19
+ .env
20
+ .pytest
21
+ .benchmarks
22
+ .bootstrap
23
+ .appveyor.token
24
+ *.bak
25
+ *.db
26
+ *.db-*
27
+
28
+ # installation package
29
+ *.egg-info/
30
+ dist/
31
+ build/
32
+
33
+ # environments
34
+ .venv
35
+ env/
36
+ venv/
37
+ ENV/
38
+ env.bak/
39
+ venv.bak/
40
+
41
+ # C extensions
42
+ *.so
43
+
44
+ # pycharm
45
+ .idea/
46
+
47
+ # vscode
48
+ *.code-workspace
49
+
50
+ # Packages
51
+ *.egg
52
+ *.egg-info
53
+ dist
54
+ build
55
+ eggs
56
+ .eggs
57
+ parts
58
+ bin
59
+ var
60
+ sdist
61
+ wheelhouse
62
+ develop-eggs
63
+ .installed.cfg
64
+ lib
65
+ lib64
66
+ venv*/
67
+ .venv*/
68
+ pyvenv*/
69
+ pip-wheel-metadata/
70
+ poetry.lock
71
+
72
+ # Installer logs
73
+ pip-log.txt
74
+
75
+ # mypy
76
+ .mypy_cache/
77
+ .dmypy.json
78
+ dmypy.json
79
+ mypy.ini
80
+
81
+ # test caches
82
+ .tox/
83
+ .pytest_cache/
84
+ .coverage
85
+ htmlcov
86
+ report.xml
87
+ nosetests.xml
88
+ coverage.xml
89
+
90
+ # Translations
91
+ *.mo
92
+
93
+ # Buildout
94
+ .mr.developer.cfg
95
+
96
+ # IDE project files
97
+ .project
98
+ .pydevproject
99
+ .idea
100
+ *.iml
101
+ *.komodoproject
102
+
103
+ # Complexity
104
+ output/*.html
105
+ output/*/index.html
106
+
107
+ # Sphinx
108
+ docs/_build
109
+ public/
110
+ web/
Scrapling/.github/FUNDING.yml ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ github: D4Vinci
2
+ buy_me_a_coffee: d4vinci
3
+ ko_fi: d4vinci
Scrapling/.github/ISSUE_TEMPLATE/01-bug_report.yml ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Bug report
2
+ description: Create a bug report to help us address errors in the repository
3
+ labels: [bug]
4
+ body:
5
+ - type: checkboxes
6
+ attributes:
7
+ label: Have you searched if there an existing issue for this?
8
+ description: Please search [existing issues](https://github.com/D4Vinci/Scrapling/labels/bug).
9
+ options:
10
+ - label: I have searched the existing issues
11
+ required: true
12
+
13
+ - type: input
14
+ attributes:
15
+ label: "Python version (python --version)"
16
+ placeholder: "Python 3.8"
17
+ validations:
18
+ required: true
19
+
20
+ - type: input
21
+ attributes:
22
+ label: "Scrapling version (scrapling.__version__)"
23
+ placeholder: "0.1"
24
+ validations:
25
+ required: true
26
+
27
+ - type: textarea
28
+ attributes:
29
+ label: "Dependencies version (pip3 freeze)"
30
+ description: >
31
+ This is the output of the command `pip3 freeze --all`. Note that the
32
+ actual output might be different as compared to the placeholder text.
33
+ placeholder: |
34
+ cssselect==1.2.0
35
+ lxml==5.3.0
36
+ orjson==3.10.7
37
+ ...
38
+ validations:
39
+ required: true
40
+
41
+ - type: input
42
+ attributes:
43
+ label: "What's your operating system?"
44
+ placeholder: "Windows 10"
45
+ validations:
46
+ required: true
47
+
48
+ - type: dropdown
49
+ attributes:
50
+ label: 'Are you using a separate virtual environment?'
51
+ description: "Please pay attention to this question"
52
+ options:
53
+ - 'No'
54
+ - 'Yes'
55
+ default: 0
56
+ validations:
57
+ required: true
58
+
59
+ - type: textarea
60
+ attributes:
61
+ label: "Expected behavior"
62
+ description: "Describe the behavior you expect. May include images or videos."
63
+ validations:
64
+ required: true
65
+
66
+ - type: textarea
67
+ attributes:
68
+ label: "Actual behavior"
69
+ validations:
70
+ required: true
71
+
72
+ - type: textarea
73
+ attributes:
74
+ label: Steps To Reproduce
75
+ description: Steps to reproduce the behavior.
76
+ placeholder: |
77
+ 1. In this environment...
78
+ 2. With this config...
79
+ 3. Run '...'
80
+ 4. See error...
81
+ validations:
82
+ required: false
Scrapling/.github/ISSUE_TEMPLATE/02-feature_request.yml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Feature request
2
+ description: Suggest features, propose improvements, discuss new ideas.
3
+ labels: [enhancement]
4
+ body:
5
+ - type: checkboxes
6
+ attributes:
7
+ label: Have you searched if there an existing feature request for this?
8
+ description: Please search [existing requests](https://github.com/D4Vinci/Scrapling/labels/enhancement).
9
+ options:
10
+ - label: I have searched the existing requests
11
+ required: true
12
+
13
+ - type: textarea
14
+ attributes:
15
+ label: "Feature description"
16
+ description: >
17
+ This could include new topics or improving any existing features/implementations.
18
+ validations:
19
+ required: true
Scrapling/.github/ISSUE_TEMPLATE/03-other.yml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Other
2
+ description: Use this for any other issues. PLEASE provide as much information as possible.
3
+ labels: ["awaiting triage"]
4
+ body:
5
+ - type: textarea
6
+ id: issuedescription
7
+ attributes:
8
+ label: What would you like to share?
9
+ description: Provide a clear and concise explanation of your issue.
10
+ validations:
11
+ required: true
12
+
13
+ - type: textarea
14
+ id: extrainfo
15
+ attributes:
16
+ label: Additional information
17
+ description: Is there anything else we should know about this issue?
18
+ validations:
19
+ required: false
Scrapling/.github/ISSUE_TEMPLATE/04-docs_issue.yml ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Documentation issue
2
+ description: Report incorrect, unclear, or missing documentation.
3
+ labels: [documentation]
4
+ body:
5
+ - type: checkboxes
6
+ attributes:
7
+ label: Have you searched if there an existing issue for this?
8
+ description: Please search [existing issues](https://github.com/D4Vinci/Scrapling/labels/documentation).
9
+ options:
10
+ - label: I have searched the existing issues
11
+ required: true
12
+
13
+ - type: input
14
+ attributes:
15
+ label: "Page URL"
16
+ description: "Link to the documentation page with the issue."
17
+ placeholder: "https://scrapling.readthedocs.io/en/latest/..."
18
+ validations:
19
+ required: true
20
+
21
+ - type: dropdown
22
+ attributes:
23
+ label: "Type of issue"
24
+ options:
25
+ - Incorrect information
26
+ - Unclear or confusing
27
+ - Missing information
28
+ - Typo or formatting
29
+ - Broken link
30
+ - Other
31
+ default: 0
32
+ validations:
33
+ required: true
34
+
35
+ - type: textarea
36
+ attributes:
37
+ label: "Description"
38
+ description: "Describe what's wrong and what you expected to find."
39
+ validations:
40
+ required: true
Scrapling/.github/ISSUE_TEMPLATE/config.yml ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ blank_issues_enabled: false
2
+ contact_links:
3
+ - name: Discussions
4
+ url: https://github.com/D4Vinci/Scrapling/discussions
5
+ about: >
6
+ The "Discussions" forum is where you want to start. 💖
7
+ - name: Ask on our discord server
8
+ url: https://discord.gg/EMgGbDceNQ
9
+ about: >
10
+ Our community chat forum.
Scrapling/.github/PULL_REQUEST_TEMPLATE.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!--
2
+ You are amazing! Thanks for contributing to Scrapling!
3
+ Please, DO NOT DELETE ANY TEXT from this template! (unless instructed).
4
+ -->
5
+
6
+ ## Proposed change
7
+ <!--
8
+ Describe the big picture of your changes here to communicate to the maintainers why we should accept this pull request.
9
+ If it fixes a bug or resolves a feature request, be sure to link to that issue in the additional information section.
10
+ -->
11
+
12
+
13
+ ### Type of change:
14
+ <!--
15
+ What type of change does your PR introduce to Scrapling?
16
+ NOTE: Please, check at least 1 box!
17
+ If your PR requires multiple boxes to be checked, you'll most likely need to
18
+ split it into multiple PRs. This makes things easier and faster to code review.
19
+ -->
20
+
21
+
22
+
23
+ - [ ] Dependency upgrade
24
+ - [ ] Bugfix (non-breaking change which fixes an issue)
25
+ - [ ] New integration (thank you!)
26
+ - [ ] New feature (which adds functionality to an existing integration)
27
+ - [ ] Deprecation (breaking change to happen in the future)
28
+ - [ ] Breaking change (fix/feature causing existing functionality to break)
29
+ - [ ] Code quality improvements to existing code or addition of tests
30
+ - [ ] Add or change doctests? -- Note: Please avoid changing both code and tests in a single pull request.
31
+ - [ ] Documentation change?
32
+
33
+ ### Additional information
34
+ <!--
35
+ Details are important and help maintainers processing your PR.
36
+ Please be sure to fill out additional details, if applicable.
37
+ -->
38
+
39
+ - This PR fixes or closes an issue: fixes #
40
+ - This PR is related to an issue: #
41
+ - Link to documentation pull request: **
42
+
43
+ ### Checklist:
44
+ * [ ] I have read [CONTRIBUTING.md](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md).
45
+ * [ ] This pull request is all my own work -- I have not plagiarized.
46
+ * [ ] I know that pull requests will not be merged if they fail the automated tests.
47
+ * [ ] All new Python files are placed inside an existing directory.
48
+ * [ ] All filenames are in all lowercase characters with no spaces or dashes.
49
+ * [ ] All functions and variable names follow Python naming conventions.
50
+ * [ ] All function parameters and return values are annotated with Python [type hints](https://docs.python.org/3/library/typing.html).
51
+ * [ ] All functions have doc-strings.
Scrapling/.github/workflows/code-quality.yml ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Code Quality
2
+
3
+ on:
4
+ push:
5
+ branches:
6
+ - main
7
+ - dev
8
+ paths-ignore:
9
+ - '*.md'
10
+ - '**/*.md'
11
+ - 'docs/**'
12
+ - 'images/**'
13
+ - '.github/**'
14
+ - '!.github/workflows/code-quality.yml' # Always run when this workflow changes
15
+ pull_request:
16
+ branches:
17
+ - main
18
+ - dev
19
+ paths-ignore:
20
+ - '*.md'
21
+ - '**/*.md'
22
+ - 'docs/**'
23
+ - 'images/**'
24
+ workflow_dispatch: # Allow manual triggering
25
+
26
+ concurrency:
27
+ group: ${{ github.workflow }}-${{ github.ref }}
28
+ cancel-in-progress: true
29
+
30
+ jobs:
31
+ code-quality:
32
+ name: Code Quality Checks
33
+ runs-on: ubuntu-latest
34
+ permissions:
35
+ contents: read
36
+ pull-requests: write # For PR annotations
37
+
38
+ steps:
39
+ - name: Checkout code
40
+ uses: actions/checkout@v4
41
+ with:
42
+ fetch-depth: 0 # Full history for better analysis
43
+
44
+ - name: Set up Python
45
+ uses: actions/setup-python@v5
46
+ with:
47
+ python-version: '3.10'
48
+ cache: 'pip'
49
+
50
+ - name: Install dependencies
51
+ run: |
52
+ python -m pip install --upgrade pip
53
+ pip install bandit[toml] ruff vermin mypy pyright
54
+ pip install -e ".[all]"
55
+ pip install lxml-stubs
56
+
57
+ - name: Run Bandit (Security Linter)
58
+ id: bandit
59
+ continue-on-error: true
60
+ run: |
61
+ echo "::group::Bandit - Security Linter"
62
+ bandit -r -c .bandit.yml scrapling/ -f json -o bandit-report.json
63
+ bandit -r -c .bandit.yml scrapling/
64
+ echo "::endgroup::"
65
+
66
+ - name: Run Ruff Linter
67
+ id: ruff-lint
68
+ continue-on-error: true
69
+ run: |
70
+ echo "::group::Ruff - Linter"
71
+ ruff check scrapling/ --output-format=github
72
+ echo "::endgroup::"
73
+
74
+ - name: Run Ruff Formatter Check
75
+ id: ruff-format
76
+ continue-on-error: true
77
+ run: |
78
+ echo "::group::Ruff - Formatter Check"
79
+ ruff format --check scrapling/ --diff
80
+ echo "::endgroup::"
81
+
82
+ - name: Run Vermin (Python Version Compatibility)
83
+ id: vermin
84
+ continue-on-error: true
85
+ run: |
86
+ echo "::group::Vermin - Python 3.10+ Compatibility Check"
87
+ vermin -t=3.10- --violations --eval-annotations --no-tips scrapling/
88
+ echo "::endgroup::"
89
+
90
+ - name: Run Mypy (Static Type Checker)
91
+ id: mypy
92
+ continue-on-error: true
93
+ run: |
94
+ echo "::group::Mypy - Static Type Checker"
95
+ mypy scrapling/
96
+ echo "::endgroup::"
97
+
98
+ - name: Run Pyright (Static Type Checker)
99
+ id: pyright
100
+ continue-on-error: true
101
+ run: |
102
+ echo "::group::Pyright - Static Type Checker"
103
+ pyright scrapling/
104
+ echo "::endgroup::"
105
+
106
+ - name: Check results and create summary
107
+ if: always()
108
+ run: |
109
+ echo "# Code Quality Check Results" >> $GITHUB_STEP_SUMMARY
110
+ echo "" >> $GITHUB_STEP_SUMMARY
111
+
112
+ # Initialize status
113
+ all_passed=true
114
+
115
+ # Check Bandit
116
+ if [ "${{ steps.bandit.outcome }}" == "success" ]; then
117
+ echo "✅ **Bandit (Security)**: Passed" >> $GITHUB_STEP_SUMMARY
118
+ else
119
+ echo "❌ **Bandit (Security)**: Failed" >> $GITHUB_STEP_SUMMARY
120
+ all_passed=false
121
+ fi
122
+
123
+ # Check Ruff Linter
124
+ if [ "${{ steps.ruff-lint.outcome }}" == "success" ]; then
125
+ echo "✅ **Ruff Linter**: Passed" >> $GITHUB_STEP_SUMMARY
126
+ else
127
+ echo "❌ **Ruff Linter**: Failed" >> $GITHUB_STEP_SUMMARY
128
+ all_passed=false
129
+ fi
130
+
131
+ # Check Ruff Formatter
132
+ if [ "${{ steps.ruff-format.outcome }}" == "success" ]; then
133
+ echo "✅ **Ruff Formatter**: Passed" >> $GITHUB_STEP_SUMMARY
134
+ else
135
+ echo "❌ **Ruff Formatter**: Failed" >> $GITHUB_STEP_SUMMARY
136
+ all_passed=false
137
+ fi
138
+
139
+ # Check Vermin
140
+ if [ "${{ steps.vermin.outcome }}" == "success" ]; then
141
+ echo "✅ **Vermin (Python 3.10+)**: Passed" >> $GITHUB_STEP_SUMMARY
142
+ else
143
+ echo "❌ **Vermin (Python 3.10+)**: Failed" >> $GITHUB_STEP_SUMMARY
144
+ all_passed=false
145
+ fi
146
+
147
+ # Check Mypy
148
+ if [ "${{ steps.mypy.outcome }}" == "success" ]; then
149
+ echo "✅ **Mypy (Type Checker)**: Passed" >> $GITHUB_STEP_SUMMARY
150
+ else
151
+ echo "❌ **Mypy (Type Checker)**: Failed" >> $GITHUB_STEP_SUMMARY
152
+ all_passed=false
153
+ fi
154
+
155
+ # Check Pyright
156
+ if [ "${{ steps.pyright.outcome }}" == "success" ]; then
157
+ echo "✅ **Pyright (Type Checker)**: Passed" >> $GITHUB_STEP_SUMMARY
158
+ else
159
+ echo "❌ **Pyright (Type Checker)**: Failed" >> $GITHUB_STEP_SUMMARY
160
+ all_passed=false
161
+ fi
162
+
163
+ echo "" >> $GITHUB_STEP_SUMMARY
164
+
165
+ if [ "$all_passed" == "true" ]; then
166
+ echo "### 🎉 All checks passed!" >> $GITHUB_STEP_SUMMARY
167
+ echo "" >> $GITHUB_STEP_SUMMARY
168
+ echo "Your code meets all quality standards." >> $GITHUB_STEP_SUMMARY
169
+ else
170
+ echo "### ⚠️ Some checks failed" >> $GITHUB_STEP_SUMMARY
171
+ echo "" >> $GITHUB_STEP_SUMMARY
172
+ echo "Please review the errors above and fix them." >> $GITHUB_STEP_SUMMARY
173
+ echo "" >> $GITHUB_STEP_SUMMARY
174
+ echo "**Tip**: Run \`pre-commit run --all-files\` locally to catch these issues before pushing." >> $GITHUB_STEP_SUMMARY
175
+ exit 1
176
+ fi
177
+
178
+ - name: Upload Bandit report
179
+ if: always() && steps.bandit.outcome != 'skipped'
180
+ uses: actions/upload-artifact@v4
181
+ with:
182
+ name: bandit-security-report
183
+ path: bandit-report.json
184
+ retention-days: 30
Scrapling/.github/workflows/docker-build.yml ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Build and Push Docker Image
2
+
3
+ on:
4
+ pull_request:
5
+ types: [closed]
6
+ branches:
7
+ - main
8
+ workflow_dispatch:
9
+ inputs:
10
+ tag:
11
+ description: 'Docker image tag'
12
+ required: true
13
+ default: 'latest'
14
+
15
+ env:
16
+ DOCKERHUB_IMAGE: pyd4vinci/scrapling
17
+ GHCR_IMAGE: ghcr.io/${{ github.repository_owner }}/scrapling
18
+
19
+ jobs:
20
+ build-and-push:
21
+ runs-on: ubuntu-latest
22
+ permissions:
23
+ contents: read
24
+ packages: write
25
+
26
+ steps:
27
+ - name: Checkout repository
28
+ uses: actions/checkout@v4
29
+
30
+ - name: Set up Docker Buildx
31
+ uses: docker/setup-buildx-action@v3
32
+ with:
33
+ platforms: linux/amd64,linux/arm64
34
+
35
+ - name: Log in to Docker Hub
36
+ uses: docker/login-action@v3
37
+ with:
38
+ registry: docker.io
39
+ username: ${{ secrets.DOCKER_USERNAME }}
40
+ password: ${{ secrets.DOCKER_PASSWORD }}
41
+
42
+ - name: Log in to GitHub Container Registry
43
+ uses: docker/login-action@v3
44
+ with:
45
+ registry: ghcr.io
46
+ username: ${{ github.actor }}
47
+ password: ${{ secrets.CONTAINER_TOKEN }}
48
+
49
+ - name: Extract metadata
50
+ id: meta
51
+ uses: docker/metadata-action@v5
52
+ with:
53
+ images: |
54
+ ${{ env.DOCKERHUB_IMAGE }}
55
+ ${{ env.GHCR_IMAGE }}
56
+ tags: |
57
+ type=ref,event=branch
58
+ type=ref,event=pr
59
+ type=semver,pattern={{version}}
60
+ type=semver,pattern={{major}}.{{minor}}
61
+ type=semver,pattern={{major}}
62
+ type=raw,value=latest,enable={{is_default_branch}}
63
+ labels: |
64
+ org.opencontainers.image.title=Scrapling
65
+ org.opencontainers.image.description=An undetectable, powerful, flexible, high-performance Python library that makes Web Scraping easy and effortless as it should be!
66
+ org.opencontainers.image.vendor=D4Vinci
67
+ org.opencontainers.image.licenses=BSD
68
+ org.opencontainers.image.url=https://scrapling.readthedocs.io/en/latest/
69
+ org.opencontainers.image.source=${{ github.server_url }}/${{ github.repository }}
70
+ org.opencontainers.image.documentation=https://scrapling.readthedocs.io/en/latest/
71
+
72
+ - name: Build and push Docker image
73
+ uses: docker/build-push-action@v5
74
+ with:
75
+ context: .
76
+ platforms: linux/amd64,linux/arm64
77
+ push: true
78
+ tags: ${{ steps.meta.outputs.tags }}
79
+ labels: ${{ steps.meta.outputs.labels }}
80
+ cache-from: type=gha
81
+ cache-to: type=gha,mode=max
82
+ build-args: |
83
+ BUILDKIT_INLINE_CACHE=1
84
+
85
+ - name: Image digest
86
+ run: echo ${{ steps.build.outputs.digest }}
Scrapling/.github/workflows/release-and-publish.yml ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Create Release and Publish to PyPI
2
+ # Creates a GitHub release when a PR is merged to main (using PR title as version and body as release notes), then publishes to PyPI.
3
+
4
+ on:
5
+ pull_request:
6
+ types: [closed]
7
+ branches:
8
+ - main
9
+
10
+ jobs:
11
+ create-release-and-publish:
12
+ if: github.event.pull_request.merged == true
13
+ runs-on: ubuntu-latest
14
+ environment:
15
+ name: PyPI
16
+ url: https://pypi.org/p/scrapling
17
+ permissions:
18
+ contents: write
19
+ id-token: write
20
+ steps:
21
+ - uses: actions/checkout@v4
22
+ with:
23
+ fetch-depth: 0
24
+
25
+ - name: Get PR title
26
+ id: pr_title
27
+ run: echo "title=${{ github.event.pull_request.title }}" >> $GITHUB_OUTPUT
28
+
29
+ - name: Save PR body to file
30
+ uses: actions/github-script@v6
31
+ with:
32
+ script: |
33
+ const fs = require('fs');
34
+ fs.writeFileSync('pr_body.md', context.payload.pull_request.body || '');
35
+
36
+ - name: Extract version
37
+ id: extract_version
38
+ run: |
39
+ PR_TITLE="${{ steps.pr_title.outputs.title }}"
40
+ if [[ $PR_TITLE =~ ^v ]]; then
41
+ echo "version=$PR_TITLE" >> $GITHUB_OUTPUT
42
+ echo "Valid version format found in PR title: $PR_TITLE"
43
+ else
44
+ echo "Error: PR title '$PR_TITLE' must start with 'v' (e.g., 'v1.0.0') to create a release."
45
+ exit 1
46
+ fi
47
+
48
+ - name: Create Release
49
+ uses: softprops/action-gh-release@v1
50
+ with:
51
+ tag_name: ${{ steps.extract_version.outputs.version }}
52
+ name: Release ${{ steps.extract_version.outputs.version }}
53
+ body_path: pr_body.md
54
+ draft: false
55
+ prerelease: false
56
+ env:
57
+ GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
58
+
59
+ - name: Set up Python
60
+ uses: actions/setup-python@v5
61
+ with:
62
+ python-version: 3.12
63
+
64
+ - name: Upgrade pip
65
+ run: python3 -m pip install --upgrade pip
66
+
67
+ - name: Install build
68
+ run: python3 -m pip install --upgrade build twine setuptools
69
+
70
+ - name: Build a binary wheel and a source tarball
71
+ run: python3 -m build --sdist --wheel --outdir dist/
72
+
73
+ - name: Publish distribution 📦 to PyPI
74
+ uses: pypa/gh-action-pypi-publish@release/v1
Scrapling/.github/workflows/tests.yml ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Tests
2
+ on:
3
+ push:
4
+ branches:
5
+ - main
6
+ - dev
7
+ paths-ignore:
8
+ - '*.md'
9
+ - '**/*.md'
10
+ - 'docs/*'
11
+ - 'images/*'
12
+ - '.github/*'
13
+ - '*.yml'
14
+ - '*.yaml'
15
+ - 'ruff.toml'
16
+
17
+ concurrency:
18
+ group: ${{github.workflow}}-${{ github.ref }}
19
+ cancel-in-progress: true
20
+
21
+ jobs:
22
+ tests:
23
+ timeout-minutes: 60
24
+ runs-on: ${{ matrix.os }}
25
+ strategy:
26
+ fail-fast: false
27
+ matrix:
28
+ include:
29
+ - python-version: "3.10"
30
+ os: macos-latest
31
+ env:
32
+ TOXENV: py310
33
+ - python-version: "3.11"
34
+ os: macos-latest
35
+ env:
36
+ TOXENV: py311
37
+ - python-version: "3.12"
38
+ os: macos-latest
39
+ env:
40
+ TOXENV: py312
41
+ - python-version: "3.13"
42
+ os: macos-latest
43
+ env:
44
+ TOXENV: py313
45
+
46
+ steps:
47
+ - uses: actions/checkout@v4
48
+
49
+ - name: Set up Python ${{ matrix.python-version }}
50
+ uses: actions/setup-python@v5
51
+ with:
52
+ python-version: ${{ matrix.python-version }}
53
+ cache: 'pip'
54
+ cache-dependency-path: |
55
+ pyproject.toml
56
+ tox.ini
57
+
58
+ - name: Install all browsers dependencies
59
+ run: |
60
+ python3 -m pip install --upgrade pip
61
+ python3 -m pip install playwright==1.56.0 patchright==1.56.0
62
+
63
+ - name: Get Playwright version
64
+ id: playwright-version
65
+ run: |
66
+ PLAYWRIGHT_VERSION=$(python3 -c "import importlib.metadata; print(importlib.metadata.version('playwright'))")
67
+ echo "version=$PLAYWRIGHT_VERSION" >> $GITHUB_OUTPUT
68
+ echo "Playwright version: $PLAYWRIGHT_VERSION"
69
+
70
+ - name: Retrieve Playwright browsers from cache if any
71
+ id: playwright-cache
72
+ uses: actions/cache@v4
73
+ with:
74
+ path: |
75
+ ~/.cache/ms-playwright
76
+ ~/Library/Caches/ms-playwright
77
+ ~/.ms-playwright
78
+ key: ${{ runner.os }}-playwright-${{ steps.playwright-version.outputs.version }}-v1
79
+ restore-keys: |
80
+ ${{ runner.os }}-playwright-${{ steps.playwright-version.outputs.version }}-
81
+ ${{ runner.os }}-playwright-
82
+
83
+ - name: Install Playwright browsers
84
+ run: |
85
+ echo "Cache hit: ${{ steps.playwright-cache.outputs.cache-hit }}"
86
+ if [ "${{ steps.playwright-cache.outputs.cache-hit }}" != "true" ]; then
87
+ python3 -m playwright install chromium
88
+ else
89
+ echo "Skipping install - using cached Playwright browsers"
90
+ fi
91
+ python3 -m playwright install-deps chromium
92
+
93
+ # Cache tox environments
94
+ - name: Cache tox environments
95
+ uses: actions/cache@v4
96
+ with:
97
+ path: .tox
98
+ # Include python version and os in the cache key
99
+ key: tox-v1-${{ runner.os }}-py${{ matrix.python-version }}-${{ hashFiles('/Users/runner/work/Scrapling/pyproject.toml') }}
100
+ restore-keys: |
101
+ tox-v1-${{ runner.os }}-py${{ matrix.python-version }}-
102
+ tox-v1-${{ runner.os }}-
103
+
104
+ - name: Install tox
105
+ run: pip install -U tox
106
+
107
+ - name: Run tests
108
+ env: ${{ matrix.env }}
109
+ run: tox
Scrapling/.gitignore ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ site/*
2
+
3
+ # AI related files
4
+ .claude/*
5
+ CLAUDE.md
6
+
7
+ # cached files
8
+ __pycache__/
9
+ *.py[cod]
10
+ .cache
11
+ .DS_Store
12
+ *~
13
+ .*.sw[po]
14
+ .build
15
+ .ve
16
+ .env
17
+ .pytest
18
+ .benchmarks
19
+ .bootstrap
20
+ .appveyor.token
21
+ *.bak
22
+ *.db
23
+ *.db-*
24
+
25
+ # installation package
26
+ *.egg-info/
27
+ dist/
28
+ build/
29
+
30
+ # environments
31
+ .venv
32
+ env/
33
+ venv/
34
+ ENV/
35
+ env.bak/
36
+ venv.bak/
37
+
38
+ # C extensions
39
+ *.so
40
+
41
+ # pycharm
42
+ .idea/
43
+
44
+ # vscode
45
+ *.code-workspace
46
+
47
+ # Packages
48
+ *.egg
49
+ *.egg-info
50
+ dist
51
+ build
52
+ eggs
53
+ .eggs
54
+ parts
55
+ bin
56
+ var
57
+ sdist
58
+ wheelhouse
59
+ develop-eggs
60
+ .installed.cfg
61
+ lib
62
+ lib64
63
+ venv*/
64
+ .venv*/
65
+ pyvenv*/
66
+ pip-wheel-metadata/
67
+ poetry.lock
68
+
69
+ # Installer logs
70
+ pip-log.txt
71
+
72
+ # mypy
73
+ .mypy_cache/
74
+ .dmypy.json
75
+ dmypy.json
76
+ mypy.ini
77
+
78
+ # test caches
79
+ .tox/
80
+ .pytest_cache/
81
+ .coverage
82
+ htmlcov
83
+ report.xml
84
+ nosetests.xml
85
+ coverage.xml
86
+
87
+ # Translations
88
+ *.mo
89
+
90
+ # Buildout
91
+ .mr.developer.cfg
92
+
93
+ # IDE project files
94
+ .project
95
+ .pydevproject
96
+ .idea
97
+ *.iml
98
+ *.komodoproject
99
+
100
+ # Complexity
101
+ output/*.html
102
+ output/*/index.html
103
+
104
+ # Sphinx
105
+ docs/_build
106
+ public/
107
+ web/
Scrapling/.pre-commit-config.yaml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ repos:
2
+ - repo: https://github.com/PyCQA/bandit
3
+ rev: 1.9.0
4
+ hooks:
5
+ - id: bandit
6
+ args: [-r, -c, .bandit.yml]
7
+ - repo: https://github.com/astral-sh/ruff-pre-commit
8
+ # Ruff version.
9
+ rev: v0.14.5
10
+ hooks:
11
+ # Run the linter.
12
+ - id: ruff
13
+ args: [ --fix ]
14
+ # Run the formatter.
15
+ - id: ruff-format
16
+ - repo: https://github.com/netromdk/vermin
17
+ rev: v1.7.0
18
+ hooks:
19
+ - id: vermin
20
+ args: ['-t=3.10-', '--violations', '--eval-annotations', '--no-tips']
Scrapling/.readthedocs.yaml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # See https://docs.readthedocs.com/platform/stable/intro/zensical.html for details
2
+ # Example: https://github.com/readthedocs/test-builds/tree/zensical
3
+
4
+ version: 2
5
+
6
+ build:
7
+ os: ubuntu-24.04
8
+ apt_packages:
9
+ - pngquant
10
+ tools:
11
+ python: "3.13"
12
+ jobs:
13
+ install:
14
+ - pip install -r docs/requirements.txt
15
+ - pip install ".[all]"
16
+ build:
17
+ html:
18
+ - zensical build
19
+ post_build:
20
+ - mkdir -p $READTHEDOCS_OUTPUT/html/
21
+ - cp --recursive site/* $READTHEDOCS_OUTPUT/html/
Scrapling/CODE_OF_CONDUCT.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributor Covenant Code of Conduct
2
+
3
+ ## Our Pledge
4
+
5
+ We as members, contributors, and leaders pledge to make participation in our
6
+ community a harassment-free experience for everyone, regardless of age, body
7
+ size, visible or invisible disability, ethnicity, sex characteristics, gender
8
+ identity and expression, level of experience, education, socio-economic status,
9
+ nationality, personal appearance, race, religion, or sexual identity
10
+ and orientation.
11
+
12
+ We pledge to act and interact in ways that contribute to an open, welcoming,
13
+ diverse, inclusive, and healthy community.
14
+
15
+ ## Our Standards
16
+
17
+ Examples of behavior that contributes to a positive environment for our
18
+ community include:
19
+
20
+ * Demonstrating empathy and kindness toward other people
21
+ * Being respectful of differing opinions, viewpoints, and experiences
22
+ * Giving and gracefully accepting constructive feedback
23
+ * Accepting responsibility and apologizing to those affected by our mistakes,
24
+ and learning from the experience
25
+ * Focusing on what is best not just for us as individuals, but for the
26
+ overall community
27
+
28
+ Examples of unacceptable behavior include:
29
+
30
+ * The use of sexualized language or imagery, and sexual attention or
31
+ advances of any kind
32
+ * Trolling, insulting or derogatory comments, and personal or political attacks
33
+ * Public or private harassment
34
+ * Publishing others' private information, such as a physical or email
35
+ address, without their explicit permission
36
+ * Other conduct which could reasonably be considered inappropriate in a
37
+ professional setting
38
+
39
+ ## Enforcement Responsibilities
40
+
41
+ Community leaders are responsible for clarifying and enforcing our standards of
42
+ acceptable behavior and will take appropriate and fair corrective action in
43
+ response to any behavior that they deem inappropriate, threatening, offensive,
44
+ or harmful.
45
+
46
+ Community leaders have the right and responsibility to remove, edit, or reject
47
+ comments, commits, code, wiki edits, issues, and other contributions that are
48
+ not aligned to this Code of Conduct, and will communicate reasons for moderation
49
+ decisions when appropriate.
50
+
51
+ ## Scope
52
+
53
+ This Code of Conduct applies within all community spaces, and also applies when
54
+ an individual is officially representing the community in public spaces.
55
+ Examples of representing our community include using an official e-mail address,
56
+ posting via an official social media account, or acting as an appointed
57
+ representative at an online or offline event.
58
+
59
+ ## Enforcement
60
+
61
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
62
+ reported to the community leaders responsible for enforcement at
63
+ karim.shoair@pm.me.
64
+ All complaints will be reviewed and investigated promptly and fairly.
65
+
66
+ All community leaders are obligated to respect the privacy and security of the
67
+ reporter of any incident.
68
+
69
+ ## Enforcement Guidelines
70
+
71
+ Community leaders will follow these Community Impact Guidelines in determining
72
+ the consequences for any action they deem in violation of this Code of Conduct:
73
+
74
+ ### 1. Correction
75
+
76
+ **Community Impact**: Use of inappropriate language or other behavior deemed
77
+ unprofessional or unwelcome in the community.
78
+
79
+ **Consequence**: A private, written warning from community leaders, providing
80
+ clarity around the nature of the violation and an explanation of why the
81
+ behavior was inappropriate. A public apology may be requested.
82
+
83
+ ### 2. Warning
84
+
85
+ **Community Impact**: A violation through a single incident or series
86
+ of actions.
87
+
88
+ **Consequence**: A warning with consequences for continued behavior. No
89
+ interaction with the people involved, including unsolicited interaction with
90
+ those enforcing the Code of Conduct, for a specified period of time. This
91
+ includes avoiding interactions in community spaces as well as external channels
92
+ like social media. Violating these terms may lead to a temporary or
93
+ permanent ban.
94
+
95
+ ### 3. Temporary Ban
96
+
97
+ **Community Impact**: A serious violation of community standards, including
98
+ sustained inappropriate behavior.
99
+
100
+ **Consequence**: A temporary ban from any sort of interaction or public
101
+ communication with the community for a specified period of time. No public or
102
+ private interaction with the people involved, including unsolicited interaction
103
+ with those enforcing the Code of Conduct, is allowed during this period.
104
+ Violating these terms may lead to a permanent ban.
105
+
106
+ ### 4. Permanent Ban
107
+
108
+ **Community Impact**: Demonstrating a pattern of violation of community
109
+ standards, including sustained inappropriate behavior, harassment of an
110
+ individual, or aggression toward or disparagement of classes of individuals.
111
+
112
+ **Consequence**: A permanent ban from any sort of public interaction within
113
+ the community.
114
+
115
+ ## Attribution
116
+
117
+ This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118
+ version 2.0, available at
119
+ https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
120
+
121
+ Community Impact Guidelines were inspired by [Mozilla's code of conduct
122
+ enforcement ladder](https://github.com/mozilla/diversity).
123
+
124
+ [homepage]: https://www.contributor-covenant.org
125
+
126
+ For answers to common questions about this code of conduct, see the FAQ at
127
+ https://www.contributor-covenant.org/faq. Translations are available at
128
+ https://www.contributor-covenant.org/translations.
Scrapling/CONTRIBUTING.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing to Scrapling
2
+
3
+ Thank you for your interest in contributing to Scrapling!
4
+
5
+ Everybody is invited and welcome to contribute to Scrapling.
6
+
7
+ Minor changes are more likely to be included promptly. Adding unit tests for new features or test cases for bugs you've fixed helps us ensure that the Pull Request (PR) is acceptable.
8
+
9
+ There are many ways to contribute to Scrapling. Here are some of them:
10
+
11
+ - Report bugs and request features using the [GitHub issues](https://github.com/D4Vinci/Scrapling/issues). Please follow the issue template to help us resolve your issue quickly.
12
+ - Blog about Scrapling. Tell the world how you’re using Scrapling. This will help newcomers with more examples and increase the Scrapling project's visibility.
13
+ - Join the [Discord community](https://discord.gg/EMgGbDceNQ) and share your ideas on how to improve Scrapling. We’re always open to suggestions.
14
+ - If you are not a developer, perhaps you would like to help with translating the [documentation](https://github.com/D4Vinci/Scrapling/tree/docs)?
15
+
16
+
17
+ ## Finding work
18
+
19
+ If you have decided to make a contribution to Scrapling, but you do not know what to contribute, here are some ways to find pending work:
20
+
21
+ - Check out the [contribution](https://github.com/D4Vinci/Scrapling/contribute) GitHub page, which lists open issues tagged as `good first issue`. These issues provide a good starting point.
22
+ - There are also the [help wanted](https://github.com/D4Vinci/Scrapling/issues?q=is%3Aissue%20label%3A%22help%20wanted%22%20state%3Aopen) issues, but know that some may require familiarity with the Scrapling code base first. You can also target any other issue, provided it is not tagged as `invalid`, `wontfix`, or similar tags.
23
+ - If you enjoy writing automated tests, you can work on increasing our test coverage. Currently, the test coverage is around 90–92%.
24
+ - Join the [Discord community](https://discord.gg/EMgGbDceNQ) and ask questions in the `#help` channel.
25
+
26
+ ## Coding style
27
+ Please follow these coding conventions as we do when writing code for Scrapling:
28
+ - We use [pre-commit](https://pre-commit.com/) to automatically address simple code issues before every commit, so please install it and run `pre-commit install` to set it up. This will install hooks to run [ruff](https://docs.astral.sh/ruff/), [bandit](https://github.com/PyCQA/bandit), and [vermin](https://github.com/netromdk/vermin) on every commit. We are currently using a workflow to automatically run these tools on every PR, so if your code doesn't pass these checks, the PR will be rejected.
29
+ - We use type hints for better code clarity and [pyright](https://github.com/microsoft/pyright) for static type checking, which depends on the type hints, of course.
30
+ - We use the conventional commit messages format as [here](https://gist.github.com/qoomon/5dfcdf8eec66a051ecd85625518cfd13#types), so for example, we use the following prefixes for commit messages:
31
+
32
+ | Prefix | When to use it |
33
+ |-------------|--------------------------|
34
+ | `feat:` | New feature added |
35
+ | `fix:` | Bug fix |
36
+ | `docs:` | Documentation change/add |
37
+ | `test:` | Tests |
38
+ | `refactor:` | Code refactoring |
39
+ | `chore:` | Maintenance tasks |
40
+
41
+ Then include the details of the change in the commit message body/description.
42
+
43
+ Example:
44
+ ```
45
+ feat: add `adaptive` for similar elements
46
+
47
+ - Added find_similar() method
48
+ - Implemented pattern matching
49
+ - Added tests and documentation
50
+ ```
51
+
52
+ > Please don’t put your name in the code you contribute; git provides enough metadata to identify the author of the code.
53
+
54
+ ## Development
55
+ Setting the scrapling logging level to `debug` makes it easier to know what's happening in the background.
56
+ ```python
57
+ import logging
58
+ logging.getLogger("scrapling").setLevel(logging.DEBUG)
59
+ ```
60
+ Bonus: You can install the beta of the upcoming update from the dev branch as follows
61
+ ```commandline
62
+ pip3 install git+https://github.com/D4Vinci/Scrapling.git@dev
63
+ ```
64
+
65
+ ## Building Documentation
66
+ Documentation is built using [MkDocs](https://www.mkdocs.org/). You can build it locally using the following commands:
67
+ ```bash
68
+ pip install mkdocs-material
69
+ mkdocs serve # Local preview
70
+ mkdocs build # Build the static site
71
+ ```
72
+
73
+ ## Tests
74
+ Scrapling includes a comprehensive test suite that can be executed with pytest. However, first, you need to install all libraries and `pytest-plugins` listed in `tests/requirements.txt`. Then, running the tests will result in an output like this:
75
+ ```bash
76
+ $ pytest tests -n auto
77
+ =============================== test session starts ===============================
78
+ platform darwin -- Python 3.13.8, pytest-8.4.2, pluggy-1.6.0 -- /Users/<redacted>/.venv/bin/python3.13
79
+ cachedir: .pytest_cache
80
+ rootdir: /Users/<redacted>/scrapling
81
+ configfile: pytest.ini
82
+ plugins: asyncio-1.2.0, anyio-4.11.0, xdist-3.8.0, httpbin-2.1.0, cov-7.0.0
83
+ asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=function, asyncio_default_test_loop_scope=function
84
+ 10 workers [271 items]
85
+ scheduling tests via LoadScheduling
86
+
87
+ ...<shortened>...
88
+
89
+ =============================== 271 passed in 52.68s ==============================
90
+ ```
91
+ Hence, we used `-n auto` in the command above to run tests in threads to increase speed.
92
+
93
+ Bonus: You can also see the test coverage with the `pytest` plugin below
94
+ ```bash
95
+ pytest --cov=scrapling tests/
96
+ ```
97
+
98
+ ## Making a Pull Request
99
+ To ensure that your PR gets accepted, please make sure that your PR is based on the latest changes from the dev branch and that it satisfies the following requirements:
100
+
101
+ - The PR should be made against the [**dev**](https://github.com/D4Vinci/Scrapling/tree/dev) branch of Scrapling. Any PR made against the main branch will be rejected.
102
+ - The code should be passing all available tests. We use tox with GitHub's CI to run the current tests on all supported Python versions for every code-related commit.
103
+ - The code should be passing all code quality checks we mentioned above. We are using GitHub's CI to enforce the code style checks performed by pre-commit. If you were using the pre-commit hooks we discussed above, you should not see any issues when committing your changes.
104
+ - Make your changes, keep the code clean with an explanation of any part that might be vague, and remember to create a separate virtual environment for this project.
105
+ - If you are adding a new feature, please add tests for it.
106
+ - If you are fixing a bug, please add code with the PR that reproduces the bug.
Scrapling/Dockerfile ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12-slim-trixie
2
+ COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
3
+
4
+ # Set environment variables
5
+ ENV DEBIAN_FRONTEND=noninteractive \
6
+ PYTHONUNBUFFERED=1 \
7
+ PYTHONDONTWRITEBYTECODE=1
8
+
9
+ WORKDIR /app
10
+
11
+ # Copy dependency file first for better layer caching
12
+ COPY pyproject.toml ./
13
+
14
+ # Install dependencies only
15
+ RUN --mount=type=cache,target=/root/.cache/uv \
16
+ uv sync --no-install-project --all-extras --compile-bytecode
17
+
18
+ # Copy source code
19
+ COPY . .
20
+
21
+ # Install browsers and project in one optimized layer
22
+ RUN --mount=type=cache,target=/root/.cache/uv \
23
+ --mount=type=cache,target=/var/cache/apt \
24
+ --mount=type=cache,target=/var/lib/apt \
25
+ apt-get update && \
26
+ uv run playwright install-deps chromium && \
27
+ uv run playwright install chromium && \
28
+ uv sync --all-extras --compile-bytecode && \
29
+ apt-get clean && \
30
+ rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
31
+
32
+ # Expose port for MCP server HTTP transport
33
+ EXPOSE 8000
34
+
35
+ # Set entrypoint to run scrapling
36
+ ENTRYPOINT ["uv", "run", "scrapling"]
37
+
38
+ # Default command (can be overridden)
39
+ CMD ["--help"]
Scrapling/LICENSE ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ BSD 3-Clause License
2
+
3
+ Copyright (c) 2024, Karim shoair
4
+
5
+ Redistribution and use in source and binary forms, with or without
6
+ modification, are permitted provided that the following conditions are met:
7
+
8
+ 1. Redistributions of source code must retain the above copyright notice, this
9
+ list of conditions and the following disclaimer.
10
+
11
+ 2. Redistributions in binary form must reproduce the above copyright notice,
12
+ this list of conditions and the following disclaimer in the documentation
13
+ and/or other materials provided with the distribution.
14
+
15
+ 3. Neither the name of the copyright holder nor the names of its
16
+ contributors may be used to endorse or promote products derived from
17
+ this software without specific prior written permission.
18
+
19
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
20
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
21
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
22
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
23
+ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
24
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
25
+ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
26
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
27
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
28
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Scrapling/MANIFEST.in ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ include LICENSE
2
+ include *.db
3
+ include *.js
4
+ include scrapling/engines/toolbelt/bypasses/*.js
5
+ include scrapling/*.db
6
+ include scrapling/*.db*
7
+ include scrapling/*.db-*
8
+ include scrapling/py.typed
9
+ include scrapling/.scrapling_dependencies_installed
10
+ include .scrapling_dependencies_installed
11
+
12
+ recursive-exclude * __pycache__
13
+ recursive-exclude * *.py[co]
Scrapling/README.md ADDED
@@ -0,0 +1,419 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!-- mcp-name: io.github.D4Vinci/Scrapling -->
2
+
3
+ <h1 align="center">
4
+ <a href="https://scrapling.readthedocs.io">
5
+ <picture>
6
+ <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
7
+ <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
8
+ </picture>
9
+ </a>
10
+ <br>
11
+ <small>Effortless Web Scraping for the Modern Web</small>
12
+ </h1>
13
+
14
+ <p align="center">
15
+ <a href="https://trendshift.io/repositories/14244" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14244" alt="D4Vinci%2FScrapling | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
16
+ <br/>
17
+ <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_AR.md">العربيه</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_ES.md">Español</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_DE.md">Deutsch</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_CN.md">简体中文</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_JP.md">日本語</a> | <a href="https://github.com/D4Vinci/Scrapling/blob/main/docs/README_RU.md">Русский</a>
18
+ <br/>
19
+ <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
20
+ <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
21
+ <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
22
+ <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
23
+ <a href="https://pepy.tech/project/scrapling" alt="PyPI Downloads">
24
+ <img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/scrapling?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=Downloads"></a>
25
+ <br/>
26
+ <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
27
+ <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
28
+ </a>
29
+ <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
30
+ <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
31
+ </a>
32
+ <br/>
33
+ <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
34
+ <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
35
+ </p>
36
+
37
+ <p align="center">
38
+ <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/"><strong>Selection methods</strong></a>
39
+ &middot;
40
+ <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing/"><strong>Fetchers</strong></a>
41
+ &middot;
42
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>Spiders</strong></a>
43
+ &middot;
44
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>Proxy Rotation</strong></a>
45
+ &middot;
46
+ <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/"><strong>CLI</strong></a>
47
+ &middot;
48
+ <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/"><strong>MCP</strong></a>
49
+ </p>
50
+
51
+ Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.
52
+
53
+ Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation — all in a few lines of Python. One library, zero compromises.
54
+
55
+ Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
56
+
57
+ ```python
58
+ from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
59
+ StealthyFetcher.adaptive = True
60
+ p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar!
61
+ products = p.css('.product', auto_save=True) # Scrape data that survives website design changes!
62
+ products = p.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them!
63
+ ```
64
+ Or scale up to full crawls
65
+ ```python
66
+ from scrapling.spiders import Spider, Response
67
+
68
+ class MySpider(Spider):
69
+ name = "demo"
70
+ start_urls = ["https://example.com/"]
71
+
72
+ async def parse(self, response: Response):
73
+ for item in response.css('.product'):
74
+ yield {"title": item.css('h2::text').get()}
75
+
76
+ MySpider().start()
77
+ ```
78
+
79
+ # Platinum Sponsors
80
+
81
+ # Sponsors
82
+
83
+ <!-- sponsors -->
84
+
85
+ <a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>
86
+ <a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
87
+ <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
88
+ <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
89
+ <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
90
+ <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
91
+ <a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
92
+ <a href="https://proxyempire.io/" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
93
+ <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"></a>
94
+
95
+
96
+ <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
97
+ <a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
98
+ <a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
99
+
100
+ <!-- /sponsors -->
101
+
102
+ <i><sub>Do you want to show your ad here? Click [here](https://github.com/sponsors/D4Vinci) and choose the tier that suites you!</sub></i>
103
+
104
+ ---
105
+
106
+ ## Key Features
107
+
108
+ ### Spiders — A Full Crawling Framework
109
+ - 🕷️ **Scrapy-like Spider API**: Define spiders with `start_urls`, async `parse` callbacks, and `Request`/`Response` objects.
110
+ - ⚡ **Concurrent Crawling**: Configurable concurrency limits, per-domain throttling, and download delays.
111
+ - 🔄 **Multi-Session Support**: Unified interface for HTTP requests, and stealthy headless browsers in a single spider — route requests to different sessions by ID.
112
+ - 💾 **Pause & Resume**: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off.
113
+ - 📡 **Streaming Mode**: Stream scraped items as they arrive via `async for item in spider.stream()` with real-time stats — ideal for UI, pipelines, and long-running crawls.
114
+ - 🛡️ **Blocked Request Detection**: Automatic detection and retry of blocked requests with customizable logic.
115
+ - 📦 **Built-in Export**: Export results through hooks and your own pipeline or the built-in JSON/JSONL with `result.items.to_json()` / `result.items.to_jsonl()` respectively.
116
+
117
+ ### Advanced Websites Fetching with Session Support
118
+ - **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.
119
+ - **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium and Google's Chrome.
120
+ - **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation.
121
+ - **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
122
+ - **Proxy Rotation**: Built-in `ProxyRotator` with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides.
123
+ - **Domain Blocking**: Block requests to specific domains (and their subdomains) in browser-based fetchers.
124
+ - **Async Support**: Complete async support across all fetchers and dedicated async session classes.
125
+
126
+ ### Adaptive Scraping & AI Integration
127
+ - 🔄 **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms.
128
+ - 🎯 **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
129
+ - 🔍 **Find Similar Elements**: Automatically locate elements similar to found elements.
130
+ - 🤖 **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
131
+
132
+ ### High-Performance & battle-tested Architecture
133
+ - 🚀 **Lightning Fast**: Optimized performance outperforming most Python scraping libraries.
134
+ - 🔋 **Memory Efficient**: Optimized data structures and lazy loading for a minimal memory footprint.
135
+ - ⚡ **Fast JSON Serialization**: 10x faster than the standard library.
136
+ - 🏗️ **Battle tested**: Not only does Scrapling have 92% test coverage and full type hints coverage, but it has been used daily by hundreds of Web Scrapers over the past year.
137
+
138
+ ### Developer/Web Scraper Friendly Experience
139
+ - 🎯 **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.
140
+ - 🚀 **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single line of code!
141
+ - 🛠️ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods.
142
+ - 🧬 **Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations.
143
+ - 📝 **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element.
144
+ - 🔌 **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
145
+ - 📘 **Complete Type Coverage**: Full type hints for excellent IDE support and code completion. The entire codebase is automatically scanned with **PyRight** and **MyPy** with each change.
146
+ - 🔋 **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed.
147
+
148
+ ## Getting Started
149
+
150
+ Let's give you a quick glimpse of what Scrapling can do without deep diving.
151
+
152
+ ### Basic Usage
153
+ HTTP requests with session support
154
+ ```python
155
+ from scrapling.fetchers import Fetcher, FetcherSession
156
+
157
+ with FetcherSession(impersonate='chrome') as session: # Use latest version of Chrome's TLS fingerprint
158
+ page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
159
+ quotes = page.css('.quote .text::text').getall()
160
+
161
+ # Or use one-off requests
162
+ page = Fetcher.get('https://quotes.toscrape.com/')
163
+ quotes = page.css('.quote .text::text').getall()
164
+ ```
165
+ Advanced stealth mode
166
+ ```python
167
+ from scrapling.fetchers import StealthyFetcher, StealthySession
168
+
169
+ with StealthySession(headless=True, solve_cloudflare=True) as session: # Keep the browser open until you finish
170
+ page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
171
+ data = page.css('#padded_content a').getall()
172
+
173
+ # Or use one-off request style, it opens the browser for this request, then closes it after finishing
174
+ page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
175
+ data = page.css('#padded_content a').getall()
176
+ ```
177
+ Full browser automation
178
+ ```python
179
+ from scrapling.fetchers import DynamicFetcher, DynamicSession
180
+
181
+ with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Keep the browser open until you finish
182
+ page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
183
+ data = page.xpath('//span[@class="text"]/text()').getall() # XPath selector if you prefer it
184
+
185
+ # Or use one-off request style, it opens the browser for this request, then closes it after finishing
186
+ page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
187
+ data = page.css('.quote .text::text').getall()
188
+ ```
189
+
190
+ ### Spiders
191
+ Build full crawlers with concurrent requests, multiple session types, and pause/resume:
192
+ ```python
193
+ from scrapling.spiders import Spider, Request, Response
194
+
195
+ class QuotesSpider(Spider):
196
+ name = "quotes"
197
+ start_urls = ["https://quotes.toscrape.com/"]
198
+ concurrent_requests = 10
199
+
200
+ async def parse(self, response: Response):
201
+ for quote in response.css('.quote'):
202
+ yield {
203
+ "text": quote.css('.text::text').get(),
204
+ "author": quote.css('.author::text').get(),
205
+ }
206
+
207
+ next_page = response.css('.next a')
208
+ if next_page:
209
+ yield response.follow(next_page[0].attrib['href'])
210
+
211
+ result = QuotesSpider().start()
212
+ print(f"Scraped {len(result.items)} quotes")
213
+ result.items.to_json("quotes.json")
214
+ ```
215
+ Use multiple session types in a single spider:
216
+ ```python
217
+ from scrapling.spiders import Spider, Request, Response
218
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession
219
+
220
+ class MultiSessionSpider(Spider):
221
+ name = "multi"
222
+ start_urls = ["https://example.com/"]
223
+
224
+ def configure_sessions(self, manager):
225
+ manager.add("fast", FetcherSession(impersonate="chrome"))
226
+ manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
227
+
228
+ async def parse(self, response: Response):
229
+ for link in response.css('a::attr(href)').getall():
230
+ # Route protected pages through the stealth session
231
+ if "protected" in link:
232
+ yield Request(link, sid="stealth")
233
+ else:
234
+ yield Request(link, sid="fast", callback=self.parse) # explicit callback
235
+ ```
236
+ Pause and resume long crawls with checkpoints by running the spider like this:
237
+ ```python
238
+ QuotesSpider(crawldir="./crawl_data").start()
239
+ ```
240
+ Press Ctrl+C to pause gracefully — progress is saved automatically. Later, when you start the spider again, pass the same `crawldir`, and it will resume from where it stopped.
241
+
242
+ ### Advanced Parsing & Navigation
243
+ ```python
244
+ from scrapling.fetchers import Fetcher
245
+
246
+ # Rich element selection and navigation
247
+ page = Fetcher.get('https://quotes.toscrape.com/')
248
+
249
+ # Get quotes with multiple selection methods
250
+ quotes = page.css('.quote') # CSS selector
251
+ quotes = page.xpath('//div[@class="quote"]') # XPath
252
+ quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup-style
253
+ # Same as
254
+ quotes = page.find_all('div', class_='quote')
255
+ quotes = page.find_all(['div'], class_='quote')
256
+ quotes = page.find_all(class_='quote') # and so on...
257
+ # Find element by text content
258
+ quotes = page.find_by_text('quote', tag='div')
259
+
260
+ # Advanced navigation
261
+ quote_text = page.css('.quote')[0].css('.text::text').get()
262
+ quote_text = page.css('.quote').css('.text::text').getall() # Chained selectors
263
+ first_quote = page.css('.quote')[0]
264
+ author = first_quote.next_sibling.css('.author::text')
265
+ parent_container = first_quote.parent
266
+
267
+ # Element relationships and similarity
268
+ similar_elements = first_quote.find_similar()
269
+ below_elements = first_quote.below_elements()
270
+ ```
271
+ You can use the parser right away if you don't want to fetch websites like below:
272
+ ```python
273
+ from scrapling.parser import Selector
274
+
275
+ page = Selector("<html>...</html>")
276
+ ```
277
+ And it works precisely the same way!
278
+
279
+ ### Async Session Management Examples
280
+ ```python
281
+ import asyncio
282
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
283
+
284
+ async with FetcherSession(http3=True) as session: # `FetcherSession` is context-aware and can work in both sync/async patterns
285
+ page1 = session.get('https://quotes.toscrape.com/')
286
+ page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
287
+
288
+ # Async session usage
289
+ async with AsyncStealthySession(max_pages=2) as session:
290
+ tasks = []
291
+ urls = ['https://example.com/page1', 'https://example.com/page2']
292
+
293
+ for url in urls:
294
+ task = session.fetch(url)
295
+ tasks.append(task)
296
+
297
+ print(session.get_pool_stats()) # Optional - The status of the browser tabs pool (busy/free/error)
298
+ results = await asyncio.gather(*tasks)
299
+ print(session.get_pool_stats())
300
+ ```
301
+
302
+ ## CLI & Interactive Shell
303
+
304
+ Scrapling includes a powerful command-line interface:
305
+
306
+ [![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339)
307
+
308
+ Launch the interactive Web Scraping shell
309
+ ```bash
310
+ scrapling shell
311
+ ```
312
+ Extract pages to a file directly without programming (Extracts the content inside the `body` tag by default). If the output file ends with `.txt`, then the text content of the target will be extracted. If it ends in `.md`, it will be a Markdown representation of the HTML content; if it ends in `.html`, it will be the HTML content itself.
313
+ ```bash
314
+ scrapling extract get 'https://example.com' content.md
315
+ scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # All elements matching the CSS selector '#fromSkipToProducts'
316
+ scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
317
+ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
318
+ ```
319
+
320
+ > [!NOTE]
321
+ > There are many additional features, but we want to keep this page concise, including the MCP server and the interactive Web Scraping Shell. Check out the full documentation [here](https://scrapling.readthedocs.io/en/latest/)
322
+
323
+ ## Performance Benchmarks
324
+
325
+ Scrapling isn't just powerful—it's also blazing fast. The following benchmarks compare Scrapling's parser with the latest versions of other popular libraries.
326
+
327
+ ### Text Extraction Speed Test (5000 nested elements)
328
+
329
+ | # | Library | Time (ms) | vs Scrapling |
330
+ |---|:-----------------:|:---------:|:------------:|
331
+ | 1 | Scrapling | 2.02 | 1.0x |
332
+ | 2 | Parsel/Scrapy | 2.04 | 1.01 |
333
+ | 3 | Raw Lxml | 2.54 | 1.257 |
334
+ | 4 | PyQuery | 24.17 | ~12x |
335
+ | 5 | Selectolax | 82.63 | ~41x |
336
+ | 6 | MechanicalSoup | 1549.71 | ~767.1x |
337
+ | 7 | BS4 with Lxml | 1584.31 | ~784.3x |
338
+ | 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
339
+
340
+
341
+ ### Element Similarity & Text Search Performance
342
+
343
+ Scrapling's adaptive element finding capabilities significantly outperform alternatives:
344
+
345
+ | Library | Time (ms) | vs Scrapling |
346
+ |-------------|:---------:|:------------:|
347
+ | Scrapling | 2.39 | 1.0x |
348
+ | AutoScraper | 12.45 | 5.209x |
349
+
350
+
351
+ > All benchmarks represent averages of 100+ runs. See [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology.
352
+
353
+ ## Installation
354
+
355
+ Scrapling requires Python 3.10 or higher:
356
+
357
+ ```bash
358
+ pip install scrapling
359
+ ```
360
+
361
+ This installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies.
362
+
363
+ ### Optional Dependencies
364
+
365
+ 1. If you are going to use any of the extra features below, the fetchers, or their classes, you will need to install fetchers' dependencies and their browser dependencies as follows:
366
+ ```bash
367
+ pip install "scrapling[fetchers]"
368
+
369
+ scrapling install
370
+ ```
371
+
372
+ This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies.
373
+
374
+ 2. Extra features:
375
+ - Install the MCP server feature:
376
+ ```bash
377
+ pip install "scrapling[ai]"
378
+ ```
379
+ - Install shell features (Web Scraping shell and the `extract` command):
380
+ ```bash
381
+ pip install "scrapling[shell]"
382
+ ```
383
+ - Install everything:
384
+ ```bash
385
+ pip install "scrapling[all]"
386
+ ```
387
+ Remember that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already)
388
+
389
+ ### Docker
390
+ You can also install a Docker image with all extras and browsers with the following command from DockerHub:
391
+ ```bash
392
+ docker pull pyd4vinci/scrapling
393
+ ```
394
+ Or download it from the GitHub registry:
395
+ ```bash
396
+ docker pull ghcr.io/d4vinci/scrapling:latest
397
+ ```
398
+ This image is automatically built and pushed using GitHub Actions and the repository's main branch.
399
+
400
+ ## Contributing
401
+
402
+ We welcome contributions! Please read our [contributing guidelines](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) before getting started.
403
+
404
+ ## Disclaimer
405
+
406
+ > [!CAUTION]
407
+ > This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.
408
+
409
+ ## License
410
+
411
+ This work is licensed under the BSD-3-Clause License.
412
+
413
+ ## Acknowledgments
414
+
415
+ This project includes code adapted from:
416
+ - Parsel (BSD License)—Used for [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py) submodule
417
+
418
+ ---
419
+ <div align="center"><small>Designed & crafted with ❤️ by Karim Shoair.</small></div><br>
Scrapling/ROADMAP.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## TODOs
2
+ - [x] Add more tests and increase the code coverage.
3
+ - [x] Structure the tests folder in a better way.
4
+ - [x] Add more documentation.
5
+ - [x] Add the browsing ability.
6
+ - [x] Create detailed documentation for the 'readthedocs' website, preferably add GitHub action for deploying it.
7
+ - [ ] Create a Scrapy plugin/decorator to make it replace parsel in the response argument when needed.
8
+ - [x] Need to add more functionality to `AttributesHandler` and more navigation functions to `Selector` object (ex: functions similar to map, filter, and reduce functions but here pass it to the element and the function is executed on children, siblings, next elements, etc...)
9
+ - [x] Add `.filter` method to `Selectors` object and other similar methods.
10
+ - [ ] Add functionality to automatically detect pagination URLs
11
+ - [ ] Add the ability to auto-detect schemas in pages and manipulate them.
12
+ - [ ] Add `analyzer` ability that tries to learn about the page through meta-elements and return what it learned
13
+ - [ ] Add the ability to generate a regex from a group of elements (Like for all href attributes)
14
+ -
Scrapling/benchmarks.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import functools
2
+ import time
3
+ import timeit
4
+ from statistics import mean
5
+
6
+ import requests
7
+ from autoscraper import AutoScraper
8
+ from bs4 import BeautifulSoup
9
+ from lxml import etree, html
10
+ from mechanicalsoup import StatefulBrowser
11
+ from parsel import Selector
12
+ from pyquery import PyQuery as pq
13
+ from selectolax.parser import HTMLParser
14
+
15
+ from scrapling import Selector as ScraplingSelector
16
+
17
+ large_html = (
18
+ "<html><body>" + '<div class="item">' * 5000 + "</div>" * 5000 + "</body></html>"
19
+ )
20
+
21
+
22
+ def benchmark(func):
23
+ @functools.wraps(func)
24
+ def wrapper(*args, **kwargs):
25
+ benchmark_name = func.__name__.replace("test_", "").replace("_", " ")
26
+ print(f"-> {benchmark_name}", end=" ", flush=True)
27
+ # Warm-up phase
28
+ timeit.repeat(
29
+ lambda: func(*args, **kwargs), number=2, repeat=2, globals=globals()
30
+ )
31
+ # Measure time (1 run, repeat 100 times, take average)
32
+ times = timeit.repeat(
33
+ lambda: func(*args, **kwargs),
34
+ number=1,
35
+ repeat=100,
36
+ globals=globals(),
37
+ timer=time.process_time,
38
+ )
39
+ min_time = round(mean(times) * 1000, 2) # Convert to milliseconds
40
+ print(f"average execution time: {min_time} ms")
41
+ return min_time
42
+
43
+ return wrapper
44
+
45
+
46
+ @benchmark
47
+ def test_lxml():
48
+ return [
49
+ e.text
50
+ for e in etree.fromstring(
51
+ large_html,
52
+ # Scrapling and Parsel use the same parser inside, so this is just to make it fair
53
+ parser=html.HTMLParser(recover=True, huge_tree=True),
54
+ ).cssselect(".item")
55
+ ]
56
+
57
+
58
+ @benchmark
59
+ def test_bs4_lxml():
60
+ return [e.text for e in BeautifulSoup(large_html, "lxml").select(".item")]
61
+
62
+
63
+ @benchmark
64
+ def test_bs4_html5lib():
65
+ return [e.text for e in BeautifulSoup(large_html, "html5lib").select(".item")]
66
+
67
+
68
+ @benchmark
69
+ def test_pyquery():
70
+ return [e.text() for e in pq(large_html)(".item").items()]
71
+
72
+
73
+ @benchmark
74
+ def test_scrapling():
75
+ # No need to do `.extract()` like parsel to extract text
76
+ # Also, this is faster than `[t.text for t in Selector(large_html, adaptive=False).css('.item')]`
77
+ # for obvious reasons, of course.
78
+ return ScraplingSelector(large_html, adaptive=False).css(".item::text").getall()
79
+
80
+
81
+ @benchmark
82
+ def test_parsel():
83
+ return Selector(text=large_html).css(".item::text").extract()
84
+
85
+
86
+ @benchmark
87
+ def test_mechanicalsoup():
88
+ browser = StatefulBrowser()
89
+ browser.open_fake_page(large_html)
90
+ return [e.text for e in browser.page.select(".item")]
91
+
92
+
93
+ @benchmark
94
+ def test_selectolax():
95
+ return [node.text() for node in HTMLParser(large_html).css(".item")]
96
+
97
+
98
+ def display(results):
99
+ # Sort and display results
100
+ sorted_results = sorted(results.items(), key=lambda x: x[1]) # Sort by time
101
+ scrapling_time = results["Scrapling"]
102
+ print("\nRanked Results (fastest to slowest):")
103
+ print(f" i. {'Library tested':<18} | {'avg. time (ms)':<15} | vs Scrapling")
104
+ print("-" * 50)
105
+ for i, (test_name, test_time) in enumerate(sorted_results, 1):
106
+ compare = round(test_time / scrapling_time, 3)
107
+ print(f" {i}. {test_name:<18} | {str(test_time):<15} | {compare}")
108
+
109
+
110
+ @benchmark
111
+ def test_scrapling_text(request_html):
112
+ return ScraplingSelector(request_html, adaptive=False).find_by_text("Tipping the Velvet", first_match=True, clean_match=False).find_similar(ignore_attributes=["title"])
113
+
114
+
115
+ @benchmark
116
+ def test_autoscraper(request_html):
117
+ # autoscraper by default returns elements text
118
+ return AutoScraper().build(html=request_html, wanted_list=["Tipping the Velvet"])
119
+
120
+
121
+ if __name__ == "__main__":
122
+ print(
123
+ " Benchmark: Speed of parsing and retrieving the text content of 5000 nested elements \n"
124
+ )
125
+ results1 = {
126
+ "Raw Lxml": test_lxml(),
127
+ "Parsel/Scrapy": test_parsel(),
128
+ "Scrapling": test_scrapling(),
129
+ "Selectolax": test_selectolax(),
130
+ "PyQuery": test_pyquery(),
131
+ "BS4 with Lxml": test_bs4_lxml(),
132
+ "MechanicalSoup": test_mechanicalsoup(),
133
+ "BS4 with html5lib": test_bs4_html5lib(),
134
+ }
135
+
136
+ display(results1)
137
+ print("\n" + "=" * 25)
138
+ req = requests.get("https://books.toscrape.com/index.html")
139
+ print(
140
+ " Benchmark: Speed of searching for an element by text content, and retrieving the text of similar elements\n"
141
+ )
142
+ results2 = {
143
+ "Scrapling": test_scrapling_text(req.text),
144
+ "AutoScraper": test_autoscraper(req.text),
145
+ }
146
+ display(results2)
Scrapling/cleanup.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import shutil
2
+ from pathlib import Path
3
+
4
+
5
+ # Clean up after installing for local development
6
+ def clean():
7
+ # Get the current directory
8
+ base_dir = Path.cwd()
9
+
10
+ # Directories and patterns to clean
11
+ cleanup_patterns = [
12
+ "build",
13
+ "dist",
14
+ "*.egg-info",
15
+ "__pycache__",
16
+ ".eggs",
17
+ ".pytest_cache",
18
+ ]
19
+
20
+ # Clean directories
21
+ for pattern in cleanup_patterns:
22
+ for path in base_dir.glob(pattern):
23
+ try:
24
+ if path.is_dir():
25
+ shutil.rmtree(path)
26
+ else:
27
+ path.unlink()
28
+ print(f"Removed: {path}")
29
+ except Exception as e:
30
+ print(f"Could not remove {path}: {e}")
31
+
32
+ # Remove compiled Python files
33
+ for path in base_dir.rglob("*.py[co]"):
34
+ try:
35
+ path.unlink()
36
+ print(f"Removed compiled file: {path}")
37
+ except Exception as e:
38
+ print(f"Could not remove {path}: {e}")
39
+
40
+
41
+ if __name__ == "__main__":
42
+ clean()
Scrapling/docs/README_AR.md ADDED
@@ -0,0 +1,414 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">
2
+ <a href="https://scrapling.readthedocs.io">
3
+ <picture>
4
+ <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
5
+ <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
6
+ </picture>
7
+ </a>
8
+ <br>
9
+ <small>Effortless Web Scraping for the Modern Web</small>
10
+ </h1>
11
+
12
+ <p align="center">
13
+ <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
14
+ <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
15
+ <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
16
+ <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
17
+ <a href="https://pepy.tech/project/scrapling" alt="PyPI Downloads">
18
+ <img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/scrapling?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=Downloads"></a>
19
+ <br/>
20
+ <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
21
+ <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
22
+ </a>
23
+ <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
24
+ <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
25
+ </a>
26
+ <br/>
27
+ <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
28
+ <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
29
+ </p>
30
+
31
+ <p align="center">
32
+ <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/"><strong>طرق الاختيار</strong></a>
33
+ &middot;
34
+ <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing/"><strong>اختيار Fetcher</strong></a>
35
+ &middot;
36
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>العناكب</strong></a>
37
+ &middot;
38
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>تدوير البروكسي</strong></a>
39
+ &middot;
40
+ <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/"><strong>واجهة سطر الأوامر</strong></a>
41
+ &middot;
42
+ <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/"><strong>وضع MCP</strong></a>
43
+ </p>
44
+
45
+ Scrapling هو إطار عمل تكيفي لـ Web Scraping يتعامل مع كل شيء من طلب واحد إلى زحف كامل النطاق.
46
+
47
+ محلله يتعلم من تغييرات المواقع ويعيد تحديد موقع عناصرك تلقائياً عند تحديث الصفحات. جوالبه تتجاوز أنظمة مكافحة الروبوتات مثل Cloudflare Turnstile مباشرةً. وإطار عمل Spider الخاص به يتيح لك التوسع إلى عمليات زحف متزامنة ومتعددة الجلسات مع إيقاف/استئناف وتدوير تلقائي لـ Proxy - كل ذلك في بضعة أسطر من Python. مكتبة واحدة، بدون تنازلات.
48
+
49
+ زحف سريع للغاية مع إحصائيات فورية و Streaming. مبني بواسطة مستخرجي الويب لمستخرجي الويب والمستخدمين العاديين، هناك شيء للجميع.
50
+
51
+ ```python
52
+ from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
53
+ StealthyFetcher.adaptive = True
54
+ p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # احصل على الموقع بشكل خفي!
55
+ products = p.css('.product', auto_save=True) # استخرج بيانات تنجو من تغييرات تصميم الموقع!
56
+ products = p.css('.product', adaptive=True) # لاحقاً، إذا تغيرت بنية الموقع، مرر `adaptive=True` للعثور عليها!
57
+ ```
58
+ أو توسع إلى عمليات زحف كاملة
59
+ ```python
60
+ from scrapling.spiders import Spider, Response
61
+
62
+ class MySpider(Spider):
63
+ name = "demo"
64
+ start_urls = ["https://example.com/"]
65
+
66
+ async def parse(self, response: Response):
67
+ for item in response.css('.product'):
68
+ yield {"title": item.css('h2::text').get()}
69
+
70
+ MySpider().start()
71
+ ```
72
+
73
+
74
+ # الرعاة البلاتينيون
75
+
76
+ # الرعاة
77
+
78
+ <!-- sponsors -->
79
+
80
+ <a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>
81
+ <a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
82
+ <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
83
+ <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
84
+ <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
85
+ <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
86
+ <a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
87
+ <a href="https://proxyempire.io/" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
88
+ <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"></a>
89
+
90
+
91
+ <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
92
+ <a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
93
+ <a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
94
+
95
+ <!-- /sponsors -->
96
+
97
+ <i><sub>هل تريد عرض إعلانك هنا؟ انقر [هنا](https://github.com/sponsors/D4Vinci) واختر المستوى الذي يناسبك!</sub></i>
98
+
99
+ ---
100
+
101
+ ## الميزات الرئيسية
102
+
103
+ ### Spiders — إطار عمل زحف كامل
104
+ - 🕷️ **واجهة Spider شبيهة بـ Scrapy**: عرّف Spiders مع `start_urls`، و async `parse` callbacks، وكائنات `Request`/`Response`.
105
+ - ⚡ **زحف متزامن**: حدود تزامن قابلة للتكوين، وتحكم بالسرعة حسب النطاق، وتأخيرات التنزيل.
106
+ - 🔄 **دعم الجلسات المتعددة**: واجهة موحدة لطلبات HTTP، ومتصفحات خفية بدون واجهة في Spider واحد — وجّه الطلبات إلى جلسات مختلفة بالمعرّف.
107
+ - 💾 **إيقاف واستئناف**: استمرارية الزحف القائمة على Checkpoint. اضغط Ctrl+C للإيقاف بسلاسة؛ أعد التشغيل للاستئناف من حيث توقفت.
108
+ - 📡 **وضع Streaming**: بث العناصر المستخرجة فور وصولها عبر `async for item in spider.stream()` مع إحصائيات فورية — مثالي لواجهات المستخدم وخطوط الأنابيب وعمليات الزحف الطويلة.
109
+ - 🛡️ **كشف الطلبات المحظورة**: كشف تلقائي وإعادة محاولة للطلبات المحظورة مع منطق قابل للتخصيص.
110
+ - 📦 **تصدير مدمج**: صدّر النتائج عبر الخطافات وخط الأنابيب الخاص بك أو JSON/JSONL المدمج مع `result.items.to_json()` / `result.items.to_jsonl()` على التوالي.
111
+
112
+ ### جلب متقدم للمواقع مع دعم الجلسات
113
+ - **طلبات HTTP**: طلبات HTTP سريعة وخفية مع فئة `Fetcher`. يمكنها تقليد بصمة TLS للمتصفح والرؤوس واستخدام HTTP/3.
114
+ - **التحميل الديناميكي**: جلب المواقع الديناميكية مع أتمتة كاملة للمتصفح من خلال فئة `DynamicFetcher` التي تدعم Chromium من Playwright و Google Chrome.
115
+ - **تجاوز مكافحة الروبوتات**: قدرات تخفي متقدمة مع `StealthyFetcher` وانتحال fingerprint. يمكنه تجاوز جميع أنواع Turnstile/Interstitial من Cloudflare بسهولة بالأتمتة.
116
+ - **إدارة الجلسات**: دعم الجلسات المستمرة مع ��ئات `FetcherSession` و`StealthySession` و`DynamicSession` لإدارة ملفات تعريف الارتباط والحالة عبر الطلبات.
117
+ - **تدوير Proxy**: `ProxyRotator` مدمج مع استراتيجيات التدوير الدوري أو المخصصة عبر جميع أنواع الجلسات، بالإضافة إلى تجاوزات Proxy لكل طلب.
118
+ - **حظر النطاقات**: حظر الطلبات إلى نطاقات محددة (ونطاقاتها الفرعية) في الجوالب المعتمدة على المتصفح.
119
+ - **دعم Async**: دعم async كامل عبر جميع الجوالب وفئات الجلسات async المخصصة.
120
+
121
+ ### الاستخراج التكيفي والتكامل مع الذكاء الاصطناعي
122
+ - 🔄 **تتبع العناصر الذكي**: إعادة تحديد موقع العناصر بعد تغييرات الموقع باستخدام خوارزميات التشابه الذكية.
123
+ - 🎯 **الاختيار المرن الذكي**: محددات CSS، محددات XPath، البحث القائم على الفلاتر، البحث النصي، البحث بالتعبيرات العادية والمزيد.
124
+ - 🔍 **البحث عن عناصر مشابهة**: تحديد العناصر المشابهة للعناصر الموجودة تلقائياً.
125
+ - 🤖 **خادم MCP للاستخدام مع الذكاء الاصطناعي**: خادم MCP مدمج لـ Web Scraping بمساعدة الذكاء الاصطناعي واستخراج البيانات. يتميز خادم MCP بقدرات قوية مخصصة تستفيد من Scrapling لاستخراج المحتوى المستهدف قبل تمريره إلى الذكاء الاصطناعي (Claude/Cursor/إلخ)، وبالتالي تسريع العمليات وتقليل التكاليف عن طريق تقليل استخدام الرموز. ([فيديو توضيحي](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
126
+
127
+ ### بنية عالية الأداء ومختبرة ميدانياً
128
+ - 🚀 **سريع كالبرق**: أداء محسّن يتفوق على معظم مكتبات Web Scraping في Python.
129
+ - 🔋 **فعال في استخدام الذاكرة**: هياكل بيانات محسّنة وتحميل كسول لأقل استخدام للذاكرة.
130
+ - ⚡ **تسلسل JSON سريع**: أسرع 10 مرات من المكتبة القياسية.
131
+ - 🏗️ **مُختبر ميدانياً**: لا يمتلك Scrapling فقط تغطية اختبار بنسبة 92٪ وتغطية كاملة لتلميحات الأنواع، بل تم استخدامه يومياً من قبل مئات مستخرجي الويب خلال العام الماضي.
132
+
133
+ ### تجربة صديقة للمطورين/مستخرجي الويب
134
+ - 🎯 **Shell تفاعلي لـ Web Scraping**: Shell IPython مدمج اختياري مع تكامل Scrapling، واختصارات، وأدوات جديدة لتسريع تطوير سكريبتات Web Scraping، مثل تحويل طلبات curl إلى طلبات Scrapling وعرض نتائج الطلبات في متصفحك.
135
+ - 🚀 **استخدمه مباشرة من الطرفية**: اختيارياً، يمكنك استخدام Scrapling لاستخراج عنوان URL دون كتابة سطر واحد من الكود!
136
+ - 🛠️ **واجهة تنقل غنية**: اجتياز DOM متقدم مع طرق التنقل بين العناصر الوالدية والشقيقة والفرعية.
137
+ - 🧬 **معالجة نصوص محسّنة**: تعبيرات عادية مدمجة وطرق تنظيف وعمليات نصية محسّنة.
138
+ - 📝 **إنشاء محددات تلقائي**: إنشاء محددات CSS/XPath قوية لأي عنصر.
139
+ - 🔌 **واجهة مألوفة**: مشابه لـ Scrapy/BeautifulSoup مع نفس العناصر الزائفة المستخدمة في Scrapy/Parsel.
140
+ - 📘 **تغطية كاملة للأنواع**: تلميحات نوع كاملة لدعم IDE ممتاز وإكمال الكود. يتم فحص قاعدة الكود بالكامل تلقائياً بواسطة **PyRight** و**MyPy** مع كل تغيير.
141
+ - 🔋 **صورة Docker جاهزة**: مع كل إصدار، يتم بناء ودفع صورة Docker تحتوي على جميع المتصفحات تلقائياً.
142
+
143
+ ## البدء
144
+
145
+ لنلقِ نظرة سريعة على ما يمكن لـ Scrapling فعله دون التعمق.
146
+
147
+ ### الاستخدام الأساسي
148
+ طلبات HTTP مع دعم الجلسات
149
+ ```python
150
+ from scrapling.fetchers import Fetcher, FetcherSession
151
+
152
+ with FetcherSession(impersonate='chrome') as session: # استخدم أحدث إصدار من بصمة TLS لـ Chrome
153
+ page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
154
+ quotes = page.css('.quote .text::text').getall()
155
+
156
+ # أو استخدم طلبات لمرة واحدة
157
+ page = Fetcher.get('https://quotes.toscrape.com/')
158
+ quotes = page.css('.quote .text::text').getall()
159
+ ```
160
+ وضع التخفي المتقدم
161
+ ```python
162
+ from scrapling.fetchers import StealthyFetcher, StealthySession
163
+
164
+ with StealthySession(headless=True, solve_cloudflare=True) as session: # أبقِ المتصفح مفتوحاً حتى تنتهي
165
+ page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
166
+ data = page.css('#padded_content a').getall()
167
+
168
+ # أو استخدم نمط الطلب لمرة واحدة، يفتح المتصفح لهذا الطلب، ثم يغلقه بعد الانتهاء
169
+ page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
170
+ data = page.css('#padded_content a').getall()
171
+ ```
172
+ أتمتة المتصفح الكاملة
173
+ ```python
174
+ from scrapling.fetchers import DynamicFetcher, DynamicSession
175
+
176
+ with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # أبقِ المتصفح مفتوحاً حتى تنتهي
177
+ page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
178
+ data = page.xpath('//span[@class="text"]/text()').getall() # محدد XPath إذا كنت تفضله
179
+
180
+ # أو استخدم نمط الطلب لمرة واحدة، يفتح المتصفح لهذا الطلب، ثم يغلقه بعد الانتهاء
181
+ page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
182
+ data = page.css('.quote .text::text').getall()
183
+ ```
184
+
185
+ ### Spiders
186
+ ابنِ زواحف كاملة مع طلبات متزامنة وأنواع جلسات متعددة وإيقاف/استئناف:
187
+ ```python
188
+ from scrapling.spiders import Spider, Request, Response
189
+
190
+ class QuotesSpider(Spider):
191
+ name = "quotes"
192
+ start_urls = ["https://quotes.toscrape.com/"]
193
+ concurrent_requests = 10
194
+
195
+ async def parse(self, response: Response):
196
+ for quote in response.css('.quote'):
197
+ yield {
198
+ "text": quote.css('.text::text').get(),
199
+ "author": quote.css('.author::text').get(),
200
+ }
201
+
202
+ next_page = response.css('.next a')
203
+ if next_page:
204
+ yield response.follow(next_page[0].attrib['href'])
205
+
206
+ result = QuotesSpider().start()
207
+ print(f"Scraped {len(result.items)} quotes")
208
+ result.items.to_json("quotes.json")
209
+ ```
210
+ استخدم أنواع جلسات متعددة في Spider واحد:
211
+ ```python
212
+ from scrapling.spiders import Spider, Request, Response
213
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession
214
+
215
+ class MultiSessionSpider(Spider):
216
+ name = "multi"
217
+ start_urls = ["https://example.com/"]
218
+
219
+ def configure_sessions(self, manager):
220
+ manager.add("fast", FetcherSession(impersonate="chrome"))
221
+ manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
222
+
223
+ async def parse(self, response: Response):
224
+ for link in response.css('a::attr(href)').getall():
225
+ # وجّه الصفحات المحمية عبر جلسة التخفي
226
+ if "protected" in link:
227
+ yield Request(link, sid="stealth")
228
+ else:
229
+ yield Request(link, sid="fast", callback=self.parse) # callback صريح
230
+ ```
231
+ أوقف واستأنف عمليات الزحف الطويلة مع Checkpoints بتشغيل Spider هكذا:
232
+ ```python
233
+ QuotesSpider(crawldir="./crawl_data").start()
234
+ ```
235
+ اضغط Ctrl+C للإيقاف بسلاسة — يتم حفظ التقدم تلقائياً. لاحقاً، عند تشغيل Spider مرة أخرى، مرر نفس `crawldir`، وسيستأنف من حيث توقف.
236
+
237
+ ### التحليل المتقدم والتنقل
238
+ ```python
239
+ from scrapling.fetchers import Fetcher
240
+
241
+ # اختيار عناصر غني وتنقل
242
+ page = Fetcher.get('https://quotes.toscrape.com/')
243
+
244
+ # احصل على الاقتباسات بطرق اختيار متعددة
245
+ quotes = page.css('.quote') # محدد CSS
246
+ quotes = page.xpath('//div[@class="quote"]') # XPath
247
+ quotes = page.find_all('div', {'class': 'quote'}) # بأسلوب BeautifulSoup
248
+ # نفس الشيء مثل
249
+ quotes = page.find_all('div', class_='quote')
250
+ quotes = page.find_all(['div'], class_='quote')
251
+ quotes = page.find_all(class_='quote') # وهكذا...
252
+ # البحث عن عنصر بمحتوى النص
253
+ quotes = page.find_by_text('quote', tag='div')
254
+
255
+ # التنقل المتقدم
256
+ quote_text = page.css('.quote')[0].css('.text::text').get()
257
+ quote_text = page.css('.quote').css('.text::text').getall() # محددات متسلسلة
258
+ first_quote = page.css('.quote')[0]
259
+ author = first_quote.next_sibling.css('.author::text')
260
+ parent_container = first_quote.parent
261
+
262
+ # علاقات العناصر والتشابه
263
+ similar_elements = first_quote.find_similar()
264
+ below_elements = first_quote.below_elements()
265
+ ```
266
+ يمكنك استخدام المحلل مباشرة إذا كنت لا تريد جلب المواقع كما يلي:
267
+ ```python
268
+ from scrapling.parser import Selector
269
+
270
+ page = Selector("<html>...</html>")
271
+ ```
272
+ وهو يعمل بنفس الطريقة تماماً!
273
+
274
+ ### أمثلة إدارة الجلسات بشكل Async
275
+ ```python
276
+ import asyncio
277
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
278
+
279
+ async with FetcherSession(http3=True) as session: # `FetcherSession` واعٍ بالسياق ويعمل في كلا النمطين المتزامن/async
280
+ page1 = session.get('https://quotes.toscrape.com/')
281
+ page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
282
+
283
+ # استخدام جلسة async
284
+ async with AsyncStealthySession(max_pages=2) as session:
285
+ tasks = []
286
+ urls = ['https://example.com/page1', 'https://example.com/page2']
287
+
288
+ for url in urls:
289
+ task = session.fetch(url)
290
+ tasks.append(task)
291
+
292
+ print(session.get_pool_stats()) # اختياري - حالة مجموعة علامات تبويب المتصفح (مشغول/حر/خطأ)
293
+ results = await asyncio.gather(*tasks)
294
+ print(session.get_pool_stats())
295
+ ```
296
+
297
+ ## واجهة سطر الأوامر والـ Shell التفاعلي
298
+
299
+ يتضمن Scrapling واجهة سطر أوامر قوية:
300
+
301
+ [![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339)
302
+
303
+ تشغيل Shell الـ Web Scraping التفاعلي
304
+ ```bash
305
+ scrapling shell
306
+ ```
307
+ استخرج الصفحات إلى ملف مباشرة دون برمجة (يستخرج المحتوى داخل وسم `body` افتراضياً). إذا انتهى ملف الإخراج بـ `.txt`، فسيتم استخراج محتوى النص للهدف. إذا انتهى بـ `.md`، فسيكون تمثيل Markdown لمحتوى HTML؛ إذا انتهى بـ `.html`، فسيكون محتوى HTML نفسه.
308
+ ```bash
309
+ scrapling extract get 'https://example.com' content.md
310
+ scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # جميع العناصر المطابقة لمحدد CSS '#fromSkipToProducts'
311
+ scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
312
+ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
313
+ ```
314
+
315
+ > [!NOTE]
316
+ > هناك العديد من الميزات الإضافية، لكننا نريد إبقاء هذه الصفحة موجزة، بما في ذلك خادم MCP والـ Shell التفاعلي لـ Web Scraping. تحقق من الوثائق الكاملة [هنا](https://scrapling.readthedocs.io/en/latest/)
317
+
318
+ ## معايير الأداء
319
+
320
+ Scrapling ليس قوياً فحسب — بل هو أيضاً سريع بشكل مذهل. تقارن المعايير التالية محلل Scrapling مع أحدث إصدارات المكتبات الشائعة الأخرى.
321
+
322
+ ### اختبار سرعة استخراج النص (5000 عنصر متداخل)
323
+
324
+ | # | المكتبة | الوقت (ms) | vs Scrapling |
325
+ |---|:-----------------:|:----------:|:------------:|
326
+ | 1 | Scrapling | 2.02 | 1.0x |
327
+ | 2 | Parsel/Scrapy | 2.04 | 1.01 |
328
+ | 3 | Raw Lxml | 2.54 | 1.257 |
329
+ | 4 | PyQuery | 24.17 | ~12x |
330
+ | 5 | Selectolax | 82.63 | ~41x |
331
+ | 6 | MechanicalSoup | 1549.71 | ~767.1x |
332
+ | 7 | BS4 with Lxml | 1584.31 | ~784.3x |
333
+ | 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
334
+
335
+
336
+ ### أداء تشابه العناصر والبحث النصي
337
+
338
+ قدرات العثور على العناصر التكيفية لـ Scrapling تتفوق بشكل كبير على البدائل:
339
+
340
+ | المكتبة | الوقت (ms) | vs Scrapling |
341
+ |-------------|:----------:|:------------:|
342
+ | Scrapling | 2.39 | 1.0x |
343
+ | AutoScraper | 12.45 | 5.209x |
344
+
345
+
346
+ > تمثل جميع المعايير متوسطات أكثر من 100 تشغيل. انظر [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) للمنهجية.
347
+
348
+ ## التثبيت
349
+
350
+ يتطلب Scrapling إصدار Python 3.10 أو أعلى:
351
+
352
+ ```bash
353
+ pip install scrapling
354
+ ```
355
+
356
+ يتضمن هذا التثبيت فقط محرك المحلل وتبعياته، بدون أي جوالب أو تبعيات سطر الأوامر.
357
+
358
+ ### التبعيات الاختيارية
359
+
360
+ 1. إذا كنت ستستخدم أياً من الميزات الإضافية أدناه، أو الجوالب، أو فئاتها، فستحتاج إلى تثبيت تبعيات الجوالب وتبعيات المتصفح الخاصة بها على النحو التالي:
361
+ ```bash
362
+ pip install "scrapling[fetchers]"
363
+
364
+ scrapling install
365
+ ```
366
+
367
+ يقوم هذا بتنزيل جميع المتصفحات، إلى جانب تبعيات النظام وتبعيات معالجة fingerprint الخاصة بها.
368
+
369
+ 2. ميزات إضافية:
370
+ - تثبيت ميزة خادم MCP:
371
+ ```bash
372
+ pip install "scrapling[ai]"
373
+ ```
374
+ - تثبيت ميزات Shell (Shell الـ Web Scraping وأمر `extract`):
375
+ ```bash
376
+ pip install "scrapling[shell]"
377
+ ```
378
+ - تثبيت كل شيء:
379
+ ```bash
380
+ pip install "scrapling[all]"
381
+ ```
382
+ تذكر أنك تحتاج إلى تثبيت تبعيات المتصفح مع `scrapling install` ب��د أي من هذه الإضافات (إذا لم تكن قد فعلت ذلك بالفعل)
383
+
384
+ ### Docker
385
+ يمكنك أيضاً تثبيت صورة Docker مع جميع الإضافات والمتصفحات باستخدام الأمر التالي من DockerHub:
386
+ ```bash
387
+ docker pull pyd4vinci/scrapling
388
+ ```
389
+ أو تنزيلها من سجل GitHub:
390
+ ```bash
391
+ docker pull ghcr.io/d4vinci/scrapling:latest
392
+ ```
393
+ يتم بناء هذه الصورة ودفعها تلقائياً باستخدام GitHub Actions والفرع الرئيسي للمستودع.
394
+
395
+ ## المساهمة
396
+
397
+ نرحب بالمساهمات! يرجى قراءة [إرشادات المساهمة](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) قبل البدء.
398
+
399
+ ## إخلاء المسؤولية
400
+
401
+ > [!CAUTION]
402
+ > يتم توفير هذه المكتبة للأغراض التعليمية والبحثية فقط. باستخدام هذه المكتبة، فإنك توافق على الامتثال لقوانين استخراج البيانات والخصوصية المحلية والدولية. المؤلفون والمساهمون غير مسؤولين عن أي إساءة استخدام لهذا البرنامج. احترم دائماً شروط خدمة المواقع وملفات robots.txt.
403
+
404
+ ## الترخيص
405
+
406
+ هذا العمل مرخص بموجب ترخيص BSD-3-Clause.
407
+
408
+ ## الشكر والتقدير
409
+
410
+ يتضمن هذا المشروع كوداً معدلاً من:
411
+ - Parsel (ترخيص BSD) — يُستخدم للوحدة الفرعية [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)
412
+
413
+ ---
414
+ <div align="center"><small>مصمم ومصنوع بـ ❤️ بواسطة كريم شعير.</small></div><br>
Scrapling/docs/README_CN.md ADDED
@@ -0,0 +1,414 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">
2
+ <a href="https://scrapling.readthedocs.io">
3
+ <picture>
4
+ <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
5
+ <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
6
+ </picture>
7
+ </a>
8
+ <br>
9
+ <small>Effortless Web Scraping for the Modern Web</small>
10
+ </h1>
11
+
12
+ <p align="center">
13
+ <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
14
+ <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
15
+ <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
16
+ <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
17
+ <a href="https://pepy.tech/project/scrapling" alt="PyPI Downloads">
18
+ <img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/scrapling?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=Downloads"></a>
19
+ <br/>
20
+ <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
21
+ <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
22
+ </a>
23
+ <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
24
+ <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
25
+ </a>
26
+ <br/>
27
+ <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
28
+ <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
29
+ </p>
30
+
31
+ <p align="center">
32
+ <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/"><strong>选择方法</strong></a>
33
+ &middot;
34
+ <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing/"><strong>选择Fetcher</strong></a>
35
+ &middot;
36
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>爬虫</strong></a>
37
+ &middot;
38
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>代理轮换</strong></a>
39
+ &middot;
40
+ <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/"><strong>CLI</strong></a>
41
+ &middot;
42
+ <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/"><strong>MCP模式</strong></a>
43
+ </p>
44
+
45
+ Scrapling是一个自适应Web Scraping框架,能处理从单个请求到大规模爬取的一切需求。
46
+
47
+ 它的解析器能够从网站变化中学习,并在页面更新时自动重新定位您的元素。它的Fetcher能够开箱即用地绕过Cloudflare Turnstile等反机器人系统。它的Spider框架让您可以扩展到并发、多Session爬取,支持暂停/恢复和自动Proxy轮换——只需几行Python代码。一个库,零妥协。
48
+
49
+ 极速爬取,实时统计和Streaming。由Web Scraper为Web Scraper和普通用户而构建,每个人都能找到适合自己的功能。
50
+
51
+ ```python
52
+ from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
53
+ StealthyFetcher.adaptive = True
54
+ p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # 隐秘地获取网站!
55
+ products = p.css('.product', auto_save=True) # 抓取在网站设计变更后仍能存活的数据!
56
+ products = p.css('.product', adaptive=True) # 之后,如果网站结构改变,传递 `adaptive=True` 来找到它们!
57
+ ```
58
+ 或扩展为完整爬取
59
+ ```python
60
+ from scrapling.spiders import Spider, Response
61
+
62
+ class MySpider(Spider):
63
+ name = "demo"
64
+ start_urls = ["https://example.com/"]
65
+
66
+ async def parse(self, response: Response):
67
+ for item in response.css('.product'):
68
+ yield {"title": item.css('h2::text').get()}
69
+
70
+ MySpider().start()
71
+ ```
72
+
73
+
74
+ # 铂金赞助商
75
+
76
+ # 赞助商
77
+
78
+ <!-- sponsors -->
79
+
80
+ <a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>
81
+ <a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
82
+ <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
83
+ <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
84
+ <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
85
+ <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
86
+ <a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
87
+ <a href="https://proxyempire.io/" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
88
+ <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"></a>
89
+
90
+
91
+ <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
92
+ <a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
93
+ <a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
94
+
95
+ <!-- /sponsors -->
96
+
97
+ <i><sub>想在这里展示您的广告吗?点击[这里](https://github.com/sponsors/D4Vinci)并选择适合您的级别!</sub></i>
98
+
99
+ ---
100
+
101
+ ## 主要特性
102
+
103
+ ### Spider — 完整的爬取框架
104
+ - 🕷️ **类Scrapy的Spider API**:使用`start_urls`、async `parse` callback和`Request`/`Response`对象定义Spider。
105
+ - ⚡ **并发爬取**:可配置的并发限制、按域名节流和下载延迟。
106
+ - 🔄 **多Session支持**:统一接口,支持HTTP请求和隐秘无头浏览器在同一个Spider中使用——通过ID将请求路由到不同的Session。
107
+ - 💾 **暂停与恢复**:基于Checkpoint的爬取持久化。按Ctrl+C优雅关闭;重启后从上次停止的地方继续。
108
+ - 📡 **Streaming模式**:通过`async for item in spider.stream()`以实时统计Streaming抓取的数据——非常适合UI、管道和长时间运行的爬取。
109
+ - 🛡️ **被阻止请求检测**:自动检测并重试被阻止的请求,支持自定义逻辑。
110
+ - 📦 **内置导出**:通过钩子和您自己的管道导出结果,或使用内置的JSON/JSONL,分别通过`result.items.to_json()`/`result.items.to_jsonl()`。
111
+
112
+ ### 支持Session的高级网站获取
113
+ - **HTTP请求**:使用`Fetcher`类进行快速和隐秘的HTTP请求。可以模拟浏览器的TLS fingerprint、标头并使用HTTP/3。
114
+ - **动态加载**:通过`DynamicFetcher`类使用完整的浏览器自动化获取动态网站,支持Playwright的Chromium和Google Chrome。
115
+ - **反机器人绕过**:使用`StealthyFetcher`的高级隐秘功能和fingerprint伪装。可以轻松自动绕过所有类型的Cloudflare Turnstile/Interstitial。
116
+ - **Session管理**:使用`FetcherSession`、`StealthySession`和`DynamicSession`类实现持久化Session支持,用于跨请求的cookie和状态管理。
117
+ - **Proxy轮换**:内置`ProxyRotator`,支持轮询或自定义策略,适用于所有Session类型,并支持按请求覆盖Proxy。
118
+ - **域名屏蔽**:在基于浏览器的Fetcher中屏蔽对特定域名(及其子域名)的请求。
119
+ - **Async支持**:所有Fetcher和专用async Session类的完整async支持。
120
+
121
+ ### 自适应抓取和AI集成
122
+ - 🔄 **智能元素跟踪**:使用智能相似性算法在网站更改后重新定位元素。
123
+ - 🎯 **智能灵活选择**:CSS选择器、XPath选择器、基于过滤器的搜索、文本搜索、正则表达式搜索等。
124
+ - 🔍 **查找相似元素**:自动定位与已找到元素相似的元素。
125
+ - 🤖 **与AI一起使用的MCP服务器**:内置MCP服务器用于AI辅助Web Scraping和数据提取。MCP服务器具有强大的自定义功能,利用Scrapling在将内容传递给AI(Claude/Cursor等)之前提取目标内容,从而加快操作并通过最小化token使用来降低成本。([演示视频](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
126
+
127
+ ### 高性能和经过实战测试的架构
128
+ - 🚀 **闪电般快速**:优化性能超越大多数Python抓取库。
129
+ - 🔋 **内存高效**:优化的数据结构和延迟加载,最小内存占用。
130
+ - ⚡ **快速JSON序列化**:比标准库快10倍。
131
+ - 🏗️ **经过实战测试**:Scrapling不仅拥有92%的测试覆盖率和完整的类型提示覆盖率,而且在过去一年中每天被数百名Web Scraper使用。
132
+
133
+ ### 对开发者/Web Scraper友好的体验
134
+ - 🎯 **交互式Web Scraping Shell**:可选的内置IPython Shell,具有Scrapling集成、快捷方式和新工具,可加快Web Scraping脚本开发,例如将curl请求转换为Scrapling请求并在浏览器中查看请求结果。
135
+ - 🚀 **直接从终端使用**:可选地,您可以使用Scrapling抓取URL而无需编写任何代码!
136
+ - 🛠️ **丰富的导航API**:使用父级、兄弟级和子级导航方法进行高级DOM遍历。
137
+ - 🧬 **增强的文本处理**:内置正则表达式、清理方法和优化的字符串操作。
138
+ - 📝 **自动选择器生成**:为任何元素生成强大的CSS/XPath选择器。
139
+ - 🔌 **熟悉的API**:类似于Scrapy/BeautifulSoup,使用与Scrapy/Parsel相同的伪元素。
140
+ - 📘 **完整的类型覆盖**:完整的类型提示,出色的IDE支持和代码补全。整个代码库在每次更改时都会自动使用**PyRight**和**MyPy**扫描。
141
+ - 🔋 **现成的Docker镜像**:每次发布时,包含所有浏览器的Docker镜像会自动构建和推送。
142
+
143
+ ## 入门
144
+
145
+ 让我们快速展示Scrapling的功能,无需深入了解。
146
+
147
+ ### 基本用法
148
+ 支持Session的HTTP请求
149
+ ```python
150
+ from scrapling.fetchers import Fetcher, FetcherSession
151
+
152
+ with FetcherSession(impersonate='chrome') as session: # 使用Chrome的最新版本TLS fingerprint
153
+ page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
154
+ quotes = page.css('.quote .text::text').getall()
155
+
156
+ # 或使用一次性请求
157
+ page = Fetcher.get('https://quotes.toscrape.com/')
158
+ quotes = page.css('.quote .text::text').getall()
159
+ ```
160
+ 高级隐秘模式
161
+ ```python
162
+ from scrapling.fetchers import StealthyFetcher, StealthySession
163
+
164
+ with StealthySession(headless=True, solve_cloudflare=True) as session: # 保持浏览器打开直到完成
165
+ page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
166
+ data = page.css('#padded_content a').getall()
167
+
168
+ # 或使用一次性请求样式,为此请求打开浏览器,完成后关闭
169
+ page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
170
+ data = page.css('#padded_content a').getall()
171
+ ```
172
+ 完整的浏览器自动化
173
+ ```python
174
+ from scrapling.fetchers import DynamicFetcher, DynamicSession
175
+
176
+ with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # 保持浏览器打开直到完成
177
+ page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
178
+ data = page.xpath('//span[@class="text"]/text()').getall() # 如果您偏好XPath选择器
179
+
180
+ # 或使用一次性请求样式,为此请求打开浏览器,完成后关闭
181
+ page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
182
+ data = page.css('.quote .text::text').getall()
183
+ ```
184
+
185
+ ### Spider
186
+ 构建具有并发请求、多种Session类型和暂停/恢复功能的完整爬虫:
187
+ ```python
188
+ from scrapling.spiders import Spider, Request, Response
189
+
190
+ class QuotesSpider(Spider):
191
+ name = "quotes"
192
+ start_urls = ["https://quotes.toscrape.com/"]
193
+ concurrent_requests = 10
194
+
195
+ async def parse(self, response: Response):
196
+ for quote in response.css('.quote'):
197
+ yield {
198
+ "text": quote.css('.text::text').get(),
199
+ "author": quote.css('.author::text').get(),
200
+ }
201
+
202
+ next_page = response.css('.next a')
203
+ if next_page:
204
+ yield response.follow(next_page[0].attrib['href'])
205
+
206
+ result = QuotesSpider().start()
207
+ print(f"抓取了 {len(result.items)} 条引用")
208
+ result.items.to_json("quotes.json")
209
+ ```
210
+ 在单个Spider中使用多种Session类型:
211
+ ```python
212
+ from scrapling.spiders import Spider, Request, Response
213
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession
214
+
215
+ class MultiSessionSpider(Spider):
216
+ name = "multi"
217
+ start_urls = ["https://example.com/"]
218
+
219
+ def configure_sessions(self, manager):
220
+ manager.add("fast", FetcherSession(impersonate="chrome"))
221
+ manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
222
+
223
+ async def parse(self, response: Response):
224
+ for link in response.css('a::attr(href)').getall():
225
+ # 将受保护的页面路由到隐秘Session
226
+ if "protected" in link:
227
+ yield Request(link, sid="stealth")
228
+ else:
229
+ yield Request(link, sid="fast", callback=self.parse) # 显式callback
230
+ ```
231
+ 通过如下方式运行Spider来暂停和恢复长时间爬取,使用Checkpoint:
232
+ ```python
233
+ QuotesSpider(crawldir="./crawl_data").start()
234
+ ```
235
+ 按Ctrl+C优雅暂停——进度会自动保存。之后,当您再次启动Spider时,传递相同的`crawldir`,它将从上次停止的地方继续。
236
+
237
+ ### 高级解析与导航
238
+ ```python
239
+ from scrapling.fetchers import Fetcher
240
+
241
+ # 丰富的元素选择和导航
242
+ page = Fetcher.get('https://quotes.toscrape.com/')
243
+
244
+ # 使用多种选择方法获取引用
245
+ quotes = page.css('.quote') # CSS选择器
246
+ quotes = page.xpath('//div[@class="quote"]') # XPath
247
+ quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup风格
248
+ # 等同于
249
+ quotes = page.find_all('div', class_='quote')
250
+ quotes = page.find_all(['div'], class_='quote')
251
+ quotes = page.find_all(class_='quote') # 等等...
252
+ # 按文本内容查找元素
253
+ quotes = page.find_by_text('quote', tag='div')
254
+
255
+ # 高级导航
256
+ quote_text = page.css('.quote')[0].css('.text::text').get()
257
+ quote_text = page.css('.quote').css('.text::text').getall() # 链式选择器
258
+ first_quote = page.css('.quote')[0]
259
+ author = first_quote.next_sibling.css('.author::text')
260
+ parent_container = first_quote.parent
261
+
262
+ # 元素关系和相似性
263
+ similar_elements = first_quote.find_similar()
264
+ below_elements = first_quote.below_elements()
265
+ ```
266
+ 如果您不想获取网站,可以直接使用解析器,如下所示:
267
+ ```python
268
+ from scrapling.parser import Selector
269
+
270
+ page = Selector("<html>...</html>")
271
+ ```
272
+ 用法完全相同!
273
+
274
+ ### Async Session管理示例
275
+ ```python
276
+ import asyncio
277
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
278
+
279
+ async with FetcherSession(http3=True) as session: # `FetcherSession`是上下文感知的,可以在sync/async模式下工作
280
+ page1 = session.get('https://quotes.toscrape.com/')
281
+ page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
282
+
283
+ # Async Session用法
284
+ async with AsyncStealthySession(max_pages=2) as session:
285
+ tasks = []
286
+ urls = ['https://example.com/page1', 'https://example.com/page2']
287
+
288
+ for url in urls:
289
+ task = session.fetch(url)
290
+ tasks.append(task)
291
+
292
+ print(session.get_pool_stats()) # 可选 - 浏览器标签池的状态(忙/空闲/错误)
293
+ results = await asyncio.gather(*tasks)
294
+ print(session.get_pool_stats())
295
+ ```
296
+
297
+ ## CLI和交互式Shell
298
+
299
+ Scrapling包含强大的命令行界面:
300
+
301
+ [![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339)
302
+
303
+ 启动交互式Web Scraping Shell
304
+ ```bash
305
+ scrapling shell
306
+ ```
307
+ 直接将页面提取到文件而无需编程(默认提取`body`标签内的内容)。如果输出文件以`.txt`结尾,则将提取目标的文本内容。如果以`.md`结尾,它将是HTML内容的Markdown表示;如果以`.html`结尾,它将是HTML内容本身。
308
+ ```bash
309
+ scrapling extract get 'https://example.com' content.md
310
+ scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # 所有匹配CSS选择器'#fromSkipToProducts'的元素
311
+ scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
312
+ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
313
+ ```
314
+
315
+ > [!NOTE]
316
+ > 还有许多其他功能,但我们希望保持此页面简洁,包括MCP服务器和交互式Web Scraping Shell。查看完整文档[这里](https://scrapling.readthedocs.io/en/latest/)
317
+
318
+ ## 性能基准
319
+
320
+ Scrapling不仅功能强大——它还速度极快。以下基准测试将Scrapling的解析器与其他流行库的最新版本进行了比较。
321
+
322
+ ### 文本提取速度测试(5000个嵌套元素)
323
+
324
+ | # | 库 | 时间(ms) | vs Scrapling |
325
+ |---|:-----------------:|:---------:|:------------:|
326
+ | 1 | Scrapling | 2.02 | 1.0x |
327
+ | 2 | Parsel/Scrapy | 2.04 | 1.01 |
328
+ | 3 | Raw Lxml | 2.54 | 1.257 |
329
+ | 4 | PyQuery | 24.17 | ~12x |
330
+ | 5 | Selectolax | 82.63 | ~41x |
331
+ | 6 | MechanicalSoup | 1549.71 | ~767.1x |
332
+ | 7 | BS4 with Lxml | 1584.31 | ~784.3x |
333
+ | 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
334
+
335
+
336
+ ### 元素相似性和文本搜索性能
337
+
338
+ Scrapling的自适应元素查找功能明显优于替代方案:
339
+
340
+ | 库 | 时间(ms) | vs Scrapling |
341
+ |-------------|:---------:|:------------:|
342
+ | Scrapling | 2.39 | 1.0x |
343
+ | AutoScraper | 12.45 | 5.209x |
344
+
345
+
346
+ > 所有基准测试代表100+次运行的平均值。请参阅[benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py)了解方法。
347
+
348
+ ## 安装
349
+
350
+ Scrapling需要Python 3.10或更高版本:
351
+
352
+ ```bash
353
+ pip install scrapling
354
+ ```
355
+
356
+ 此安装仅包括解析器引擎及其依赖项,没有任何Fetcher或命令行依赖项。
357
+
358
+ ### 可选依赖项
359
+
360
+ 1. 如果您要使用以下任何额外功能、Fetcher或它们的类,您将需要安装Fetcher的依赖项和它们的浏览器依赖项,如下所示:
361
+ ```bash
362
+ pip install "scrapling[fetchers]"
363
+
364
+ scrapling install
365
+ ```
366
+
367
+ 这会下载所有浏览器,以及它们的系统依赖项和fingerprint操作依赖项。
368
+
369
+ 2. 额外功能:
370
+ - 安装MCP服务器功能:
371
+ ```bash
372
+ pip install "scrapling[ai]"
373
+ ```
374
+ - 安装Shell功能(Web Scraping Shell和`extract`命令):
375
+ ```bash
376
+ pip install "scrapling[shell]"
377
+ ```
378
+ - 安装所有内容:
379
+ ```bash
380
+ pip install "scrapling[all]"
381
+ ```
382
+ 请记住,在安装任何这些额外功能后(如果您还没有安装),您需要使用`scrapling install`安装浏览器依赖项
383
+
384
+ ### Docker
385
+ 您还可以使用以下命令从DockerHub安装包含所有额外功能和浏览器的Docker镜像:
386
+ ```bash
387
+ docker pull pyd4vinci/scrapling
388
+ ```
389
+ 或从GitHub注册表下载:
390
+ ```bash
391
+ docker pull ghcr.io/d4vinci/scrapling:latest
392
+ ```
393
+ 此镜像使用GitHub Actions和仓库主分支自动构建和推送。
394
+
395
+ ## 贡献
396
+
397
+ 我们欢迎贡献!在开始之前,请阅读我们的[贡献指南](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md)。
398
+
399
+ ## 免责声明
400
+
401
+ > [!CAUTION]
402
+ > 此库仅用于教育和研究目的。使用此库即表示您同意遵守本地和国际数据抓取和隐私法律。作者和贡献者对本软件的任何滥用不承担责任。始终尊重网站的服务条款和robots.txt文件。
403
+
404
+ ## 许可证
405
+
406
+ 本作品根据BSD-3-Clause许可证授权。
407
+
408
+ ## 致谢
409
+
410
+ 此项目包含改编自以下内容的代码:
411
+ - Parsel(BSD许可证)——用于[translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)子模块
412
+
413
+ ---
414
+ <div align="center"><small>由Karim Shoair用❤️设计和制作。</small></div><br>
Scrapling/docs/README_DE.md ADDED
@@ -0,0 +1,414 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">
2
+ <a href="https://scrapling.readthedocs.io">
3
+ <picture>
4
+ <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
5
+ <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
6
+ </picture>
7
+ </a>
8
+ <br>
9
+ <small>Effortless Web Scraping for the Modern Web</small>
10
+ </h1>
11
+
12
+ <p align="center">
13
+ <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
14
+ <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
15
+ <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
16
+ <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
17
+ <a href="https://pepy.tech/project/scrapling" alt="PyPI Downloads">
18
+ <img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/scrapling?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=Downloads"></a>
19
+ <br/>
20
+ <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
21
+ <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
22
+ </a>
23
+ <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
24
+ <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
25
+ </a>
26
+ <br/>
27
+ <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
28
+ <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
29
+ </p>
30
+
31
+ <p align="center">
32
+ <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/"><strong>Auswahlmethoden</strong></a>
33
+ &middot;
34
+ <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing/"><strong>Einen Fetcher wählen</strong></a>
35
+ &middot;
36
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>Spiders</strong></a>
37
+ &middot;
38
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>Proxy-Rotation</strong></a>
39
+ &middot;
40
+ <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/"><strong>CLI</strong></a>
41
+ &middot;
42
+ <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/"><strong>MCP-Modus</strong></a>
43
+ </p>
44
+
45
+ Scrapling ist ein adaptives Web-Scraping-Framework, das alles abdeckt -- von einer einzelnen Anfrage bis hin zu einem umfassenden Crawl.
46
+
47
+ Sein Parser lernt aus Website-Änderungen und lokalisiert Ihre Elemente automatisch neu, wenn sich Seiten aktualisieren. Seine Fetcher umgehen Anti-Bot-Systeme wie Cloudflare Turnstile direkt ab Werk. Und sein Spider-Framework ermöglicht es Ihnen, auf parallele Multi-Session-Crawls mit Pause & Resume und automatischer Proxy-Rotation hochzuskalieren -- alles in wenigen Zeilen Python. Eine Bibliothek, keine Kompromisse.
48
+
49
+ Blitzschnelle Crawls mit Echtzeit-Statistiken und Streaming. Von Web Scrapern für Web Scraper und normale Benutzer entwickelt, ist für jeden etwas dabei.
50
+
51
+ ```python
52
+ from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
53
+ StealthyFetcher.adaptive = True
54
+ p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Website unbemerkt abrufen!
55
+ products = p.css('.product', auto_save=True) # Daten scrapen, die Website-Designänderungen überleben!
56
+ products = p.css('.product', adaptive=True) # Später, wenn sich die Website-Struktur ändert, `adaptive=True` übergeben, um sie zu finden!
57
+ ```
58
+ Oder auf vollständige Crawls hochskalieren
59
+ ```python
60
+ from scrapling.spiders import Spider, Response
61
+
62
+ class MySpider(Spider):
63
+ name = "demo"
64
+ start_urls = ["https://example.com/"]
65
+
66
+ async def parse(self, response: Response):
67
+ for item in response.css('.product'):
68
+ yield {"title": item.css('h2::text').get()}
69
+
70
+ MySpider().start()
71
+ ```
72
+
73
+
74
+ # Platin-Sponsoren
75
+
76
+ # Sponsoren
77
+
78
+ <!-- sponsors -->
79
+
80
+ <a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>
81
+ <a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
82
+ <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
83
+ <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
84
+ <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
85
+ <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
86
+ <a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
87
+ <a href="https://proxyempire.io/" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
88
+ <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"></a>
89
+
90
+
91
+ <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
92
+ <a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
93
+ <a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
94
+
95
+ <!-- /sponsors -->
96
+
97
+ <i><sub>Möchten Sie Ihre Anzeige hier zeigen? Klicken Sie [hier](https://github.com/sponsors/D4Vinci) und wählen Sie die Stufe, die zu Ihnen passt!</sub></i>
98
+
99
+ ---
100
+
101
+ ## Hauptmerkmale
102
+
103
+ ### Spiders -- Ein vollständiges Crawling-Framework
104
+ - 🕷️ **Scrapy-ähnliche Spider-API**: Definieren Sie Spiders mit `start_urls`, async `parse` Callbacks und `Request`/`Response`-Objekten.
105
+ - ⚡ **Paralleles Crawling**: Konfigurierbare Parallelitätslimits, domainbezogenes Throttling und Download-Verzögerungen.
106
+ - 🔄 **Multi-Session-Unterstützung**: Einheitliche Schnittstelle für HTTP-Anfragen und heimliche Headless-Browser in einem einzigen Spider -- leiten Sie Anfragen per ID an verschiedene Sessions weiter.
107
+ - 💾 **Pause & Resume**: Checkpoint-basierte Crawl-Persistenz. Drücken Sie Strg+C für ein kontrolliertes Herunterfahren; starten Sie neu, um dort fortzufahren, wo Sie aufgehört haben.
108
+ - 📡 **Streaming-Modus**: Gescrapte Elemente in Echtzeit streamen über `async for item in spider.stream()` mit Echtzeit-Statistiken -- ideal für UI, Pipelines und lang laufende Crawls.
109
+ - 🛡️ **Erkennung blockierter Anfragen**: Automatische Erkennung und Wiederholung blockierter Anfragen mit anpassbarer Logik.
110
+ - 📦 **Integrierter Export**: Ergebnisse über Hooks und Ihre eigene Pipeline oder den integrierten JSON/JSONL-Export mit `result.items.to_json()` / `result.items.to_jsonl()` exportieren.
111
+
112
+ ### Erweitertes Website-Abrufen mit Session-Unterstützung
113
+ - **HTTP-Anfragen**: Schnelle und heimliche HTTP-Anfragen mit der `Fetcher`-Klasse. Kann Browser-TLS-Fingerprints und Header imitieren und HTTP/3 verwenden.
114
+ - **Dynamisches Laden**: Dynamische Websites mit vollständiger Browser-Automatisierung über die `DynamicFetcher`-Klasse abrufen, die Playwrights Chromium und Google Chrome unterstützt.
115
+ - **Anti-Bot-Umgehung**: Erweiterte Stealth-Fähigkeiten mit `StealthyFetcher` und Fingerprint-Spoofing. Kann alle Arten von Cloudflares Turnstile/Interstitial einfach mit Automatisierung umgehen.
116
+ - **Session-Verwaltung**: Persistente Session-Unterstützung mit den Klassen `FetcherSession`, `StealthySession` und `DynamicSession` für Cookie- und Zustandsverwaltung über Anfragen hinweg.
117
+ - **Proxy-Rotation**: Integrierter `ProxyRotator` mit zyklischen oder benutzerdefinierten Rotationsstrategien über alle Session-Typen hinweg, plus Proxy-Überschreibungen pro Anfrage.
118
+ - **Domain-Blockierung**: Anfragen an bestimmte Domains (und deren Subdomains) in browserbasierten Fetchern blockieren.
119
+ - **Async-Unterstützung**: Vollständige async-Unterstützung über alle Fetcher und dedizierte async Session-Klassen hinweg.
120
+
121
+ ### Adaptives Scraping & KI-Integration
122
+ - 🔄 **Intelligente Element-Verfolgung**: Elemente nach Website-Änderungen mit intelligenten Ähnlichkeitsalgorithmen neu lokalisieren.
123
+ - 🎯 **Intelligente flexible Auswahl**: CSS-Selektoren, XPath-Selektoren, filterbasierte Suche, Textsuche, Regex-Suche und mehr.
124
+ - 🔍 **Ähnliche Elemente finden**: Elemente, die gefundenen Elementen ähnlich sind, automatisch lokalisieren.
125
+ - 🤖 **MCP-Server für die Verwendung mit KI**: Integrierter MCP-Server für KI-unterstütztes Web Scraping und Datenextraktion. Der MCP-Server verfügt über leistungsstarke, benutzerdefinierte Funktionen, die Scrapling nutzen, um gezielten Inhalt zu extrahieren, bevor er an die KI (Claude/Cursor/etc.) übergeben wird, wodurch Vorgänge beschleunigt und Kosten durch Minimierung der Token-Nutzung gesenkt werden. ([Demo-Video](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
126
+
127
+ ### Hochleistungs- und praxiserprobte Architektur
128
+ - 🚀 **Blitzschnell**: Optimierte Leistung, die die meisten Python-Scraping-Bibliotheken übertrifft.
129
+ - 🔋 **Speichereffizient**: Optimierte Datenstrukturen und Lazy Loading für einen minimalen Speicher-Footprint.
130
+ - ⚡ **Schnelle JSON-Serialisierung**: 10x schneller als die Standardbibliothek.
131
+ - 🏗️ **Praxiserprobt**: Scrapling hat nicht nur eine Testabdeckung von 92% und eine vollständige Type-Hints-Abdeckung, sondern wird seit dem letzten Jahr täglich von Hunderten von Web Scrapern verwendet.
132
+
133
+ ### Entwickler-/Web-Scraper-freundliche Erfahrung
134
+ - 🎯 **Interaktive Web-Scraping-Shell**: Optionale integrierte IPython-Shell mit Scrapling-Integration, Shortcuts und neuen Tools zur Beschleunigung der Web-Scraping-Skriptentwicklung, wie das Konvertieren von Curl-Anfragen in Scrapling-Anfragen und das Anzeigen von Anfrageergebnissen in Ihrem Browser.
135
+ - 🚀 **Direkt vom Terminal aus verwenden**: Optional können Sie Scrapling verwenden, um eine URL zu scrapen, ohne eine einzige Codezeile zu schreiben!
136
+ - 🛠️ **Umfangreiche Navigations-API**: Erweiterte DOM-Traversierung mit Eltern-, Geschwister- und Kind-Navigationsmethoden.
137
+ - 🧬 **Verbesserte Textverarbeitung**: Integrierte Regex, Bereinigungsmethoden und optimierte String-Operationen.
138
+ - 📝 **Automatische Selektorgenerierung**: Robuste CSS/XPath-Selektoren für jedes Element generieren.
139
+ - 🔌 **Vertraute API**: Ähnlich wie Scrapy/BeautifulSoup mit denselben Pseudo-Elementen, die in Scrapy/Parsel verwendet werden.
140
+ - 📘 **Vollständige Typabdeckung**: Vollständige Type Hints für hervorragende IDE-Unterstützung und Code-Vervollständigung. Die gesamte Codebasis wird bei jeder Änderung automatisch mit **PyRight** und **MyPy** gescannt.
141
+ - 🔋 **Fertiges Docker-Image**: Mit jeder Veröffentlichung wird automatisch ein Docker-Image erstellt und gepusht, das alle Browser enthält.
142
+
143
+ ## Erste Schritte
144
+
145
+ Hier ein kurzer Überblick über das, was Scrapling kann, ohne zu sehr ins Detail zu gehen.
146
+
147
+ ### Grundlegende Verwendung
148
+ HTTP-Anfragen mit Session-Unterstützung
149
+ ```python
150
+ from scrapling.fetchers import Fetcher, FetcherSession
151
+
152
+ with FetcherSession(impersonate='chrome') as session: # Neueste Version von Chromes TLS-Fingerprint verwenden
153
+ page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
154
+ quotes = page.css('.quote .text::text').getall()
155
+
156
+ # Oder einmalige Anfragen verwenden
157
+ page = Fetcher.get('https://quotes.toscrape.com/')
158
+ quotes = page.css('.quote .text::text').getall()
159
+ ```
160
+ Erweiterter Stealth-Modus
161
+ ```python
162
+ from scrapling.fetchers import StealthyFetcher, StealthySession
163
+
164
+ with StealthySession(headless=True, solve_cloudflare=True) as session: # Browser offen halten, bis Sie fertig sind
165
+ page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
166
+ data = page.css('#padded_content a').getall()
167
+
168
+ # Oder einmaligen Anfragenstil verwenden: öffnet den Browser für diese Anfrage und schließt ihn nach Abschluss
169
+ page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
170
+ data = page.css('#padded_content a').getall()
171
+ ```
172
+ Vollständige Browser-Automatisierung
173
+ ```python
174
+ from scrapling.fetchers import DynamicFetcher, DynamicSession
175
+
176
+ with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Browser offen halten, bis Sie fertig sind
177
+ page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
178
+ data = page.xpath('//span[@class="text"]/text()').getall() # XPath-Selektor, falls bevorzugt
179
+
180
+ # Oder einmaligen Anfragenstil verwenden: öffnet den Browser für diese Anfrage und schließt ihn nach Abschluss
181
+ page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
182
+ data = page.css('.quote .text::text').getall()
183
+ ```
184
+
185
+ ### Spiders
186
+ Vollständige Crawler mit parallelen Anfragen, mehreren Session-Typen und Pause & Resume erstellen:
187
+ ```python
188
+ from scrapling.spiders import Spider, Request, Response
189
+
190
+ class QuotesSpider(Spider):
191
+ name = "quotes"
192
+ start_urls = ["https://quotes.toscrape.com/"]
193
+ concurrent_requests = 10
194
+
195
+ async def parse(self, response: Response):
196
+ for quote in response.css('.quote'):
197
+ yield {
198
+ "text": quote.css('.text::text').get(),
199
+ "author": quote.css('.author::text').get(),
200
+ }
201
+
202
+ next_page = response.css('.next a')
203
+ if next_page:
204
+ yield response.follow(next_page[0].attrib['href'])
205
+
206
+ result = QuotesSpider().start()
207
+ print(f"{len(result.items)} Zitate gescrapt")
208
+ result.items.to_json("quotes.json")
209
+ ```
210
+ Mehrere Session-Typen in einem einzigen Spider verwenden:
211
+ ```python
212
+ from scrapling.spiders import Spider, Request, Response
213
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession
214
+
215
+ class MultiSessionSpider(Spider):
216
+ name = "multi"
217
+ start_urls = ["https://example.com/"]
218
+
219
+ def configure_sessions(self, manager):
220
+ manager.add("fast", FetcherSession(impersonate="chrome"))
221
+ manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
222
+
223
+ async def parse(self, response: Response):
224
+ for link in response.css('a::attr(href)').getall():
225
+ # Geschützte Seiten über die Stealth-Session leiten
226
+ if "protected" in link:
227
+ yield Request(link, sid="stealth")
228
+ else:
229
+ yield Request(link, sid="fast", callback=self.parse) # Expliziter Callback
230
+ ```
231
+ Lange Crawls mit Checkpoints pausieren und fortsetzen, indem Sie den Spider so starten:
232
+ ```python
233
+ QuotesSpider(crawldir="./crawl_data").start()
234
+ ```
235
+ Drücken Sie Strg+C, um kontrolliert zu pausieren -- der Fortschritt wird automatisch gespeichert. Wenn Sie den Spider später erneut starten, übergeben Sie dasselbe `crawldir`, und er setzt dort fort, wo er aufgehört hat.
236
+
237
+ ### Erweitertes Parsing & Navigation
238
+ ```python
239
+ from scrapling.fetchers import Fetcher
240
+
241
+ # Umfangreiche Elementauswahl und Navigation
242
+ page = Fetcher.get('https://quotes.toscrape.com/')
243
+
244
+ # Zitate mit verschiedenen Auswahlmethoden abrufen
245
+ quotes = page.css('.quote') # CSS-Selektor
246
+ quotes = page.xpath('//div[@class="quote"]') # XPath
247
+ quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup-Stil
248
+ # Gleich wie
249
+ quotes = page.find_all('div', class_='quote')
250
+ quotes = page.find_all(['div'], class_='quote')
251
+ quotes = page.find_all(class_='quote') # und so weiter...
252
+ # Element nach Textinhalt finden
253
+ quotes = page.find_by_text('quote', tag='div')
254
+
255
+ # Erweiterte Navigation
256
+ quote_text = page.css('.quote')[0].css('.text::text').get()
257
+ quote_text = page.css('.quote').css('.text::text').getall() # Verkettete Selektoren
258
+ first_quote = page.css('.quote')[0]
259
+ author = first_quote.next_sibling.css('.author::text')
260
+ parent_container = first_quote.parent
261
+
262
+ # Elementbeziehungen und Ähnlichkeit
263
+ similar_elements = first_quote.find_similar()
264
+ below_elements = first_quote.below_elements()
265
+ ```
266
+ Sie können den Parser direkt verwenden, wenn Sie keine Websites abrufen möchten, wie unten gezeigt:
267
+ ```python
268
+ from scrapling.parser import Selector
269
+
270
+ page = Selector("<html>...</html>")
271
+ ```
272
+ Und es funktioniert genau auf die gleiche Weise!
273
+
274
+ ### Beispiele für async Session-Verwaltung
275
+ ```python
276
+ import asyncio
277
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
278
+
279
+ async with FetcherSession(http3=True) as session: # `FetcherSession` ist kontextbewusst und kann sowohl in sync- als auch in async-Mustern arbeiten
280
+ page1 = session.get('https://quotes.toscrape.com/')
281
+ page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
282
+
283
+ # Async-Session-Verwendung
284
+ async with AsyncStealthySession(max_pages=2) as session:
285
+ tasks = []
286
+ urls = ['https://example.com/page1', 'https://example.com/page2']
287
+
288
+ for url in urls:
289
+ task = session.fetch(url)
290
+ tasks.append(task)
291
+
292
+ print(session.get_pool_stats()) # Optional - Der Status des Browser-Tab-Pools (beschäftigt/frei/Fehler)
293
+ results = await asyncio.gather(*tasks)
294
+ print(session.get_pool_stats())
295
+ ```
296
+
297
+ ## CLI & Interaktive Shell
298
+
299
+ Scrapling enthält eine leistungsstarke Befehlszeilenschnittstelle:
300
+
301
+ [![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339)
302
+
303
+ Interaktive Web-Scraping-Shell starten
304
+ ```bash
305
+ scrapling shell
306
+ ```
307
+ Seiten direkt ohne Programmierung in eine Datei extrahieren (extrahiert standardmäßig den Inhalt im `body`-Tag). Wenn die Ausgabedatei mit `.txt` endet, wird der Textinhalt des Ziels extrahiert. Wenn sie mit `.md` endet, ist es eine Markdown-Darstellung des HTML-Inhalts; wenn sie mit `.html` endet, ist es der HTML-Inhalt selbst.
308
+ ```bash
309
+ scrapling extract get 'https://example.com' content.md
310
+ scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # Alle Elemente, die dem CSS-Selektor '#fromSkipToProducts' entsprechen
311
+ scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
312
+ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
313
+ ```
314
+
315
+ > [!NOTE]
316
+ > Es gibt viele zusätzliche Funktionen, aber wir möchten diese Seite prägnant halten, einschließlich des MCP-Servers und der interaktiven Web-Scraping-Shell. Schauen Sie sich die vollständige Dokumentation [hier](https://scrapling.readthedocs.io/en/latest/) an
317
+
318
+ ## Leistungsbenchmarks
319
+
320
+ Scrapling ist nicht nur leistungsstark -- es ist auch blitzschnell. Die folgenden Benchmarks vergleichen Scraplings Parser mit den neuesten Versionen anderer beliebter Bibliotheken.
321
+
322
+ ### Textextraktions-Geschwindigkeitstest (5000 verschachtelte Elemente)
323
+
324
+ | # | Bibliothek | Zeit (ms) | vs Scrapling |
325
+ |---|:-----------------:|:---------:|:------------:|
326
+ | 1 | Scrapling | 2.02 | 1.0x |
327
+ | 2 | Parsel/Scrapy | 2.04 | 1.01 |
328
+ | 3 | Raw Lxml | 2.54 | 1.257 |
329
+ | 4 | PyQuery | 24.17 | ~12x |
330
+ | 5 | Selectolax | 82.63 | ~41x |
331
+ | 6 | MechanicalSoup | 1549.71 | ~767.1x |
332
+ | 7 | BS4 with Lxml | 1584.31 | ~784.3x |
333
+ | 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
334
+
335
+
336
+ ### Element-Ähnlichkeit & Textsuche-Leistung
337
+
338
+ Scraplings adaptive Element-Finding-Fähigkeiten übertreffen Alternativen deutlich:
339
+
340
+ | Bibliothek | Zeit (ms) | vs Scrapling |
341
+ |-------------|:---------:|:------------:|
342
+ | Scrapling | 2.39 | 1.0x |
343
+ | AutoScraper | 12.45 | 5.209x |
344
+
345
+
346
+ > Alle Benchmarks stellen Durchschnittswerte von über 100 Durchläufen dar. Siehe [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) für die Methodik.
347
+
348
+ ## Installation
349
+
350
+ Scrapling erfordert Python 3.10 oder höher:
351
+
352
+ ```bash
353
+ pip install scrapling
354
+ ```
355
+
356
+ Diese Installation enthält nur die Parser-Engine und ihre Abhängigkeiten, ohne Fetcher oder Kommandozeilenabhängigkeiten.
357
+
358
+ ### Optionale Abhängigkeiten
359
+
360
+ 1. Wenn Sie eine der folgenden zusätzlichen Funktionen, die Fetcher oder ihre Klassen verwenden möchten, müssen Sie die Abhängigkeiten der Fetcher und ihre Browser-Abhängigkeiten wie folgt installieren:
361
+ ```bash
362
+ pip install "scrapling[fetchers]"
363
+
364
+ scrapling install
365
+ ```
366
+
367
+ Dies lädt alle Browser zusammen mit ihren Systemabhängigkeiten und Fingerprint-Manipulationsabhängigkeiten herunter.
368
+
369
+ 2. Zusätzliche Funktionen:
370
+ - MCP-Server-Funktion installieren:
371
+ ```bash
372
+ pip install "scrapling[ai]"
373
+ ```
374
+ - Shell-Funktionen installieren (Web-Scraping-Shell und der `extract`-Befehl):
375
+ ```bash
376
+ pip install "scrapling[shell]"
377
+ ```
378
+ - Alles installieren:
379
+ ```bash
380
+ pip install "scrapling[all]"
381
+ ```
382
+ Denken Sie daran, dass Sie nach einem dieser Extras (falls noch nicht geschehen) die Browser-Abhängigkeiten mit `scrapling install` installieren müssen
383
+
384
+ ### Docker
385
+ Sie können auch ein Docker-Image mit allen Extras und Browsern mit dem folgenden Befehl von DockerHub installieren:
386
+ ```bash
387
+ docker pull pyd4vinci/scrapling
388
+ ```
389
+ Oder laden Sie es aus der GitHub-Registry herunter:
390
+ ```bash
391
+ docker pull ghcr.io/d4vinci/scrapling:latest
392
+ ```
393
+ Dieses Image wird automatisch mit GitHub Actions und dem Hauptzweig des Repositorys erstellt und gepusht.
394
+
395
+ ## Beitragen
396
+
397
+ Wir freuen uns über Beiträge! Bitte lesen Sie unsere [Beitragsrichtlinien](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md), bevor Sie beginnen.
398
+
399
+ ## Haftungsausschluss
400
+
401
+ > [!CAUTION]
402
+ > Diese Bibliothek wird nur zu Bildungs- und Forschungszwecken bereitgestellt. Durch die Nutzung dieser Bibliothek erklären Sie sich damit einverstanden, lokale und internationale Gesetze zum Daten-Scraping und Datenschutz einzuhalten. Die Autoren und Mitwirkenden sind nicht verantwortlich für Missbrauch dieser Software. Respektieren Sie immer die Nutzungsbedingungen von Websites und robots.txt-Dateien.
403
+
404
+ ## Lizenz
405
+
406
+ Diese Arbeit ist unter der BSD-3-Clause-Lizenz lizenziert.
407
+
408
+ ## Danksagungen
409
+
410
+ Dieses Projekt enthält angepassten Code von:
411
+ - Parsel (BSD-Lizenz) -- Verwendet für das [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)-Submodul
412
+
413
+ ---
414
+ <div align="center"><small>Entworfen und hergestellt mit ❤️ von Karim Shoair.</small></div><br>
Scrapling/docs/README_ES.md ADDED
@@ -0,0 +1,414 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">
2
+ <a href="https://scrapling.readthedocs.io">
3
+ <picture>
4
+ <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
5
+ <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
6
+ </picture>
7
+ </a>
8
+ <br>
9
+ <small>Effortless Web Scraping for the Modern Web</small>
10
+ </h1>
11
+
12
+ <p align="center">
13
+ <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
14
+ <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
15
+ <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
16
+ <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
17
+ <a href="https://pepy.tech/project/scrapling" alt="PyPI Downloads">
18
+ <img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/scrapling?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=Downloads"></a>
19
+ <br/>
20
+ <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
21
+ <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
22
+ </a>
23
+ <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
24
+ <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
25
+ </a>
26
+ <br/>
27
+ <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
28
+ <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
29
+ </p>
30
+
31
+ <p align="center">
32
+ <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/"><strong>Métodos de selección</strong></a>
33
+ &middot;
34
+ <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing/"><strong>Elegir un fetcher</strong></a>
35
+ &middot;
36
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>Spiders</strong></a>
37
+ &middot;
38
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>Rotación de proxy</strong></a>
39
+ &middot;
40
+ <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/"><strong>CLI</strong></a>
41
+ &middot;
42
+ <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/"><strong>Modo MCP</strong></a>
43
+ </p>
44
+
45
+ Scrapling es un framework de Web Scraping adaptativo que se encarga de todo, desde una sola solicitud hasta un rastreo a gran escala.
46
+
47
+ Su parser aprende de los cambios de los sitios web y relocaliza automáticamente tus elementos cuando las páginas se actualizan. Sus fetchers evaden sistemas anti-bot como Cloudflare Turnstile de forma nativa. Y su framework Spider te permite escalar a rastreos concurrentes con múltiples sesiones, con Pause & Resume y rotación automática de Proxy, todo en unas pocas líneas de Python. Una biblioteca, cero compromisos.
48
+
49
+ Rastreos ultrarrápidos con estadísticas en tiempo real y Streaming. Construido por Web Scrapers para Web Scrapers y usuarios regulares, hay algo para todos.
50
+
51
+ ```python
52
+ from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
53
+ StealthyFetcher.adaptive = True
54
+ p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # ¡Obtén el sitio web bajo el radar!
55
+ products = p.css('.product', auto_save=True) # ¡Extrae datos que sobreviven a cambios de diseño del sitio web!
56
+ products = p.css('.product', adaptive=True) # Más tarde, si la estructura del sitio web cambia, ¡pasa `adaptive=True` para encontrarlos!
57
+ ```
58
+ O escala a rastreos completos
59
+ ```python
60
+ from scrapling.spiders import Spider, Response
61
+
62
+ class MySpider(Spider):
63
+ name = "demo"
64
+ start_urls = ["https://example.com/"]
65
+
66
+ async def parse(self, response: Response):
67
+ for item in response.css('.product'):
68
+ yield {"title": item.css('h2::text').get()}
69
+
70
+ MySpider().start()
71
+ ```
72
+
73
+
74
+ # Patrocinadores Platino
75
+
76
+ # Patrocinadores
77
+
78
+ <!-- sponsors -->
79
+
80
+ <a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>
81
+ <a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
82
+ <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
83
+ <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
84
+ <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
85
+ <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
86
+ <a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
87
+ <a href="https://proxyempire.io/" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
88
+ <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"></a>
89
+
90
+
91
+ <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
92
+ <a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
93
+ <a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
94
+
95
+ <!-- /sponsors -->
96
+
97
+ <i><sub>¿Quieres mostrar tu anuncio aquí? ¡Haz clic [aquí](https://github.com/sponsors/D4Vinci) y elige el nivel que te convenga!</sub></i>
98
+
99
+ ---
100
+
101
+ ## Características Principales
102
+
103
+ ### Spiders — Un Framework Completo de Rastreo
104
+ - 🕷️ **API de Spider al estilo Scrapy**: Define spiders con `start_urls`, callbacks async `parse`, y objetos `Request`/`Response`.
105
+ - ⚡ **Rastreo Concurrente**: Límites de concurrencia configurables, limitación por dominio y retrasos de descarga.
106
+ - 🔄 **Soporte Multi-Session**: Interfaz unificada para solicitudes HTTP y navegadores headless sigilosos en un solo Spider — enruta solicitudes a diferentes sesiones por ID.
107
+ - 💾 **Pause & Resume**: Persistencia de rastreo basada en Checkpoint. Presiona Ctrl+C para un cierre ordenado; reinicia para continuar desde donde lo dejaste.
108
+ - 📡 **Modo Streaming**: Transmite elementos extraídos a medida que llegan con `async for item in spider.stream()` con estadísticas en tiempo real — ideal para UI, pipelines y rastreos de larga duración.
109
+ - 🛡️ **Detección de Solicitudes Bloqueadas**: Detección automática y reintento de solicitudes bloqueadas con lógica personalizable.
110
+ - 📦 **Exportación Integrada**: Exporta resultados a través de hooks y tu propio pipeline o el JSON/JSONL integrado con `result.items.to_json()` / `result.items.to_jsonl()` respectivamente.
111
+
112
+ ### Obtención Avanzada de Sitios Web con Soporte de Session
113
+ - **Solicitudes HTTP**: Solicitudes HTTP rápidas y sigilosas con la clase `Fetcher`. Puede imitar el fingerprint TLS de los navegadores, encabezados y usar HTTP/3.
114
+ - **Carga Dinámica**: Obtén sitios web dinámicos con automatización completa del navegador a través de la clase `DynamicFetcher` compatible con Chromium de Playwright y Google Chrome.
115
+ - **Evasión Anti-bot**: Capacidades de sigilo avanzadas con `StealthyFetcher` y falsificación de fingerprint. Puede evadir fácilmente todos los tipos de Turnstile/Interstitial de Cloudflare con automatización.
116
+ - **Gestión de Session**: Soporte de sesión persistente con las clases `FetcherSession`, `StealthySession` y `DynamicSession` para la gestión de cookies y estado entre solicitudes.
117
+ - **Rotación de Proxy**: `ProxyRotator` integrado con estrategias de rotación cíclica o personalizadas en todos los tipos de sesión, además de sobrescrituras de Proxy por solicitud.
118
+ - **Bloqueo de Dominios**: Bloquea solicitudes a dominios específicos (y sus subdominios) en fetchers basados en navegador.
119
+ - **Soporte Async**: Soporte async completo en todos los fetchers y clases de sesión async dedicadas.
120
+
121
+ ### Scraping Adaptativo e Integración con IA
122
+ - 🔄 **Seguimiento Inteligente de Elementos**: Relocaliza elementos después de cambios en el sitio web usando algoritmos inteligentes de similitud.
123
+ - 🎯 **Selección Flexible Inteligente**: Selectores CSS, selectores XPath, búsqueda basada en filtros, búsqueda de texto, búsqueda regex y más.
124
+ - 🔍 **Encontrar Elementos Similares**: Localiza automáticamente elementos similares a los elementos encontrados.
125
+ - 🤖 **Servidor MCP para usar con IA**: Servidor MCP integrado para Web Scraping asistido por IA y extracción de datos. El servidor MCP presenta capacidades potentes y personalizadas que aprovechan Scrapling para extraer contenido específico antes de pasarlo a la IA (Claude/Cursor/etc), acelerando así las operaciones y reduciendo costos al minimizar el uso de tokens. ([video demo](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
126
+
127
+ ### Arquitectura de Alto Rendimiento y Probada en Batalla
128
+ - 🚀 **Ultrarrápido**: Rendimiento optimizado que supera a la mayoría de las bibliotecas de Web Scraping de Python.
129
+ - 🔋 **Eficiente en Memoria**: Estructuras de datos optimizadas y carga diferida para una huella de memoria mínima.
130
+ - ⚡ **Serialización JSON Rápida**: 10 veces más rápido que la biblioteca estándar.
131
+ - 🏗️ **Probado en batalla**: Scrapling no solo tiene una cobertura de pruebas del 92% y cobertura completa de type hints, sino que ha sido utilizado diariamente por cientos de Web Scrapers durante el último año.
132
+
133
+ ### Experiencia Amigable para Desarrolladores/Web Scrapers
134
+ - 🎯 **Shell Interactivo de Web Scraping**: Shell IPython integrado opcional con integración de Scrapling, atajos y nuevas herramientas para acelerar el desarrollo de scripts de Web Scraping, como convertir solicitudes curl a solicitudes Scrapling y ver resultados de solicitudes en tu navegador.
135
+ - 🚀 **Úsalo directamente desde la Terminal**: Opcionalmente, ¡puedes usar Scrapling para hacer scraping de una URL sin escribir ni una sola línea de código!
136
+ - 🛠️ **API de Navegación Rica**: Recorrido avanzado del DOM con métodos de navegación de padres, hermanos e hijos.
137
+ - 🧬 **Procesamiento de Texto Mejorado**: Métodos integrados de regex, limpieza y operaciones de cadena optimizadas.
138
+ - 📝 **Generación Automática de Selectores**: Genera selectores CSS/XPath robustos para cualquier elemento.
139
+ - 🔌 **API Familiar**: Similar a Scrapy/BeautifulSoup con los mismos pseudo-elementos usados en Scrapy/Parsel.
140
+ - 📘 **Cobertura Completa de Tipos**: Type hints completos para excelente soporte de IDE y autocompletado de código. Todo el código fuente se escanea automáticamente con **PyRight** y **MyPy** en cada cambio.
141
+ - 🔋 **Imagen Docker Lista**: Con cada lanzamiento, se construye y publica automáticamente una imagen Docker que contiene todos los navegadores.
142
+
143
+ ## Primeros Pasos
144
+
145
+ Aquí tienes un vistazo rápido de lo que Scrapling puede hacer sin entrar en profundidad.
146
+
147
+ ### Uso Básico
148
+ Solicitudes HTTP con soporte de sesión
149
+ ```python
150
+ from scrapling.fetchers import Fetcher, FetcherSession
151
+
152
+ with FetcherSession(impersonate='chrome') as session: # Usa la última versión del fingerprint TLS de Chrome
153
+ page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
154
+ quotes = page.css('.quote .text::text').getall()
155
+
156
+ # O usa solicitudes de una sola vez
157
+ page = Fetcher.get('https://quotes.toscrape.com/')
158
+ quotes = page.css('.quote .text::text').getall()
159
+ ```
160
+ Modo sigiloso avanzado
161
+ ```python
162
+ from scrapling.fetchers import StealthyFetcher, StealthySession
163
+
164
+ with StealthySession(headless=True, solve_cloudflare=True) as session: # Mantén el navegador abierto hasta que termines
165
+ page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
166
+ data = page.css('#padded_content a').getall()
167
+
168
+ # O usa el estilo de solicitud de una sola vez, abre el navegador para esta solicitud, luego lo cierra después de terminar
169
+ page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
170
+ data = page.css('#padded_content a').getall()
171
+ ```
172
+ Automatización completa del navegador
173
+ ```python
174
+ from scrapling.fetchers import DynamicFetcher, DynamicSession
175
+
176
+ with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Mantén el navegador abierto hasta que termines
177
+ page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
178
+ data = page.xpath('//span[@class="text"]/text()').getall() # Selector XPath si lo prefieres
179
+
180
+ # O usa el estilo de solicitud de una sola vez, abre el navegador para esta solicitud, luego lo cierra después de terminar
181
+ page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
182
+ data = page.css('.quote .text::text').getall()
183
+ ```
184
+
185
+ ### Spiders
186
+ Construye rastreadores completos con solicitudes concurrentes, múltiples tipos de sesión y Pause & Resume:
187
+ ```python
188
+ from scrapling.spiders import Spider, Request, Response
189
+
190
+ class QuotesSpider(Spider):
191
+ name = "quotes"
192
+ start_urls = ["https://quotes.toscrape.com/"]
193
+ concurrent_requests = 10
194
+
195
+ async def parse(self, response: Response):
196
+ for quote in response.css('.quote'):
197
+ yield {
198
+ "text": quote.css('.text::text').get(),
199
+ "author": quote.css('.author::text').get(),
200
+ }
201
+
202
+ next_page = response.css('.next a')
203
+ if next_page:
204
+ yield response.follow(next_page[0].attrib['href'])
205
+
206
+ result = QuotesSpider().start()
207
+ print(f"Se extrajeron {len(result.items)} citas")
208
+ result.items.to_json("quotes.json")
209
+ ```
210
+ Usa múltiples tipos de sesión en un solo Spider:
211
+ ```python
212
+ from scrapling.spiders import Spider, Request, Response
213
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession
214
+
215
+ class MultiSessionSpider(Spider):
216
+ name = "multi"
217
+ start_urls = ["https://example.com/"]
218
+
219
+ def configure_sessions(self, manager):
220
+ manager.add("fast", FetcherSession(impersonate="chrome"))
221
+ manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
222
+
223
+ async def parse(self, response: Response):
224
+ for link in response.css('a::attr(href)').getall():
225
+ # Enruta las páginas protegidas a través de la sesión sigilosa
226
+ if "protected" in link:
227
+ yield Request(link, sid="stealth")
228
+ else:
229
+ yield Request(link, sid="fast", callback=self.parse) # callback explícito
230
+ ```
231
+ Pausa y reanuda rastreos largos con checkpoints ejecutando el Spider así:
232
+ ```python
233
+ QuotesSpider(crawldir="./crawl_data").start()
234
+ ```
235
+ Presiona Ctrl+C para pausar de forma ordenada — el progreso se guarda automáticamente. Después, cuando inicies el Spider de nuevo, pasa el mismo `crawldir`, y continuará desde donde se detuvo.
236
+
237
+ ### Análisis Avanzado y Navegación
238
+ ```python
239
+ from scrapling.fetchers import Fetcher
240
+
241
+ # Selección rica de elementos y navegación
242
+ page = Fetcher.get('https://quotes.toscrape.com/')
243
+
244
+ # Obtén citas con múltiples métodos de selección
245
+ quotes = page.css('.quote') # Selector CSS
246
+ quotes = page.xpath('//div[@class="quote"]') # XPath
247
+ quotes = page.find_all('div', {'class': 'quote'}) # Estilo BeautifulSoup
248
+ # Igual que
249
+ quotes = page.find_all('div', class_='quote')
250
+ quotes = page.find_all(['div'], class_='quote')
251
+ quotes = page.find_all(class_='quote') # y así sucesivamente...
252
+ # Encuentra elementos por contenido de texto
253
+ quotes = page.find_by_text('quote', tag='div')
254
+
255
+ # Navegación avanzada
256
+ quote_text = page.css('.quote')[0].css('.text::text').get()
257
+ quote_text = page.css('.quote').css('.text::text').getall() # Selectores encadenados
258
+ first_quote = page.css('.quote')[0]
259
+ author = first_quote.next_sibling.css('.author::text')
260
+ parent_container = first_quote.parent
261
+
262
+ # Relaciones y similitud de elementos
263
+ similar_elements = first_quote.find_similar()
264
+ below_elements = first_quote.below_elements()
265
+ ```
266
+ Puedes usar el parser directamente si no necesitas obtener sitios web, como se muestra a continuación:
267
+ ```python
268
+ from scrapling.parser import Selector
269
+
270
+ page = Selector("<html>...</html>")
271
+ ```
272
+ ¡Y funciona exactamente de la misma manera!
273
+
274
+ ### Ejemplos de Gestión de Session Async
275
+ ```python
276
+ import asyncio
277
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
278
+
279
+ async with FetcherSession(http3=True) as session: # `FetcherSession` es consciente del contexto y puede funcionar tanto en patrones sync/async
280
+ page1 = session.get('https://quotes.toscrape.com/')
281
+ page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
282
+
283
+ # Uso de sesión async
284
+ async with AsyncStealthySession(max_pages=2) as session:
285
+ tasks = []
286
+ urls = ['https://example.com/page1', 'https://example.com/page2']
287
+
288
+ for url in urls:
289
+ task = session.fetch(url)
290
+ tasks.append(task)
291
+
292
+ print(session.get_pool_stats()) # Opcional - El estado del pool de pestañas del navegador (ocupado/libre/error)
293
+ results = await asyncio.gather(*tasks)
294
+ print(session.get_pool_stats())
295
+ ```
296
+
297
+ ## CLI y Shell Interactivo
298
+
299
+ Scrapling incluye una poderosa interfaz de línea de comandos:
300
+
301
+ [![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339)
302
+
303
+ Lanzar el Shell interactivo de Web Scraping
304
+ ```bash
305
+ scrapling shell
306
+ ```
307
+ Extraer páginas a un archivo directamente sin programar (Extrae el contenido dentro de la etiqueta `body` por defecto). Si el archivo de salida termina con `.txt`, entonces se extraerá el contenido de texto del objetivo. Si termina con `.md`, será una representación Markdown del contenido HTML; si termina con `.html`, será el contenido HTML en sí mismo.
308
+ ```bash
309
+ scrapling extract get 'https://example.com' content.md
310
+ scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # Todos los elementos que coinciden con el selector CSS '#fromSkipToProducts'
311
+ scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
312
+ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
313
+ ```
314
+
315
+ > [!NOTE]
316
+ > Hay muchas características adicionales, pero queremos mantener esta página concisa, incluyendo el servidor MCP y el Shell Interactivo de Web Scraping. Consulta la documentación completa [aquí](https://scrapling.readthedocs.io/en/latest/)
317
+
318
+ ## Benchmarks de Rendimiento
319
+
320
+ Scrapling no solo es potente, también es ultrarrápido. Los siguientes benchmarks comparan el parser de Scrapling con las últimas versiones de otras bibliotecas populares.
321
+
322
+ ### Prueba de Velocidad de Extracción de Texto (5000 elementos anidados)
323
+
324
+ | # | Biblioteca | Tiempo (ms) | vs Scrapling |
325
+ |---|:-----------------:|:-----------:|:------------:|
326
+ | 1 | Scrapling | 2.02 | 1.0x |
327
+ | 2 | Parsel/Scrapy | 2.04 | 1.01 |
328
+ | 3 | Raw Lxml | 2.54 | 1.257 |
329
+ | 4 | PyQuery | 24.17 | ~12x |
330
+ | 5 | Selectolax | 82.63 | ~41x |
331
+ | 6 | MechanicalSoup | 1549.71 | ~767.1x |
332
+ | 7 | BS4 with Lxml | 1584.31 | ~784.3x |
333
+ | 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
334
+
335
+
336
+ ### Rendimiento de Similitud de Elementos y Búsqueda de Texto
337
+
338
+ Las capacidades de búsqueda adaptativa de elementos de Scrapling superan significativamente a las alternativas:
339
+
340
+ | Biblioteca | Tiempo (ms) | vs Scrapling |
341
+ |-------------|:-----------:|:------------:|
342
+ | Scrapling | 2.39 | 1.0x |
343
+ | AutoScraper | 12.45 | 5.209x |
344
+
345
+
346
+ > Todos los benchmarks representan promedios de más de 100 ejecuciones. Ver [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) para la metodología.
347
+
348
+ ## Instalación
349
+
350
+ Scrapling requiere Python 3.10 o superior:
351
+
352
+ ```bash
353
+ pip install scrapling
354
+ ```
355
+
356
+ Esta instalación solo incluye el motor de análisis y sus dependencias, sin ningún fetcher ni dependencias de línea de comandos.
357
+
358
+ ### Dependencias Opcionales
359
+
360
+ 1. Si vas a usar alguna de las características adicionales a continuación, los fetchers, o sus clases, necesitarás instalar las dependencias de los fetchers y sus dependencias del navegador de la siguiente manera:
361
+ ```bash
362
+ pip install "scrapling[fetchers]"
363
+
364
+ scrapling install
365
+ ```
366
+
367
+ Esto descarga todos los navegadores, junto con sus dependencias del sistema y dependencias de manipulación de fingerprint.
368
+
369
+ 2. Características adicionales:
370
+ - Instalar la característica del servidor MCP:
371
+ ```bash
372
+ pip install "scrapling[ai]"
373
+ ```
374
+ - Instalar características del Shell (Shell de Web Scraping y el comando `extract`):
375
+ ```bash
376
+ pip install "scrapling[shell]"
377
+ ```
378
+ - Instalar todo:
379
+ ```bash
380
+ pip install "scrapling[all]"
381
+ ```
382
+ Recuerda que necesitas instalar las dependencias del navegador con `scrapling install` después de cualquiera de estos extras (si no lo hiciste ya)
383
+
384
+ ### Docker
385
+ También puedes instalar una imagen Docker con todos los extras y navegadores con el siguiente comando desde DockerHub:
386
+ ```bash
387
+ docker pull pyd4vinci/scrapling
388
+ ```
389
+ O descárgala desde el registro de GitHub:
390
+ ```bash
391
+ docker pull ghcr.io/d4vinci/scrapling:latest
392
+ ```
393
+ Esta imagen se construye y publica automáticamente usando GitHub Actions y la rama principal del repositorio.
394
+
395
+ ## Contribuir
396
+
397
+ ¡Damos la bienvenida a las contribuciones! Por favor lee nuestras [pautas de contribución](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) antes de comenzar.
398
+
399
+ ## Descargo de Responsabilidad
400
+
401
+ > [!CAUTION]
402
+ > Esta biblioteca se proporciona solo con fines educativos y de investigación. Al usar esta biblioteca, aceptas cumplir con las leyes locales e internacionales de scraping de datos y privacidad. Los autores y contribuyentes no son responsables de ningún mal uso de este software. Respeta siempre los términos de servicio de los sitios web y los archivos robots.txt.
403
+
404
+ ## Licencia
405
+
406
+ Este trabajo está licenciado bajo la Licencia BSD-3-Clause.
407
+
408
+ ## Agradecimientos
409
+
410
+ Este proyecto incluye código adaptado de:
411
+ - Parsel (Licencia BSD)—Usado para el submódulo [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)
412
+
413
+ ---
414
+ <div align="center"><small>Diseñado y elaborado con ❤️ por Karim Shoair.</small></div><br>
Scrapling/docs/README_JP.md ADDED
@@ -0,0 +1,414 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">
2
+ <a href="https://scrapling.readthedocs.io">
3
+ <picture>
4
+ <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
5
+ <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
6
+ </picture>
7
+ </a>
8
+ <br>
9
+ <small>Effortless Web Scraping for the Modern Web</small>
10
+ </h1>
11
+
12
+ <p align="center">
13
+ <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
14
+ <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
15
+ <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
16
+ <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
17
+ <a href="https://pepy.tech/project/scrapling" alt="PyPI Downloads">
18
+ <img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/scrapling?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=Downloads"></a>
19
+ <br/>
20
+ <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
21
+ <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
22
+ </a>
23
+ <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
24
+ <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
25
+ </a>
26
+ <br/>
27
+ <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
28
+ <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
29
+ </p>
30
+
31
+ <p align="center">
32
+ <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/"><strong>選択メソッド</strong></a>
33
+ &middot;
34
+ <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing/"><strong>Fetcherの選び方</strong></a>
35
+ &middot;
36
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>スパイダー</strong></a>
37
+ &middot;
38
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>プロキシローテーション</strong></a>
39
+ &middot;
40
+ <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/"><strong>CLI</strong></a>
41
+ &middot;
42
+ <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/"><strong>MCPモード</strong></a>
43
+ </p>
44
+
45
+ Scraplingは、単一のリクエストから本格的なクロールまですべてを処理する適応型Web Scrapingフレームワークです。
46
+
47
+ そのパーサーはウェブサイトの変更から学習し、ページが更新されたときに要素を自動的に再配置します。Fetcherはすぐに使えるCloudflare Turnstileなどのアンチボットシステムを回避します。そしてSpiderフレームワークにより、Pause & Resumeや自動Proxy回転機能を備えた並行マルチSessionクロールへとスケールアップできます — すべてわずか数行のPythonで。1つのライブラリ、妥協なし。
48
+
49
+ リアルタイム統計とStreamingによる超高速クロール。Web Scraperによって、Web Scraperと一般ユーザーのために構築され、誰にでも何かがあります。
50
+
51
+ ```python
52
+ from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
53
+ StealthyFetcher.adaptive = True
54
+ p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # レーダーの下でウェブサイトを取得!
55
+ products = p.css('.product', auto_save=True) # ウェブサイトのデザイン変更に耐えるデータをスクレイプ!
56
+ products = p.css('.product', adaptive=True) # 後でウェブサイトの構造が変わったら、`adaptive=True`を渡して見つける!
57
+ ```
58
+ または本格的なクロールへスケールアップ
59
+ ```python
60
+ from scrapling.spiders import Spider, Response
61
+
62
+ class MySpider(Spider):
63
+ name = "demo"
64
+ start_urls = ["https://example.com/"]
65
+
66
+ async def parse(self, response: Response):
67
+ for item in response.css('.product'):
68
+ yield {"title": item.css('h2::text').get()}
69
+
70
+ MySpider().start()
71
+ ```
72
+
73
+
74
+ # プラチナスポンサー
75
+
76
+ # スポンサー
77
+
78
+ <!-- sponsors -->
79
+
80
+ <a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>
81
+ <a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
82
+ <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
83
+ <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
84
+ <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
85
+ <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
86
+ <a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
87
+ <a href="https://proxyempire.io/" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
88
+ <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"></a>
89
+
90
+
91
+ <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
92
+ <a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
93
+ <a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
94
+
95
+ <!-- /sponsors -->
96
+
97
+ <i><sub>ここに広告を表示したいですか?[こちら](https://github.com/sponsors/D4Vinci)をクリックして、あなたに合ったティアを選択してください!</sub></i>
98
+
99
+ ---
100
+
101
+ ## 主な機能
102
+
103
+ ### Spider — 本格的なクロールフレームワーク
104
+ - 🕷️ **Scrapy風のSpider API**:`start_urls`、async `parse` callback、`Request`/`Response`オブジェクトでSpiderを定義。
105
+ - ⚡ **並行クロール**:設定可能な並行数制限、ドメインごとのスロットリング、ダウンロード遅延。
106
+ - 🔄 **マルチSessionサポート**:HTTPリクエストとステルスヘッドレスブラウザの統一インターフェース — IDによって異なるSessionにリクエストをルーティング。
107
+ - 💾 **Pause & Resume**:Checkpointベースのクロール永続化。Ctrl+Cで正常にシャットダウン;再起動すると中断したところから再開。
108
+ - 📡 **Streamingモード**:`async for item in spider.stream()`でリアルタイム統計とともにスクレイプされたアイテムをStreamingで受信 — UI、パイプライン、長時間実行クロールに最適。
109
+ - 🛡️ **ブロックされたリクエストの検出**:カスタマイズ可能なロジックによるブロックされたリクエストの自動検出とリトライ。
110
+ - 📦 **組み込みエクスポート**:フックや独自のパイプライン、または組み込みのJSON/JSONLで結果をエクスポート。それぞれ`result.items.to_json()` / `result.items.to_jsonl()`を使用。
111
+
112
+ ### Sessionサポート付き高度なウェブサイト取得
113
+ - **HTTPリクエスト**:`Fetcher`クラスで高速かつステルスなHTTPリクエスト。ブラウザのTLS fingerprint、ヘッダーを模倣し、HTTP/3を使用可能。
114
+ - **動的読み込み**:PlaywrightのChromiumとGoogle Chromeをサポートする`DynamicFetcher`クラスによる完全なブラウザ自動化で動的ウェブサイトを取得。
115
+ - **アンチボット回避**:`StealthyFetcher`とfingerprint偽装による高度なステルス機能。自動化でCloudflareのTurnstile/Interstitialのすべてのタイプを簡単に回避。
116
+ - **Session管理**:リクエスト間でCookieと状態を管理するための`FetcherSession`、`StealthySession`、`DynamicSession`クラスによる永続的なSessionサポート。
117
+ - **Proxy回転**:すべてのSessionタイプに対応したラウンドロビンまたはカスタム戦略の組み込み`ProxyRotator`、さらにリクエストごとのProxyオーバーライド。
118
+ - **ドメインブロック**:ブラウザベースのFetcherで特定のドメイン(およびそのサブドメイン)へのリクエストをブロック。
119
+ - **asyncサポート**:すべてのFetcherおよび専用asyncSessionクラス全体での完全なasyncサポート。
120
+
121
+ ### 適応型スクレイピングとAI統合
122
+ - 🔄 **スマート要素追跡**:インテリジェントな類似性アルゴリズムを使用してウェブサイトの変更後に要素を再配置。
123
+ - 🎯 **スマート柔軟選択**:CSSセレクタ、XPathセレクタ、フィルタベース検索、テキスト検索、正規表現検索など。
124
+ - 🔍 **類似要素の検出**:見つかった要素に類似した要素を自動的に特定。
125
+ - 🤖 **AIと使用するMCPサーバー**:AI支援Web Scrapingとデータ抽出のための組み込みMCPサーバー。MCPサーバーは、AI(Claude/Cursorなど)に渡す前にScraplingを活用してターゲットコンテンツを抽出する強力でカスタムな機能を備えており、操作を高速化し、トークン使用量を最小限に抑えることでコストを削減します。([デモ動画](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
126
+
127
+ ### 高性能で実戦テスト済みのアーキテクチャ
128
+ - 🚀 **超高速**:ほとんどのPythonスクレイピングライブラリを上回る最適化されたパフォーマンス。
129
+ - 🔋 **メモリ効率**:最小のメモリフットプリントのための最適化されたデータ構造と遅延読み込み。
130
+ - ⚡ **高速JSONシリアル化**:標準ライブラリの10倍の速度。
131
+ - 🏗️ **実戦テスト済み**:Scraplingは92%のテストカバレッジと完全な型ヒントカバレッジを備えているだけでなく、過去1年間に数百人のWeb Scraperによって毎日使用されてきました。
132
+
133
+ ### 開発者/Web Scraperにやさしい体験
134
+ - 🎯 **インタラクティブWeb Scraping Shell**:Scrapling統合、ショートカット、curlリクエストをScraplingリクエストに変換したり、ブラウザでリクエスト結果を表示したりするなどの新しいツールを備えたオプションの組み込みIPython Shellで、Web Scrapingスクリプトの開発を加速。
135
+ - 🚀 **ターミナルから直接使用**:オプションで、コードを一行も書かずにScraplingを使用してURLをスクレイプできます!
136
+ - 🛠️ **豊富なナビゲーションAPI**:親、兄弟、子のナビゲーションメソッドによる高度なDOMトラバーサル。
137
+ - 🧬 **強化されたテキスト処理**:組み込みの正規表現、クリーニングメソッド、最適化された文字列操作。
138
+ - 📝 **自動セレクタ生成**:任意の要素に対して堅牢なCSS/XPathセレクタを生成。
139
+ - 🔌 **馴染みのあるAPI**:Scrapy/Parselで使用されている同じ疑似要素を持つScrapy/BeautifulSoupに似た設計。
140
+ - 📘 **完全な型カバレッジ**:優れたIDEサポートとコード補完のための完全な型ヒント。コードベース全体が変更のたびに**PyRight**と**MyPy**で自動的にスキャンされます。
141
+ - 🔋 **すぐに使えるDockerイメージ**:各リリースで、すべてのブラウザを含むDockerイメージが自動的にビルドおよびプッシュされます。
142
+
143
+ ## はじめに
144
+
145
+ 深く掘り下げずに、Scraplingにできることの簡単な概要をお見せしましょう。
146
+
147
+ ### 基本的な使い方
148
+ Sessionサポート付きHTTPリクエスト
149
+ ```python
150
+ from scrapling.fetchers import Fetcher, FetcherSession
151
+
152
+ with FetcherSession(impersonate='chrome') as session: # ChromeのTLS fingerprintの最新バージョンを使用
153
+ page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
154
+ quotes = page.css('.quote .text::text').getall()
155
+
156
+ # または一回限りのリクエストを使用
157
+ page = Fetcher.get('https://quotes.toscrape.com/')
158
+ quotes = page.css('.quote .text::text').getall()
159
+ ```
160
+ 高度なステルスモード
161
+ ```python
162
+ from scrapling.fetchers import StealthyFetcher, StealthySession
163
+
164
+ with StealthySession(headless=True, solve_cloudflare=True) as session: # 完了するまでブラウザを開いたままにする
165
+ page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
166
+ data = page.css('#padded_content a').getall()
167
+
168
+ # または一回限りのリクエストスタイル、このリクエストのためにブラウザを開き、完了後に閉じる
169
+ page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
170
+ data = page.css('#padded_content a').getall()
171
+ ```
172
+ 完全なブラウザ自動化
173
+ ```python
174
+ from scrapling.fetchers import DynamicFetcher, DynamicSession
175
+
176
+ with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # 完了するまでブラウザを開いたままにする
177
+ page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
178
+ data = page.xpath('//span[@class="text"]/text()').getall() # お好みであればXPathセレクタを使用
179
+
180
+ # または一回限りのリクエストスタイル、このリクエストのためにブラウザを開き、完了後に閉じる
181
+ page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
182
+ data = page.css('.quote .text::text').getall()
183
+ ```
184
+
185
+ ### Spider
186
+ 並行リクエスト、複数のSessionタイプ、Pause & Resumeを備えた本格的なクローラーを構築:
187
+ ```python
188
+ from scrapling.spiders import Spider, Request, Response
189
+
190
+ class QuotesSpider(Spider):
191
+ name = "quotes"
192
+ start_urls = ["https://quotes.toscrape.com/"]
193
+ concurrent_requests = 10
194
+
195
+ async def parse(self, response: Response):
196
+ for quote in response.css('.quote'):
197
+ yield {
198
+ "text": quote.css('.text::text').get(),
199
+ "author": quote.css('.author::text').get(),
200
+ }
201
+
202
+ next_page = response.css('.next a')
203
+ if next_page:
204
+ yield response.follow(next_page[0].attrib['href'])
205
+
206
+ result = QuotesSpider().start()
207
+ print(f"{len(result.items)}件の引用をスクレイプしました")
208
+ result.items.to_json("quotes.json")
209
+ ```
210
+ 単一のSpiderで複数のSessionタイプを使用:
211
+ ```python
212
+ from scrapling.spiders import Spider, Request, Response
213
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession
214
+
215
+ class MultiSessionSpider(Spider):
216
+ name = "multi"
217
+ start_urls = ["https://example.com/"]
218
+
219
+ def configure_sessions(self, manager):
220
+ manager.add("fast", FetcherSession(impersonate="chrome"))
221
+ manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
222
+
223
+ async def parse(self, response: Response):
224
+ for link in response.css('a::attr(href)').getall():
225
+ # 保護されたページはステルスSessionを通してルーティング
226
+ if "protected" in link:
227
+ yield Request(link, sid="stealth")
228
+ else:
229
+ yield Request(link, sid="fast", callback=self.parse) # 明示的なcallback
230
+ ```
231
+ Checkpointを使用して長時間のクロールをPause & Resume:
232
+ ```python
233
+ QuotesSpider(crawldir="./crawl_data").start()
234
+ ```
235
+ Ctrl+Cを押すと正常に一時停止し、進捗は自動的に保存されます。後でSpiderを再度起動する際に同じ`crawldir`を渡すと、中断したところから再開します。
236
+
237
+ ### 高度なパースとナビゲーション
238
+ ```python
239
+ from scrapling.fetchers import Fetcher
240
+
241
+ # 豊富な要素選択とナビゲーション
242
+ page = Fetcher.get('https://quotes.toscrape.com/')
243
+
244
+ # 複数の選択メソッドで引用を取得
245
+ quotes = page.css('.quote') # CSSセレクタ
246
+ quotes = page.xpath('//div[@class="quote"]') # XPath
247
+ quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoupスタイル
248
+ # 以下と同じ
249
+ quotes = page.find_all('div', class_='quote')
250
+ quotes = page.find_all(['div'], class_='quote')
251
+ quotes = page.find_all(class_='quote') # など...
252
+ # テキスト内容で要素を検索
253
+ quotes = page.find_by_text('quote', tag='div')
254
+
255
+ # 高度なナビゲーション
256
+ quote_text = page.css('.quote')[0].css('.text::text').get()
257
+ quote_text = page.css('.quote').css('.text::text').getall() # チェーンセレクタ
258
+ first_quote = page.css('.quote')[0]
259
+ author = first_quote.next_sibling.css('.author::text')
260
+ parent_container = first_quote.parent
261
+
262
+ # 要素の関連性と類似性
263
+ similar_elements = first_quote.find_similar()
264
+ below_elements = first_quote.below_elements()
265
+ ```
266
+ ウェブサイトを取得せずにパーサーをすぐに使用することもできます:
267
+ ```python
268
+ from scrapling.parser import Selector
269
+
270
+ page = Selector("<html>...</html>")
271
+ ```
272
+ まったく同じ方法で動作します!
273
+
274
+ ### 非同期Session管理の例
275
+ ```python
276
+ import asyncio
277
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
278
+
279
+ async with FetcherSession(http3=True) as session: # `FetcherSession`はコンテキストアウェアで、同期/非同期両方のパターンで動作可能
280
+ page1 = session.get('https://quotes.toscrape.com/')
281
+ page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
282
+
283
+ # 非同期Sessionの使用
284
+ async with AsyncStealthySession(max_pages=2) as session:
285
+ tasks = []
286
+ urls = ['https://example.com/page1', 'https://example.com/page2']
287
+
288
+ for url in urls:
289
+ task = session.fetch(url)
290
+ tasks.append(task)
291
+
292
+ print(session.get_pool_stats()) # オプション - ブラウザタブプールのステータス(ビジー/フリー/エラー)
293
+ results = await asyncio.gather(*tasks)
294
+ print(session.get_pool_stats())
295
+ ```
296
+
297
+ ## CLIとインタラクティブShell
298
+
299
+ Scraplingには強力なコマンドラインインターフェースが含まれています:
300
+
301
+ [![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339)
302
+
303
+ インタラクティブWeb Scraping Shellを起動
304
+ ```bash
305
+ scrapling shell
306
+ ```
307
+ プログラミング���ずに直接ページをファイルに抽出(デフォルトで`body`タグ内のコンテンツを抽出)。出力ファイルが`.txt`で終わる場合、ターゲットのテキストコンテンツが抽出されます。`.md`で終わる場合、HTMLコンテンツのMarkdown表現になります。`.html`で終わる場合、HTMLコンテンツそのものになります。
308
+ ```bash
309
+ scrapling extract get 'https://example.com' content.md
310
+ scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # CSSセレクタ'#fromSkipToProducts'に一致するすべての要素
311
+ scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
312
+ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
313
+ ```
314
+
315
+ > [!NOTE]
316
+ > MCPサーバーやインタラクティブWeb Scraping Shellなど、他にも多くの追加機能がありますが、このページは簡潔に保ちたいと思います。完全なドキュメントは[こちら](https://scrapling.readthedocs.io/en/latest/)をご覧ください
317
+
318
+ ## パフォーマンスベンチマーク
319
+
320
+ Scraplingは強力であるだけでなく、超高速です。以下のベンチマークは、Scraplingのパーサーを他の人気ライブラリの最新バージョンと比較しています。
321
+
322
+ ### テキスト抽出速度テスト(5000個のネストされた要素)
323
+
324
+ | # | ライブラリ | 時間(ms) | vs Scrapling |
325
+ |---|:-----------------:|:---------:|:------------:|
326
+ | 1 | Scrapling | 2.02 | 1.0x |
327
+ | 2 | Parsel/Scrapy | 2.04 | 1.01 |
328
+ | 3 | Raw Lxml | 2.54 | 1.257 |
329
+ | 4 | PyQuery | 24.17 | ~12x |
330
+ | 5 | Selectolax | 82.63 | ~41x |
331
+ | 6 | MechanicalSoup | 1549.71 | ~767.1x |
332
+ | 7 | BS4 with Lxml | 1584.31 | ~784.3x |
333
+ | 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
334
+
335
+
336
+ ### 要素類似性とテキスト検索のパフォーマンス
337
+
338
+ Scraplingの適応型要素検索機能は代替手段を大幅に上回ります:
339
+
340
+ | ライブラリ | 時間(ms) | vs Scrapling |
341
+ |-------------|:---------:|:------------:|
342
+ | Scrapling | 2.39 | 1.0x |
343
+ | AutoScraper | 12.45 | 5.209x |
344
+
345
+
346
+ > すべてのベンチマークは100回以上の実行の平均を表します。方法論については[benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py)を参照してください。
347
+
348
+ ## インストール
349
+
350
+ ScraplingにはPython 3.10以上が必要です:
351
+
352
+ ```bash
353
+ pip install scrapling
354
+ ```
355
+
356
+ このインストールにはパーサーエンジンとその依存関係のみが含まれており、Fetcherやコマンドライン依存関係は含まれていません。
357
+
358
+ ### オプションの依存関係
359
+
360
+ 1. 以下の追加機能、Fetcher、またはそれらのクラスのいずれかを使用する場合は、Fetcherの依存関係とブラウザの依存関係を次のようにインストールする必要があります:
361
+ ```bash
362
+ pip install "scrapling[fetchers]"
363
+
364
+ scrapling install
365
+ ```
366
+
367
+ これにより、すべてのブラウザ、およびそれらのシステム依存関係とfingerprint操作依存関係がダウンロードされます。
368
+
369
+ 2. 追加機能:
370
+ - MCPサーバー機能をインストール:
371
+ ```bash
372
+ pip install "scrapling[ai]"
373
+ ```
374
+ - Shell機能(Web Scraping Shellと`extract`コマンド)をインストール:
375
+ ```bash
376
+ pip install "scrapling[shell]"
377
+ ```
378
+ - すべてをインストール:
379
+ ```bash
380
+ pip install "scrapling[all]"
381
+ ```
382
+ これらの追加機能のいずれかの後(まだインストールしていない場合)、`scrapling install`でブラウザの依存関係をインストールする必要があることを忘れないでください
383
+
384
+ ### Docker
385
+ DockerHubから次のコマンドですべての追加機能とブラウザを含むDockerイメージをインストールすることもできます:
386
+ ```bash
387
+ docker pull pyd4vinci/scrapling
388
+ ```
389
+ またはGitHubレジストリからダウンロード:
390
+ ```bash
391
+ docker pull ghcr.io/d4vinci/scrapling:latest
392
+ ```
393
+ このイメージは、GitHub Actionsとリポジトリのメインブランチを使用して自動的にビルドおよびプッシュされます。
394
+
395
+ ## 貢献
396
+
397
+ 貢献を歓迎します!始める前に[貢献ガイドライン](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md)をお読みください。
398
+
399
+ ## 免責事項
400
+
401
+ > [!CAUTION]
402
+ > このライブラリは教育および研究目的のみで提供されています。このライブラリを使用することにより、地域および国際的なデータスクレイピングおよびプライバシー法に準拠することに同意したものとみなされます。著者および貢献���は、このソフトウェアの誤用について責任を負いません。常にウェブサイトの利用規約とrobots.txtファイルを尊重してください。
403
+
404
+ ## ライセンス
405
+
406
+ この作品はBSD-3-Clauseライセンスの下でライセンスされています。
407
+
408
+ ## 謝辞
409
+
410
+ このプロジェクトには次から適応されたコードが含まれています:
411
+ - Parsel(BSDライセンス)— [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)サブモジュールに使用
412
+
413
+ ---
414
+ <div align="center"><small>Karim Shoairによって❤️でデザインおよび作成されました。</small></div><br>
Scrapling/docs/README_RU.md ADDED
@@ -0,0 +1,414 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <h1 align="center">
2
+ <a href="https://scrapling.readthedocs.io">
3
+ <picture>
4
+ <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_dark.svg?sanitize=true">
5
+ <img alt="Scrapling Poster" src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/docs/assets/cover_light.svg?sanitize=true">
6
+ </picture>
7
+ </a>
8
+ <br>
9
+ <small>Effortless Web Scraping for the Modern Web</small>
10
+ </h1>
11
+
12
+ <p align="center">
13
+ <a href="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml" alt="Tests">
14
+ <img alt="Tests" src="https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg"></a>
15
+ <a href="https://badge.fury.io/py/Scrapling" alt="PyPI version">
16
+ <img alt="PyPI version" src="https://badge.fury.io/py/Scrapling.svg"></a>
17
+ <a href="https://pepy.tech/project/scrapling" alt="PyPI Downloads">
18
+ <img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/scrapling?period=total&units=INTERNATIONAL_SYSTEM&left_color=GREY&right_color=GREEN&left_text=Downloads"></a>
19
+ <br/>
20
+ <a href="https://discord.gg/EMgGbDceNQ" alt="Discord" target="_blank">
21
+ <img alt="Discord" src="https://img.shields.io/discord/1360786381042880532?style=social&logo=discord&link=https%3A%2F%2Fdiscord.gg%2FEMgGbDceNQ">
22
+ </a>
23
+ <a href="https://x.com/Scrapling_dev" alt="X (formerly Twitter)">
24
+ <img alt="X (formerly Twitter) Follow" src="https://img.shields.io/twitter/follow/Scrapling_dev?style=social&logo=x&link=https%3A%2F%2Fx.com%2FScrapling_dev">
25
+ </a>
26
+ <br/>
27
+ <a href="https://pypi.org/project/scrapling/" alt="Supported Python versions">
28
+ <img alt="Supported Python versions" src="https://img.shields.io/pypi/pyversions/scrapling.svg"></a>
29
+ </p>
30
+
31
+ <p align="center">
32
+ <a href="https://scrapling.readthedocs.io/en/latest/parsing/selection/"><strong>Методы выбора</strong></a>
33
+ &middot;
34
+ <a href="https://scrapling.readthedocs.io/en/latest/fetching/choosing/"><strong>Выбор Fetcher</strong></a>
35
+ &middot;
36
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/architecture.html"><strong>Пауки</strong></a>
37
+ &middot;
38
+ <a href="https://scrapling.readthedocs.io/en/latest/spiders/proxy-blocking.html"><strong>Ротация прокси</strong></a>
39
+ &middot;
40
+ <a href="https://scrapling.readthedocs.io/en/latest/cli/overview/"><strong>CLI</strong></a>
41
+ &middot;
42
+ <a href="https://scrapling.readthedocs.io/en/latest/ai/mcp-server/"><strong>Режим MCP</strong></a>
43
+ </p>
44
+
45
+ Scrapling — это адаптивный фреймворк для Web Scraping, который берёт на себя всё: от одного запроса до полномасштабного обхода сайтов.
46
+
47
+ Его парсер учится на изменениях сайтов и автоматически перемещает ваши элементы при обновлении страниц. Его Fetcher'ы обходят анти-бот системы вроде Cloudflare Turnstile прямо из коробки. А его Spider-фреймворк позволяет масштабироваться до параллельных, многосессионных обходов с Pause & Resume и автоматической ротацией Proxy — и всё это в нескольких строках Python. Одна библиотека, без компромиссов.
48
+
49
+ Молниеносно быстрые обходы с отслеживанием статистики в реальном времени и Streaming. Создано веб-скраперами для веб-скраперов и обычных пользователей — здесь есть что-то для каждого.
50
+
51
+ ```python
52
+ from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
53
+ StealthyFetcher.adaptive = True
54
+ p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Загрузите сайт незаметно!
55
+ products = p.css('.product', auto_save=True) # Скрапьте данные, которые переживут изменения дизайна сайта!
56
+ products = p.css('.product', adaptive=True) # Позже, если структура сайта изменится, передайте `adaptive=True`, чтобы найти их!
57
+ ```
58
+ Или масштабируйте до полного обхода
59
+ ```python
60
+ from scrapling.spiders import Spider, Response
61
+
62
+ class MySpider(Spider):
63
+ name = "demo"
64
+ start_urls = ["https://example.com/"]
65
+
66
+ async def parse(self, response: Response):
67
+ for item in response.css('.product'):
68
+ yield {"title": item.css('h2::text').get()}
69
+
70
+ MySpider().start()
71
+ ```
72
+
73
+
74
+ # Платиновые спонсоры
75
+
76
+ # Спонсоры
77
+
78
+ <!-- sponsors -->
79
+
80
+ <a href="https://www.scrapeless.com/en?utm_source=official&utm_term=scrapling" target="_blank" title="Effortless Web Scraping Toolkit for Business and Developers"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/scrapeless.jpg"></a>
81
+ <a href="https://www.thordata.com/?ls=github&lk=github" target="_blank" title="Unblockable proxies and scraping infrastructure, delivering real-time, reliable web data to power AI models and workflows."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/thordata.jpg"></a>
82
+ <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png"></a>
83
+ <a href="https://serpapi.com/?utm_source=scrapling" target="_blank" title="Scrape Google and other search engines with SerpApi"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/SerpApi.png"></a>
84
+ <a href="https://visit.decodo.com/Dy6W0b" target="_blank" title="Try the Most Efficient Residential Proxies for Free"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/decodo.png"></a>
85
+ <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png"></a>
86
+ <a href="https://hasdata.com/?utm_source=github&utm_medium=banner&utm_campaign=D4Vinci" target="_blank" title="The web scraping service that actually beats anti-bot systems!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/hasdata.png"></a>
87
+ <a href="https://proxyempire.io/" target="_blank" title="Collect The Data Your Project Needs with the Best Residential Proxies"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxyEmpire.png"></a>
88
+ <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png"></a>
89
+
90
+
91
+ <a href="https://www.swiftproxy.net/" target="_blank" title="Unlock Reliable Proxy Services with Swiftproxy!"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/swiftproxy.png"></a>
92
+ <a href="https://www.rapidproxy.io/?ref=d4v" target="_blank" title="Affordable Access to the Proxy World – bypass CAPTCHAs blocks, and avoid additional costs."><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/rapidproxy.jpg"></a>
93
+ <a href="https://browser.cash/?utm_source=D4Vinci&utm_medium=referral" target="_blank" title="Browser Automation & AI Browser Agent Platform"><img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/browserCash.png"></a>
94
+
95
+ <!-- /sponsors -->
96
+
97
+ <i><sub>Хотите показать здесь свою рекламу? Нажмите [здесь](https://github.com/sponsors/D4Vinci) и выберите подходящий вам уровень!</sub></i>
98
+
99
+ ---
100
+
101
+ ## Ключевые особенности
102
+
103
+ ### Spider'ы — полноценный фреймворк для обхода сайтов
104
+ - 🕷️ **Scrapy-подобный Spider API**: Определяйте Spider'ов с `start_urls`, async `parse` callback'ами и объектами `Request`/`Response`.
105
+ - ⚡ **Параллельный обход**: Настраиваемые лимиты параллелизма, ограничение скорости по домену и задержки загрузки.
106
+ - 🔄 **Поддержка нескольких сессий**: Единый интерфейс для HTTP-запросов и скрытных headless-браузеров в одном Spider — маршрутизируйте запросы к разным сессиям по ID.
107
+ - 💾 **Pause & Resume**: Persistence обхода на основе Checkpoint'ов. Нажмите Ctrl+C для мягкой остановки; перезапустите, чтобы продолжить с того места, где вы остановились.
108
+ - 📡 **Режим Streaming**: Стримьте извлечённые элементы по мере их поступления через `async for item in spider.stream()` со статистикой в реальном времени — идеально для UI, конвейеров и длительных обходов.
109
+ - 🛡️ **Обнаружение заблокированных запросов**: Автоматическое обнаружение и повторная отправка заблокированных запросов с настраиваемой логикой.
110
+ - 📦 **Встроенный экспорт**: Экспортируйте результаты через хуки и собственный конвейер или встроенный JSON/JSONL с `result.items.to_json()` / `result.items.to_jsonl()` соответственно.
111
+
112
+ ### Продвинутая загрузка сайтов с поддержкой Session
113
+ - **HTTP-запросы**: Быстрые и скрытные HTTP-запросы с классом `Fetcher`. Может имитировать TLS fingerprint браузера, заголовки и использовать HTTP/3.
114
+ - **Динамическая загрузка**: Загрузка динамических сайтов с полной автоматизацией браузера через класс `DynamicFetcher`, поддерживающий Chromium от Playwright и Google Chrome.
115
+ - **Обход анти-ботов**: Расширенные возможности скрытности с `StealthyFetcher` и подмену fingerprint'ов. Может легко обойти все типы Cloudflare Turnstile/Interstitial с помощью автоматизации.
116
+ - **Управление сессиями**: Поддержка постоянных сессий с классами `FetcherSession`, `StealthySession` и `DynamicSession` для управления cookie и состоянием между запросами.
117
+ - **Ротация Proxy**: Встроенный `ProxyRotator` с циклической или пользовательскими стратегиями для всех типов сессий, а также переопределение Proxy для каждого запроса.
118
+ - **Блокировка доменов**: Блокируйте запросы к определённым доменам (и их поддоменам) в браузерных Fetcher'ах.
119
+ - **Поддержка async**: Полная async-поддержка во всех Fetcher'ах и выделенных async-классах сессий.
120
+
121
+ ### Адаптивный скрапинг и интеграция с ИИ
122
+ - 🔄 **Умное отслеживание элементов**: Перемещайте элементы после изменений сайта с помощью интеллектуальных алгоритмов подобия.
123
+ - 🎯 **Умный гибкий выбор**: CSS-селекторы, XPath-селекторы, поиск на основе фильтров, текстовый поиск, поиск по регулярным выражениям и многое другое.
124
+ - 🔍 **Поиск похожих элементов**: Автоматически находите элементы, похожие на найденные.
125
+ - 🤖 **MCP-сервер для использования с ИИ**: Встроенный MCP-сервер для Web Scraping с помощью ИИ и извлечения данных. MCP-сервер обладает мощными пользовательскими возможностями, которые используют Scrapling для извлечения целевого контента перед передачей его ИИ (Claude/Cursor/и т.д.), тем самым ускоряя операции и снижая затраты за счёт минимизации использования токенов. ([демо-видео](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
126
+
127
+ ### Высокопроизводительная и проверенная в боях архитектура
128
+ - 🚀 **Молниеносная скорость**: Оптимизированная производительность, превосходящая большинство Python-библиотек для скрапинга.
129
+ - 🔋 **Эффективное использование памяти**: Оптимизированные структуры данных и ленивая загрузка для минимального потребления памяти.
130
+ - ⚡ **Быстрая сериализация JSON**: В 10 раз быстрее стандартной библиотеки.
131
+ - 🏗️ **Проверено в боях**: Scrapling имеет не только 92% покрытия тестами и полное покрытие type hints, но и ежедневно использовался сотнями веб-скраперов в течение последнего года.
132
+
133
+ ### Удобный для разработчиков/веб-скраперов опыт
134
+ - 🎯 **Интерактивная Web Scraping Shell**: Опциональная встроенная IPython-оболочка с интеграцией Scrapling, ярлыками и новыми инструментами для ускорения разработки скриптов Web Scraping, такими как преобразование curl-запросов в запросы Scrapling и просмотр результатов запросов в браузере.
135
+ - 🚀 **Используйте прямо из терминала**: При желании вы можете использовать Scrapling для скрапинга URL без написания ни одной строки кода!
136
+ - 🛠️ **Богатый API навигации**: Расширенный обход DOM с методами навигации по родителям, братьям и детям.
137
+ - 🧬 **Улучшенная обработка текста**: Встроенные регулярные выражения, методы очистки и оптимизированные операции со строками.
138
+ - 📝 **Автоматическая генерация селекторов**: Генерация надёжных CSS/XPath-селекторов для любого элемента.
139
+ - 🔌 **Знакомый API**: Похож на Scrapy/BeautifulSoup с теми же псевдоэлементами, используемыми в Scrapy/Parsel.
140
+ - 📘 **Полное покрытие типами**: Полные type hints для отличной поддержки IDE и автодополнения кода. Вся кодовая база автоматически проверяется **PyRight** и **MyPy** при каждом изменении.
141
+ - 🔋 **Готовый Docker-образ**: С каждым релизом автоматически создаётся и публикуется Docker-образ, содержащий все браузеры.
142
+
143
+ ## Начало работы
144
+
145
+ Давайте кратко покажем, на что способен Scrapling, без глубокого погружения.
146
+
147
+ ### Базовое использование
148
+ HTTP-запросы с поддержкой Session
149
+ ```python
150
+ from scrapling.fetchers import Fetcher, FetcherSession
151
+
152
+ with FetcherSession(impersonate='chrome') as session: # Используйте последнюю версию TLS fingerprint Chrome
153
+ page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
154
+ quotes = page.css('.quote .text::text').getall()
155
+
156
+ # Или используйте одноразовые запросы
157
+ page = Fetcher.get('https://quotes.toscrape.com/')
158
+ quotes = page.css('.quote .text::text').getall()
159
+ ```
160
+ Расширенный режим скрытности
161
+ ```python
162
+ from scrapling.fetchers import StealthyFetcher, StealthySession
163
+
164
+ with StealthySession(headless=True, solve_cloudflare=True) as session: # Держите браузер открытым, пока не закончите
165
+ page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
166
+ data = page.css('#padded_content a').getall()
167
+
168
+ # Или используйте стиль одноразового запроса — открывает браузер для этого запроса, затем закрывает его после завершения
169
+ page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
170
+ data = page.css('#padded_content a').getall()
171
+ ```
172
+ Полная автоматизация браузера
173
+ ```python
174
+ from scrapling.fetchers import DynamicFetcher, DynamicSession
175
+
176
+ with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Держите браузер открытым, пока не закончите
177
+ page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
178
+ data = page.xpath('//span[@class="text"]/text()').getall() # XPath-селектор, если вы предпочитаете его
179
+
180
+ # Или используйте стиль одноразового запроса — открывает браузер для этого запроса, затем закрывает его после завершения
181
+ page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
182
+ data = page.css('.quote .text::text').getall()
183
+ ```
184
+
185
+ ### Spider'ы
186
+ Создавайте полноценные обходчики с параллельными запросами, несколькими типами сессий и Pause & Resume:
187
+ ```python
188
+ from scrapling.spiders import Spider, Request, Response
189
+
190
+ class QuotesSpider(Spider):
191
+ name = "quotes"
192
+ start_urls = ["https://quotes.toscrape.com/"]
193
+ concurrent_requests = 10
194
+
195
+ async def parse(self, response: Response):
196
+ for quote in response.css('.quote'):
197
+ yield {
198
+ "text": quote.css('.text::text').get(),
199
+ "author": quote.css('.author::text').get(),
200
+ }
201
+
202
+ next_page = response.css('.next a')
203
+ if next_page:
204
+ yield response.follow(next_page[0].attrib['href'])
205
+
206
+ result = QuotesSpider().start()
207
+ print(f"Извлечено {len(result.items)} цитат")
208
+ result.items.to_json("quotes.json")
209
+ ```
210
+ Используйте несколько типов сессий в одном Spider:
211
+ ```python
212
+ from scrapling.spiders import Spider, Request, Response
213
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession
214
+
215
+ class MultiSessionSpider(Spider):
216
+ name = "multi"
217
+ start_urls = ["https://example.com/"]
218
+
219
+ def configure_sessions(self, manager):
220
+ manager.add("fast", FetcherSession(impersonate="chrome"))
221
+ manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
222
+
223
+ async def parse(self, response: Response):
224
+ for link in response.css('a::attr(href)').getall():
225
+ # Направляйте защищённые страницы через stealth-сессию
226
+ if "protected" in link:
227
+ yield Request(link, sid="stealth")
228
+ else:
229
+ yield Request(link, sid="fast", callback=self.parse) # явный callback
230
+ ```
231
+ Приостанавливайте и возобновляйте длительные обходы с помощью Checkpoint'ов, запуская Spider следующим образом:
232
+ ```python
233
+ QuotesSpider(crawldir="./crawl_data").start()
234
+ ```
235
+ Нажмите Ctrl+C для мягкой остановки — прогресс сохраняется автоматически. Позже, когда вы снова запустите Spider, передайте тот же `crawldir`, и он продолжит с того места, где остановился.
236
+
237
+ ### Продвинутый парсинг и навигация
238
+ ```python
239
+ from scrapling.fetchers import Fetcher
240
+
241
+ # Богатый выбор элементов и навигация
242
+ page = Fetcher.get('https://quotes.toscrape.com/')
243
+
244
+ # Получение цитат различными методами выбора
245
+ quotes = page.css('.quote') # CSS-селектор
246
+ quotes = page.xpath('//div[@class="quote"]') # XPath
247
+ quotes = page.find_all('div', {'class': 'quote'}) # В стиле BeautifulSoup
248
+ # То же самое, что
249
+ quotes = page.find_all('div', class_='quote')
250
+ quotes = page.find_all(['div'], class_='quote')
251
+ quotes = page.find_all(class_='quote') # и так далее...
252
+ # Найти элемент по текстовому содержимому
253
+ quotes = page.find_by_text('quote', tag='div')
254
+
255
+ # Продвинутая навигация
256
+ quote_text = page.css('.quote')[0].css('.text::text').get()
257
+ quote_text = page.css('.quote').css('.text::text').getall() # Цепочка селекторов
258
+ first_quote = page.css('.quote')[0]
259
+ author = first_quote.next_sibling.css('.author::text')
260
+ parent_container = first_quote.parent
261
+
262
+ # Связи элементов и подобие
263
+ similar_elements = first_quote.find_similar()
264
+ below_elements = first_quote.below_elements()
265
+ ```
266
+ Вы можете использовать парсер напрямую, если не хотите загружать сайты, как показано ниже:
267
+ ```python
268
+ from scrapling.parser import Selector
269
+
270
+ page = Selector("<html>...</html>")
271
+ ```
272
+ И он работает точно так же!
273
+
274
+ ### Примеры async Session
275
+ ```python
276
+ import asyncio
277
+ from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
278
+
279
+ async with FetcherSession(http3=True) as session: # `FetcherSession` контекстно-осведомлён и может работать как в sync, так и в async-режимах
280
+ page1 = session.get('https://quotes.toscrape.com/')
281
+ page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
282
+
283
+ # Использование async-сессии
284
+ async with AsyncStealthySession(max_pages=2) as session:
285
+ tasks = []
286
+ urls = ['https://example.com/page1', 'https://example.com/page2']
287
+
288
+ for url in urls:
289
+ task = session.fetch(url)
290
+ tasks.append(task)
291
+
292
+ print(session.get_pool_stats()) # Опционально — статус пула вкладок браузера (занят/свободен/ошибка)
293
+ results = await asyncio.gather(*tasks)
294
+ print(session.get_pool_stats())
295
+ ```
296
+
297
+ ## CLI и интерактивная Shell
298
+
299
+ Scrapling включает мощный интерфейс командной строки:
300
+
301
+ [![asciicast](https://asciinema.org/a/736339.svg)](https://asciinema.org/a/736339)
302
+
303
+ Запустить интерактивную Web Scraping Shell
304
+ ```bash
305
+ scrapling shell
306
+ ```
307
+ Извлечь страницы в файл напрямую без программирования (по умолчанию извлекает содержимое внутри тега `body`). Если выходной файл заканчивается на `.txt`, будет извлечено текстовое содержимое цели. Если заканчивается на `.md`, это будет Markdown-представление HTML-содержимого; если заканчивается на `.html`, это будет само HTML-содержимое.
308
+ ```bash
309
+ scrapling extract get 'https://example.com' content.md
310
+ scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # Все элементы, соответствующие CSS-селектору '#fromSkipToProducts'
311
+ scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
312
+ scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflare
313
+ ```
314
+
315
+ > [!NOTE]
316
+ > Есть множество дополнительных возможностей, но мы хотим сохранить эту страницу краткой, включая MCP-сервер и интерактивную Web Scraping Shell. Ознакомьтесь с полной документацией [здесь](https://scrapling.readthedocs.io/en/latest/)
317
+
318
+ ## Тесты производительности
319
+
320
+ Scrapling не только мощный — он ещё и невероятно быстрый. Следующие тесты производительности сравнивают парсер Scrapling с последними версиями других популярных библиотек.
321
+
322
+ ### Тест скорости извлечения текста (5000 вложенных элементов)
323
+
324
+ | # | Библиотека | Время (мс) | vs Scrapling |
325
+ |---|:-----------------:|:----------:|:------------:|
326
+ | 1 | Scrapling | 2.02 | 1.0x |
327
+ | 2 | Parsel/Scrapy | 2.04 | 1.01 |
328
+ | 3 | Raw Lxml | 2.54 | 1.257 |
329
+ | 4 | PyQuery | 24.17 | ~12x |
330
+ | 5 | Selectolax | 82.63 | ~41x |
331
+ | 6 | MechanicalSoup | 1549.71 | ~767.1x |
332
+ | 7 | BS4 with Lxml | 1584.31 | ~784.3x |
333
+ | 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
334
+
335
+
336
+ ### Производительность подобия элементов и текстового поиска
337
+
338
+ Возможности адаптивного поиска элементов Scrapling значительно превосходят альтернативы:
339
+
340
+ | Библиотека | Время (мс) | vs Scrapling |
341
+ |-------------|:----------:|:------------:|
342
+ | Scrapling | 2.39 | 1.0x |
343
+ | AutoScraper | 12.45 | 5.209x |
344
+
345
+
346
+ > Все тесты производительности представляют собой средние значения более 100 запусков. См. [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) для методологии.
347
+
348
+ ## Установка
349
+
350
+ Scrapling требует Python 3.10 или выше:
351
+
352
+ ```bash
353
+ pip install scrapling
354
+ ```
355
+
356
+ Эта установка включает только движок парсера и его зависимости, без каких-либо Fetcher'ов или зависимостей командной строки.
357
+
358
+ ### Опциональные зависимости
359
+
360
+ 1. Если вы собираетесь использовать какие-либо из дополнительных возможностей ниже, Fetcher'ы или их классы, вам необходимо установить зависимости Fetcher'ов и браузеров следующим образом:
361
+ ```bash
362
+ pip install "scrapling[fetchers]"
363
+
364
+ scrapling install
365
+ ```
366
+
367
+ Это загрузит все браузеры вместе с их системными зависимостями и зависимостями для манипуляции fingerprint'ами.
368
+
369
+ 2. Дополнительные возможности:
370
+ - Установить функцию MCP-сервера:
371
+ ```bash
372
+ pip install "scrapling[ai]"
373
+ ```
374
+ - Установить функции Shell (Web Scraping Shell и команда `extract`):
375
+ ```bash
376
+ pip install "scrapling[shell]"
377
+ ```
378
+ - Установить всё:
379
+ ```bash
380
+ pip install "scrapling[all]"
381
+ ```
382
+ Помните, что вам нужно установить зависимости браузеров с помощью `scrapling install` после любого из этих дополнений (если вы ещё этого не сделали)
383
+
384
+ ### Docker
385
+ Вы также можете установить Docker-образ со всеми дополнениями и браузерами с помощью следующей команды из DockerHub:
386
+ ```bash
387
+ docker pull pyd4vinci/scrapling
388
+ ```
389
+ Или скачайте его из реестра GitHub:
390
+ ```bash
391
+ docker pull ghcr.io/d4vinci/scrapling:latest
392
+ ```
393
+ Этот образ автоматически создаётся и публикуется с помощью GitHub Actions и основной ветки репозитория.
394
+
395
+ ## Участие в разработке
396
+
397
+ Мы приветствуем участие! Пожалуйста, прочитайте наши [руководства по участию в разработке](https://github.com/D4Vinci/Scrapling/blob/main/CONTRIBUTING.md) перед началом работы.
398
+
399
+ ## Отказ от ответственности
400
+
401
+ > [!CAUTION]
402
+ > Эта библиотека предоставляется только в образовательных и исследовательских целях. Используя ��ту библиотеку, вы соглашаетесь соблюдать местные и международные законы о скрапинге данных и конфиденциальности. Авторы и участники не несут ответственности за любое неправомерное использование этого программного обеспечения. Всегда уважайте условия обслуживания веб-сайтов и файлы robots.txt.
403
+
404
+ ## Лицензия
405
+
406
+ Эта работа лицензирована по лицензии BSD-3-Clause.
407
+
408
+ ## Благодарности
409
+
410
+ Этот проект включает код, адаптированный из:
411
+ - Parsel (лицензия BSD) — Используется для подмодуля [translator](https://github.com/D4Vinci/Scrapling/blob/main/scrapling/core/translator.py)
412
+
413
+ ---
414
+ <div align="center"><small>Разработано и создано с ❤️ Карим Шоаир.</small></div><br>
Scrapling/docs/ai/mcp-server.md ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Scrapling MCP Server Guide
2
+
3
+ <iframe width="560" height="315" src="https://www.youtube.com/embed/qyFk3ZNwOxE?si=3FHzgcYCb66iJ6e3" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
4
+
5
+ The **Scrapling MCP Server** is a new feature that brings Scrapling's powerful Web Scraping capabilities directly to your favorite AI chatbot or AI agent. This integration allows you to scrape websites, extract data, and bypass anti-bot protections conversationally through Claude's AI interface or any interface that supports MCP.
6
+
7
+ ## Features
8
+
9
+ The Scrapling MCP Server provides six powerful tools for web scraping:
10
+
11
+ ### 🚀 Basic HTTP Scraping
12
+ - **`get`**: Fast HTTP requests with browser fingerprint impersonation, generating real browser headers matching the TLS version, HTTP/3, and more!
13
+ - **`bulk_get`**: An async version of the above tool that allows scraping of multiple URLs at the same time!
14
+
15
+ ### 🌐 Dynamic Content Scraping
16
+ - **`fetch`**: Rapidly fetch dynamic content with Chromium/Chrome browser with complete control over the request/browser, and more!
17
+ - **`bulk_fetch`**: An async version of the above tool that allows scraping of multiple URLs in different browser tabs at the same time!
18
+
19
+ ### 🔒 Stealth Scraping
20
+ - **`stealthy_fetch`**: Uses our Stealthy browser to bypass Cloudflare Turnstile/Interstitial and other anti-bot systems with complete control over the request/browser!
21
+ - **`bulk_stealthy_fetch`**: An async version of the above tool that allows stealth scraping of multiple URLs in different browser tabs at the same time!
22
+
23
+ ### Key Capabilities
24
+ - **Smart Content Extraction**: Convert web pages/elements to Markdown, HTML, or extract a clean version of the text content
25
+ - **CSS Selector Support**: Use the Scrapling engine to target specific elements with precision before handing the content to the AI
26
+ - **Anti-Bot Bypass**: Handle Cloudflare Turnstile, Interstitial, and other protections
27
+ - **Proxy Support**: Use proxies for anonymity and geo-targeting
28
+ - **Browser Impersonation**: Mimic real browsers with TLS fingerprinting, real browser headers matching that version, and more
29
+ - **Parallel Processing**: Scrape multiple URLs concurrently for efficiency
30
+
31
+ #### But why use Scrapling MCP Server instead of other available tools?
32
+
33
+ Aside from its stealth capabilities and ability to bypass Cloudflare Turnstile/Interstitial, Scrapling's server is the only one that lets you select specific elements to pass to the AI, saving a lot of time and tokens!
34
+
35
+ The way other servers work is that they extract the content, then pass it all to the AI to extract the fields you want. This causes the AI to consume far more tokens than needed (from irrelevant content). Scrapling solves this problem by allowing you to pass a CSS selector to narrow down the content you want before passing it to the AI, which makes the whole process much faster and more efficient.
36
+
37
+ If you don't know how to write/use CSS selectors, don't worry. You can tell the AI in the prompt to write selectors to match possible fields for you and watch it try different combinations until it finds the right one, as we will show in the examples section.
38
+
39
+ ## Installation
40
+
41
+ Install Scrapling with MCP Support, then double-check that the browser dependencies are installed.
42
+
43
+ ```bash
44
+ # Install Scrapling with MCP server dependencies
45
+ pip install "scrapling[ai]"
46
+
47
+ # Install browser dependencies
48
+ scrapling install
49
+ ```
50
+
51
+ Or use the Docker image directly from the Docker registry:
52
+ ```bash
53
+ docker pull pyd4vinci/scrapling
54
+ ```
55
+ Or download it from the GitHub registry:
56
+ ```bash
57
+ docker pull ghcr.io/d4vinci/scrapling:latest
58
+ ```
59
+
60
+ ## Setting up the MCP Server
61
+
62
+ Here we will explain how to add Scrapling MCP Server to [Claude Desktop](https://claude.ai/download) and [Claude Code](https://www.anthropic.com/claude-code), but the same logic applies to any other chatbot that supports MCP:
63
+
64
+ ### Claude Desktop
65
+
66
+ 1. Open Claude Desktop
67
+ 2. Click the hamburger menu (☰) at the top left → Settings → Developer → Edit Config
68
+ 3. Add the Scrapling MCP server configuration:
69
+ ```json
70
+ "ScraplingServer": {
71
+ "command": "scrapling",
72
+ "args": [
73
+ "mcp"
74
+ ]
75
+ }
76
+ ```
77
+ If that's the first MCP server you're adding, set the content of the file to this:
78
+ ```json
79
+ {
80
+ "mcpServers": {
81
+ "ScraplingServer": {
82
+ "command": "scrapling",
83
+ "args": [
84
+ "mcp"
85
+ ]
86
+ }
87
+ }
88
+ }
89
+ ```
90
+ As per the [official article](https://modelcontextprotocol.io/quickstart/user), this action either creates a new configuration file if none exists or opens your existing configuration. The file is located at
91
+
92
+ 1. **MacOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
93
+ 2. **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
94
+
95
+ To ensure it's working, use the full path to the `scrapling` executable. Open the terminal and execute the following command:
96
+
97
+ 1. **MacOS**: `which scrapling`
98
+ 2. **Windows**: `where scrapling`
99
+
100
+ For me, on my Mac, it returned `/Users/<MyUsername>/.venv/bin/scrapling`, so the config I used in the end is:
101
+ ```json
102
+ {
103
+ "mcpServers": {
104
+ "ScraplingServer": {
105
+ "command": "/Users/<MyUsername>/.venv/bin/scrapling",
106
+ "args": [
107
+ "mcp"
108
+ ]
109
+ }
110
+ }
111
+ }
112
+ ```
113
+ #### Docker
114
+ If you are using the Docker image, then it would be something like
115
+ ```json
116
+ {
117
+ "mcpServers": {
118
+ "ScraplingServer": {
119
+ "command": "docker",
120
+ "args": [
121
+ "run", "-i", "--rm", "scrapling", "mcp"
122
+ ]
123
+ }
124
+ }
125
+ }
126
+ ```
127
+
128
+ The same logic applies to [Cursor](https://docs.cursor.com/en/context/mcp), [WindSurf](https://windsurf.com/university/tutorials/configuring-first-mcp-server), and others.
129
+
130
+ ### Claude Code
131
+ Here it's much simpler to do. If you have [Claude Code](https://www.anthropic.com/claude-code) installed, open the terminal and execute the following command:
132
+
133
+ ```bash
134
+ claude mcp add ScraplingServer "/Users/<MyUsername>/.venv/bin/scrapling" mcp
135
+ ```
136
+ Same as above, to get Scrapling's executable path, open the terminal and execute the following command:
137
+
138
+ 1. **MacOS**: `which scrapling`
139
+ 2. **Windows**: `where scrapling`
140
+
141
+ Here's the main article from Anthropic on [how to add MCP servers to Claude code](https://docs.anthropic.com/en/docs/claude-code/mcp#option-1%3A-add-a-local-stdio-server) for further details.
142
+
143
+
144
+ Then, after you've added the server, you need to completely quit and restart the app you used above. In Claude Desktop, you should see an MCP server indicator (🔧) in the bottom-right corner of the chat input or see `ScraplingServer` in the `Search and tools` dropdown in the chat input box.
145
+
146
+ ### Streamable HTTP
147
+ As per version 0.3.6, we have added the ability to make the MCP server use the 'Streamable HTTP' transport mode instead of the traditional 'stdio' transport.
148
+
149
+ So instead of using the following command (the 'stdio' one):
150
+ ```bash
151
+ scrapling mcp
152
+ ```
153
+ Use the following to enable 'Streamable HTTP' transport mode:
154
+ ```bash
155
+ scrapling mcp --http
156
+ ```
157
+ Hence, the default value for the host the server is listening to is '0.0.0.0' and the port is 8000, which both can be configured as below:
158
+ ```bash
159
+ scrapling mcp --http --host '127.0.0.1' --port 8000
160
+ ```
161
+
162
+ ## Examples
163
+
164
+ Now we will show you some examples of prompts we used while testing the MCP server, but you are probably more creative than we are and better at prompt engineering than we are :)
165
+
166
+ We will gradually go from simple prompts to more complex ones. We will use Claude Desktop for the examples, but the same logic applies to the rest, of course.
167
+
168
+ 1. **Basic Web Scraping**
169
+
170
+ Extract the main content from a webpage as Markdown:
171
+
172
+ ```
173
+ Scrape the main content from https://example.com and convert it to markdown format.
174
+ ```
175
+
176
+ Claude will use the `get` tool to fetch the page and return clean, readable content. If it fails, it will continue retrying every second for 3 attempts, unless you instruct it otherwise. If it fails to retrieve content for any reason, such as protection or if it's a dynamic website, it will automatically try the other tools. If Claude didn't do that automatically for some reason, you can add that to the prompt.
177
+
178
+ A more optimized version of the same prompt would be:
179
+ ```
180
+ Use regular requests to scrape the main content from https://example.com and convert it to markdown format.
181
+ ```
182
+ This tells Claude which tool to use here, so it doesn't have to guess. Sometimes it will start using normal requests on its own, and at other times, it will assume browsers are better suited for this website without any apparent reason. As a rule of thumb, you should always tell Claude which tool to use to save time and money and get consistent results.
183
+
184
+ 2. **Targeted Data Extraction**
185
+
186
+ Extract specific elements using CSS selectors:
187
+
188
+ ```
189
+ Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
190
+ ```
191
+
192
+ The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try up to 5 times in case the website has connection issues, but the default setting should be fine for most cases.
193
+
194
+ 3. **E-commerce Data Collection**
195
+
196
+ Another example of a bit more complex prompt:
197
+ ```
198
+ Extract product information from these e-commerce URLs using bulk browser fetches:
199
+ - https://shop1.com/product-a
200
+ - https://shop2.com/product-b
201
+ - https://shop3.com/product-c
202
+
203
+ Get the product names, prices, and descriptions from each page.
204
+ ```
205
+
206
+ Claude will use `bulk_fetch` to concurrently scrape all URLs, then analyze the extracted data.
207
+
208
+ 4. **More advanced workflow**
209
+
210
+ Let's say I want to get all the action games available on PlayStation's store first page right now. I can use the following prompt to do that:
211
+ ```
212
+ Extract the URLs of all games in this page, then do a bulk request to them and return a list of all action games: https://store.playstation.com/en-us/pages/browse
213
+ ```
214
+ Note that I instructed it to use a bulk request for all the URLs collected. If I hadn't mentioned it, sometimes it works as intended, and other times it makes a separate request to each URL, which takes significantly longer. This prompt takes approximately one minute to complete.
215
+
216
+ However, because I wasn't specific enough, it actually used the `stealthy_fetch` here and the `bulk_stealthy_fetch` in the second step, which unnecessarily consumed a large number of tokens. A better prompt would be:
217
+ ```
218
+ Use normal requests to extract the URLs of all games in this page, then do a bulk request to them and return a list of all action games: https://store.playstation.com/en-us/pages/browse
219
+ ```
220
+ And if you know how to write CSS selectors, you can instruct Claude to apply the selectors to the elements you want, and it will nearly complete the task immediately.
221
+ ```
222
+ Use normal requests to extract the URLs of all games on the page below, then perform a bulk request to them and return a list of all action games.
223
+ The selector for games in the first page is `[href*="/concept/"]` and the selector for the genre in the second request is `[data-qa="gameInfo#releaseInformation#genre-value"]`.
224
+
225
+ URL: https://store.playstation.com/en-us/pages/browse
226
+ ```
227
+
228
+ 5. **Get data from a website with Cloudflare protection**
229
+
230
+ If you think the website you are targeting has Cloudflare protection, tell Claude instead of letting it discover it on its own.
231
+ ```
232
+ What's the price of this product? Be cautious, as it utilizes Cloudflare's Turnstile protection. Make the browser visible while you work.
233
+
234
+ https://ao.com/product/oo101uk-ninja-woodfire-outdoor-pizza-oven-brown-99357-685.aspx
235
+ ```
236
+
237
+ 6. **Long workflow**
238
+
239
+ You can, for example, use a prompt like this:
240
+ ```
241
+ Extract all product URLs for the following category, then return the prices and details for the first 3 products.
242
+
243
+ https://www.arnotts.ie/furniture/bedroom/bed-frames/
244
+ ```
245
+ But a better prompt would be:
246
+ ```
247
+ Go to the following category URL and extract all product URLs using the CSS selector "a". Then, fetch the first 3 product pages in parallel and extract each product’s price and details.
248
+
249
+ Keep the output in markdown format to reduce irrelevant content.
250
+
251
+ Category URL:
252
+ https://www.arnotts.ie/furniture/bedroom/bed-frames/
253
+ ```
254
+
255
+ And so on, you get the idea. Your creativity is the key here.
256
+
257
+ ## Best Practices
258
+
259
+ Here is some technical advice for you.
260
+
261
+ ### 1. Choose the Right Tool
262
+ - **`get`**: Fast, simple websites
263
+ - **`fetch`**: Sites with JavaScript/dynamic content
264
+ - **`stealthy_fetch`**: Protected sites, Cloudflare, anti-bot systems
265
+
266
+ ### 2. Optimize Performance
267
+ - Use bulk tools for multiple URLs
268
+ - Disable unnecessary resources
269
+ - Set appropriate timeouts
270
+ - Use CSS selectors for targeted extraction
271
+
272
+ ### 3. Handle Dynamic Content
273
+ - Use `network_idle` for SPAs
274
+ - Set `wait_selector` for specific elements
275
+ - Increase timeout for slow-loading sites
276
+
277
+ ### 4. Data Quality
278
+ - Use `main_content_only=true` to avoid navigation/ads
279
+ - Choose an appropriate `extraction_type` for your use case
280
+
281
+ ## Legal and Ethical Considerations
282
+
283
+ ⚠️ **Important Guidelines:**
284
+
285
+ - **Check robots.txt**: Visit `https://website.com/robots.txt` to see scraping rules
286
+ - **Respect rate limits**: Don't overwhelm servers with requests
287
+ - **Terms of Service**: Read and comply with website terms
288
+ - **Copyright**: Respect intellectual property rights
289
+ - **Privacy**: Be mindful of personal data protection laws
290
+ - **Commercial use**: Ensure you have permission for business purposes
291
+
292
+ ---
293
+
294
+ *Built with ❤️ by the Scrapling team. Happy scraping!*
Scrapling/docs/api-reference/custom-types.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ search:
3
+ exclude: true
4
+ ---
5
+
6
+ # Custom Types API Reference
7
+
8
+ Here's the reference information for all custom types of classes Scrapling implemented, with all their parameters, attributes, and methods.
9
+
10
+ You can import all of them directly like below:
11
+
12
+ ```python
13
+ from scrapling.core.custom_types import TextHandler, TextHandlers, AttributesHandler
14
+ ```
15
+
16
+ ## ::: scrapling.core.custom_types.TextHandler
17
+ handler: python
18
+ :docstring:
19
+
20
+ ## ::: scrapling.core.custom_types.TextHandlers
21
+ handler: python
22
+ :docstring:
23
+
24
+ ## ::: scrapling.core.custom_types.AttributesHandler
25
+ handler: python
26
+ :docstring:
Scrapling/docs/api-reference/fetchers.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ search:
3
+ exclude: true
4
+ ---
5
+
6
+ # Fetchers Classes
7
+
8
+ Here's the reference information for all fetcher-type classes' parameters, attributes, and methods.
9
+
10
+ You can import all of them directly like below:
11
+
12
+ ```python
13
+ from scrapling.fetchers import (
14
+ Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher,
15
+ FetcherSession, AsyncStealthySession, StealthySession, DynamicSession, AsyncDynamicSession
16
+ )
17
+ ```
18
+
19
+ ## ::: scrapling.fetchers.Fetcher
20
+ handler: python
21
+ :docstring:
22
+
23
+ ## ::: scrapling.fetchers.AsyncFetcher
24
+ handler: python
25
+ :docstring:
26
+
27
+ ## ::: scrapling.fetchers.DynamicFetcher
28
+ handler: python
29
+ :docstring:
30
+
31
+ ## ::: scrapling.fetchers.StealthyFetcher
32
+ handler: python
33
+ :docstring:
34
+
35
+
36
+ ## Session Classes
37
+
38
+ ### HTTP Sessions
39
+
40
+ ## ::: scrapling.fetchers.FetcherSession
41
+ handler: python
42
+ :docstring:
43
+
44
+ ### Stealth Sessions
45
+
46
+ ## ::: scrapling.fetchers.StealthySession
47
+ handler: python
48
+ :docstring:
49
+
50
+ ## ::: scrapling.fetchers.AsyncStealthySession
51
+ handler: python
52
+ :docstring:
53
+
54
+ ### Dynamic Sessions
55
+
56
+ ## ::: scrapling.fetchers.DynamicSession
57
+ handler: python
58
+ :docstring:
59
+
60
+ ## ::: scrapling.fetchers.AsyncDynamicSession
61
+ handler: python
62
+ :docstring:
63
+
Scrapling/docs/api-reference/mcp-server.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ search:
3
+ exclude: true
4
+ ---
5
+
6
+ # MCP Server API Reference
7
+
8
+ The **Scrapling MCP Server** provides six powerful tools for web scraping through the Model Context Protocol (MCP). This server integrates Scrapling's capabilities directly into AI chatbots and agents, allowing conversational web scraping with advanced anti-bot bypass features.
9
+
10
+ You can start the MCP server by running:
11
+
12
+ ```bash
13
+ scrapling mcp
14
+ ```
15
+
16
+ Or import the server class directly:
17
+
18
+ ```python
19
+ from scrapling.core.ai import ScraplingMCPServer
20
+
21
+ server = ScraplingMCPServer()
22
+ server.serve(http=False, host="0.0.0.0", port=8000)
23
+ ```
24
+
25
+ ## Response Model
26
+
27
+ The standardized response structure that's returned by all MCP server tools:
28
+
29
+ ## ::: scrapling.core.ai.ResponseModel
30
+ handler: python
31
+ :docstring:
32
+
33
+ ## MCP Server Class
34
+
35
+ The main MCP server class that provides all web scraping tools:
36
+
37
+ ## ::: scrapling.core.ai.ScraplingMCPServer
38
+ handler: python
39
+ :docstring:
Scrapling/docs/api-reference/proxy-rotation.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ search:
3
+ exclude: true
4
+ ---
5
+
6
+ # Proxy Rotation
7
+
8
+ The `ProxyRotator` class provides thread-safe proxy rotation for any fetcher or session.
9
+
10
+ You can import it directly like below:
11
+
12
+ ```python
13
+ from scrapling.fetchers import ProxyRotator
14
+ ```
15
+
16
+ ## ::: scrapling.engines.toolbelt.proxy_rotation.ProxyRotator
17
+ handler: python
18
+ :docstring:
Scrapling/docs/api-reference/response.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ search:
3
+ exclude: true
4
+ ---
5
+
6
+ # Response Class
7
+
8
+ The `Response` class wraps HTTP responses returned by all fetchers, providing access to status, headers, body, cookies, and a `Selector` for parsing.
9
+
10
+ You can import the `Response` class like below:
11
+
12
+ ```python
13
+ from scrapling.engines.toolbelt.custom import Response
14
+ ```
15
+
16
+ ## ::: scrapling.engines.toolbelt.custom.Response
17
+ handler: python
18
+ :docstring:
Scrapling/docs/api-reference/selector.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ search:
3
+ exclude: true
4
+ ---
5
+
6
+ # Selector Class
7
+
8
+ The `Selector` class is the core parsing engine in Scrapling that provides HTML parsing and element selection capabilities.
9
+
10
+ Here's the reference information for the `Selector` class, with all its parameters, attributes, and methods.
11
+
12
+ You can import the `Selector` class directly from `scrapling`:
13
+
14
+ ```python
15
+ from scrapling.parser import Selector
16
+ ```
17
+
18
+ ## ::: scrapling.parser.Selector
19
+ handler: python
20
+ :docstring:
21
+
22
+ ## ::: scrapling.parser.Selectors
23
+ handler: python
24
+ :docstring:
25
+
Scrapling/docs/api-reference/spiders.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ search:
3
+ exclude: true
4
+ ---
5
+
6
+ # Spider Classes
7
+
8
+ Here's the reference information for the spider framework classes' parameters, attributes, and methods.
9
+
10
+ You can import them directly like below:
11
+
12
+ ```python
13
+ from scrapling.spiders import Spider, Request, CrawlResult, SessionManager, Response
14
+ ```
15
+
16
+ ## ::: scrapling.spiders.Spider
17
+ handler: python
18
+ :docstring:
19
+
20
+ ## ::: scrapling.spiders.Request
21
+ handler: python
22
+ :docstring:
23
+
24
+ ## Result Classes
25
+
26
+ ## ::: scrapling.spiders.result.CrawlResult
27
+ handler: python
28
+ :docstring:
29
+
30
+ ## ::: scrapling.spiders.result.CrawlStats
31
+ handler: python
32
+ :docstring:
33
+
34
+ ## ::: scrapling.spiders.result.ItemList
35
+ handler: python
36
+ :docstring:
37
+
38
+ ## Session Management
39
+
40
+ ## ::: scrapling.spiders.session.SessionManager
41
+ handler: python
42
+ :docstring:
Scrapling/docs/assets/cover_dark.png ADDED

Git LFS Details

  • SHA256: 8eec59d31fa1c41f1a35ee8e08a412e975eeabf1347b1bb6ca609cd454edf044
  • Pointer size: 131 Bytes
  • Size of remote file: 114 kB
Scrapling/docs/assets/cover_dark.svg ADDED
Scrapling/docs/assets/cover_light.png ADDED
Scrapling/docs/assets/cover_light.svg ADDED
Scrapling/docs/assets/favicon.ico ADDED

Git LFS Details

  • SHA256: 9d2643963074a37762e2f2896b3146c7601a262838cecbcac30b69baa497d4f8
  • Pointer size: 131 Bytes
  • Size of remote file: 267 kB
Scrapling/docs/assets/logo.png ADDED
Scrapling/docs/assets/main_cover.png ADDED

Git LFS Details

  • SHA256: a80343a3e9f04e64c08c568ff2e452cccd2b24157d24b7263fc5d677d14ccc40
  • Pointer size: 131 Bytes
  • Size of remote file: 455 kB
Scrapling/docs/assets/scrapling_shell_curl.png ADDED

Git LFS Details

  • SHA256: 39c5c7aa963d31dc4f8584f34058600487c1941160dcfdcb8d11f1c699935c13
  • Pointer size: 131 Bytes
  • Size of remote file: 351 kB