Karim shoair commited on
Commit ·
be5aa87
1
Parent(s): c0cf681
docs: update the cli section
Browse files- docs/cli/extract-commands.md +17 -20
- docs/cli/interactive-shell.md +4 -2
- docs/cli/overview.md +1 -1
docs/cli/extract-commands.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
|
| 3 |
**Web Scraping through the terminal without requiring any programming!**
|
| 4 |
|
| 5 |
-
The `scrapling extract`
|
| 6 |
|
| 7 |
> 💡 **Prerequisites:**
|
| 8 |
>
|
|
@@ -30,7 +30,7 @@ The extract command is a set of simple terminal tools that:
|
|
| 30 |
```bash
|
| 31 |
scrapling extract get "https://example.com" page_content.txt
|
| 32 |
```
|
| 33 |
-
This
|
| 34 |
|
| 35 |
- **Save as Different Formats**
|
| 36 |
|
|
@@ -73,13 +73,13 @@ Commands:
|
|
| 73 |
stealthy-fetch Use StealthyFetcher to fetch content with advanced...
|
| 74 |
```
|
| 75 |
|
| 76 |
-
We will go through each
|
| 77 |
|
| 78 |
### HTTP Requests
|
| 79 |
|
| 80 |
1. **GET Request**
|
| 81 |
|
| 82 |
-
The most common
|
| 83 |
|
| 84 |
```bash
|
| 85 |
scrapling extract get [URL] [OUTPUT_FILE] [OPTIONS]
|
|
@@ -105,7 +105,7 @@ We will go through each Command in detail below.
|
|
| 105 |
# Add multiple headers
|
| 106 |
scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"
|
| 107 |
```
|
| 108 |
-
Get the available options for the
|
| 109 |
```bash
|
| 110 |
Usage: scrapling extract get [OPTIONS] URL OUTPUT_FILE
|
| 111 |
|
|
@@ -143,7 +143,7 @@ We will go through each Command in detail below.
|
|
| 143 |
# Send JSON data
|
| 144 |
scrapling extract post "https://api.site.com" response.json --json '{"username": "test", "action": "search"}'
|
| 145 |
```
|
| 146 |
-
Get the available options for the
|
| 147 |
```bash
|
| 148 |
Usage: scrapling extract post [OPTIONS] URL OUTPUT_FILE
|
| 149 |
|
|
@@ -182,7 +182,7 @@ We will go through each Command in detail below.
|
|
| 182 |
# Send JSON data
|
| 183 |
scrapling extract put "https://scrapling.requestcatcher.com/put" response.json --json '{"username": "test", "action": "search"}'
|
| 184 |
```
|
| 185 |
-
Get the available options for the
|
| 186 |
```bash
|
| 187 |
Usage: scrapling extract put [OPTIONS] URL OUTPUT_FILE
|
| 188 |
|
|
@@ -220,7 +220,7 @@ We will go through each Command in detail below.
|
|
| 220 |
# Send JSON data
|
| 221 |
scrapling extract delete "https://scrapling.requestcatcher.com/" response.txt --impersonate "chrome"
|
| 222 |
```
|
| 223 |
-
Get the available options for the
|
| 224 |
```bash
|
| 225 |
Usage: scrapling extract delete [OPTIONS] URL OUTPUT_FILE
|
| 226 |
|
|
@@ -263,7 +263,7 @@ We will go through each Command in detail below.
|
|
| 263 |
# Run in visible browser mode (helpful for debugging)
|
| 264 |
scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources
|
| 265 |
```
|
| 266 |
-
Get the available options for the
|
| 267 |
```bash
|
| 268 |
Usage: scrapling extract fetch [OPTIONS] URL OUTPUT_FILE
|
| 269 |
|
|
@@ -279,10 +279,8 @@ We will go through each Command in detail below.
|
|
| 279 |
--wait INTEGER Additional wait time in milliseconds after page load (default: 0)
|
| 280 |
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
|
| 281 |
--wait-selector TEXT CSS selector to wait for before proceeding
|
| 282 |
-
--locale TEXT
|
| 283 |
-
--
|
| 284 |
-
--hide-canvas / --show-canvas Add noise to canvas operations (default: False)
|
| 285 |
-
--disable-webgl / --enable-webgl Disable WebGL support (default: False)
|
| 286 |
--proxy TEXT Proxy URL in format "http://username:password@host:port"
|
| 287 |
-H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
|
| 288 |
--help Show this message and exit.
|
|
@@ -304,10 +302,10 @@ We will go through each Command in detail below.
|
|
| 304 |
# Solve Cloudflare challenges
|
| 305 |
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"
|
| 306 |
|
| 307 |
-
# Use proxy for anonymity
|
| 308 |
scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"
|
| 309 |
```
|
| 310 |
-
Get the available options for the
|
| 311 |
```bash
|
| 312 |
Usage: scrapling extract stealthy-fetch [OPTIONS] URL OUTPUT_FILE
|
| 313 |
|
|
@@ -317,25 +315,24 @@ We will go through each Command in detail below.
|
|
| 317 |
|
| 318 |
Options:
|
| 319 |
--headless / --no-headless Run browser in headless mode (default: True)
|
| 320 |
-
--block-images / --allow-images Block image loading (default: False)
|
| 321 |
--disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False)
|
| 322 |
--block-webrtc / --allow-webrtc Block WebRTC entirely (default: False)
|
| 323 |
-
--humanize / --no-humanize Humanize cursor movement (default: False)
|
| 324 |
--solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)
|
| 325 |
--allow-webgl / --block-webgl Allow WebGL (default: True)
|
| 326 |
--network-idle / --no-network-idle Wait for network idle (default: False)
|
| 327 |
-
--
|
|
|
|
| 328 |
--timeout INTEGER Timeout in milliseconds (default: 30000)
|
| 329 |
--wait INTEGER Additional wait time in milliseconds after page load (default: 0)
|
| 330 |
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
|
| 331 |
--wait-selector TEXT CSS selector to wait for before proceeding
|
| 332 |
-
--
|
| 333 |
--proxy TEXT Proxy URL in format "http://username:password@host:port"
|
| 334 |
-H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
|
| 335 |
--help Show this message and exit.
|
| 336 |
```
|
| 337 |
|
| 338 |
-
## When to use each
|
| 339 |
|
| 340 |
If you are not a Web Scraping expert and can't decide what to choose, you can use the following formula to help you decide:
|
| 341 |
|
|
|
|
| 2 |
|
| 3 |
**Web Scraping through the terminal without requiring any programming!**
|
| 4 |
|
| 5 |
+
The `scrapling extract` command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction.
|
| 6 |
|
| 7 |
> 💡 **Prerequisites:**
|
| 8 |
>
|
|
|
|
| 30 |
```bash
|
| 31 |
scrapling extract get "https://example.com" page_content.txt
|
| 32 |
```
|
| 33 |
+
This makes an HTTP GET request and saves the webpage's text content to `page_content.txt`.
|
| 34 |
|
| 35 |
- **Save as Different Formats**
|
| 36 |
|
|
|
|
| 73 |
stealthy-fetch Use StealthyFetcher to fetch content with advanced...
|
| 74 |
```
|
| 75 |
|
| 76 |
+
We will go through each command in detail below.
|
| 77 |
|
| 78 |
### HTTP Requests
|
| 79 |
|
| 80 |
1. **GET Request**
|
| 81 |
|
| 82 |
+
The most common command for downloading website content:
|
| 83 |
|
| 84 |
```bash
|
| 85 |
scrapling extract get [URL] [OUTPUT_FILE] [OPTIONS]
|
|
|
|
| 105 |
# Add multiple headers
|
| 106 |
scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"
|
| 107 |
```
|
| 108 |
+
Get the available options for the command with `scrapling extract get --help` as follows:
|
| 109 |
```bash
|
| 110 |
Usage: scrapling extract get [OPTIONS] URL OUTPUT_FILE
|
| 111 |
|
|
|
|
| 143 |
# Send JSON data
|
| 144 |
scrapling extract post "https://api.site.com" response.json --json '{"username": "test", "action": "search"}'
|
| 145 |
```
|
| 146 |
+
Get the available options for the command with `scrapling extract post --help` as follows:
|
| 147 |
```bash
|
| 148 |
Usage: scrapling extract post [OPTIONS] URL OUTPUT_FILE
|
| 149 |
|
|
|
|
| 182 |
# Send JSON data
|
| 183 |
scrapling extract put "https://scrapling.requestcatcher.com/put" response.json --json '{"username": "test", "action": "search"}'
|
| 184 |
```
|
| 185 |
+
Get the available options for the command with `scrapling extract put --help` as follows:
|
| 186 |
```bash
|
| 187 |
Usage: scrapling extract put [OPTIONS] URL OUTPUT_FILE
|
| 188 |
|
|
|
|
| 220 |
# Send JSON data
|
| 221 |
scrapling extract delete "https://scrapling.requestcatcher.com/" response.txt --impersonate "chrome"
|
| 222 |
```
|
| 223 |
+
Get the available options for the command with `scrapling extract delete --help` as follows:
|
| 224 |
```bash
|
| 225 |
Usage: scrapling extract delete [OPTIONS] URL OUTPUT_FILE
|
| 226 |
|
|
|
|
| 263 |
# Run in visible browser mode (helpful for debugging)
|
| 264 |
scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources
|
| 265 |
```
|
| 266 |
+
Get the available options for the command with `scrapling extract fetch --help` as follows:
|
| 267 |
```bash
|
| 268 |
Usage: scrapling extract fetch [OPTIONS] URL OUTPUT_FILE
|
| 269 |
|
|
|
|
| 279 |
--wait INTEGER Additional wait time in milliseconds after page load (default: 0)
|
| 280 |
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
|
| 281 |
--wait-selector TEXT CSS selector to wait for before proceeding
|
| 282 |
+
--locale TEXT Specify user locale. Defaults to the system default locale.
|
| 283 |
+
---real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
|
|
|
|
|
|
|
| 284 |
--proxy TEXT Proxy URL in format "http://username:password@host:port"
|
| 285 |
-H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
|
| 286 |
--help Show this message and exit.
|
|
|
|
| 302 |
# Solve Cloudflare challenges
|
| 303 |
scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"
|
| 304 |
|
| 305 |
+
# Use a proxy for anonymity.
|
| 306 |
scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"
|
| 307 |
```
|
| 308 |
+
Get the available options for the command with `scrapling extract stealthy-fetch --help` as follows:
|
| 309 |
```bash
|
| 310 |
Usage: scrapling extract stealthy-fetch [OPTIONS] URL OUTPUT_FILE
|
| 311 |
|
|
|
|
| 315 |
|
| 316 |
Options:
|
| 317 |
--headless / --no-headless Run browser in headless mode (default: True)
|
|
|
|
| 318 |
--disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False)
|
| 319 |
--block-webrtc / --allow-webrtc Block WebRTC entirely (default: False)
|
|
|
|
| 320 |
--solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)
|
| 321 |
--allow-webgl / --block-webgl Allow WebGL (default: True)
|
| 322 |
--network-idle / --no-network-idle Wait for network idle (default: False)
|
| 323 |
+
---real-chrome/--no-real-chrom If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
|
| 324 |
+
--hide-canvas/--show-canvas Add noise to canvas operations (default: False)
|
| 325 |
--timeout INTEGER Timeout in milliseconds (default: 30000)
|
| 326 |
--wait INTEGER Additional wait time in milliseconds after page load (default: 0)
|
| 327 |
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
|
| 328 |
--wait-selector TEXT CSS selector to wait for before proceeding
|
| 329 |
+
--hide-canvas / --show-canvas Add noise to canvas operations (default: False)
|
| 330 |
--proxy TEXT Proxy URL in format "http://username:password@host:port"
|
| 331 |
-H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
|
| 332 |
--help Show this message and exit.
|
| 333 |
```
|
| 334 |
|
| 335 |
+
## When to use each command
|
| 336 |
|
| 337 |
If you are not a Web Scraping expert and can't decide what to choose, you can use the following formula to help you decide:
|
| 338 |
|
docs/cli/interactive-shell.md
CHANGED
|
@@ -4,7 +4,7 @@
|
|
| 4 |
|
| 5 |
**Powerful Web Scraping REPL for Developers and Data Scientists**
|
| 6 |
|
| 7 |
-
The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools
|
| 8 |
|
| 9 |
> 💡 **Prerequisites:**
|
| 10 |
>
|
|
@@ -129,7 +129,9 @@ View scraped pages in your browser:
|
|
| 129 |
|
| 130 |
### Curl Command Integration
|
| 131 |
|
| 132 |
-
The shell provides a few functions to help you convert curl commands from the browser DevTools to `Fetcher` requests
|
|
|
|
|
|
|
| 133 |
|
| 134 |
<img src="../../assets/scrapling_shell_curl.png" title="Copying a request as a curl command from Chrome" alt="Copying a request as a curl command from Chrome" style="width: 70%;"/>
|
| 135 |
|
|
|
|
| 4 |
|
| 5 |
**Powerful Web Scraping REPL for Developers and Data Scientists**
|
| 6 |
|
| 7 |
+
The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools, such as conversion of the curl command.
|
| 8 |
|
| 9 |
> 💡 **Prerequisites:**
|
| 10 |
>
|
|
|
|
| 129 |
|
| 130 |
### Curl Command Integration
|
| 131 |
|
| 132 |
+
The shell provides a few functions to help you convert curl commands from the browser DevTools to `Fetcher` requests: `uncurl` and `curl2fetcher`.
|
| 133 |
+
|
| 134 |
+
First, you need to copy a request as a curl command like the following:
|
| 135 |
|
| 136 |
<img src="../../assets/scrapling_shell_curl.png" title="Copying a request as a curl command from Chrome" alt="Copying a request as a curl command from Chrome" style="width: 70%;"/>
|
| 137 |
|
docs/cli/overview.md
CHANGED
|
@@ -27,4 +27,4 @@ and the installation of the fetchers' dependencies with the following command
|
|
| 27 |
```bash
|
| 28 |
scrapling install
|
| 29 |
```
|
| 30 |
-
This downloads all browsers with their system dependencies and fingerprint manipulation dependencies.
|
|
|
|
| 27 |
```bash
|
| 28 |
scrapling install
|
| 29 |
```
|
| 30 |
+
This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies.
|