Karim shoair commited on
Commit
be5aa87
·
1 Parent(s): c0cf681

docs: update the cli section

Browse files
docs/cli/extract-commands.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  **Web Scraping through the terminal without requiring any programming!**
4
 
5
- The `scrapling extract` Command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction.
6
 
7
  > 💡 **Prerequisites:**
8
  >
@@ -30,7 +30,7 @@ The extract command is a set of simple terminal tools that:
30
  ```bash
31
  scrapling extract get "https://example.com" page_content.txt
32
  ```
33
- This does an HTTP GET request and saves the text content of the webpage to `page_content.txt`.
34
 
35
  - **Save as Different Formats**
36
 
@@ -73,13 +73,13 @@ Commands:
73
  stealthy-fetch Use StealthyFetcher to fetch content with advanced...
74
  ```
75
 
76
- We will go through each Command in detail below.
77
 
78
  ### HTTP Requests
79
 
80
  1. **GET Request**
81
 
82
- The most common Command for downloading website content:
83
 
84
  ```bash
85
  scrapling extract get [URL] [OUTPUT_FILE] [OPTIONS]
@@ -105,7 +105,7 @@ We will go through each Command in detail below.
105
  # Add multiple headers
106
  scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"
107
  ```
108
- Get the available options for the Command with `scrapling extract get --help` as follows:
109
  ```bash
110
  Usage: scrapling extract get [OPTIONS] URL OUTPUT_FILE
111
 
@@ -143,7 +143,7 @@ We will go through each Command in detail below.
143
  # Send JSON data
144
  scrapling extract post "https://api.site.com" response.json --json '{"username": "test", "action": "search"}'
145
  ```
146
- Get the available options for the Command with `scrapling extract post --help` as follows:
147
  ```bash
148
  Usage: scrapling extract post [OPTIONS] URL OUTPUT_FILE
149
 
@@ -182,7 +182,7 @@ We will go through each Command in detail below.
182
  # Send JSON data
183
  scrapling extract put "https://scrapling.requestcatcher.com/put" response.json --json '{"username": "test", "action": "search"}'
184
  ```
185
- Get the available options for the Command with `scrapling extract put --help` as follows:
186
  ```bash
187
  Usage: scrapling extract put [OPTIONS] URL OUTPUT_FILE
188
 
@@ -220,7 +220,7 @@ We will go through each Command in detail below.
220
  # Send JSON data
221
  scrapling extract delete "https://scrapling.requestcatcher.com/" response.txt --impersonate "chrome"
222
  ```
223
- Get the available options for the Command with `scrapling extract delete --help` as follows:
224
  ```bash
225
  Usage: scrapling extract delete [OPTIONS] URL OUTPUT_FILE
226
 
@@ -263,7 +263,7 @@ We will go through each Command in detail below.
263
  # Run in visible browser mode (helpful for debugging)
264
  scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources
265
  ```
266
- Get the available options for the Command with `scrapling extract fetch --help` as follows:
267
  ```bash
268
  Usage: scrapling extract fetch [OPTIONS] URL OUTPUT_FILE
269
 
@@ -279,10 +279,8 @@ We will go through each Command in detail below.
279
  --wait INTEGER Additional wait time in milliseconds after page load (default: 0)
280
  -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
281
  --wait-selector TEXT CSS selector to wait for before proceeding
282
- --locale TEXT Browser locale (default: en-US)
283
- --stealth / --no-stealth Enable stealth mode (default: False)
284
- --hide-canvas / --show-canvas Add noise to canvas operations (default: False)
285
- --disable-webgl / --enable-webgl Disable WebGL support (default: False)
286
  --proxy TEXT Proxy URL in format "http://username:password@host:port"
287
  -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
288
  --help Show this message and exit.
@@ -304,10 +302,10 @@ We will go through each Command in detail below.
304
  # Solve Cloudflare challenges
305
  scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"
306
 
307
- # Use proxy for anonymity
308
  scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"
309
  ```
310
- Get the available options for the Command with `scrapling extract stealthy-fetch --help` as follows:
311
  ```bash
312
  Usage: scrapling extract stealthy-fetch [OPTIONS] URL OUTPUT_FILE
313
 
@@ -317,25 +315,24 @@ We will go through each Command in detail below.
317
 
318
  Options:
319
  --headless / --no-headless Run browser in headless mode (default: True)
320
- --block-images / --allow-images Block image loading (default: False)
321
  --disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False)
322
  --block-webrtc / --allow-webrtc Block WebRTC entirely (default: False)
323
- --humanize / --no-humanize Humanize cursor movement (default: False)
324
  --solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)
325
  --allow-webgl / --block-webgl Allow WebGL (default: True)
326
  --network-idle / --no-network-idle Wait for network idle (default: False)
327
- --disable-ads / --allow-ads Install uBlock Origin addon (default: False)
 
328
  --timeout INTEGER Timeout in milliseconds (default: 30000)
329
  --wait INTEGER Additional wait time in milliseconds after page load (default: 0)
330
  -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
331
  --wait-selector TEXT CSS selector to wait for before proceeding
332
- --geoip / --no-geoip Use IP/Proxy geolocation for timezone/locale (default: False)
333
  --proxy TEXT Proxy URL in format "http://username:password@host:port"
334
  -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
335
  --help Show this message and exit.
336
  ```
337
 
338
- ## When to use each Command
339
 
340
  If you are not a Web Scraping expert and can't decide what to choose, you can use the following formula to help you decide:
341
 
 
2
 
3
  **Web Scraping through the terminal without requiring any programming!**
4
 
5
+ The `scrapling extract` command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction.
6
 
7
  > 💡 **Prerequisites:**
8
  >
 
30
  ```bash
31
  scrapling extract get "https://example.com" page_content.txt
32
  ```
33
+ This makes an HTTP GET request and saves the webpage's text content to `page_content.txt`.
34
 
35
  - **Save as Different Formats**
36
 
 
73
  stealthy-fetch Use StealthyFetcher to fetch content with advanced...
74
  ```
75
 
76
+ We will go through each command in detail below.
77
 
78
  ### HTTP Requests
79
 
80
  1. **GET Request**
81
 
82
+ The most common command for downloading website content:
83
 
84
  ```bash
85
  scrapling extract get [URL] [OUTPUT_FILE] [OPTIONS]
 
105
  # Add multiple headers
106
  scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US"
107
  ```
108
+ Get the available options for the command with `scrapling extract get --help` as follows:
109
  ```bash
110
  Usage: scrapling extract get [OPTIONS] URL OUTPUT_FILE
111
 
 
143
  # Send JSON data
144
  scrapling extract post "https://api.site.com" response.json --json '{"username": "test", "action": "search"}'
145
  ```
146
+ Get the available options for the command with `scrapling extract post --help` as follows:
147
  ```bash
148
  Usage: scrapling extract post [OPTIONS] URL OUTPUT_FILE
149
 
 
182
  # Send JSON data
183
  scrapling extract put "https://scrapling.requestcatcher.com/put" response.json --json '{"username": "test", "action": "search"}'
184
  ```
185
+ Get the available options for the command with `scrapling extract put --help` as follows:
186
  ```bash
187
  Usage: scrapling extract put [OPTIONS] URL OUTPUT_FILE
188
 
 
220
  # Send JSON data
221
  scrapling extract delete "https://scrapling.requestcatcher.com/" response.txt --impersonate "chrome"
222
  ```
223
+ Get the available options for the command with `scrapling extract delete --help` as follows:
224
  ```bash
225
  Usage: scrapling extract delete [OPTIONS] URL OUTPUT_FILE
226
 
 
263
  # Run in visible browser mode (helpful for debugging)
264
  scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources
265
  ```
266
+ Get the available options for the command with `scrapling extract fetch --help` as follows:
267
  ```bash
268
  Usage: scrapling extract fetch [OPTIONS] URL OUTPUT_FILE
269
 
 
279
  --wait INTEGER Additional wait time in milliseconds after page load (default: 0)
280
  -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
281
  --wait-selector TEXT CSS selector to wait for before proceeding
282
+ --locale TEXT Specify user locale. Defaults to the system default locale.
283
+ ---real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
 
 
284
  --proxy TEXT Proxy URL in format "http://username:password@host:port"
285
  -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
286
  --help Show this message and exit.
 
302
  # Solve Cloudflare challenges
303
  scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a"
304
 
305
+ # Use a proxy for anonymity.
306
  scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080"
307
  ```
308
+ Get the available options for the command with `scrapling extract stealthy-fetch --help` as follows:
309
  ```bash
310
  Usage: scrapling extract stealthy-fetch [OPTIONS] URL OUTPUT_FILE
311
 
 
315
 
316
  Options:
317
  --headless / --no-headless Run browser in headless mode (default: True)
 
318
  --disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False)
319
  --block-webrtc / --allow-webrtc Block WebRTC entirely (default: False)
 
320
  --solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)
321
  --allow-webgl / --block-webgl Allow WebGL (default: True)
322
  --network-idle / --no-network-idle Wait for network idle (default: False)
323
+ ---real-chrome/--no-real-chrom If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
324
+ --hide-canvas/--show-canvas Add noise to canvas operations (default: False)
325
  --timeout INTEGER Timeout in milliseconds (default: 30000)
326
  --wait INTEGER Additional wait time in milliseconds after page load (default: 0)
327
  -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
328
  --wait-selector TEXT CSS selector to wait for before proceeding
329
+ --hide-canvas / --show-canvas Add noise to canvas operations (default: False)
330
  --proxy TEXT Proxy URL in format "http://username:password@host:port"
331
  -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
332
  --help Show this message and exit.
333
  ```
334
 
335
+ ## When to use each command
336
 
337
  If you are not a Web Scraping expert and can't decide what to choose, you can use the following formula to help you decide:
338
 
docs/cli/interactive-shell.md CHANGED
@@ -4,7 +4,7 @@
4
 
5
  **Powerful Web Scraping REPL for Developers and Data Scientists**
6
 
7
- The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools like curl command conversion.
8
 
9
  > 💡 **Prerequisites:**
10
  >
@@ -129,7 +129,9 @@ View scraped pages in your browser:
129
 
130
  ### Curl Command Integration
131
 
132
- The shell provides a few functions to help you convert curl commands from the browser DevTools to `Fetcher` requests, which are `uncurl` and `curl2fetcher`. First, you need to copy a request as a curl command like the following:
 
 
133
 
134
  <img src="../../assets/scrapling_shell_curl.png" title="Copying a request as a curl command from Chrome" alt="Copying a request as a curl command from Chrome" style="width: 70%;"/>
135
 
 
4
 
5
  **Powerful Web Scraping REPL for Developers and Data Scientists**
6
 
7
+ The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools, such as conversion of the curl command.
8
 
9
  > 💡 **Prerequisites:**
10
  >
 
129
 
130
  ### Curl Command Integration
131
 
132
+ The shell provides a few functions to help you convert curl commands from the browser DevTools to `Fetcher` requests: `uncurl` and `curl2fetcher`.
133
+
134
+ First, you need to copy a request as a curl command like the following:
135
 
136
  <img src="../../assets/scrapling_shell_curl.png" title="Copying a request as a curl command from Chrome" alt="Copying a request as a curl command from Chrome" style="width: 70%;"/>
137
 
docs/cli/overview.md CHANGED
@@ -27,4 +27,4 @@ and the installation of the fetchers' dependencies with the following command
27
  ```bash
28
  scrapling install
29
  ```
30
- This downloads all browsers with their system dependencies and fingerprint manipulation dependencies.
 
27
  ```bash
28
  scrapling install
29
  ```
30
+ This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies.