Spaces:
Sleeping
Sleeping
| # Scrapling Extract Command Guide | |
| **Web Scraping through the terminal without requiring any programming!** | |
| The `scrapling extract` command lets you download and extract content from websites directly from your terminal without writing any code. Ideal for beginners, researchers, and anyone requiring rapid web data extraction. | |
| !!! success "Prerequisites" | |
| 1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use. | |
| 2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object. | |
| 3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class. | |
| 4. You've completed or read at least one page from the fetchers section to use here for requests: [HTTP requests](../fetching/static.md), [Dynamic websites](../fetching/dynamic.md), or [Dynamic websites with hard protections](../fetching/stealthy.md). | |
| ## What is the Extract Command group? | |
| The extract command is a set of simple terminal tools that: | |
| - **Downloads web pages** and saves their content to files. | |
| - **Converts HTML to readable formats** like Markdown, keeps it as HTML, or just extracts the text content of the page. | |
| - **Supports custom CSS selectors** to extract specific parts of the page. | |
| - **Handles HTTP requests and fetching through browsers** | |
| - **Highly customizable** with custom headers, cookies, proxies, and the rest of the options. Almost all the options available through the code are also accessible through the command line. | |
| ## Quick Start | |
| - **Basic Website Download** | |
| Download a website's text content as clean, readable text: | |
| ```bash | |
| scrapling extract get "https://example.com" page_content.txt | |
| ``` | |
| This makes an HTTP GET request and saves the webpage's text content to `page_content.txt`. | |
| - **Save as Different Formats** | |
| Choose your output format by changing the file extension: | |
| ```bash | |
| # Convert the HTML content to Markdown, then save it to the file (great for documentation) | |
| scrapling extract get "https://blog.example.com" article.md | |
| # Save the HTML content as it is to the file | |
| scrapling extract get "https://example.com" page.html | |
| # Save a clean version of the text content of the webpage to the file | |
| scrapling extract get "https://example.com" content.txt | |
| # Or use the Docker image with something like this: | |
| docker run -v $(pwd)/output:/output scrapling extract get "https://blog.example.com" /output/article.md | |
| ``` | |
| - **Extract Specific Content** | |
| All commands can use CSS selectors to extract specific parts of the page through `--css-selector` or `-s` as you will see in the examples below. | |
| ## Available Commands | |
| You can display the available commands through `scrapling extract --help` to get the following list: | |
| ```bash | |
| Usage: scrapling extract [OPTIONS] COMMAND [ARGS]... | |
| Fetch web pages using various fetchers and extract full/selected HTML content as HTML, Markdown, or extract text content. | |
| Options: | |
| --help Show this message and exit. | |
| Commands: | |
| get Perform a GET request and save the content to a file. | |
| post Perform a POST request and save the content to a file. | |
| put Perform a PUT request and save the content to a file. | |
| delete Perform a DELETE request and save the content to a file. | |
| fetch Use DynamicFetcher to fetch content with browser... | |
| stealthy-fetch Use StealthyFetcher to fetch content with advanced... | |
| ``` | |
| We will go through each command in detail below. | |
| ### HTTP Requests | |
| 1. **GET Request** | |
| The most common command for downloading website content: | |
| ```bash | |
| scrapling extract get [URL] [OUTPUT_FILE] [OPTIONS] | |
| ``` | |
| **Examples:** | |
| ```bash | |
| # Basic download | |
| scrapling extract get "https://news.site.com" news.md | |
| # Download with custom timeout | |
| scrapling extract get "https://example.com" content.txt --timeout 60 | |
| # Extract only specific content using CSS selectors | |
| scrapling extract get "https://blog.example.com" articles.md --css-selector "article" | |
| # Send a request with cookies | |
| scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john" | |
| # Add user agent | |
| scrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0" | |
| # Add multiple headers | |
| scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US" | |
| ``` | |
| Get the available options for the command with `scrapling extract get --help` as follows: | |
| ```bash | |
| Usage: scrapling extract get [OPTIONS] URL OUTPUT_FILE | |
| Perform a GET request and save the content to a file. | |
| The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. | |
| Options: | |
| -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times) | |
| --cookies TEXT Cookies string in format "name1=value1;name2=value2" | |
| --timeout INTEGER Request timeout in seconds (default: 30) | |
| --proxy TEXT Proxy URL in format "http://username:password@host:port" | |
| -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. | |
| -p, --params TEXT Query parameters in format "key=value" (can be used multiple times) | |
| --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True) | |
| --verify / --no-verify Whether to verify SSL certificates (default: True) | |
| --impersonate TEXT Browser to impersonate (e.g., chrome, firefox). | |
| --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True) | |
| --help Show this message and exit. | |
| ``` | |
| Note that the options will work in the same way for all other request commands, so no need to repeat them. | |
| 2. **Post Request** | |
| ```bash | |
| scrapling extract post [URL] [OUTPUT_FILE] [OPTIONS] | |
| ``` | |
| **Examples:** | |
| ```bash | |
| # Submit form data | |
| scrapling extract post "https://api.site.com/search" results.html --data "query=python&type=tutorial" | |
| # Send JSON data | |
| scrapling extract post "https://api.site.com" response.json --json '{"username": "test", "action": "search"}' | |
| ``` | |
| Get the available options for the command with `scrapling extract post --help` as follows: | |
| ```bash | |
| Usage: scrapling extract post [OPTIONS] URL OUTPUT_FILE | |
| Perform a POST request and save the content to a file. | |
| The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. | |
| Options: | |
| -d, --data TEXT Form data to include in the request body (as string, ex: "param1=value1¶m2=value2") | |
| -j, --json TEXT JSON data to include in the request body (as string) | |
| -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times) | |
| --cookies TEXT Cookies string in format "name1=value1;name2=value2" | |
| --timeout INTEGER Request timeout in seconds (default: 30) | |
| --proxy TEXT Proxy URL in format "http://username:password@host:port" | |
| -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. | |
| -p, --params TEXT Query parameters in format "key=value" (can be used multiple times) | |
| --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True) | |
| --verify / --no-verify Whether to verify SSL certificates (default: True) | |
| --impersonate TEXT Browser to impersonate (e.g., chrome, firefox). | |
| --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True) | |
| --help Show this message and exit. | |
| ``` | |
| 3. **Put Request** | |
| ```bash | |
| scrapling extract put [URL] [OUTPUT_FILE] [OPTIONS] | |
| ``` | |
| **Examples:** | |
| ```bash | |
| # Send data | |
| scrapling extract put "https://scrapling.requestcatcher.com/put" results.html --data "update=info" --impersonate "firefox" | |
| # Send JSON data | |
| scrapling extract put "https://scrapling.requestcatcher.com/put" response.json --json '{"username": "test", "action": "search"}' | |
| ``` | |
| Get the available options for the command with `scrapling extract put --help` as follows: | |
| ```bash | |
| Usage: scrapling extract put [OPTIONS] URL OUTPUT_FILE | |
| Perform a PUT request and save the content to a file. | |
| The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. | |
| Options: | |
| -d, --data TEXT Form data to include in the request body | |
| -j, --json TEXT JSON data to include in the request body (as string) | |
| -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times) | |
| --cookies TEXT Cookies string in format "name1=value1;name2=value2" | |
| --timeout INTEGER Request timeout in seconds (default: 30) | |
| --proxy TEXT Proxy URL in format "http://username:password@host:port" | |
| -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. | |
| -p, --params TEXT Query parameters in format "key=value" (can be used multiple times) | |
| --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True) | |
| --verify / --no-verify Whether to verify SSL certificates (default: True) | |
| --impersonate TEXT Browser to impersonate (e.g., chrome, firefox). | |
| --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True) | |
| --help Show this message and exit. | |
| ``` | |
| 4. **Delete Request** | |
| ```bash | |
| scrapling extract delete [URL] [OUTPUT_FILE] [OPTIONS] | |
| ``` | |
| **Examples:** | |
| ```bash | |
| # Send data | |
| scrapling extract delete "https://scrapling.requestcatcher.com/delete" results.html | |
| # Send JSON data | |
| scrapling extract delete "https://scrapling.requestcatcher.com/" response.txt --impersonate "chrome" | |
| ``` | |
| Get the available options for the command with `scrapling extract delete --help` as follows: | |
| ```bash | |
| Usage: scrapling extract delete [OPTIONS] URL OUTPUT_FILE | |
| Perform a DELETE request and save the content to a file. | |
| The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. | |
| Options: | |
| -H, --headers TEXT HTTP headers in format "Key: Value" (can be used multiple times) | |
| --cookies TEXT Cookies string in format "name1=value1;name2=value2" | |
| --timeout INTEGER Request timeout in seconds (default: 30) | |
| --proxy TEXT Proxy URL in format "http://username:password@host:port" | |
| -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. | |
| -p, --params TEXT Query parameters in format "key=value" (can be used multiple times) | |
| --follow-redirects / --no-follow-redirects Whether to follow redirects (default: True) | |
| --verify / --no-verify Whether to verify SSL certificates (default: True) | |
| --impersonate TEXT Browser to impersonate (e.g., chrome, firefox). | |
| --stealthy-headers / --no-stealthy-headers Use stealthy browser headers (default: True) | |
| --help Show this message and exit. | |
| ``` | |
| ### Browsers fetching | |
| 1. **fetch - Handle Dynamic Content** | |
| For websites that load content with dynamic content or have slight protection | |
| ```bash | |
| scrapling extract fetch [URL] [OUTPUT_FILE] [OPTIONS] | |
| ``` | |
| **Examples:** | |
| ```bash | |
| # Wait for JavaScript to load content and finish network activity | |
| scrapling extract fetch "https://scrapling.requestcatcher.com/" content.md --network-idle | |
| # Wait for specific content to appear | |
| scrapling extract fetch "https://scrapling.requestcatcher.com/" data.txt --wait-selector ".content-loaded" | |
| # Run in visible browser mode (helpful for debugging) | |
| scrapling extract fetch "https://scrapling.requestcatcher.com/" page.html --no-headless --disable-resources | |
| ``` | |
| Get the available options for the command with `scrapling extract fetch --help` as follows: | |
| ```bash | |
| Usage: scrapling extract fetch [OPTIONS] URL OUTPUT_FILE | |
| Use DynamicFetcher to fetch content with browser automation. | |
| The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. | |
| Options: | |
| --headless / --no-headless Run browser in headless mode (default: True) | |
| --disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False) | |
| --network-idle / --no-network-idle Wait for network idle (default: False) | |
| --timeout INTEGER Timeout in milliseconds (default: 30000) | |
| --wait INTEGER Additional wait time in milliseconds after page load (default: 0) | |
| -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. | |
| --wait-selector TEXT CSS selector to wait for before proceeding | |
| --locale TEXT Specify user locale. Defaults to the system default locale. | |
| --real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False) | |
| --proxy TEXT Proxy URL in format "http://username:password@host:port" | |
| -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times) | |
| --help Show this message and exit. | |
| ``` | |
| 2. **stealthy-fetch - Bypass Protection** | |
| For websites with anti-bot protection or Cloudflare protection | |
| ```bash | |
| scrapling extract stealthy-fetch [URL] [OUTPUT_FILE] [OPTIONS] | |
| ``` | |
| **Examples:** | |
| ```bash | |
| # Bypass basic protection | |
| scrapling extract stealthy-fetch "https://scrapling.requestcatcher.com" content.md | |
| # Solve Cloudflare challenges | |
| scrapling extract stealthy-fetch "https://nopecha.com/demo/cloudflare" data.txt --solve-cloudflare --css-selector "#padded_content a" | |
| # Use a proxy for anonymity. | |
| scrapling extract stealthy-fetch "https://site.com" content.md --proxy "http://proxy-server:8080" | |
| ``` | |
| Get the available options for the command with `scrapling extract stealthy-fetch --help` as follows: | |
| ```bash | |
| Usage: scrapling extract stealthy-fetch [OPTIONS] URL OUTPUT_FILE | |
| Use StealthyFetcher to fetch content with advanced stealth features. | |
| The output file path can be an HTML file, a Markdown file of the HTML content, or the text content itself. Use file extensions (`.html`/`.md`/`.txt`) respectively. | |
| Options: | |
| --headless / --no-headless Run browser in headless mode (default: True) | |
| --disable-resources / --enable-resources Drop unnecessary resources for speed boost (default: False) | |
| --block-webrtc / --allow-webrtc Block WebRTC entirely (default: False) | |
| --solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False) | |
| --allow-webgl / --block-webgl Allow WebGL (default: True) | |
| --network-idle / --no-network-idle Wait for network idle (default: False) | |
| --real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False) | |
| --timeout INTEGER Timeout in milliseconds (default: 30000) | |
| --wait INTEGER Additional wait time in milliseconds after page load (default: 0) | |
| -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches. | |
| --wait-selector TEXT CSS selector to wait for before proceeding | |
| --hide-canvas / --show-canvas Add noise to canvas operations (default: False) | |
| --proxy TEXT Proxy URL in format "http://username:password@host:port" | |
| -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times) | |
| --help Show this message and exit. | |
| ``` | |
| ## When to use each command | |
| If you are not a Web Scraping expert and can't decide what to choose, you can use the following formula to help you decide: | |
| - Use **`get`** with simple websites, blogs, or news articles | |
| - Use **`fetch`** with modern web apps, or sites with dynamic content | |
| - Use **`stealthy-fetch`** with protected sites, Cloudflare, or anti-bot systems | |
| ## Legal and Ethical Considerations | |
| ⚠️ **Important Guidelines:** | |
| - **Check robots.txt**: Visit `https://website.com/robots.txt` to see scraping rules | |
| - **Respect rate limits**: Don't overwhelm servers with requests | |
| - **Terms of Service**: Read and comply with website terms | |
| - **Copyright**: Respect intellectual property rights | |
| - **Privacy**: Be mindful of personal data protection laws | |
| - **Commercial use**: Ensure you have permission for business purposes | |
| --- | |
| *Happy scraping! Remember to always respect website policies and comply with all applicable laws and regulations.* |