Spaces:
Sleeping
Sleeping
| # HTTP requests | |
| The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities. | |
| !!! success "Prerequisites" | |
| 1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use. | |
| 2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object. | |
| 3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class. | |
| ## Basic Usage | |
| You have one primary way to import this Fetcher, which is the same for all fetchers. | |
| ```python | |
| >>> from scrapling.fetchers import Fetcher | |
| ``` | |
| Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers) | |
| ### Shared arguments | |
| All methods for making requests here share some arguments, so let's discuss them first. | |
| - **url**: The targeted URL | |
| - **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of the URL's domain. | |
| - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default** | |
| - **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**. | |
| - **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**. | |
| - **retry_delay**: Number of seconds to wait between retry attempts. **Defaults to 1 second**. | |
| - **impersonate**: Impersonate specific browsers' TLS fingerprints. Accepts browser strings or a list of them like `"chrome110"`, `"firefox102"`, `"safari15_5"` to use specific versions or `"chrome"`, `"firefox"`, `"safari"`, `"edge"` to automatically use the latest version available. This makes your requests appear to come from real browsers at the TLS level. If you pass it a list of strings, it will choose a random one with each request. **Defaults to the latest available Chrome version.** | |
| - **http3**: Use HTTP/3 protocol for requests. **Defaults to False**. It might be problematic if used with `impersonate`. | |
| - **cookies**: Cookies to use in the request. Can be a dictionary of `name→value` or a list of dictionaries. | |
| - **proxy**: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`. | |
| - **proxy_auth**: HTTP basic auth for proxy, tuple of (username, password). | |
| - **proxies**: Dict of proxies to use. Format: `{"http": proxy_url, "https": proxy_url}`. | |
| - **proxy_rotator**: A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy` or `proxies`. | |
| - **headers**: Headers to include in the request. Can override any header generated by the `stealthy_headers` argument | |
| - **max_redirects**: Maximum number of redirects. **Defaults to 30**, use -1 for unlimited. | |
| - **verify**: Whether to verify HTTPS certificates. **Defaults to True**. | |
| - **cert**: Tuple of (cert, key) filenames for the client certificate. | |
| - **selector_config**: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | |
| !!! note "Notes:" | |
| 1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)<br/> | |
| 2. The available browsers to impersonate, along with their corresponding versions, are automatically displayed in the argument autocompletion and updated with each `curl_cffi` update.<br/> | |
| 3. If any of the arguments `impersonate` or `stealthy_headers` are enabled, the fetchers will automatically generate real browser headers that match the browser version used. | |
| Other than this, for further customization, you can pass any arguments that `curl_cffi` supports for any method if that method doesn't already support them. | |
| ### HTTP Methods | |
| There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests. | |
| Examples are the best way to explain this: | |
| > Hence: `OPTIONS` and `HEAD` methods are not supported. | |
| #### GET | |
| ```python | |
| >>> from scrapling.fetchers import Fetcher | |
| >>> # Basic GET | |
| >>> page = Fetcher.get('https://example.com') | |
| >>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) | |
| >>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:password@localhost:8030') | |
| >>> # With parameters | |
| >>> page = Fetcher.get('https://example.com/search', params={'q': 'query'}) | |
| >>> | |
| >>> # With headers | |
| >>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'}) | |
| >>> # Basic HTTP authentication | |
| >>> page = Fetcher.get("https://example.com", auth=("my_user", "password123")) | |
| >>> # Browser impersonation | |
| >>> page = Fetcher.get('https://example.com', impersonate='chrome') | |
| >>> # HTTP/3 support | |
| >>> page = Fetcher.get('https://example.com', http3=True) | |
| ``` | |
| And for asynchronous requests, it's a small adjustment | |
| ```python | |
| >>> from scrapling.fetchers import AsyncFetcher | |
| >>> # Basic GET | |
| >>> page = await AsyncFetcher.get('https://example.com') | |
| >>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) | |
| >>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:password@localhost:8030') | |
| >>> # With parameters | |
| >>> page = await AsyncFetcher.get('https://example.com/search', params={'q': 'query'}) | |
| >>> | |
| >>> # With headers | |
| >>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'}) | |
| >>> # Basic HTTP authentication | |
| >>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123")) | |
| >>> # Browser impersonation | |
| >>> page = await AsyncFetcher.get('https://example.com', impersonate='chrome110') | |
| >>> # HTTP/3 support | |
| >>> page = await AsyncFetcher.get('https://example.com', http3=True) | |
| ``` | |
| Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is a [Selector](../parsing/main_classes.md#selector) as we said, so you can use it directly | |
| ```python | |
| >>> page.css('.something.something') | |
| >>> page = Fetcher.get('https://api.github.com/events') | |
| >>> page.json() | |
| [{'id': '<redacted>', | |
| 'type': 'PushEvent', | |
| 'actor': {'id': '<redacted>', | |
| 'login': '<redacted>', | |
| 'display_login': '<redacted>', | |
| 'gravatar_id': '', | |
| 'url': 'https://api.github.com/users/<redacted>', | |
| 'avatar_url': 'https://avatars.githubusercontent.com/u/<redacted>'}, | |
| 'repo': {'id': '<redacted>', | |
| ... | |
| ``` | |
| #### POST | |
| ```python | |
| >>> from scrapling.fetchers import Fetcher | |
| >>> # Basic POST | |
| >>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, params={'q': 'query'}) | |
| >>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True) | |
| >>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome") | |
| >>> # Another example of form-encoded data | |
| >>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True) | |
| >>> # JSON data | |
| >>> page = Fetcher.post('https://example.com/api', json={'key': 'value'}) | |
| ``` | |
| And for asynchronous requests, it's a small adjustment | |
| ```python | |
| >>> from scrapling.fetchers import AsyncFetcher | |
| >>> # Basic POST | |
| >>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}) | |
| >>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True) | |
| >>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome") | |
| >>> # Another example of form-encoded data | |
| >>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True) | |
| >>> # JSON data | |
| >>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'}) | |
| ``` | |
| #### PUT | |
| ```python | |
| >>> from scrapling.fetchers import Fetcher | |
| >>> # Basic PUT | |
| >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}) | |
| >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome") | |
| >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030') | |
| >>> # Another example of form-encoded data | |
| >>> page = Fetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']}) | |
| ``` | |
| And for asynchronous requests, it's a small adjustment | |
| ```python | |
| >>> from scrapling.fetchers import AsyncFetcher | |
| >>> # Basic PUT | |
| >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}) | |
| >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome") | |
| >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030') | |
| >>> # Another example of form-encoded data | |
| >>> page = await AsyncFetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']}) | |
| ``` | |
| #### DELETE | |
| ```python | |
| >>> from scrapling.fetchers import Fetcher | |
| >>> page = Fetcher.delete('https://example.com/resource/123') | |
| >>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome") | |
| >>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030') | |
| ``` | |
| And for asynchronous requests, it's a small adjustment | |
| ```python | |
| >>> from scrapling.fetchers import AsyncFetcher | |
| >>> page = await AsyncFetcher.delete('https://example.com/resource/123') | |
| >>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome") | |
| >>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030') | |
| ``` | |
| ## Session Management | |
| For making multiple requests with the same configuration, use the `FetcherSession` class. It can be used in both synchronous and asynchronous code without issue; the class automatically detects and changes the session type, without requiring a different import. | |
| The `FetcherSession` class can accept nearly all the arguments that the methods can take, which enables you to specify a config for the entire session and later choose a different config for one of the requests effortlessly, as you will see in the following examples. | |
| ```python | |
| from scrapling.fetchers import FetcherSession | |
| # Create a session with default configuration | |
| with FetcherSession( | |
| impersonate='chrome', | |
| http3=True, | |
| stealthy_headers=True, | |
| timeout=30, | |
| retries=3 | |
| ) as session: | |
| # Make multiple requests with the same settings and the same cookies | |
| page1 = session.get('https://scrapling.requestcatcher.com/get') | |
| page2 = session.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}) | |
| page3 = session.get('https://api.github.com/events') | |
| # All requests share the same session and connection pool | |
| ``` | |
| You can also use a `ProxyRotator` with `FetcherSession` for automatic proxy rotation across requests: | |
| ```python | |
| from scrapling.fetchers import FetcherSession, ProxyRotator | |
| rotator = ProxyRotator([ | |
| 'http://proxy1:8080', | |
| 'http://proxy2:8080', | |
| 'http://proxy3:8080', | |
| ]) | |
| with FetcherSession(proxy_rotator=rotator, impersonate='chrome') as session: | |
| # Each request automatically uses the next proxy in rotation | |
| page1 = session.get('https://example.com/page1') | |
| page2 = session.get('https://example.com/page2') | |
| # You can check which proxy was used via the response metadata | |
| print(page1.meta['proxy']) | |
| ``` | |
| You can also override the session proxy (or rotator) for a specific request by passing `proxy=` directly to the request method: | |
| ```python | |
| with FetcherSession(proxy='http://default-proxy:8080') as session: | |
| # Uses the session proxy | |
| page1 = session.get('https://example.com/page1') | |
| # Override the proxy for this specific request | |
| page2 = session.get('https://example.com/page2', proxy='http://special-proxy:9090') | |
| ``` | |
| And here's an async example | |
| ```python | |
| async with FetcherSession(impersonate='firefox', http3=True) as session: | |
| # All standard HTTP methods available | |
| response = await session.get('https://example.com') | |
| response = await session.post('https://scrapling.requestcatcher.com/post', json={'data': 'value'}) | |
| response = await session.put('https://scrapling.requestcatcher.com/put', data={'update': 'info'}) | |
| response = await session.delete('https://scrapling.requestcatcher.com/delete') | |
| ``` | |
| or better | |
| ```python | |
| import asyncio | |
| from scrapling.fetchers import FetcherSession | |
| # Async session usage | |
| async with FetcherSession(impersonate="safari") as session: | |
| urls = ['https://example.com/page1', 'https://example.com/page2'] | |
| tasks = [ | |
| session.get(url) for url in urls | |
| ] | |
| pages = await asyncio.gather(*tasks) | |
| ``` | |
| The `Fetcher` class uses `FetcherSession` to create a temporary session with each request you make. | |
| ### Session Benefits | |
| - **A lot faster**: 10 times faster than creating a single session for each request | |
| - **Cookie persistence**: Automatic cookie handling across requests | |
| - **Resource efficiency**: Better memory and CPU usage for multiple requests | |
| - **Centralized configuration**: Single place to manage request settings | |
| ## Examples | |
| Some well-rounded examples to aid newcomers to Web Scraping | |
| ### Basic HTTP Request | |
| ```python | |
| from scrapling.fetchers import Fetcher | |
| # Make a request | |
| page = Fetcher.get('https://example.com') | |
| # Check the status | |
| if page.status == 200: | |
| # Extract title | |
| title = page.css('title::text').get() | |
| print(f"Page title: {title}") | |
| # Extract all links | |
| links = page.css('a::attr(href)').getall() | |
| print(f"Found {len(links)} links") | |
| ``` | |
| ### Product Scraping | |
| ```python | |
| from scrapling.fetchers import Fetcher | |
| def scrape_products(): | |
| page = Fetcher.get('https://example.com/products') | |
| # Find all product elements | |
| products = page.css('.product') | |
| results = [] | |
| for product in products: | |
| results.append({ | |
| 'title': product.css('.title::text').get(), | |
| 'price': product.css('.price::text').re_first(r'\d+\.\d{2}'), | |
| 'description': product.css('.description::text').get(), | |
| 'in_stock': product.has_class('in-stock') | |
| }) | |
| return results | |
| ``` | |
| ### Downloading Files | |
| ```python | |
| from scrapling.fetchers import Fetcher | |
| page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/main_cover.png') | |
| with open(file='main_cover.png', mode='wb') as f: | |
| f.write(page.body) | |
| ``` | |
| ### Pagination Handling | |
| ```python | |
| from scrapling.fetchers import Fetcher | |
| def scrape_all_pages(): | |
| base_url = 'https://example.com/products?page={}' | |
| page_num = 1 | |
| all_products = [] | |
| while True: | |
| # Get current page | |
| page = Fetcher.get(base_url.format(page_num)) | |
| # Find products | |
| products = page.css('.product') | |
| if not products: | |
| break | |
| # Process products | |
| for product in products: | |
| all_products.append({ | |
| 'name': product.css('.name::text').get(), | |
| 'price': product.css('.price::text').get() | |
| }) | |
| # Next page | |
| page_num += 1 | |
| return all_products | |
| ``` | |
| ### Form Submission | |
| ```python | |
| from scrapling.fetchers import Fetcher | |
| # Submit login form | |
| response = Fetcher.post( | |
| 'https://example.com/login', | |
| data={ | |
| 'username': 'user@example.com', | |
| 'password': 'password123' | |
| } | |
| ) | |
| # Check login success | |
| if response.status == 200: | |
| # Extract user info | |
| user_name = response.css('.user-name::text').get() | |
| print(f"Logged in as: {user_name}") | |
| ``` | |
| ### Table Extraction | |
| ```python | |
| from scrapling.fetchers import Fetcher | |
| def extract_table(): | |
| page = Fetcher.get('https://example.com/data') | |
| # Find table | |
| table = page.css('table')[0] | |
| # Extract headers | |
| headers = [ | |
| th.text for th in table.css('thead th') | |
| ] | |
| # Extract rows | |
| rows = [] | |
| for row in table.css('tbody tr'): | |
| cells = [td.text for td in row.css('td')] | |
| rows.append(dict(zip(headers, cells))) | |
| return rows | |
| ``` | |
| ### Navigation Menu | |
| ```python | |
| from scrapling.fetchers import Fetcher | |
| def extract_menu(): | |
| page = Fetcher.get('https://example.com') | |
| # Find navigation | |
| nav = page.css('nav')[0] | |
| menu = {} | |
| for item in nav.css('li'): | |
| links = item.css('a') | |
| if links: | |
| link = links[0] | |
| menu[link.text] = { | |
| 'url': link['href'], | |
| 'has_submenu': bool(item.css('.submenu')) | |
| } | |
| return menu | |
| ``` | |
| ## When to Use | |
| Use `Fetcher` when: | |
| - Need rapid HTTP requests. | |
| - Want minimal overhead. | |
| - Don't need JavaScript execution (the website can be scraped through requests). | |
| - Need some stealth features (ex, the targeted website is using protection but doesn't use JavaScript challenges). | |
| Use `FetcherSession` when: | |
| - Making multiple requests to the same or different sites. | |
| - Need to maintain cookies/authentication between requests. | |
| - Want connection pooling for better performance. | |
| - Require consistent configuration across requests. | |
| - Working with APIs that require a session state. | |
| Use other fetchers when: | |
| - Need browser automation. | |
| - Need advanced anti-bot/stealth capabilities. | |
| - Need JavaScript support or interacting with dynamic content |