Scrapling / docs /cli /interactive-shell.md
Karim shoair
docs: Style updates and a lot of clarifications
3dc1188
|
raw
history blame
8.77 kB
# Scrapling Interactive Shell Guide
<script src="https://asciinema.org/a/736339.js" id="asciicast-736339" async data-autoplay="1" data-loop="1" data-cols="225" data-rows="40" data-start-at="00:06" data-speed="1.5"></script>
**Powerful Web Scraping REPL for Developers and Data Scientists**
The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools like curl command conversion.
> 💡 **Prerequisites:**
>
> 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
> 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
> 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
> 4. You’ve completed or read at least one page from the fetchers section to use here for requests: [HTTP requests](../fetching/static.md), [Dynamic websites](../fetching/dynamic.md), or [Dynamic websites with hard protections](../fetching/stealthy.md).
## Why use the Interactive Shell?
The interactive shell transforms web scraping from a slow script-and-run cycle into a fast, exploratory experience. It's perfect for:
- **Rapid prototyping**: Test scraping strategies instantly
- **Data exploration**: Interactively navigate and extract from websites
- **Learning Scrapling**: Experiment with features in real-time
- **Debugging scrapers**: Step through requests and inspect results
- **Converting workflows**: Transform curl commands from browser DevTools to a Fetcher request in a one-liner
## Getting Started
### Launch the Shell
```bash
# Start the interactive shell
scrapling shell
# Execute code and exit (useful for scripting)
scrapling shell -c "get('https://quotes.toscrape.com'); print(len(page.css('.quote')))"
# Set logging level
scrapling shell --loglevel info
```
Once launched, you'll see the Scrapling banner and can immediately start scraping as the video above shows:
```python
# No imports needed - everything is ready!
>>> get('https://news.ycombinator.com')
>>> # Explore the page structure
>>> page.css('a')[:5] # Look at first 5 links
>>> # Refine your selectors
>>> stories = page.css('.titleline>a')
>>> len(stories)
30
>>> # Extract specific data
>>> for story in stories[:3]:
... title = story.text
... url = story['href']
... print(f"{title}: {url}")
>>> # Try different approaches
>>> titles = page.css('.titleline>a::text') # Direct text extraction
>>> urls = page.css('.titleline>a::attr(href)') # Direct attribute extraction
```
## Built-in Shortcuts
The shell provides convenient shortcuts that eliminate boilerplate code:
- **`get(url, **kwargs)`** - HTTP GET request (instead of `Fetcher.get`)
- **`post(url, **kwargs)`** - HTTP POST request (instead of `Fetcher.post`)
- **`put(url, **kwargs)`** - HTTP PUT request (instead of `Fetcher.put`)
- **`delete(url, **kwargs)`** - HTTP DELETE request (instead of `Fetcher.delete`)
- **`fetch(url, **kwargs)`** - Browser-based fetch (instead of `DynamicFetcher.fetch`)
- **`stealthy_fetch(url, **kwargs)`** - Stealthy browser fetch (instead of `StealthyFetcher.fetch`)
The most commonly used classes are automatically available without any import, including `Fetcher`, `AsyncFetcher`, `DynamicFetcher`, `StealthyFetcher`, and `Selector`.
### Smart Page Management
The shell automatically tracks your requests and pages:
- **Current Page Access**
The `page` and `response` commands are automatically updated with the last fetched page:
```python
>>> get('https://quotes.toscrape.com')
>>> # 'page' and 'response' both refer to the last fetched page
>>> page.url
'https://quotes.toscrape.com'
>>> response.status # Same as page.status
200
```
- **Page History**
The `pages` command keeps track of the last five pages (it's a `Selectors` object):
```python
>>> get('https://site1.com')
>>> get('https://site2.com')
>>> get('https://site3.com')
>>> # Access last 5 pages
>>> len(pages) # `Selectors` object with `page` history
3
>>> pages[0].url # First page in history
'https://site1.com'
>>> pages[-1].url # Most recent page
'https://site3.com'
>>> # Work with historical pages
>>> for i, old_page in enumerate(pages):
... print(f"Page {i}: {old_page.url} - {old_page.status}")
```
## Additional helpful commands
### Page Visualization
View scraped pages in your browser:
```python
>>> get('https://quotes.toscrape.com')
>>> view(page) # Opens the page HTML in your default browser
```
### Curl Command Integration
The shell provides a few functions to help you convert curl commands from the browser DevTools to `Fetcher` requests, which are `uncurl` and `curl2fetcher`. First, you need to copy a request as a curl command like the following:
<img src="../../assets/scrapling_shell_curl.png" title="Copying a request as a curl command from Chrome" alt="Copying a request as a curl command from Chrome" style="width: 70%;"/>
- **Convert Curl command to Request Object**
```python
>>> curl_cmd = '''curl 'https://scrapling.requestcatcher.com/post' \
... -X POST \
... -H 'Content-Type: application/json' \
... -d '{"name": "test", "value": 123}' '''
>>> request = uncurl(curl_cmd)
>>> request.method
'post'
>>> request.url
'https://scrapling.requestcatcher.com/post'
>>> request.headers
{'Content-Type': 'application/json'}
```
- **Execute Curl Command Directly**
```python
>>> # Convert and execute in one step
>>> curl2fetcher(curl_cmd)
>>> page.status
200
>>> page.json()['json']
{'name': 'test', 'value': 123}
```
### IPython Features
The shell inherits all IPython capabilities:
```python
>>> # Magic commands
>>> %time page = get('https://example.com') # Time execution
>>> %history # Show command history
>>> %save filename.py 1-10 # Save commands 1-10 to file
>>> # Tab completion works everywhere
>>> page.c<TAB> # Shows: css, css_first, cookies, etc.
>>> Fetcher.<TAB> # Shows all Fetcher methods
>>> # Object inspection
>>> get? # Show get documentation
```
## Examples
Here are a few examples generated via AI:
#### E-commerce Data Collection
```python
>>> # Start with product listing page
>>> catalog = get('https://shop.example.com/products')
>>> # Find product links
>>> product_links = catalog.css('.product-link::attr(href)')
>>> print(f"Found {len(product_links)} products")
>>> # Sample a few products first
>>> for link in product_links[:3]:
... product = get(f"https://shop.example.com{link}")
... name = product.css('.product-name::text').get('')
... price = product.css('.price::text').get('')
... print(f"{name}: {price}")
>>> # Scale up with sessions for efficiency
>>> from scrapling.fetchers import FetcherSession
>>> with FetcherSession() as session:
... products = []
... for link in product_links:
... product = session.get(f"https://shop.example.com{link}")
... products.append({
... 'name': product.css('.product-name::text').get(''),
... 'price': product.css('.price::text').get(''),
... 'url': link
... })
```
#### API Integration and Testing
```python
>>> # Test API endpoints interactively
>>> response = get('https://jsonplaceholder.typicode.com/posts/1')
>>> response.json()
{'userId': 1, 'id': 1, 'title': 'sunt aut...', 'body': 'quia et...'}
>>> # Test POST requests
>>> new_post = post('https://jsonplaceholder.typicode.com/posts',
... json={'title': 'Test Post', 'body': 'Test content', 'userId': 1})
>>> new_post.json()['id']
101
>>> # Test with different data
>>> updated = put(f'https://jsonplaceholder.typicode.com/posts/{new_post.json()["id"]}',
... json={'title': 'Updated Title'})
```
## Getting Help
If you need help other than what is available in-terminal, you can:
- [Scrapling Documentation](https://scrapling.readthedocs.io/)
- [Discord Community](https://discord.gg/EMgGbDceNQ)
- [GitHub Issues](https://github.com/D4Vinci/Scrapling/issues)
And that's it! Happy scraping! The shell makes web scraping as easy as a conversation.