Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Scrapling / docs /cli /interactive-shell.md

Karim shoair

docs: Style updates and a lot of clarifications

3dc1188 5 months ago

preview code

raw

history blame

8.77 kB

	# Scrapling Interactive Shell Guide

	<script src="https://asciinema.org/a/736339.js" id="asciicast-736339" async data-autoplay="1" data-loop="1" data-cols="225" data-rows="40" data-start-at="00:06" data-speed="1.5"></script>

	Powerful Web Scraping REPL for Developers and Data Scientists

	The Scrapling Interactive Shell is an enhanced IPython-based environment designed specifically for Web Scraping tasks. It provides instant access to all Scrapling features, clever shortcuts, automatic page management, and advanced tools like curl command conversion.

	> 💡 Prerequisites:
	>
	> 1. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
	> 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
	> 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
	> 4. You’ve completed or read at least one page from the fetchers section to use here for requests: [HTTP requests](../fetching/static.md), [Dynamic websites](../fetching/dynamic.md), or [Dynamic websites with hard protections](../fetching/stealthy.md).


	## Why use the Interactive Shell?

	The interactive shell transforms web scraping from a slow script-and-run cycle into a fast, exploratory experience. It's perfect for:

	- Rapid prototyping: Test scraping strategies instantly
	- Data exploration: Interactively navigate and extract from websites
	- Learning Scrapling: Experiment with features in real-time
	- Debugging scrapers: Step through requests and inspect results
	- Converting workflows: Transform curl commands from browser DevTools to a Fetcher request in a one-liner

	## Getting Started

	### Launch the Shell

	```bash
	# Start the interactive shell
	scrapling shell

	# Execute code and exit (useful for scripting)
	scrapling shell -c "get('https://quotes.toscrape.com'); print(len(page.css('.quote')))"

	# Set logging level
	scrapling shell --loglevel info
	```

	Once launched, you'll see the Scrapling banner and can immediately start scraping as the video above shows:

	```python
	# No imports needed - everything is ready!
	>>> get('https://news.ycombinator.com')

	>>> # Explore the page structure
	>>> page.css('a')[:5] # Look at first 5 links

	>>> # Refine your selectors
	>>> stories = page.css('.titleline>a')
	>>> len(stories)
	30

	>>> # Extract specific data
	>>> for story in stories[:3]:
	... title = story.text
	... url = story['href']
	... print(f"{title}: {url}")

	>>> # Try different approaches
	>>> titles = page.css('.titleline>a::text') # Direct text extraction
	>>> urls = page.css('.titleline>a::attr(href)') # Direct attribute extraction
	```

	## Built-in Shortcuts

	The shell provides convenient shortcuts that eliminate boilerplate code:

	- `get(url, kwargs)`** - HTTP GET request (instead of `Fetcher.get`)
	- `post(url, kwargs)`** - HTTP POST request (instead of `Fetcher.post`)
	- `put(url, kwargs)`** - HTTP PUT request (instead of `Fetcher.put`)
	- `delete(url, kwargs)`** - HTTP DELETE request (instead of `Fetcher.delete`)
	- `fetch(url, kwargs)`** - Browser-based fetch (instead of `DynamicFetcher.fetch`)
	- `stealthy_fetch(url, kwargs)`** - Stealthy browser fetch (instead of `StealthyFetcher.fetch`)

	The most commonly used classes are automatically available without any import, including `Fetcher`, `AsyncFetcher`, `DynamicFetcher`, `StealthyFetcher`, and `Selector`.

	### Smart Page Management

	The shell automatically tracks your requests and pages:

	- Current Page Access

	The `page` and `response` commands are automatically updated with the last fetched page:

	```python
	>>> get('https://quotes.toscrape.com')
	>>> # 'page' and 'response' both refer to the last fetched page
	>>> page.url
	'https://quotes.toscrape.com'
	>>> response.status # Same as page.status
	200
	```

	- Page History

	The `pages` command keeps track of the last five pages (it's a `Selectors` object):

	```python
	>>> get('https://site1.com')
	>>> get('https://site2.com')
	>>> get('https://site3.com')

	>>> # Access last 5 pages
	>>> len(pages) # `Selectors` object with `page` history
	3
	>>> pages[0].url # First page in history
	'https://site1.com'
	>>> pages[-1].url # Most recent page
	'https://site3.com'

	>>> # Work with historical pages
	>>> for i, old_page in enumerate(pages):
	... print(f"Page {i}: {old_page.url} - {old_page.status}")
	```

	## Additional helpful commands

	### Page Visualization

	View scraped pages in your browser:

	```python
	>>> get('https://quotes.toscrape.com')
	>>> view(page) # Opens the page HTML in your default browser
	```

	### Curl Command Integration

	The shell provides a few functions to help you convert curl commands from the browser DevTools to `Fetcher` requests, which are `uncurl` and `curl2fetcher`. First, you need to copy a request as a curl command like the following:

	<img src="../../assets/scrapling_shell_curl.png" title="Copying a request as a curl command from Chrome" alt="Copying a request as a curl command from Chrome" style="width: 70%;"/>

	- Convert Curl command to Request Object

	```python
	>>> curl_cmd = '''curl 'https://scrapling.requestcatcher.com/post' \
	... -X POST \
	... -H 'Content-Type: application/json' \
	... -d '{"name": "test", "value": 123}' '''

	>>> request = uncurl(curl_cmd)
	>>> request.method
	'post'
	>>> request.url
	'https://scrapling.requestcatcher.com/post'
	>>> request.headers
	{'Content-Type': 'application/json'}
	```

	- Execute Curl Command Directly

	```python
	>>> # Convert and execute in one step
	>>> curl2fetcher(curl_cmd)
	>>> page.status
	200
	>>> page.json()['json']
	{'name': 'test', 'value': 123}
	```

	### IPython Features

	The shell inherits all IPython capabilities:

	```python
	>>> # Magic commands
	>>> %time page = get('https://example.com') # Time execution
	>>> %history # Show command history
	>>> %save filename.py 1-10 # Save commands 1-10 to file

	>>> # Tab completion works everywhere
	>>> page.c<TAB> # Shows: css, css_first, cookies, etc.
	>>> Fetcher.<TAB> # Shows all Fetcher methods

	>>> # Object inspection
	>>> get? # Show get documentation
	```

	## Examples

	Here are a few examples generated via AI:

	#### E-commerce Data Collection

	```python
	>>> # Start with product listing page
	>>> catalog = get('https://shop.example.com/products')

	>>> # Find product links
	>>> product_links = catalog.css('.product-link::attr(href)')
	>>> print(f"Found {len(product_links)} products")

	>>> # Sample a few products first
	>>> for link in product_links[:3]:
	... product = get(f"https://shop.example.com{link}")
	... name = product.css('.product-name::text').get('')
	... price = product.css('.price::text').get('')
	... print(f"{name}: {price}")

	>>> # Scale up with sessions for efficiency
	>>> from scrapling.fetchers import FetcherSession
	>>> with FetcherSession() as session:
	... products = []
	... for link in product_links:
	... product = session.get(f"https://shop.example.com{link}")
	... products.append({
	... 'name': product.css('.product-name::text').get(''),
	... 'price': product.css('.price::text').get(''),
	... 'url': link
	... })
	```

	#### API Integration and Testing

	```python
	>>> # Test API endpoints interactively
	>>> response = get('https://jsonplaceholder.typicode.com/posts/1')
	>>> response.json()
	{'userId': 1, 'id': 1, 'title': 'sunt aut...', 'body': 'quia et...'}

	>>> # Test POST requests
	>>> new_post = post('https://jsonplaceholder.typicode.com/posts',
	... json={'title': 'Test Post', 'body': 'Test content', 'userId': 1})
	>>> new_post.json()['id']
	101

	>>> # Test with different data
	>>> updated = put(f'https://jsonplaceholder.typicode.com/posts/{new_post.json()["id"]}',
	... json={'title': 'Updated Title'})
	```

	## Getting Help

	If you need help other than what is available in-terminal, you can:

	- [Scrapling Documentation](https://scrapling.readthedocs.io/)
	- [Discord Community](https://discord.gg/EMgGbDceNQ)
	- [GitHub Issues](https://github.com/D4Vinci/Scrapling/issues)

	And that's it! Happy scraping! The shell makes web scraping as easy as a conversation.