Echo-AI-official's picture
Upload 280 files
0e759d2 verified
# `scrapeURL`
New URL scraper for Firecrawl
## Signal flow
```mermaid
flowchart TD;
scrapeURL-.->buildFallbackList;
buildFallbackList-.->scrapeURLWithEngine;
scrapeURLWithEngine-.->parseMarkdown;
parseMarkdown-.->wasScrapeSuccessful{{Was scrape successful?}};
wasScrapeSuccessful-."No".->areEnginesLeft{{Are there engines left to try?}};
areEnginesLeft-."Yes, try next engine".->scrapeURLWithEngine;
areEnginesLeft-."No".->NoEnginesLeftError[/NoEnginesLeftError/]
wasScrapeSuccessful-."Yes".->asd;
```
## Differences from `WebScraperDataProvider`
- The job of `WebScraperDataProvider.validateInitialUrl` has been delegated to the zod layer above `scrapeUrl`.
- `WebScraperDataProvider.mode` has no equivalent, only `scrape_url` is supported.
- You may no longer specify multiple URLs.
- Built on `v1` definitons, instead of `v0`.
- PDFs are now converted straight to markdown using LlamaParse, instead of converting to just plaintext.
- DOCXs are now converted straight to HTML (and then later to markdown) using mammoth, instead of converting to just plaintext.
- Using new JSON Schema OpenAI API -- schema fails with LLM Extract will be basically non-existant.