| https://huggingface.co/datasets/Wissotsky/HebrewRecipes | |
| --- | |
| task_categories: | |
| - text-generation | |
| - feature-extraction | |
| language: | |
| - he | |
| tags: | |
| - cooking | |
| - art | |
| - recipes | |
| - schema.org | |
| - json-ld | |
| pretty_name: Hebrew Recipes | |
| size_categories: | |
| - 1K<n<10K | |
| --- | |
| # Dataset Card for Hebrew Recipes | |
| A dataset of recipes scraped from the Israeli recipe websites. | |
| ## Dataset Details | |
| ### Dataset Description | |
| The dataset contains recipes scraped from multiple Israeli recipe websites, including [sugat.com](https://www.sugat.com/) and [hashulchan.co.il](https://hashulchan.co.il/). It includes structured JSON-LD data conforming to schema.org Recipe specifications, cleaned HTML from the printable recipe views, and various metadata for each recipe URL. | |
| - **Curated by:** [@Wissotsky] | |
| - **Language:** [Hebrew] | |
| ### Dataset Sources | |
| - [Sugat Recipes](https://www.sugat.com/recipes/) | |
| - [Hashulchan Recipes](https://hashulchan.co.il/) | |
| ## Dataset Structure | |
| The data is stored in a single Parquet file (`recipes.parquet`) with the following columns: | |
| - **URL** (string): The original URL of the recipe page. | |
| - **Title** (string): The title of the webpage. | |
| - **JsonLd** (string): The "Recipe" schema.org JSON-LD data, if found on the page. Stored as a JSON string. | |
| - **Html** (string): The cleaned, printable HTML content of the recipe. | |
| - **Sitemap** (string): The source sitemap filepath on disk (e.g., `sitemaps/recipes-sitemap.xml`). | |
| - **ScrapeTimestamp** (int64): Unix timestamp indicating when the data for the specific URL was scraped. | |
| - **JsonLdPresent** (bool): A flag that is `true` if a "Recipe" JSON-LD block was successfully found and extracted. | |
| - **HtmlRecipePresent** (bool): A flag that is `true` if the printable HTML block (`#print_area`) was found. | |
| - **HttpStatusCode** (int): The HTTP status code returned when scraping the URL (e.g., 200, 404). | |
| ## Dataset Creation | |
| ### Data Collection and Processing | |
| A scraper written in golang using the [gocolly](https://github.com/gocolly/colly) library. It goes through recipe URLs found in the XML sitemaps, extracting the page title, the "Recipe" JSON-LD block, and the printable recipe HTML from each page. The HTML is cleaned to match the site's print view. The data is then saved into a Parquet file (recipes.parquet) with Zstandard compression. |