HebrewCookingRecipesViewer / DATASET_README.md
Wissotsky's picture
Recipe Viewer
94375a9
https://huggingface.co/datasets/Wissotsky/HebrewRecipes
---
task_categories:
- text-generation
- feature-extraction
language:
- he
tags:
- cooking
- art
- recipes
- schema.org
- json-ld
pretty_name: Hebrew Recipes
size_categories:
- 1K<n<10K
---
# Dataset Card for Hebrew Recipes
A dataset of recipes scraped from the Israeli recipe websites.
## Dataset Details
### Dataset Description
The dataset contains recipes scraped from multiple Israeli recipe websites, including [sugat.com](https://www.sugat.com/) and [hashulchan.co.il](https://hashulchan.co.il/). It includes structured JSON-LD data conforming to schema.org Recipe specifications, cleaned HTML from the printable recipe views, and various metadata for each recipe URL.
- **Curated by:** [@Wissotsky]
- **Language:** [Hebrew]
### Dataset Sources
- [Sugat Recipes](https://www.sugat.com/recipes/)
- [Hashulchan Recipes](https://hashulchan.co.il/)
## Dataset Structure
The data is stored in a single Parquet file (`recipes.parquet`) with the following columns:
- **URL** (string): The original URL of the recipe page.
- **Title** (string): The title of the webpage.
- **JsonLd** (string): The "Recipe" schema.org JSON-LD data, if found on the page. Stored as a JSON string.
- **Html** (string): The cleaned, printable HTML content of the recipe.
- **Sitemap** (string): The source sitemap filepath on disk (e.g., `sitemaps/recipes-sitemap.xml`).
- **ScrapeTimestamp** (int64): Unix timestamp indicating when the data for the specific URL was scraped.
- **JsonLdPresent** (bool): A flag that is `true` if a "Recipe" JSON-LD block was successfully found and extracted.
- **HtmlRecipePresent** (bool): A flag that is `true` if the printable HTML block (`#print_area`) was found.
- **HttpStatusCode** (int): The HTTP status code returned when scraping the URL (e.g., 200, 404).
## Dataset Creation
### Data Collection and Processing
A scraper written in golang using the [gocolly](https://github.com/gocolly/colly) library. It goes through recipe URLs found in the XML sitemaps, extracting the page title, the "Recipe" JSON-LD block, and the printable recipe HTML from each page. The HTML is cleaned to match the site's print view. The data is then saved into a Parquet file (recipes.parquet) with Zstandard compression.