Spaces:

Wissotsky
/

HebrewCookingRecipesViewer

Sleeping

App Files Files Community

HebrewCookingRecipesViewer / DATASET_README.md

Wissotsky

Recipe Viewer

94375a9 5 months ago

preview code

raw

history blame contribute delete

2.29 kB

	https://huggingface.co/datasets/Wissotsky/HebrewRecipes

	---
	task_categories:
	- text-generation
	- feature-extraction
	language:
	- he
	tags:
	- cooking
	- art
	- recipes
	- schema.org
	- json-ld
	pretty_name: Hebrew Recipes
	size_categories:
	- 1K<n<10K
	---

	# Dataset Card for Hebrew Recipes

	A dataset of recipes scraped from the Israeli recipe websites.

	## Dataset Details

	### Dataset Description

	The dataset contains recipes scraped from multiple Israeli recipe websites, including [sugat.com](https://www.sugat.com/) and [hashulchan.co.il](https://hashulchan.co.il/). It includes structured JSON-LD data conforming to schema.org Recipe specifications, cleaned HTML from the printable recipe views, and various metadata for each recipe URL.

	- Curated by: [@Wissotsky]
	- Language: [Hebrew]

	### Dataset Sources

	- [Sugat Recipes](https://www.sugat.com/recipes/)
	- [Hashulchan Recipes](https://hashulchan.co.il/)

	## Dataset Structure

	The data is stored in a single Parquet file (`recipes.parquet`) with the following columns:

	- URL (string): The original URL of the recipe page.
	- Title (string): The title of the webpage.
	- JsonLd (string): The "Recipe" schema.org JSON-LD data, if found on the page. Stored as a JSON string.
	- Html (string): The cleaned, printable HTML content of the recipe.
	- Sitemap (string): The source sitemap filepath on disk (e.g., `sitemaps/recipes-sitemap.xml`).
	- ScrapeTimestamp (int64): Unix timestamp indicating when the data for the specific URL was scraped.
	- JsonLdPresent (bool): A flag that is `true` if a "Recipe" JSON-LD block was successfully found and extracted.
	- HtmlRecipePresent (bool): A flag that is `true` if the printable HTML block (`#print_area`) was found.
	- HttpStatusCode (int): The HTTP status code returned when scraping the URL (e.g., 200, 404).

	## Dataset Creation

	### Data Collection and Processing

	A scraper written in golang using the [gocolly](https://github.com/gocolly/colly) library. It goes through recipe URLs found in the XML sitemaps, extracting the page title, the "Recipe" JSON-LD block, and the printable recipe HTML from each page. The HTML is cleaned to match the site's print view. The data is then saved into a Parquet file (recipes.parquet) with Zstandard compression.