mermaid-syntax / README.md

Upload README.md with huggingface_hub

93b2d33 verified 4 months ago

6.92 kB

	---
	task_categories:
	- text-generation
	language:
	- en
	tags:
	- code
	- mermaid
	- syntax
	- diagram
	- repair
	pretty_name: Mermaid AI Syntax
	size_categories:
	- 100K<n<1M
	authors:
	- Gabriel Lars Sabadin
	- Darshan Jain
	- Mermaid Chart AI Team
	---

	# Mermaid Syntax Dataset

	## Dataset Summary
	The Mermaid Syntax Dataset provides training and evaluation data for syntax understanding, validation, repair, and semantic titling of [Mermaid.js](https://mermaid.js.org/) diagrams.

	It supports two primary tasks:
	1. Repair – Generate minimal diffs or patched diagrams that compile successfully.
	2. Titling – Propose a short, human-friendly title, optionally with a one-sentence summary, based on content and context (instead of “Untitled Diagram”).
	3. Generation – Create a new valid Mermaid diagram from a user instruction and optional diagram type.

	> Note: Validation is performed by the Mermaid parser before any model call. Parser diagnostics are exposed in the dataset as `compiler_errors` (array of strings) so the model can understand what failed and propose targeted repairs.

	---

	## Supported Tasks and Benchmarks
	- Text Generation
	- `REPAIR`: Given an invalid diagram and parser diagnostics (`compiler_errors`), generate a corrected diagram (or a minimal patch).
	- `TITLE`: Given a valid diagram, generate a short, human-friendly title (optionally with a one-sentence summary).
	- `GENERATE`: Given a natural language instruction and optional diagram type, generate a new valid diagram (`diagram_content`) plus optional title and summary.

	### Task Categories
	- `text-generation`

	---

	## Languages
	- English (`en`)
	All error messages, titles, and instructions are in English. Future multilingual expansions may include localized error messages.

	---

	## Dataset Structure

	### Input Schema
	```json
	{
	"task": "REPAIR\|TITLE\|GENERATE",
	"input": {
	"diagram": "string (for REPAIR\|TITLE)",
	"instruction": "string (for GENERATE)",
	"context": "optional string",
	"diagram_type": "optional string",
	"compiler_errors": ["string (for REPAIR)"]
	}
	}
	```
	`compiler_errors` is an optional array of strings produced by the Mermaid parser (e.g., `"MISSING_ARROW at line 7"`, `"UNTERMINATED_BLOCK: 'gantt' missing 'end'"`). Include it for `REPAIR` samples; omit it for `TITLE` and `GENERATE` samples.

	### Output Schema
	```json
	{
	"result": {
	"compiler_errors": ["string"], // optional echo of parser diagnostics
	"patch": [ // optional for REPAIR tasks
	{
	"op": "replace\|insert\|delete",
	"range": {"startLine": 1, "startCol": 5, "endLine": 1, "endCol": 10},
	"text": "new content"
	}
	],
	"repaired_diagram": "string or null", // for REPAIR
	"diagram_content": "string or null", // for GENERATE
	"title": "string or null", // for TITLE and GENERATE
	"summary": "string or null" // optional one-sentence description
	}
	}
	```
	- `compiler_errors`: optional echo of parser diagnostics to provide context for the model.
	- `patch`: optional list of minimal edit operations for REPAIR tasks.
	- `repaired_diagram`: the corrected diagram (full text), used in REPAIR tasks.
	- `diagram_content`: the newly generated diagram, used in GENERATE tasks.
	- `title`: a short, human-friendly title, used in TITLE and GENERATE tasks.
	- `summary`: an optional one-sentence description or summary, used in TITLE and GENERATE tasks.

	### Examples

	#### Example REPAIR
	```json
	{
	"task": "REPAIR",
	"input": {
	"diagram": "flowchart TD\nA --> B",
	"compiler_errors": ["MISSING_ARROW at line 2"]
	},
	"result": {
	"compiler_errors": ["MISSING_ARROW at line 2"],
	"patch": [
	{
	"op": "replace",
	"range": {"startLine": 2, "startCol": 5, "endLine": 2, "endCol": 7},
	"text": "->"
	}
	],
	"repaired_diagram": "flowchart TD\nA -> B",
	"title": null,
	"summary": null
	}
	}
	```

	#### Example TITLE
	```json
	{
	"task": "TITLE",
	"input": {
	"diagram": "sequenceDiagram\nAlice->>Bob: Hello Bob!"
	},
	"result": {
	"compiler_errors": [],
	"patch": [],
	"repaired_diagram": null,
	"title": "Alice greets Bob",
	"summary": "A simple sequence diagram showing Alice sending a greeting message to Bob."
	}
	}
	```

	#### Example GENERATE
	```json
	{
	"task": "GENERATE",
	"input": {
	"instruction": "Create a flowchart for the checkout process",
	"diagram_type": "flowchart"
	},
	"result": {
	"compiler_errors": [],
	"patch": [],
	"diagram_content": "flowchart TD\nStart --> Cart\nCart --> Payment\nPayment --> Confirmation",
	"title": "Checkout Flow",
	"summary": "A flowchart showing the steps from start to order confirmation in an e-commerce checkout process."
	}
	}
	```

	## Sample Data

	An example of a `sample.jsonl` is included for each task type. Each line is a JSON object following the schema.

	### REPAIR Sample
	```jsonl
	{"task": "REPAIR", "input": {"diagram": "flowchart TD\nA -> B", "diagram_type": "flowchart", "compiler_errors": ["MISSING_ARROW at line 2: use '-->' instead of '->'"]}, "result": {"compiler_errors": ["MISSING_ARROW at line 2: use '-->' instead of '->'"], "patch": [{"op": "replace", "range": {"startLine": 2, "startCol": 3, "endLine": 2, "endCol": 4}, "text": "--"}], "repaired_diagram": "flowchart TD\nA --> B", "diagram_content": null, "title": null, "summary": null}}
	```

	### TITLE Sample
	```jsonl
	{"task": "TITLE", "input": {"diagram": "sequenceDiagram\nAlice->>Bob: Hello Bob!", "diagram_type": "sequence"}, "result": {"compiler_errors": [], "patch": [], "repaired_diagram": null, "diagram_content": null, "title": "Alice greets Bob", "summary": "A simple sequence diagram showing Alice sending a greeting message to Bob."}}
	```

	### GENERATE Sample
	```jsonl
	{"task": "GENERATE", "input": {"instruction": "Create a flowchart for the checkout process", "diagram_type": "flowchart"}, "result": {"compiler_errors": [], "patch": [], "repaired_diagram": null, "diagram_content": "flowchart TD\nStart --> Cart\nCart --> Payment\nPayment --> Confirmation", "title": "Checkout Flow", "summary": "A flowchart showing the steps from start to order confirmation in an e-commerce checkout process."}}
	```

	Additional syntax-focused training samples have been generated from the Mermaid documentation and are available as JSONL files:
	- `data/syntax_repair_samples.jsonl` – contains REPAIR task samples with broken diagrams and their fixes.
	- `data/syntax_title_samples.jsonl` – contains TITLE task samples with valid diagrams, titles, and summaries.
	- `data/syntax_generate_samples.jsonl` – contains GENERATE task samples with instructions and generated diagrams.
	- `data/syntax_all_samples.jsonl` – combined file with all tasks.

	These files can be used to train models specifically on Mermaid syntax understanding, repair, and generation.