|
|
--- |
|
|
task_categories: |
|
|
- text-generation |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- code |
|
|
- mermaid |
|
|
- syntax |
|
|
- diagram |
|
|
- repair |
|
|
pretty_name: Mermaid AI Syntax |
|
|
size_categories: |
|
|
- 100K<n<1M |
|
|
authors: |
|
|
- Gabriel Lars Sabadin |
|
|
- Darshan Jain |
|
|
- Mermaid Chart AI Team |
|
|
--- |
|
|
|
|
|
# Mermaid Syntax Dataset |
|
|
|
|
|
## Dataset Summary |
|
|
The **Mermaid Syntax Dataset** provides training and evaluation data for **syntax understanding, validation, repair, and semantic titling** of [Mermaid.js](https://mermaid.js.org/) diagrams. |
|
|
|
|
|
It supports two primary tasks: |
|
|
1. **Repair** – Generate minimal diffs or patched diagrams that compile successfully. |
|
|
2. **Titling** – Propose a short, human-friendly title, optionally with a one-sentence summary, based on content and context (instead of “Untitled Diagram”). |
|
|
3. **Generation** – Create a new valid Mermaid diagram from a user instruction and optional diagram type. |
|
|
|
|
|
> **Note:** Validation is performed by the Mermaid parser **before** any model call. Parser diagnostics are exposed in the dataset as `compiler_errors` (array of strings) so the model can understand what failed and propose targeted repairs. |
|
|
|
|
|
--- |
|
|
|
|
|
## Supported Tasks and Benchmarks |
|
|
- **Text Generation** |
|
|
- `REPAIR`: Given an invalid diagram and parser diagnostics (`compiler_errors`), generate a corrected diagram (or a minimal patch). |
|
|
- `TITLE`: Given a valid diagram, generate a short, human-friendly title (optionally with a one-sentence summary). |
|
|
- `GENERATE`: Given a natural language instruction and optional diagram type, generate a new valid diagram (`diagram_content`) plus optional title and summary. |
|
|
|
|
|
### Task Categories |
|
|
- `text-generation` |
|
|
|
|
|
--- |
|
|
|
|
|
## Languages |
|
|
- **English (`en`)** |
|
|
All error messages, titles, and instructions are in English. Future multilingual expansions may include localized error messages. |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset Structure |
|
|
|
|
|
### Input Schema |
|
|
```json |
|
|
{ |
|
|
"task": "REPAIR|TITLE|GENERATE", |
|
|
"input": { |
|
|
"diagram": "string (for REPAIR|TITLE)", |
|
|
"instruction": "string (for GENERATE)", |
|
|
"context": "optional string", |
|
|
"diagram_type": "optional string", |
|
|
"compiler_errors": ["string (for REPAIR)"] |
|
|
} |
|
|
} |
|
|
``` |
|
|
`compiler_errors` is an optional array of strings produced by the Mermaid parser (e.g., `"MISSING_ARROW at line 7"`, `"UNTERMINATED_BLOCK: 'gantt' missing 'end'"`). Include it for `REPAIR` samples; omit it for `TITLE` and `GENERATE` samples. |
|
|
|
|
|
### Output Schema |
|
|
```json |
|
|
{ |
|
|
"result": { |
|
|
"compiler_errors": ["string"], // optional echo of parser diagnostics |
|
|
"patch": [ // optional for REPAIR tasks |
|
|
{ |
|
|
"op": "replace|insert|delete", |
|
|
"range": {"startLine": 1, "startCol": 5, "endLine": 1, "endCol": 10}, |
|
|
"text": "new content" |
|
|
} |
|
|
], |
|
|
"repaired_diagram": "string or null", // for REPAIR |
|
|
"diagram_content": "string or null", // for GENERATE |
|
|
"title": "string or null", // for TITLE and GENERATE |
|
|
"summary": "string or null" // optional one-sentence description |
|
|
} |
|
|
} |
|
|
``` |
|
|
- `compiler_errors`: optional echo of parser diagnostics to provide context for the model. |
|
|
- `patch`: optional list of minimal edit operations for REPAIR tasks. |
|
|
- `repaired_diagram`: the corrected diagram (full text), used in REPAIR tasks. |
|
|
- `diagram_content`: the newly generated diagram, used in GENERATE tasks. |
|
|
- `title`: a short, human-friendly title, used in TITLE and GENERATE tasks. |
|
|
- `summary`: an optional one-sentence description or summary, used in TITLE and GENERATE tasks. |
|
|
|
|
|
### Examples |
|
|
|
|
|
#### Example REPAIR |
|
|
```json |
|
|
{ |
|
|
"task": "REPAIR", |
|
|
"input": { |
|
|
"diagram": "flowchart TD\nA --> B", |
|
|
"compiler_errors": ["MISSING_ARROW at line 2"] |
|
|
}, |
|
|
"result": { |
|
|
"compiler_errors": ["MISSING_ARROW at line 2"], |
|
|
"patch": [ |
|
|
{ |
|
|
"op": "replace", |
|
|
"range": {"startLine": 2, "startCol": 5, "endLine": 2, "endCol": 7}, |
|
|
"text": "->" |
|
|
} |
|
|
], |
|
|
"repaired_diagram": "flowchart TD\nA -> B", |
|
|
"title": null, |
|
|
"summary": null |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
#### Example TITLE |
|
|
```json |
|
|
{ |
|
|
"task": "TITLE", |
|
|
"input": { |
|
|
"diagram": "sequenceDiagram\nAlice->>Bob: Hello Bob!" |
|
|
}, |
|
|
"result": { |
|
|
"compiler_errors": [], |
|
|
"patch": [], |
|
|
"repaired_diagram": null, |
|
|
"title": "Alice greets Bob", |
|
|
"summary": "A simple sequence diagram showing Alice sending a greeting message to Bob." |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
#### Example GENERATE |
|
|
```json |
|
|
{ |
|
|
"task": "GENERATE", |
|
|
"input": { |
|
|
"instruction": "Create a flowchart for the checkout process", |
|
|
"diagram_type": "flowchart" |
|
|
}, |
|
|
"result": { |
|
|
"compiler_errors": [], |
|
|
"patch": [], |
|
|
"diagram_content": "flowchart TD\nStart --> Cart\nCart --> Payment\nPayment --> Confirmation", |
|
|
"title": "Checkout Flow", |
|
|
"summary": "A flowchart showing the steps from start to order confirmation in an e-commerce checkout process." |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Sample Data |
|
|
|
|
|
An example of a `sample.jsonl` is included for each task type. Each line is a JSON object following the schema. |
|
|
|
|
|
### REPAIR Sample |
|
|
```jsonl |
|
|
{"task": "REPAIR", "input": {"diagram": "flowchart TD\nA -> B", "diagram_type": "flowchart", "compiler_errors": ["MISSING_ARROW at line 2: use '-->' instead of '->'"]}, "result": {"compiler_errors": ["MISSING_ARROW at line 2: use '-->' instead of '->'"], "patch": [{"op": "replace", "range": {"startLine": 2, "startCol": 3, "endLine": 2, "endCol": 4}, "text": "--"}], "repaired_diagram": "flowchart TD\nA --> B", "diagram_content": null, "title": null, "summary": null}} |
|
|
``` |
|
|
|
|
|
### TITLE Sample |
|
|
```jsonl |
|
|
{"task": "TITLE", "input": {"diagram": "sequenceDiagram\nAlice->>Bob: Hello Bob!", "diagram_type": "sequence"}, "result": {"compiler_errors": [], "patch": [], "repaired_diagram": null, "diagram_content": null, "title": "Alice greets Bob", "summary": "A simple sequence diagram showing Alice sending a greeting message to Bob."}} |
|
|
``` |
|
|
|
|
|
### GENERATE Sample |
|
|
```jsonl |
|
|
{"task": "GENERATE", "input": {"instruction": "Create a flowchart for the checkout process", "diagram_type": "flowchart"}, "result": {"compiler_errors": [], "patch": [], "repaired_diagram": null, "diagram_content": "flowchart TD\nStart --> Cart\nCart --> Payment\nPayment --> Confirmation", "title": "Checkout Flow", "summary": "A flowchart showing the steps from start to order confirmation in an e-commerce checkout process."}} |
|
|
``` |
|
|
|
|
|
Additional syntax-focused training samples have been generated from the Mermaid documentation and are available as JSONL files: |
|
|
- `data/syntax_repair_samples.jsonl` – contains REPAIR task samples with broken diagrams and their fixes. |
|
|
- `data/syntax_title_samples.jsonl` – contains TITLE task samples with valid diagrams, titles, and summaries. |
|
|
- `data/syntax_generate_samples.jsonl` – contains GENERATE task samples with instructions and generated diagrams. |
|
|
- `data/syntax_all_samples.jsonl` – combined file with all tasks. |
|
|
|
|
|
These files can be used to train models specifically on Mermaid syntax understanding, repair, and generation. |
|
|
|