Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,161 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Mermaid Syntax Dataset
|
| 2 |
+
|
| 3 |
+
## Dataset Summary
|
| 4 |
+
The **Mermaid Syntax Dataset** provides training and evaluation data for **syntax understanding, validation, repair, and semantic titling** of [Mermaid.js](https://mermaid.js.org/) diagrams.
|
| 5 |
+
|
| 6 |
+
It supports two primary tasks:
|
| 7 |
+
1. **Repair** – Generate minimal diffs or patched diagrams that compile successfully.
|
| 8 |
+
2. **Titling** – Propose a short, human-friendly title, optionally with a one-sentence summary, based on content and context (instead of “Untitled Diagram”).
|
| 9 |
+
3. **Generation** – Create a new valid Mermaid diagram from a user instruction and optional diagram type.
|
| 10 |
+
|
| 11 |
+
> **Note:** Validation is performed by the Mermaid parser **before** any model call. Parser diagnostics are exposed in the dataset as `compiler_errors` (array of strings) so the model can understand what failed and propose targeted repairs.
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Supported Tasks and Benchmarks
|
| 16 |
+
- **Text Generation**
|
| 17 |
+
- `REPAIR`: Given an invalid diagram and parser diagnostics (`compiler_errors`), generate a corrected diagram (or a minimal patch).
|
| 18 |
+
- `TITLE`: Given a valid diagram, generate a short, human-friendly title (optionally with a one-sentence summary).
|
| 19 |
+
- `GENERATE`: Given a natural language instruction and optional diagram type, generate a new valid diagram (`diagram_content`) plus optional title and summary.
|
| 20 |
+
|
| 21 |
+
### Task Categories
|
| 22 |
+
- `text-generation`
|
| 23 |
+
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
## Languages
|
| 27 |
+
- **English (`en`)**
|
| 28 |
+
All error messages, titles, and instructions are in English. Future multilingual expansions may include localized error messages.
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## Dataset Structure
|
| 33 |
+
|
| 34 |
+
### Input Schema
|
| 35 |
+
```json
|
| 36 |
+
{
|
| 37 |
+
"task": "REPAIR|TITLE|GENERATE",
|
| 38 |
+
"input": {
|
| 39 |
+
"diagram": "string (for REPAIR|TITLE)",
|
| 40 |
+
"instruction": "string (for GENERATE)",
|
| 41 |
+
"context": "optional string",
|
| 42 |
+
"diagram_type": "optional string",
|
| 43 |
+
"compiler_errors": ["string (for REPAIR)"]
|
| 44 |
+
}
|
| 45 |
+
}
|
| 46 |
+
```
|
| 47 |
+
`compiler_errors` is an optional array of strings produced by the Mermaid parser (e.g., `"MISSING_ARROW at line 7"`, `"UNTERMINATED_BLOCK: 'gantt' missing 'end'"`). Include it for `REPAIR` samples; omit it for `TITLE` and `GENERATE` samples.
|
| 48 |
+
|
| 49 |
+
### Output Schema
|
| 50 |
+
```json
|
| 51 |
+
{
|
| 52 |
+
"result": {
|
| 53 |
+
"compiler_errors": ["string"], // optional echo of parser diagnostics
|
| 54 |
+
"patch": [ // optional for REPAIR tasks
|
| 55 |
+
{
|
| 56 |
+
"op": "replace|insert|delete",
|
| 57 |
+
"range": {"startLine": 1, "startCol": 5, "endLine": 1, "endCol": 10},
|
| 58 |
+
"text": "new content"
|
| 59 |
+
}
|
| 60 |
+
],
|
| 61 |
+
"repaired_diagram": "string or null", // for REPAIR
|
| 62 |
+
"diagram_content": "string or null", // for GENERATE
|
| 63 |
+
"title": "string or null", // for TITLE and GENERATE
|
| 64 |
+
"summary": "string or null" // optional one-sentence description
|
| 65 |
+
}
|
| 66 |
+
}
|
| 67 |
+
```
|
| 68 |
+
- `compiler_errors`: optional echo of parser diagnostics to provide context for the model.
|
| 69 |
+
- `patch`: optional list of minimal edit operations for REPAIR tasks.
|
| 70 |
+
- `repaired_diagram`: the corrected diagram (full text), used in REPAIR tasks.
|
| 71 |
+
- `diagram_content`: the newly generated diagram, used in GENERATE tasks.
|
| 72 |
+
- `title`: a short, human-friendly title, used in TITLE and GENERATE tasks.
|
| 73 |
+
- `summary`: an optional one-sentence description or summary, used in TITLE and GENERATE tasks.
|
| 74 |
+
|
| 75 |
+
### Examples
|
| 76 |
+
|
| 77 |
+
#### Example REPAIR
|
| 78 |
+
```json
|
| 79 |
+
{
|
| 80 |
+
"task": "REPAIR",
|
| 81 |
+
"input": {
|
| 82 |
+
"diagram": "flowchart TD\nA --> B",
|
| 83 |
+
"compiler_errors": ["MISSING_ARROW at line 2"]
|
| 84 |
+
},
|
| 85 |
+
"result": {
|
| 86 |
+
"compiler_errors": ["MISSING_ARROW at line 2"],
|
| 87 |
+
"patch": [
|
| 88 |
+
{
|
| 89 |
+
"op": "replace",
|
| 90 |
+
"range": {"startLine": 2, "startCol": 5, "endLine": 2, "endCol": 7},
|
| 91 |
+
"text": "->"
|
| 92 |
+
}
|
| 93 |
+
],
|
| 94 |
+
"repaired_diagram": "flowchart TD\nA -> B",
|
| 95 |
+
"title": null,
|
| 96 |
+
"summary": null
|
| 97 |
+
}
|
| 98 |
+
}
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
#### Example TITLE
|
| 102 |
+
```json
|
| 103 |
+
{
|
| 104 |
+
"task": "TITLE",
|
| 105 |
+
"input": {
|
| 106 |
+
"diagram": "sequenceDiagram\nAlice->>Bob: Hello Bob!"
|
| 107 |
+
},
|
| 108 |
+
"result": {
|
| 109 |
+
"compiler_errors": [],
|
| 110 |
+
"patch": [],
|
| 111 |
+
"repaired_diagram": null,
|
| 112 |
+
"title": "Alice greets Bob",
|
| 113 |
+
"summary": "A simple sequence diagram showing Alice sending a greeting message to Bob."
|
| 114 |
+
}
|
| 115 |
+
}
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
#### Example GENERATE
|
| 119 |
+
```json
|
| 120 |
+
{
|
| 121 |
+
"task": "GENERATE",
|
| 122 |
+
"input": {
|
| 123 |
+
"instruction": "Create a flowchart for the checkout process",
|
| 124 |
+
"diagram_type": "flowchart"
|
| 125 |
+
},
|
| 126 |
+
"result": {
|
| 127 |
+
"compiler_errors": [],
|
| 128 |
+
"patch": [],
|
| 129 |
+
"diagram_content": "flowchart TD\nStart --> Cart\nCart --> Payment\nPayment --> Confirmation",
|
| 130 |
+
"title": "Checkout Flow",
|
| 131 |
+
"summary": "A flowchart showing the steps from start to order confirmation in an e-commerce checkout process."
|
| 132 |
+
}
|
| 133 |
+
}
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
## Sample Data
|
| 137 |
+
|
| 138 |
+
An example of a `sample.jsonl` is included for each task type. Each line is a JSON object following the schema.
|
| 139 |
+
|
| 140 |
+
### REPAIR Sample
|
| 141 |
+
```jsonl
|
| 142 |
+
{"task": "REPAIR", "input": {"diagram": "flowchart TD\nA -> B", "diagram_type": "flowchart", "compiler_errors": ["MISSING_ARROW at line 2: use '-->' instead of '->'"]}, "result": {"compiler_errors": ["MISSING_ARROW at line 2: use '-->' instead of '->'"], "patch": [{"op": "replace", "range": {"startLine": 2, "startCol": 3, "endLine": 2, "endCol": 4}, "text": "--"}], "repaired_diagram": "flowchart TD\nA --> B", "diagram_content": null, "title": null, "summary": null}}
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
### TITLE Sample
|
| 146 |
+
```jsonl
|
| 147 |
+
{"task": "TITLE", "input": {"diagram": "sequenceDiagram\nAlice->>Bob: Hello Bob!", "diagram_type": "sequence"}, "result": {"compiler_errors": [], "patch": [], "repaired_diagram": null, "diagram_content": null, "title": "Alice greets Bob", "summary": "A simple sequence diagram showing Alice sending a greeting message to Bob."}}
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
### GENERATE Sample
|
| 151 |
+
```jsonl
|
| 152 |
+
{"task": "GENERATE", "input": {"instruction": "Create a flowchart for the checkout process", "diagram_type": "flowchart"}, "result": {"compiler_errors": [], "patch": [], "repaired_diagram": null, "diagram_content": "flowchart TD\nStart --> Cart\nCart --> Payment\nPayment --> Confirmation", "title": "Checkout Flow", "summary": "A flowchart showing the steps from start to order confirmation in an e-commerce checkout process."}}
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
Additional syntax-focused training samples have been generated from the Mermaid documentation and are available as JSONL files:
|
| 156 |
+
- `data/syntax_repair_samples.jsonl` – contains REPAIR task samples with broken diagrams and their fixes.
|
| 157 |
+
- `data/syntax_title_samples.jsonl` – contains TITLE task samples with valid diagrams, titles, and summaries.
|
| 158 |
+
- `data/syntax_generate_samples.jsonl` – contains GENERATE task samples with instructions and generated diagrams.
|
| 159 |
+
- `data/syntax_all_samples.jsonl` – combined file with all tasks.
|
| 160 |
+
|
| 161 |
+
These files can be used to train models specifically on Mermaid syntax understanding, repair, and generation.
|