gabrielLarsSabadin commited on
Commit
3a39da8
·
verified ·
1 Parent(s): 18fc113

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +161 -0
README.md ADDED
@@ -0,0 +1,161 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Mermaid Syntax Dataset
2
+
3
+ ## Dataset Summary
4
+ The **Mermaid Syntax Dataset** provides training and evaluation data for **syntax understanding, validation, repair, and semantic titling** of [Mermaid.js](https://mermaid.js.org/) diagrams.
5
+
6
+ It supports two primary tasks:
7
+ 1. **Repair** – Generate minimal diffs or patched diagrams that compile successfully.
8
+ 2. **Titling** – Propose a short, human-friendly title, optionally with a one-sentence summary, based on content and context (instead of “Untitled Diagram”).
9
+ 3. **Generation** – Create a new valid Mermaid diagram from a user instruction and optional diagram type.
10
+
11
+ > **Note:** Validation is performed by the Mermaid parser **before** any model call. Parser diagnostics are exposed in the dataset as `compiler_errors` (array of strings) so the model can understand what failed and propose targeted repairs.
12
+
13
+ ---
14
+
15
+ ## Supported Tasks and Benchmarks
16
+ - **Text Generation**
17
+ - `REPAIR`: Given an invalid diagram and parser diagnostics (`compiler_errors`), generate a corrected diagram (or a minimal patch).
18
+ - `TITLE`: Given a valid diagram, generate a short, human-friendly title (optionally with a one-sentence summary).
19
+ - `GENERATE`: Given a natural language instruction and optional diagram type, generate a new valid diagram (`diagram_content`) plus optional title and summary.
20
+
21
+ ### Task Categories
22
+ - `text-generation`
23
+
24
+ ---
25
+
26
+ ## Languages
27
+ - **English (`en`)**
28
+ All error messages, titles, and instructions are in English. Future multilingual expansions may include localized error messages.
29
+
30
+ ---
31
+
32
+ ## Dataset Structure
33
+
34
+ ### Input Schema
35
+ ```json
36
+ {
37
+ "task": "REPAIR|TITLE|GENERATE",
38
+ "input": {
39
+ "diagram": "string (for REPAIR|TITLE)",
40
+ "instruction": "string (for GENERATE)",
41
+ "context": "optional string",
42
+ "diagram_type": "optional string",
43
+ "compiler_errors": ["string (for REPAIR)"]
44
+ }
45
+ }
46
+ ```
47
+ `compiler_errors` is an optional array of strings produced by the Mermaid parser (e.g., `"MISSING_ARROW at line 7"`, `"UNTERMINATED_BLOCK: 'gantt' missing 'end'"`). Include it for `REPAIR` samples; omit it for `TITLE` and `GENERATE` samples.
48
+
49
+ ### Output Schema
50
+ ```json
51
+ {
52
+ "result": {
53
+ "compiler_errors": ["string"], // optional echo of parser diagnostics
54
+ "patch": [ // optional for REPAIR tasks
55
+ {
56
+ "op": "replace|insert|delete",
57
+ "range": {"startLine": 1, "startCol": 5, "endLine": 1, "endCol": 10},
58
+ "text": "new content"
59
+ }
60
+ ],
61
+ "repaired_diagram": "string or null", // for REPAIR
62
+ "diagram_content": "string or null", // for GENERATE
63
+ "title": "string or null", // for TITLE and GENERATE
64
+ "summary": "string or null" // optional one-sentence description
65
+ }
66
+ }
67
+ ```
68
+ - `compiler_errors`: optional echo of parser diagnostics to provide context for the model.
69
+ - `patch`: optional list of minimal edit operations for REPAIR tasks.
70
+ - `repaired_diagram`: the corrected diagram (full text), used in REPAIR tasks.
71
+ - `diagram_content`: the newly generated diagram, used in GENERATE tasks.
72
+ - `title`: a short, human-friendly title, used in TITLE and GENERATE tasks.
73
+ - `summary`: an optional one-sentence description or summary, used in TITLE and GENERATE tasks.
74
+
75
+ ### Examples
76
+
77
+ #### Example REPAIR
78
+ ```json
79
+ {
80
+ "task": "REPAIR",
81
+ "input": {
82
+ "diagram": "flowchart TD\nA --> B",
83
+ "compiler_errors": ["MISSING_ARROW at line 2"]
84
+ },
85
+ "result": {
86
+ "compiler_errors": ["MISSING_ARROW at line 2"],
87
+ "patch": [
88
+ {
89
+ "op": "replace",
90
+ "range": {"startLine": 2, "startCol": 5, "endLine": 2, "endCol": 7},
91
+ "text": "->"
92
+ }
93
+ ],
94
+ "repaired_diagram": "flowchart TD\nA -> B",
95
+ "title": null,
96
+ "summary": null
97
+ }
98
+ }
99
+ ```
100
+
101
+ #### Example TITLE
102
+ ```json
103
+ {
104
+ "task": "TITLE",
105
+ "input": {
106
+ "diagram": "sequenceDiagram\nAlice->>Bob: Hello Bob!"
107
+ },
108
+ "result": {
109
+ "compiler_errors": [],
110
+ "patch": [],
111
+ "repaired_diagram": null,
112
+ "title": "Alice greets Bob",
113
+ "summary": "A simple sequence diagram showing Alice sending a greeting message to Bob."
114
+ }
115
+ }
116
+ ```
117
+
118
+ #### Example GENERATE
119
+ ```json
120
+ {
121
+ "task": "GENERATE",
122
+ "input": {
123
+ "instruction": "Create a flowchart for the checkout process",
124
+ "diagram_type": "flowchart"
125
+ },
126
+ "result": {
127
+ "compiler_errors": [],
128
+ "patch": [],
129
+ "diagram_content": "flowchart TD\nStart --> Cart\nCart --> Payment\nPayment --> Confirmation",
130
+ "title": "Checkout Flow",
131
+ "summary": "A flowchart showing the steps from start to order confirmation in an e-commerce checkout process."
132
+ }
133
+ }
134
+ ```
135
+
136
+ ## Sample Data
137
+
138
+ An example of a `sample.jsonl` is included for each task type. Each line is a JSON object following the schema.
139
+
140
+ ### REPAIR Sample
141
+ ```jsonl
142
+ {"task": "REPAIR", "input": {"diagram": "flowchart TD\nA -> B", "diagram_type": "flowchart", "compiler_errors": ["MISSING_ARROW at line 2: use '-->' instead of '->'"]}, "result": {"compiler_errors": ["MISSING_ARROW at line 2: use '-->' instead of '->'"], "patch": [{"op": "replace", "range": {"startLine": 2, "startCol": 3, "endLine": 2, "endCol": 4}, "text": "--"}], "repaired_diagram": "flowchart TD\nA --> B", "diagram_content": null, "title": null, "summary": null}}
143
+ ```
144
+
145
+ ### TITLE Sample
146
+ ```jsonl
147
+ {"task": "TITLE", "input": {"diagram": "sequenceDiagram\nAlice->>Bob: Hello Bob!", "diagram_type": "sequence"}, "result": {"compiler_errors": [], "patch": [], "repaired_diagram": null, "diagram_content": null, "title": "Alice greets Bob", "summary": "A simple sequence diagram showing Alice sending a greeting message to Bob."}}
148
+ ```
149
+
150
+ ### GENERATE Sample
151
+ ```jsonl
152
+ {"task": "GENERATE", "input": {"instruction": "Create a flowchart for the checkout process", "diagram_type": "flowchart"}, "result": {"compiler_errors": [], "patch": [], "repaired_diagram": null, "diagram_content": "flowchart TD\nStart --> Cart\nCart --> Payment\nPayment --> Confirmation", "title": "Checkout Flow", "summary": "A flowchart showing the steps from start to order confirmation in an e-commerce checkout process."}}
153
+ ```
154
+
155
+ Additional syntax-focused training samples have been generated from the Mermaid documentation and are available as JSONL files:
156
+ - `data/syntax_repair_samples.jsonl` – contains REPAIR task samples with broken diagrams and their fixes.
157
+ - `data/syntax_title_samples.jsonl` – contains TITLE task samples with valid diagrams, titles, and summaries.
158
+ - `data/syntax_generate_samples.jsonl` – contains GENERATE task samples with instructions and generated diagrams.
159
+ - `data/syntax_all_samples.jsonl` – combined file with all tasks.
160
+
161
+ These files can be used to train models specifically on Mermaid syntax understanding, repair, and generation.