eclaude commited on
Commit
e8e9ae7
·
verified ·
1 Parent(s): f2a556b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +222 -0
README.md ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ task_categories:
4
+ - text-generation
5
+ language:
6
+ - en
7
+ tags:
8
+ - n8n
9
+ - workflow-automation
10
+ - code-generation
11
+ - sft
12
+ - json
13
+ - low-code
14
+ - automation
15
+ pretty_name: n8n Workflows SFT Dataset
16
+ size_categories:
17
+ - 1K<n<10K
18
+ ---
19
+
20
+ # n8n Workflows SFT Dataset
21
+
22
+ A curated dataset of [n8n](https://n8n.io/) workflow examples paired with natural language descriptions, designed for supervised fine-tuning (SFT) of code generation models.
23
+
24
+ ## Dataset Description
25
+
26
+ This dataset contains instruction-workflow pairs where each example consists of:
27
+ - A natural language description of an automation task
28
+ - The corresponding valid n8n workflow JSON configuration
29
+
30
+ The dataset is specifically formatted for training models to generate n8n workflows from user prompts.
31
+
32
+ | Property | Value |
33
+ |----------|-------|
34
+ | **Format** | JSON |
35
+ | **Size** | 1K-10K examples |
36
+ | **Language** | English |
37
+ | **License** | Apache 2.0 |
38
+
39
+ ## Dataset Structure
40
+
41
+ ### Data Fields
42
+
43
+ ```json
44
+ {
45
+ "instruction": "string - Natural language description of the desired workflow",
46
+ "output": "string - Valid n8n workflow JSON configuration"
47
+ }
48
+ ```
49
+
50
+ ### Example
51
+
52
+ ```json
53
+ {
54
+ "instruction": "Create a workflow that triggers on a webhook, filters incoming data based on a status field, and sends a notification to Slack",
55
+ "output": "{\"name\":\"Webhook to Slack\",\"nodes\":[{\"parameters\":{\"path\":\"status-webhook\"},\"name\":\"Webhook\",\"type\":\"n8n-nodes-base.webhook\",\"typeVersion\":1,\"position\":[250,300]},{\"parameters\":{\"conditions\":{\"string\":[{\"value1\":\"={{$json[\\\"status\\\"]}}\",\"value2\":\"active\"}]}},\"name\":\"Filter\",\"type\":\"n8n-nodes-base.filter\",\"typeVersion\":1,\"position\":[450,300]},{\"parameters\":{\"channel\":\"#notifications\",\"text\":\"New active status received\"},\"name\":\"Slack\",\"type\":\"n8n-nodes-base.slack\",\"typeVersion\":1,\"position\":[650,300]}],\"connections\":{\"Webhook\":{\"main\":[[{\"node\":\"Filter\",\"type\":\"main\",\"index\":0}]]},\"Filter\":{\"main\":[[{\"node\":\"Slack\",\"type\":\"main\",\"index\":0}]]}}}"
56
+ }
57
+ ```
58
+
59
+ ## Usage
60
+
61
+ ### Loading with 🤗 Datasets
62
+
63
+ ```python
64
+ from datasets import load_dataset
65
+
66
+ dataset = load_dataset("eclaude/n8n-workflows-sft")
67
+
68
+ # Access training data
69
+ print(dataset["train"][0])
70
+ ```
71
+
72
+ ### Loading with Pandas
73
+
74
+ ```python
75
+ import pandas as pd
76
+
77
+ df = pd.read_json("hf://datasets/eclaude/n8n-workflows-sft/data.json")
78
+ print(df.head())
79
+ ```
80
+
81
+ ### Preparing for SFT Training
82
+
83
+ ```python
84
+ from datasets import load_dataset
85
+
86
+ dataset = load_dataset("eclaude/n8n-workflows-sft")
87
+
88
+ def format_for_chat(example):
89
+ """Format examples for chat-style fine-tuning."""
90
+ return {
91
+ "messages": [
92
+ {
93
+ "role": "system",
94
+ "content": "You are an n8n workflow expert. Generate valid n8n workflow JSON configurations based on user requirements."
95
+ },
96
+ {
97
+ "role": "user",
98
+ "content": example["instruction"]
99
+ },
100
+ {
101
+ "role": "assistant",
102
+ "content": example["output"]
103
+ }
104
+ ]
105
+ }
106
+
107
+ formatted_dataset = dataset.map(format_for_chat)
108
+ ```
109
+
110
+ ### Training with TRL
111
+
112
+ ```python
113
+ from datasets import load_dataset
114
+ from trl import SFTTrainer, SFTConfig
115
+ from transformers import AutoModelForCausalLM, AutoTokenizer
116
+
117
+ model_id = "Qwen/Qwen2.5-Coder-3B-Instruct"
118
+ dataset = load_dataset("eclaude/n8n-workflows-sft")
119
+
120
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
121
+ model = AutoModelForCausalLM.from_pretrained(model_id)
122
+
123
+ def formatting_func(example):
124
+ return f"""<|im_start|>system
125
+ You are an n8n workflow expert. Generate valid n8n workflow JSON configurations.<|im_end|>
126
+ <|im_start|>user
127
+ {example['instruction']}<|im_end|>
128
+ <|im_start|>assistant
129
+ {example['output']}<|im_end|>"""
130
+
131
+ training_args = SFTConfig(
132
+ output_dir="./n8n-sft-model",
133
+ per_device_train_batch_size=4,
134
+ gradient_accumulation_steps=4,
135
+ num_train_epochs=3,
136
+ learning_rate=2e-5,
137
+ bf16=True,
138
+ logging_steps=10,
139
+ save_strategy="epoch",
140
+ )
141
+
142
+ trainer = SFTTrainer(
143
+ model=model,
144
+ args=training_args,
145
+ train_dataset=dataset["train"],
146
+ formatting_func=formatting_func,
147
+ tokenizer=tokenizer,
148
+ max_seq_length=2048,
149
+ )
150
+
151
+ trainer.train()
152
+ ```
153
+
154
+ ## Covered n8n Nodes
155
+
156
+ The dataset includes workflows featuring common n8n integrations:
157
+
158
+ | Category | Nodes |
159
+ |----------|-------|
160
+ | **Triggers** | Webhook, Schedule, Manual |
161
+ | **Core** | HTTP Request, Code, Function, Set, Filter, Switch, Merge |
162
+ | **Communication** | Slack, Discord, Email, Telegram |
163
+ | **Data** | PostgreSQL, MySQL, MongoDB, Airtable, Google Sheets |
164
+ | **Dev Tools** | GitHub, GitLab, Jira |
165
+ | **Storage** | AWS S3, Google Drive, Dropbox |
166
+ | **CRM** | HubSpot, Salesforce |
167
+
168
+ ## Intended Uses
169
+
170
+ ### Primary Use
171
+
172
+ - Fine-tuning language models for n8n workflow generation
173
+ - Training code assistants specialized in automation
174
+
175
+ ### Out-of-Scope Use
176
+
177
+ - Direct production deployment without validation
178
+ - Training models for other automation platforms (Zapier, Make, etc.)
179
+
180
+ ## Limitations
181
+
182
+ - **Node Coverage**: Not all 400+ n8n nodes are represented equally
183
+ - **Complexity**: Most workflows are simple to medium complexity (2-8 nodes)
184
+ - **Validation**: Workflows are structurally valid but may require credential configuration
185
+ - **Version**: Based on n8n workflow schema as of late 2024; may need updates for future n8n versions
186
+
187
+ ## Dataset Creation
188
+
189
+ ### Source Data
190
+
191
+ Workflows were collected and curated from:
192
+ - Public n8n workflow templates
193
+ - Community-shared automations
194
+ - Synthetically generated examples with manual validation
195
+
196
+ ### Curation Process
197
+
198
+ 1. Collection of raw workflow JSON files
199
+ 2. Extraction and normalization of workflow structure
200
+ 3. Generation of natural language descriptions
201
+ 4. Manual review for quality and accuracy
202
+ 5. Deduplication and filtering
203
+
204
+ ## Models Trained on This Dataset
205
+
206
+ - [eclaude/qwen-coder-3b-n8n-sft](https://huggingface.co/eclaude/qwen-coder-3b-n8n-sft)
207
+
208
+ ## Citation
209
+
210
+ ```bibtex
211
+ @dataset{n8n_workflows_sft_2025,
212
+ author = {eclaude},
213
+ title = {n8n Workflows SFT Dataset},
214
+ year = {2025},
215
+ publisher = {Hugging Face},
216
+ url = {https://huggingface.co/datasets/eclaude/n8n-workflows-sft}
217
+ }
218
+ ```
219
+
220
+ ## Contact
221
+
222
+ For questions, suggestions, or contributions, open a discussion on this repository or contact via [Hugging Face](https://huggingface.co/eclaude).