File size: 6,361 Bytes
e8e9ae7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
license: apache-2.0
task_categories:
  - text-generation
language:
  - en
tags:
  - n8n
  - workflow-automation
  - code-generation
  - sft
  - json
  - low-code
  - automation
pretty_name: n8n Workflows SFT Dataset
size_categories:
  - 1K<n<10K
---

# n8n Workflows SFT Dataset

A curated dataset of [n8n](https://n8n.io/) workflow examples paired with natural language descriptions, designed for supervised fine-tuning (SFT) of code generation models.

## Dataset Description

This dataset contains instruction-workflow pairs where each example consists of:
- A natural language description of an automation task
- The corresponding valid n8n workflow JSON configuration

The dataset is specifically formatted for training models to generate n8n workflows from user prompts.

| Property | Value |
|----------|-------|
| **Format** | JSON |
| **Size** | 1K-10K examples |
| **Language** | English |
| **License** | Apache 2.0 |

## Dataset Structure

### Data Fields

```json
{
  "instruction": "string - Natural language description of the desired workflow",
  "output": "string - Valid n8n workflow JSON configuration"
}
```

### Example

```json
{
  "instruction": "Create a workflow that triggers on a webhook, filters incoming data based on a status field, and sends a notification to Slack",
  "output": "{\"name\":\"Webhook to Slack\",\"nodes\":[{\"parameters\":{\"path\":\"status-webhook\"},\"name\":\"Webhook\",\"type\":\"n8n-nodes-base.webhook\",\"typeVersion\":1,\"position\":[250,300]},{\"parameters\":{\"conditions\":{\"string\":[{\"value1\":\"={{$json[\\\"status\\\"]}}\",\"value2\":\"active\"}]}},\"name\":\"Filter\",\"type\":\"n8n-nodes-base.filter\",\"typeVersion\":1,\"position\":[450,300]},{\"parameters\":{\"channel\":\"#notifications\",\"text\":\"New active status received\"},\"name\":\"Slack\",\"type\":\"n8n-nodes-base.slack\",\"typeVersion\":1,\"position\":[650,300]}],\"connections\":{\"Webhook\":{\"main\":[[{\"node\":\"Filter\",\"type\":\"main\",\"index\":0}]]},\"Filter\":{\"main\":[[{\"node\":\"Slack\",\"type\":\"main\",\"index\":0}]]}}}"
}
```

## Usage

### Loading with 🤗 Datasets

```python
from datasets import load_dataset

dataset = load_dataset("eclaude/n8n-workflows-sft")

# Access training data
print(dataset["train"][0])
```

### Loading with Pandas

```python
import pandas as pd

df = pd.read_json("hf://datasets/eclaude/n8n-workflows-sft/data.json")
print(df.head())
```

### Preparing for SFT Training

```python
from datasets import load_dataset

dataset = load_dataset("eclaude/n8n-workflows-sft")

def format_for_chat(example):
    """Format examples for chat-style fine-tuning."""
    return {
        "messages": [
            {
                "role": "system",
                "content": "You are an n8n workflow expert. Generate valid n8n workflow JSON configurations based on user requirements."
            },
            {
                "role": "user", 
                "content": example["instruction"]
            },
            {
                "role": "assistant",
                "content": example["output"]
            }
        ]
    }

formatted_dataset = dataset.map(format_for_chat)
```

### Training with TRL

```python
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Qwen/Qwen2.5-Coder-3B-Instruct"
dataset = load_dataset("eclaude/n8n-workflows-sft")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

def formatting_func(example):
    return f"""<|im_start|>system
You are an n8n workflow expert. Generate valid n8n workflow JSON configurations.<|im_end|>
<|im_start|>user
{example['instruction']}<|im_end|>
<|im_start|>assistant
{example['output']}<|im_end|>"""

training_args = SFTConfig(
    output_dir="./n8n-sft-model",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-5,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    formatting_func=formatting_func,
    tokenizer=tokenizer,
    max_seq_length=2048,
)

trainer.train()
```

## Covered n8n Nodes

The dataset includes workflows featuring common n8n integrations:

| Category | Nodes |
|----------|-------|
| **Triggers** | Webhook, Schedule, Manual |
| **Core** | HTTP Request, Code, Function, Set, Filter, Switch, Merge |
| **Communication** | Slack, Discord, Email, Telegram |
| **Data** | PostgreSQL, MySQL, MongoDB, Airtable, Google Sheets |
| **Dev Tools** | GitHub, GitLab, Jira |
| **Storage** | AWS S3, Google Drive, Dropbox |
| **CRM** | HubSpot, Salesforce |

## Intended Uses

### Primary Use

- Fine-tuning language models for n8n workflow generation
- Training code assistants specialized in automation

### Out-of-Scope Use

- Direct production deployment without validation
- Training models for other automation platforms (Zapier, Make, etc.)

## Limitations

- **Node Coverage**: Not all 400+ n8n nodes are represented equally
- **Complexity**: Most workflows are simple to medium complexity (2-8 nodes)
- **Validation**: Workflows are structurally valid but may require credential configuration
- **Version**: Based on n8n workflow schema as of late 2024; may need updates for future n8n versions

## Dataset Creation

### Source Data

Workflows were collected and curated from:
- Public n8n workflow templates
- Community-shared automations
- Synthetically generated examples with manual validation

### Curation Process

1. Collection of raw workflow JSON files
2. Extraction and normalization of workflow structure
3. Generation of natural language descriptions
4. Manual review for quality and accuracy
5. Deduplication and filtering

## Models Trained on This Dataset

- [eclaude/qwen-coder-3b-n8n-sft](https://huggingface.co/eclaude/qwen-coder-3b-n8n-sft)

## Citation

```bibtex
@dataset{n8n_workflows_sft_2025,
  author = {eclaude},
  title = {n8n Workflows SFT Dataset},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/eclaude/n8n-workflows-sft}
}
```

## Contact

For questions, suggestions, or contributions, open a discussion on this repository or contact via [Hugging Face](https://huggingface.co/eclaude).