|
|
--- |
|
|
license: apache-2.0 |
|
|
task_categories: |
|
|
- text-generation |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- n8n |
|
|
- workflow-automation |
|
|
- code-generation |
|
|
- sft |
|
|
- json |
|
|
- low-code |
|
|
- automation |
|
|
pretty_name: n8n Workflows SFT Dataset |
|
|
size_categories: |
|
|
- 1K<n<10K |
|
|
--- |
|
|
|
|
|
# n8n Workflows SFT Dataset |
|
|
|
|
|
A curated dataset of [n8n](https://n8n.io/) workflow examples paired with natural language descriptions, designed for supervised fine-tuning (SFT) of code generation models. |
|
|
|
|
|
## Dataset Description |
|
|
|
|
|
This dataset contains instruction-workflow pairs where each example consists of: |
|
|
- A natural language description of an automation task |
|
|
- The corresponding valid n8n workflow JSON configuration |
|
|
|
|
|
The dataset is specifically formatted for training models to generate n8n workflows from user prompts. |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Format** | JSON | |
|
|
| **Size** | 1K-10K examples | |
|
|
| **Language** | English | |
|
|
| **License** | Apache 2.0 | |
|
|
|
|
|
## Dataset Structure |
|
|
|
|
|
### Data Fields |
|
|
|
|
|
```json |
|
|
{ |
|
|
"instruction": "string - Natural language description of the desired workflow", |
|
|
"output": "string - Valid n8n workflow JSON configuration" |
|
|
} |
|
|
``` |
|
|
|
|
|
### Example |
|
|
|
|
|
```json |
|
|
{ |
|
|
"instruction": "Create a workflow that triggers on a webhook, filters incoming data based on a status field, and sends a notification to Slack", |
|
|
"output": "{\"name\":\"Webhook to Slack\",\"nodes\":[{\"parameters\":{\"path\":\"status-webhook\"},\"name\":\"Webhook\",\"type\":\"n8n-nodes-base.webhook\",\"typeVersion\":1,\"position\":[250,300]},{\"parameters\":{\"conditions\":{\"string\":[{\"value1\":\"={{$json[\\\"status\\\"]}}\",\"value2\":\"active\"}]}},\"name\":\"Filter\",\"type\":\"n8n-nodes-base.filter\",\"typeVersion\":1,\"position\":[450,300]},{\"parameters\":{\"channel\":\"#notifications\",\"text\":\"New active status received\"},\"name\":\"Slack\",\"type\":\"n8n-nodes-base.slack\",\"typeVersion\":1,\"position\":[650,300]}],\"connections\":{\"Webhook\":{\"main\":[[{\"node\":\"Filter\",\"type\":\"main\",\"index\":0}]]},\"Filter\":{\"main\":[[{\"node\":\"Slack\",\"type\":\"main\",\"index\":0}]]}}}" |
|
|
} |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Loading with 🤗 Datasets |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
|
|
|
dataset = load_dataset("eclaude/n8n-workflows-sft") |
|
|
|
|
|
# Access training data |
|
|
print(dataset["train"][0]) |
|
|
``` |
|
|
|
|
|
### Loading with Pandas |
|
|
|
|
|
```python |
|
|
import pandas as pd |
|
|
|
|
|
df = pd.read_json("hf://datasets/eclaude/n8n-workflows-sft/data.json") |
|
|
print(df.head()) |
|
|
``` |
|
|
|
|
|
### Preparing for SFT Training |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
|
|
|
dataset = load_dataset("eclaude/n8n-workflows-sft") |
|
|
|
|
|
def format_for_chat(example): |
|
|
"""Format examples for chat-style fine-tuning.""" |
|
|
return { |
|
|
"messages": [ |
|
|
{ |
|
|
"role": "system", |
|
|
"content": "You are an n8n workflow expert. Generate valid n8n workflow JSON configurations based on user requirements." |
|
|
}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": example["instruction"] |
|
|
}, |
|
|
{ |
|
|
"role": "assistant", |
|
|
"content": example["output"] |
|
|
} |
|
|
] |
|
|
} |
|
|
|
|
|
formatted_dataset = dataset.map(format_for_chat) |
|
|
``` |
|
|
|
|
|
### Training with TRL |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
from trl import SFTTrainer, SFTConfig |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_id = "Qwen/Qwen2.5-Coder-3B-Instruct" |
|
|
dataset = load_dataset("eclaude/n8n-workflows-sft") |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
|
|
|
def formatting_func(example): |
|
|
return f"""<|im_start|>system |
|
|
You are an n8n workflow expert. Generate valid n8n workflow JSON configurations.<|im_end|> |
|
|
<|im_start|>user |
|
|
{example['instruction']}<|im_end|> |
|
|
<|im_start|>assistant |
|
|
{example['output']}<|im_end|>""" |
|
|
|
|
|
training_args = SFTConfig( |
|
|
output_dir="./n8n-sft-model", |
|
|
per_device_train_batch_size=4, |
|
|
gradient_accumulation_steps=4, |
|
|
num_train_epochs=3, |
|
|
learning_rate=2e-5, |
|
|
bf16=True, |
|
|
logging_steps=10, |
|
|
save_strategy="epoch", |
|
|
) |
|
|
|
|
|
trainer = SFTTrainer( |
|
|
model=model, |
|
|
args=training_args, |
|
|
train_dataset=dataset["train"], |
|
|
formatting_func=formatting_func, |
|
|
tokenizer=tokenizer, |
|
|
max_seq_length=2048, |
|
|
) |
|
|
|
|
|
trainer.train() |
|
|
``` |
|
|
|
|
|
## Covered n8n Nodes |
|
|
|
|
|
The dataset includes workflows featuring common n8n integrations: |
|
|
|
|
|
| Category | Nodes | |
|
|
|----------|-------| |
|
|
| **Triggers** | Webhook, Schedule, Manual | |
|
|
| **Core** | HTTP Request, Code, Function, Set, Filter, Switch, Merge | |
|
|
| **Communication** | Slack, Discord, Email, Telegram | |
|
|
| **Data** | PostgreSQL, MySQL, MongoDB, Airtable, Google Sheets | |
|
|
| **Dev Tools** | GitHub, GitLab, Jira | |
|
|
| **Storage** | AWS S3, Google Drive, Dropbox | |
|
|
| **CRM** | HubSpot, Salesforce | |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
### Primary Use |
|
|
|
|
|
- Fine-tuning language models for n8n workflow generation |
|
|
- Training code assistants specialized in automation |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- Direct production deployment without validation |
|
|
- Training models for other automation platforms (Zapier, Make, etc.) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Node Coverage**: Not all 400+ n8n nodes are represented equally |
|
|
- **Complexity**: Most workflows are simple to medium complexity (2-8 nodes) |
|
|
- **Validation**: Workflows are structurally valid but may require credential configuration |
|
|
- **Version**: Based on n8n workflow schema as of late 2024; may need updates for future n8n versions |
|
|
|
|
|
## Dataset Creation |
|
|
|
|
|
### Source Data |
|
|
|
|
|
Workflows were collected and curated from: |
|
|
- Public n8n workflow templates |
|
|
- Community-shared automations |
|
|
- Synthetically generated examples with manual validation |
|
|
|
|
|
### Curation Process |
|
|
|
|
|
1. Collection of raw workflow JSON files |
|
|
2. Extraction and normalization of workflow structure |
|
|
3. Generation of natural language descriptions |
|
|
4. Manual review for quality and accuracy |
|
|
5. Deduplication and filtering |
|
|
|
|
|
## Models Trained on This Dataset |
|
|
|
|
|
- [eclaude/qwen-coder-3b-n8n-sft](https://huggingface.co/eclaude/qwen-coder-3b-n8n-sft) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@dataset{n8n_workflows_sft_2025, |
|
|
author = {eclaude}, |
|
|
title = {n8n Workflows SFT Dataset}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
url = {https://huggingface.co/datasets/eclaude/n8n-workflows-sft} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions, suggestions, or contributions, open a discussion on this repository or contact via [Hugging Face](https://huggingface.co/eclaude). |