Madlab Synthetic Data Generator

🧠 Overview

The Madlab SDG 1.2B is part of the MadlabOSS Synthetic Data Generator family — a suite of small, efficient synthetic data generators designed for rule‑consistent, semantically coherent variation.
This model was trained on a closed-source dataset created through a multi-stage synthetic data generation process using a modified Madlab training pipeline. It represents the first in its family to be built upon the cutting-edge LFM2.5-instruct foundation, marking a significant advancement over previous iterations.

🚀 Intended Use

This model is optimized for:

Madlab synthetic data generation

It is not intended as a general-purpose chatbot.

🧩 Model Details

Base Model: LFM2.5-1.2B-instruct
Parameter Count: 1.2 Billion
Training Type: Supervised fine-tuning
Sequence Length: 1024
Precision: FP16
Framework: PyTorch / Transformers

📦 Training Data

The model was trained on:

1444 compressed and encoded dataset pairs
High variation in output
Preservation of semantic meaning
Data entirely generated with Madlab

🏋️ Training Procedure

Hyperparameters

Epochs: 6
Batch size: 48
Learning rate: cosine schedule, peak ~4e-5
Optimizer: AdamW
Gradient clipping: 1.0
Gradient accumulation: 1

Hardware

Training was performed on:

RTX 6000 Blackwell (96GB)

📊 Evaluation

Synthetic Data Expansion Benchmark

A curated set of 30 input/target pairs was programmatically expanded using a Python script.
Metrics include seed pairs covered, total variation count, and semantic quality.
The task is to generate 5 variations of each incoming pair.

note: run numbers not aligned with multi_model_dashboard

Run	Model	Semantic Quality	Variations	Seeds Covered	Efficiency (Variations/Param)	Dataset
1	LFM2-350M-16	6.5	94	23	268.57	Madlab sdg small
2	LFM2-350M-16	3.5	46	11	131.43	base model
3	LFM2-350M-f16	6.5	97	22	277.14	Madlab sdg small
4	Qwen3-coder-30B-instruct-q8	8.2	149	26	4.97	base model
5	LFM2-350M-f16	7.5	136	21	388.57	Madlab sdg medium
6	LFM2-2.6B-f16	9.0	137	25	52.69	Madlab sdg medium
7	LFM2-2.6B-f16	9.9	180	25	69.23	Madlab sdg large
8	LFM2-2.6B-f16	6.2	157	20	60.38	Madlab sdg test
9	LFM2-2.6B-f16	10.0	248	27	95.38	Madlab sdg large
10	Qwen3-235B-q3-k_m	9.5	150	27	0.64	base model
11	LFM2.5-1.2B-instruct-f16	9.1	244	30	203.33	Madlab sdg large

Qualitative Behavior

Overperforms in variation count
Maintains strict semantic correctness

🔒 Safety

This model is a synthetic data generator. It is not designed for conversational use and is not suitable for anything other than generating synthetic datasets.

It is not designed for: