MadlabOSS
/

LFM2-2.6B-SDG

synthetic data generator

Model card Files Files and versions

LFM2-2.6B-SDG / README.md

Archi-medes's picture

Update README.md

c8a56b4 verified 9 days ago

|

history blame contribute delete

3.39 kB

	---
	license: other
	license_name: lfm1.0
	license_link: https://huggingface.co/LiquidAI/LFM2-2.6B/blob/main/LICENSE
	metrics:
	- Synthetic Data Expansion Benchmark
	base_model:
	- LiquidAI/LFM2-2.6B
	tags:
	- lmstudio
	- madlabOSS
	- synthetic data generator
	---

	# Madlab Synthetic Data Generator

	## 🧠 Overview
	The Madlab SDG 2.6B is part of the MadlabOSS Synthetic Data Generator family — a suite of small, efficient synthetic data generators designed for rule‑consistent, semantically coherent variation.
	This model was trained on a closed-source dataset created through a multi-stage synthetic data generation process using a modified Madlab training pipeline.

	---

	## 🚀 Intended Use
	This model is optimized for:

	- Madlab synthetic data generation

	It is not intended as a general-purpose chatbot.

	---

	## 🧩 Model Details

	Base Model: LFM2-2.6B
	Parameter Count: 2.6 Billion
	Training Type: Supervised fine-tuning
	Sequence Length: 1024
	Precision: FP16
	Framework: PyTorch / Transformers

	---

	## 📦 Training Data
	The model was trained on:

	- 1444 compressed and encoded dataset pairs
	- High variation in output
	- Preservation of semantic meaning
	- Data entirely generated with Madlab

	---

	## 🏋️ Training Procedure

	### Hyperparameters
	- Epochs: 6
	- Batch size: 48
	- Learning rate: cosine schedule, peak ~4e-5
	- Optimizer: AdamW
	- Gradient clipping: 1.0
	- Gradient accumulation: 1

	### Hardware
	Training was performed on:

	- RTX 6000 Blackwell (96GB)

	---

	## 📊 Evaluation

	### Synthetic Data Expansion Benchmark
	A curated set of 30 input/target pairs was programmatically expanded using a Python script.
	Metrics include seed pairs covered, total variation count, and semantic quality.
	The task is to generate 5 variations of each incoming pair.

	\| Run \| Model \| Semantic Quality \| Variations \| Seeds Covered \| Efficiency (Variations/Param) \| Dataset \|
	\|-----\|------------\|---------------\|------------\|---------------\|-------------------------------\|--------------\|
	\| 1 \| LFM2-350M-16 \| 6.5 \| 94 \| 23 \| 268.57 \| Madlab sdg small \|
	\| 2 \| LFM2-350M-16 \| 3.5 \| 46 \| 11 \| 131.43 \| base model \|
	\| 3 \| LFM2-350M-f16 \| 6.5 \| 97 \| 22 \| 277.14 \| Madlab sdg small \|
	\| 4 \| Qwen3-coder-30B-instruct-q8 \| 8.2 \| 149 \| 26 \| 4.97 \| base model \|
	\| 5 \| LFM2-350M-f16 \| 7.5 \| 136 \| 21 \| 388.57 \| Madlab sdg medium \|
	\| 6 \| LFM2-2.6B-f16 \| 9.0 \| 137 \| 25 \| 52.69 \| Madlab sdg medium \|
	\| 7 \| LFM2-2.6B-f16 \| 9.9 \| 180 \| 25 \| 69.23 \| Madlab sdg large \|
	\| 8 \| LFM2-2.6B-f16 \| 6.2 \| 157 \| 20 \| 60.38 \| Madlab sdg test \|
	\| 9 \| LFM2-2.6B-f16 \| 10.0 \| 248 \| 27 \| 95.38 \| Madlab sdg large \|
	\| 10 \| Qwen3-235B-q3-k_m \| 9.5 \| 150 \| 27 \| 0.64 \| base model \|
	\| 11 \| LFM2.5-1.2B-instruct-f16 \| 9.1 \| 244 \| 30 \| 203.33 \| Madlab sdg large \|

	### Qualitative Behavior
	- Overperforms in variation count
	- Maintains strict semantic correctness

	---

	## 🔒 Safety
	This model is a synthetic data generator. It is not designed for conversational use and is not suitable for anything other than generating synthetic datasets.

	It is not designed for:

	- Political advice
	- Medical advice
	- Legal advice
	- General-purpose conversation

	---

	## ⚠️ Limitations
	- Not a general assistant
	- Not trained for coding, math, or open-domain reasoning
	- May refuse tasks outside the Madlab SDG scope

	---