--- license: other license_name: lfm1.0 license_link: https://huggingface.co/LiquidAI/LFM2-2.6B/blob/main/LICENSE metrics: - Synthetic Data Expansion Benchmark base_model: - LiquidAI/LFM2-2.6B tags: - lmstudio - madlabOSS - synthetic data generator --- # Madlab Synthetic Data Generator ## 🧠 Overview The **Madlab SDG 2.6B** is part of the **MadlabOSS Synthetic Data Generator** family — a suite of small, efficient synthetic data generators designed for rule‑consistent, semantically coherent variation. This model was trained on a closed-source dataset created through a multi-stage synthetic data generation process using a modified Madlab training pipeline. --- ## 🚀 Intended Use This model is optimized for: - Madlab synthetic data generation It is **not** intended as a general-purpose chatbot. --- ## 🧩 Model Details **Base Model:** LFM2-2.6B **Parameter Count:** 2.6 Billion **Training Type:** Supervised fine-tuning **Sequence Length:** 1024 **Precision:** FP16 **Framework:** PyTorch / Transformers --- ## 📦 Training Data The model was trained on: - **1444 compressed and encoded dataset pairs** - High variation in output - Preservation of semantic meaning - Data entirely generated with Madlab --- ## 🏋️ Training Procedure ### **Hyperparameters** - Epochs: 6 - Batch size: 48 - Learning rate: cosine schedule, peak ~4e-5 - Optimizer: AdamW - Gradient clipping: 1.0 - Gradient accumulation: 1 ### **Hardware** Training was performed on: - RTX 6000 Blackwell (96GB) --- ## 📊 Evaluation ### **Synthetic Data Expansion Benchmark** A curated set of 30 input/target pairs was programmatically expanded using a Python script. Metrics include seed pairs covered, total variation count, and semantic quality. The task is to generate 5 variations of each incoming pair. | Run | Model | Semantic Quality | Variations | Seeds Covered | Efficiency (Variations/Param) | Dataset | |-----|------------|---------------|------------|---------------|-------------------------------|--------------| | 1 | LFM2-350M-16 | 6.5 | 94 | 23 | 268.57 | Madlab sdg small | | 2 | LFM2-350M-16 | 3.5 | 46 | 11 | 131.43 | base model | | 3 | LFM2-350M-f16 | 6.5 | 97 | 22 | 277.14 | Madlab sdg small | | 4 | Qwen3-coder-30B-instruct-q8 | 8.2 | 149 | 26 | 4.97 | base model | | 5 | LFM2-350M-f16 | 7.5 | 136 | 21 | 388.57 | Madlab sdg medium | | 6 | LFM2-2.6B-f16 | 9.0 | 137 | 25 | 52.69 | Madlab sdg medium | | 7 | LFM2-2.6B-f16 | 9.9 | 180 | 25 | 69.23 | Madlab sdg large | | 8 | LFM2-2.6B-f16 | 6.2 | 157 | 20 | 60.38 | Madlab sdg test | | 9 | LFM2-2.6B-f16 | 10.0 | 248 | 27 | 95.38 | Madlab sdg large | | 10 | Qwen3-235B-q3-k_m | 9.5 | 150 | 27 | 0.64 | base model | | 11 | LFM2.5-1.2B-instruct-f16 | 9.1 | 244 | 30 | 203.33 | Madlab sdg large | ### **Qualitative Behavior** - Overperforms in variation count - Maintains strict semantic correctness --- ## 🔒 Safety This model is a synthetic data generator. It is not designed for conversational use and is not suitable for anything other than generating synthetic datasets. It is **not** designed for: - Political advice - Medical advice - Legal advice - General-purpose conversation --- ## ⚠️ Limitations - Not a general assistant - Not trained for coding, math, or open-domain reasoning - May refuse tasks outside the Madlab SDG scope ---