MadlabOSS
/

LFM2.5-1.2B-Instruct-SDG

@@ -1,5 +1,121 @@
----
-license: other
-license_name: lfm1.0
-license_link: https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct/blob/main/LICENSE
----

+---
+license: other
+license_name: lfm1.0
+license_link: https://huggingface.co/LiquidAI/LFM2-1.2B-instruct/blob/main/LICENSE
+metrics:
+- Synthetic Data Expansion Benchmark
+base_model:
+- LiquidAI/LFM2-1.2B-instruct
+tags:
+- lmstudio
+- madlabOSS
+- synthetic data generator
+---
+# Madlab Synthetic Data Generator
+## 🧠 Overview
+The **Madlab SDG 1.2B** is part of the **MadlabOSS Synthetic Data Generator** family — a suite of small, efficient, and highly deterministic synthetic data generators.
+This model was trained on a closed-source dataset created through a multi-stage synthetic data generation process using a modified Madlab training pipeline.
+It represents the first in its family to be built upon the cutting-edge **LFM2.5-instruct** foundation, marking a significant advancement over previous iterations.
+---
+## 🚀 Intended Use
+This model is optimized for:
+- Madlab synthetic data generation
+It is **not** intended as a general-purpose chatbot.
+---
+## 🧩 Model Details
+**Base Model:** LFM2.5-1.2B-instruct
+**Parameter Count:** 1.2 Billion
+**Training Type:** Supervised fine-tuning
+**Sequence Length:** 1024
+**Precision:** FP16
+**Framework:** PyTorch / Transformers
+---
+## 📦 Training Data
+The model was trained on:
+- **1444 compressed and encoded dataset pairs**
+- High variation in output
+- Preservation of semantic meaning
+- Data entirely generated with Madlab
+---
+## 🏋️ Training Procedure
+### **Hyperparameters**
+- Epochs: 6
+- Batch size: 48
+- Learning rate: cosine schedule, peak ~4e-5
+- Optimizer: AdamW
+- Gradient clipping: 1.0
+- Gradient accumulation: 1
+### **Hardware**
+Training was performed on:
+- RTX 6000 Blackwell (96GB)
+---
+## 📊 Evaluation
+![multi_model_dashboard](https://cdn-uploads.huggingface.co/production/uploads/68ec78cca886edada26b90b0/6ERUc70a2I0_e9y8aK5A5.png)
+### **Synthetic Data Expansion Benchmark**
+A curated set of 30 input/target pairs was programmatically expanded using a Python script.
+Metrics include seed pairs covered, total variation count, and semantic quality.
+The task is to generate 5 variations of each incoming pair.
+note: run numbers not aligned with multi_model_dashboard
+| Run | Model | Semantic Quality | Variations | Seeds Covered | Efficiency (Variations/Param) | Dataset |
+|-----|------------|---------------|------------|---------------|-------------------------------|--------------|
+| 1 | LFM2-350M-16 | 6.5 | 94 | 23 | 268.57 | Madlab sdg small |
+| 2 | LFM2-350M-16 | 3.5 | 46 | 11 | 131.43 | base model |
+| 3 | LFM2-350M-f16 | 6.5 | 97 | 22 | 277.14 | Madlab sdg small |
+| 4 | Qwen3-coder-30B-instruct-q8 | 8.2 | 149 | 26 | 4.97 | base model |
+| 5 | LFM2-350M-f16 | 7.5 | 136 | 21 | 388.57 | Madlab sdg medium |
+| 6 | LFM2-2.6B-f16 | 9.0 | 137 | 25 | 52.69 | Madlab sdg medium |
+| 7 | LFM2-2.6B-f16 | 9.9 | 180 | 25 | 69.23 | Madlab sdg large |
+| 8 | LFM2-2.6B-f16 | 6.2 | 157 | 20 | 60.38 | Madlab sdg test |
+| 9 | LFM2-2.6B-f16 | 10.0 | 248 | 27 | 95.38 | Madlab sdg large |
+| 10 | Qwen3-235B-q3-k_m | 9.5 | 150 | 27 | 0.64 | base model |
+| 11 | LFM2.5-1.2B-instruct-f16 | 9.1 | 244 | 30 | 203.33 | Madlab sdg large |
+### **Qualitative Behavior**
+- Overperforms in variation count
+- Maintains strict semantic correctness
+---
+## 🔒 Safety
+This model is a synthetic data generator. It is not designed for conversational use and is not suitable for anything other than generating synthetic datasets.
+It is **not** designed for:
+- Political advice
+- Medical advice
+- Legal advice
+- General-purpose conversation
+---
+## ⚠️ Limitations
+- Not a general assistant
+- Not trained for coding, math, or open-domain reasoning
+- May refuse tasks outside the Madlab SDG scope
+---