Archi-medes commited on
Commit
d2f4be4
·
verified ·
1 Parent(s): 5799111

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -5
README.md CHANGED
@@ -1,5 +1,121 @@
1
- ---
2
- license: other
3
- license_name: lfm1.0
4
- license_link: https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct/blob/main/LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: lfm1.0
4
+ license_link: https://huggingface.co/LiquidAI/LFM2-1.2B-instruct/blob/main/LICENSE
5
+ metrics:
6
+ - Synthetic Data Expansion Benchmark
7
+ base_model:
8
+ - LiquidAI/LFM2-1.2B-instruct
9
+ tags:
10
+ - lmstudio
11
+ - madlabOSS
12
+ - synthetic data generator
13
+ ---
14
+
15
+ # Madlab Synthetic Data Generator
16
+
17
+ ## 🧠 Overview
18
+ The **Madlab SDG 1.2B** is part of the **MadlabOSS Synthetic Data Generator** family — a suite of small, efficient, and highly deterministic synthetic data generators.
19
+ This model was trained on a closed-source dataset created through a multi-stage synthetic data generation process using a modified Madlab training pipeline.
20
+ It represents the first in its family to be built upon the cutting-edge **LFM2.5-instruct** foundation, marking a significant advancement over previous iterations.
21
+
22
+
23
+ ---
24
+
25
+ ## 🚀 Intended Use
26
+ This model is optimized for:
27
+
28
+ - Madlab synthetic data generation
29
+
30
+ It is **not** intended as a general-purpose chatbot.
31
+
32
+ ---
33
+
34
+ ## 🧩 Model Details
35
+
36
+ **Base Model:** LFM2.5-1.2B-instruct
37
+ **Parameter Count:** 1.2 Billion
38
+ **Training Type:** Supervised fine-tuning
39
+ **Sequence Length:** 1024
40
+ **Precision:** FP16
41
+ **Framework:** PyTorch / Transformers
42
+
43
+ ---
44
+
45
+ ## 📦 Training Data
46
+ The model was trained on:
47
+
48
+ - **1444 compressed and encoded dataset pairs**
49
+ - High variation in output
50
+ - Preservation of semantic meaning
51
+ - Data entirely generated with Madlab
52
+
53
+ ---
54
+
55
+ ## 🏋️ Training Procedure
56
+
57
+ ### **Hyperparameters**
58
+ - Epochs: 6
59
+ - Batch size: 48
60
+ - Learning rate: cosine schedule, peak ~4e-5
61
+ - Optimizer: AdamW
62
+ - Gradient clipping: 1.0
63
+ - Gradient accumulation: 1
64
+
65
+ ### **Hardware**
66
+ Training was performed on:
67
+
68
+ - RTX 6000 Blackwell (96GB)
69
+
70
+ ---
71
+
72
+ ## 📊 Evaluation
73
+
74
+
75
+ ![multi_model_dashboard](https://cdn-uploads.huggingface.co/production/uploads/68ec78cca886edada26b90b0/6ERUc70a2I0_e9y8aK5A5.png)
76
+
77
+ ### **Synthetic Data Expansion Benchmark**
78
+ A curated set of 30 input/target pairs was programmatically expanded using a Python script.
79
+ Metrics include seed pairs covered, total variation count, and semantic quality.
80
+ The task is to generate 5 variations of each incoming pair.
81
+
82
+ note: run numbers not aligned with multi_model_dashboard
83
+ | Run | Model | Semantic Quality | Variations | Seeds Covered | Efficiency (Variations/Param) | Dataset |
84
+ |-----|------------|---------------|------------|---------------|-------------------------------|--------------|
85
+ | 1 | LFM2-350M-16 | 6.5 | 94 | 23 | 268.57 | Madlab sdg small |
86
+ | 2 | LFM2-350M-16 | 3.5 | 46 | 11 | 131.43 | base model |
87
+ | 3 | LFM2-350M-f16 | 6.5 | 97 | 22 | 277.14 | Madlab sdg small |
88
+ | 4 | Qwen3-coder-30B-instruct-q8 | 8.2 | 149 | 26 | 4.97 | base model |
89
+ | 5 | LFM2-350M-f16 | 7.5 | 136 | 21 | 388.57 | Madlab sdg medium |
90
+ | 6 | LFM2-2.6B-f16 | 9.0 | 137 | 25 | 52.69 | Madlab sdg medium |
91
+ | 7 | LFM2-2.6B-f16 | 9.9 | 180 | 25 | 69.23 | Madlab sdg large |
92
+ | 8 | LFM2-2.6B-f16 | 6.2 | 157 | 20 | 60.38 | Madlab sdg test |
93
+ | 9 | LFM2-2.6B-f16 | 10.0 | 248 | 27 | 95.38 | Madlab sdg large |
94
+ | 10 | Qwen3-235B-q3-k_m | 9.5 | 150 | 27 | 0.64 | base model |
95
+ | 11 | LFM2.5-1.2B-instruct-f16 | 9.1 | 244 | 30 | 203.33 | Madlab sdg large |
96
+
97
+
98
+ ### **Qualitative Behavior**
99
+ - Overperforms in variation count
100
+ - Maintains strict semantic correctness
101
+
102
+ ---
103
+
104
+ ## 🔒 Safety
105
+ This model is a synthetic data generator. It is not designed for conversational use and is not suitable for anything other than generating synthetic datasets.
106
+
107
+ It is **not** designed for:
108
+
109
+ - Political advice
110
+ - Medical advice
111
+ - Legal advice
112
+ - General-purpose conversation
113
+
114
+ ---
115
+
116
+ ## ⚠️ Limitations
117
+ - Not a general assistant
118
+ - Not trained for coding, math, or open-domain reasoning
119
+ - May refuse tasks outside the Madlab SDG scope
120
+
121
+ ---