InsafQ commited on
Commit
adffada
·
verified ·
1 Parent(s): 9561d41

Replace default template with TabGAN blog post

Browse files
Files changed (1) hide show
  1. index.html +400 -18
index.html CHANGED
@@ -1,19 +1,401 @@
1
- <!doctype html>
2
- <html>
3
- <head>
4
- <meta charset="utf-8" />
5
- <meta name="viewport" content="width=device-width" />
6
- <title>My static Space</title>
7
- <link rel="stylesheet" href="style.css" />
8
- </head>
9
- <body>
10
- <div class="card">
11
- <h1>Welcome to your static Space!</h1>
12
- <p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
13
- <p>
14
- Also don't forget to check the
15
- <a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
16
- </p>
17
- </div>
18
- </body>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  </html>
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="utf-8" />
5
+ <meta name="viewport" content="width=device-width, initial-scale=1" />
6
+ <title>TabGAN: Generate Synthetic Tabular Data with GANs, Diffusion Models & LLMs</title>
7
+ <style>
8
+ :root {
9
+ --bg: #0d1117;
10
+ --card: #161b22;
11
+ --border: #30363d;
12
+ --text: #e6edf3;
13
+ --muted: #8b949e;
14
+ --accent: #58a6ff;
15
+ --accent2: #f78166;
16
+ --green: #3fb950;
17
+ --purple: #bc8cff;
18
+ --code-bg: #1c2128;
19
+ }
20
+ * { margin: 0; padding: 0; box-sizing: border-box; }
21
+ body {
22
+ font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Helvetica, Arial, sans-serif;
23
+ background: var(--bg);
24
+ color: var(--text);
25
+ line-height: 1.7;
26
+ }
27
+ .container {
28
+ max-width: 820px;
29
+ margin: 0 auto;
30
+ padding: 2rem 1.5rem 4rem;
31
+ }
32
+ .hero {
33
+ text-align: center;
34
+ padding: 3rem 0 2rem;
35
+ border-bottom: 1px solid var(--border);
36
+ margin-bottom: 2.5rem;
37
+ }
38
+ .hero h1 {
39
+ font-size: 2rem;
40
+ font-weight: 700;
41
+ line-height: 1.3;
42
+ margin-bottom: 1rem;
43
+ }
44
+ .hero h1 .highlight { color: var(--accent); }
45
+ .hero .subtitle {
46
+ color: var(--muted);
47
+ font-size: 1.05rem;
48
+ max-width: 600px;
49
+ margin: 0 auto 1.5rem;
50
+ }
51
+ .badges { display: flex; gap: .6rem; justify-content: center; flex-wrap: wrap; }
52
+ .badge {
53
+ display: inline-block;
54
+ padding: .3rem .7rem;
55
+ border-radius: 2rem;
56
+ font-size: .8rem;
57
+ font-weight: 600;
58
+ border: 1px solid var(--border);
59
+ color: var(--muted);
60
+ }
61
+ .badge.blue { border-color: var(--accent); color: var(--accent); }
62
+ .badge.orange { border-color: var(--accent2); color: var(--accent2); }
63
+ .badge.green { border-color: var(--green); color: var(--green); }
64
+ .badge.purple { border-color: var(--purple); color: var(--purple); }
65
+
66
+ h2 {
67
+ font-size: 1.5rem;
68
+ margin: 2.5rem 0 1rem;
69
+ padding-bottom: .5rem;
70
+ border-bottom: 1px solid var(--border);
71
+ }
72
+ h3 {
73
+ font-size: 1.2rem;
74
+ margin: 2rem 0 .8rem;
75
+ color: var(--accent);
76
+ }
77
+ p { margin-bottom: 1rem; }
78
+ ul, ol { margin: 0 0 1rem 1.5rem; }
79
+ li { margin-bottom: .4rem; }
80
+ strong { color: #fff; }
81
+ a { color: var(--accent); text-decoration: none; }
82
+ a:hover { text-decoration: underline; }
83
+
84
+ pre {
85
+ background: var(--code-bg);
86
+ border: 1px solid var(--border);
87
+ border-radius: 8px;
88
+ padding: 1rem 1.2rem;
89
+ overflow-x: auto;
90
+ margin-bottom: 1.2rem;
91
+ font-size: .88rem;
92
+ line-height: 1.5;
93
+ }
94
+ code {
95
+ font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, monospace;
96
+ font-size: .88em;
97
+ }
98
+ p code, li code {
99
+ background: var(--code-bg);
100
+ padding: .15rem .4rem;
101
+ border-radius: 4px;
102
+ border: 1px solid var(--border);
103
+ }
104
+ .kw { color: #ff7b72; }
105
+ .fn { color: #d2a8ff; }
106
+ .str { color: #a5d6ff; }
107
+ .cm { color: #8b949e; font-style: italic; }
108
+ .num { color: #79c0ff; }
109
+
110
+ table {
111
+ width: 100%;
112
+ border-collapse: collapse;
113
+ margin-bottom: 1.2rem;
114
+ font-size: .92rem;
115
+ }
116
+ th, td {
117
+ padding: .6rem .8rem;
118
+ border: 1px solid var(--border);
119
+ text-align: left;
120
+ }
121
+ th { background: var(--card); font-weight: 600; }
122
+ tr:nth-child(even) { background: rgba(22,27,34,.5); }
123
+
124
+ .card-grid {
125
+ display: grid;
126
+ grid-template-columns: repeat(auto-fit, minmax(220px, 1fr));
127
+ gap: 1rem;
128
+ margin-bottom: 1.5rem;
129
+ }
130
+ .card {
131
+ background: var(--card);
132
+ border: 1px solid var(--border);
133
+ border-radius: 8px;
134
+ padding: 1.2rem;
135
+ }
136
+ .card h4 { margin-bottom: .5rem; color: var(--accent); }
137
+
138
+ .cta {
139
+ display: flex;
140
+ gap: 1rem;
141
+ flex-wrap: wrap;
142
+ margin: 2rem 0;
143
+ justify-content: center;
144
+ }
145
+ .cta a {
146
+ display: inline-flex;
147
+ align-items: center;
148
+ gap: .5rem;
149
+ padding: .7rem 1.4rem;
150
+ border-radius: 6px;
151
+ font-weight: 600;
152
+ font-size: .95rem;
153
+ transition: opacity .2s;
154
+ }
155
+ .cta a:hover { text-decoration: none; opacity: .85; }
156
+ .cta .primary { background: var(--accent); color: #0d1117; }
157
+ .cta .secondary { background: var(--card); border: 1px solid var(--border); color: var(--text); }
158
+
159
+ .footer {
160
+ text-align: center;
161
+ padding-top: 2rem;
162
+ margin-top: 3rem;
163
+ border-top: 1px solid var(--border);
164
+ color: var(--muted);
165
+ font-size: .9rem;
166
+ }
167
+ .author {
168
+ display: flex;
169
+ align-items: center;
170
+ gap: .8rem;
171
+ margin: 1rem auto;
172
+ justify-content: center;
173
+ color: var(--muted);
174
+ font-size: .9rem;
175
+ }
176
+ @media (max-width: 600px) {
177
+ .hero h1 { font-size: 1.5rem; }
178
+ .container { padding: 1rem; }
179
+ }
180
+ </style>
181
+ </head>
182
+ <body>
183
+ <div class="container">
184
+
185
+ <div class="hero">
186
+ <h1>
187
+ <span class="highlight">TabGAN:</span> Generate Synthetic Tabular Data<br>
188
+ with GANs, Diffusion &amp; LLMs &mdash; in 3 Lines of Python
189
+ </h1>
190
+ <p class="subtitle">
191
+ High-quality synthetic tabular data using GANs, Forest Diffusion, or LLMs &mdash;
192
+ with built-in quality reports, privacy metrics, <strong>AutoSynth</strong>, and
193
+ <strong>one-click synthesis for any HuggingFace dataset</strong>.
194
+ </p>
195
+ <div class="badges">
196
+ <span class="badge blue">synthetic-data</span>
197
+ <span class="badge orange">GAN</span>
198
+ <span class="badge green">diffusion</span>
199
+ <span class="badge purple">privacy</span>
200
+ <span class="badge">open-source</span>
201
+ </div>
202
+ <div class="author">
203
+ <span>by <a href="https://huggingface.co/InsafQ">InsafQ</a></span>
204
+ <span>&middot;</span>
205
+ <span>March 29, 2026</span>
206
+ </div>
207
+ </div>
208
+
209
+ <!-- Problem -->
210
+ <h2>The Problem</h2>
211
+ <p>You have tabular data that's too sensitive to share, too small to train on, or too imbalanced to model well. You need synthetic data that:</p>
212
+ <ul>
213
+ <li><strong>Preserves statistical properties</strong> of the original</li>
214
+ <li><strong>Doesn't memorize</strong> individual records (privacy!)</li>
215
+ <li><strong>Works out of the box</strong> without ML PhD-level tuning</li>
216
+ </ul>
217
+
218
+ <!-- Solution -->
219
+ <h2>The Solution: TabGAN</h2>
220
+ <pre><code>pip install tabgan</code></pre>
221
+
222
+ <h3>3 Lines to Synthetic Data</h3>
223
+ <pre><code><span class="kw">from</span> tabgan <span class="kw">import</span> GANGenerator
224
+ <span class="kw">import</span> pandas <span class="kw">as</span> pd
225
+
226
+ df = pd.<span class="fn">read_csv</span>(<span class="str">"your_data.csv"</span>)
227
+ gen = <span class="fn">GANGenerator</span>(gen_x_times=<span class="num">1.1</span>, cat_cols=[<span class="str">"gender"</span>, <span class="str">"city"</span>])
228
+ synthetic, _ = gen.<span class="fn">generate_data_pipe</span>(df, <span class="kw">None</span>, df, only_generated_data=<span class="kw">True</span>)</code></pre>
229
+ <p>That's it. <code>synthetic</code> is a DataFrame with realistic rows that never existed in the original data.</p>
230
+
231
+ <!-- Generators table -->
232
+ <h2>One API, Multiple Generators</h2>
233
+ <p>Switch between state-of-the-art methods with a single parameter change:</p>
234
+ <table>
235
+ <thead><tr><th>Generator</th><th>Best For</th><th>Speed</th></tr></thead>
236
+ <tbody>
237
+ <tr><td><strong>CTGAN</strong> (GAN)</td><td>General purpose, mixed types</td><td>Fast</td></tr>
238
+ <tr><td><strong>Forest Diffusion</strong></td><td>Tree-friendly structured data</td><td>Medium</td></tr>
239
+ <tr><td><strong>LLM</strong> (GReaT)</td><td>Text-rich, semantic dependencies</td><td>Slow</td></tr>
240
+ <tr><td><strong>Random Baseline</strong></td><td>Quick benchmarking</td><td>Instant</td></tr>
241
+ </tbody>
242
+ </table>
243
+ <pre><code><span class="kw">from</span> tabgan <span class="kw">import</span> GANGenerator, ForestDiffusionGenerator, LLMGenerator
244
+
245
+ <span class="cm"># Just swap the class &mdash; same API!</span>
246
+ gen = <span class="fn">ForestDiffusionGenerator</span>(gen_x_times=<span class="num">1.0</span>, cat_cols=[<span class="str">"category"</span>])
247
+ synthetic, _ = gen.<span class="fn">generate_data_pipe</span>(df, target, df, only_generated_data=<span class="kw">True</span>)</code></pre>
248
+
249
+ <!-- AutoSynth -->
250
+ <h3>NEW: AutoSynth &mdash; Let the Library Choose</h3>
251
+ <p>Don't know which generator works best for your data? <strong>AutoSynth</strong> runs all of them and picks the winner:</p>
252
+ <pre><code><span class="kw">from</span> tabgan <span class="kw">import</span> AutoSynth
253
+
254
+ result = <span class="fn">AutoSynth</span>(df, target_col=<span class="str">"label"</span>).<span class="fn">run</span>()
255
+
256
+ <span class="fn">print</span>(result.report)
257
+ <span class="cm"># Generator Status Score Quality Privacy Rows Time (s)</span>
258
+ <span class="cm"># 0 GAN (CTGAN) OK 0.847 0.891 0.743 165 12.3</span>
259
+ <span class="cm"># 1 Forest Diffusion OK 0.812 0.834 0.761 165 45.1</span>
260
+ <span class="cm"># 2 Random Baseline OK 0.654 0.621 0.732 165 0.1</span>
261
+
262
+ best_synthetic = result.best_data <span class="cm"># Best generator's output</span>
263
+ <span class="fn">print</span>(<span class="str">f"Winner: </span>{result.best_name}<span class="str">"</span>) <span class="cm"># "GAN (CTGAN)"</span></code></pre>
264
+ <p>AutoSynth scores each generator on a weighted combination of <strong>quality</strong> (distribution fidelity, ML utility) and <strong>privacy</strong> (distance to closest record, membership inference risk).</p>
265
+
266
+ <!-- HuggingFace integration -->
267
+ <h3>NEW: One-Click Synthesis for Any HuggingFace Dataset</h3>
268
+ <pre><code><span class="kw">from</span> tabgan <span class="kw">import</span> synthesize_hf_dataset
269
+
270
+ <span class="cm"># Load &rarr; Generate &rarr; Evaluate in one call</span>
271
+ result = <span class="fn">synthesize_hf_dataset</span>(
272
+ <span class="str">"scikit-learn/iris"</span>,
273
+ target_col=<span class="str">"target"</span>,
274
+ )
275
+
276
+ <span class="cm"># Push synthetic version to your HF account</span>
277
+ result = <span class="fn">synthesize_hf_dataset</span>(
278
+ <span class="str">"scikit-learn/iris"</span>,
279
+ target_col=<span class="str">"target"</span>,
280
+ push_to_hub=<span class="kw">True</span>,
281
+ hub_repo_id=<span class="str">"your-username/iris-synthetic"</span>,
282
+ )</code></pre>
283
+
284
+ <!-- Features -->
285
+ <h2>Key Features</h2>
286
+ <div class="card-grid">
287
+ <div class="card">
288
+ <h4>Quality Reports</h4>
289
+ <p>PSI distribution divergence, correlation comparison, ML utility (train-on-synthetic, test-on-real).</p>
290
+ </div>
291
+ <div class="card">
292
+ <h4>Privacy Metrics</h4>
293
+ <p>Distance to Closest Record, Nearest Neighbor Distance Ratio, Membership Inference Risk.</p>
294
+ </div>
295
+ <div class="card">
296
+ <h4>Business Constraints</h4>
297
+ <p>Enforce domain rules: <code>RangeConstraint</code>, <code>FormulaConstraint</code> on generated data.</p>
298
+ </div>
299
+ <div class="card">
300
+ <h4>sklearn Integration</h4>
301
+ <p>Drop <code>TabGANTransformer</code> into any sklearn pipeline for synthetic augmentation.</p>
302
+ </div>
303
+ </div>
304
+
305
+ <!-- Quality Report example -->
306
+ <h3>Quality &amp; Privacy Reports</h3>
307
+ <pre><code><span class="kw">from</span> tabgan <span class="kw">import</span> QualityReport
308
+
309
+ report = <span class="fn">QualityReport</span>(original_df, synthetic_df, cat_cols=[<span class="str">"gender"</span>], target_col=<span class="str">"label"</span>)
310
+ report.<span class="fn">compute</span>()
311
+ report.<span class="fn">to_html</span>(<span class="str">"quality_report.html"</span>) <span class="cm"># Self-contained HTML with plots</span></code></pre>
312
+
313
+ <pre><code><span class="kw">from</span> tabgan <span class="kw">import</span> PrivacyMetrics
314
+
315
+ pm = <span class="fn">PrivacyMetrics</span>(original_df, synthetic_df, cat_cols=[<span class="str">"gender"</span>])
316
+ summary = pm.<span class="fn">summary</span>()
317
+ <span class="fn">print</span>(<span class="str">f"Privacy score: </span>{summary[<span class="str">'overall_privacy_score'</span>]}<span class="str">"</span>) <span class="cm"># 0 = leaked, 1 = private</span></code></pre>
318
+
319
+ <!-- Constraints -->
320
+ <h3>Business Constraints</h3>
321
+ <pre><code><span class="kw">from</span> tabgan <span class="kw">import</span> GANGenerator, RangeConstraint, FormulaConstraint
322
+
323
+ gen = <span class="fn">GANGenerator</span>(
324
+ gen_x_times=<span class="num">1.5</span>,
325
+ cat_cols=[<span class="str">"department"</span>],
326
+ constraints=[
327
+ <span class="fn">RangeConstraint</span>(<span class="str">"age"</span>, min_val=<span class="num">18</span>, max_val=<span class="num">65</span>),
328
+ <span class="fn">RangeConstraint</span>(<span class="str">"salary"</span>, min_val=<span class="num">0</span>),
329
+ <span class="fn">FormulaConstraint</span>(<span class="str">"end_date > start_date"</span>),
330
+ ],
331
+ )</code></pre>
332
+
333
+ <!-- sklearn pipeline -->
334
+ <h3>sklearn Pipeline Integration</h3>
335
+ <pre><code><span class="kw">from</span> sklearn.pipeline <span class="kw">import</span> Pipeline
336
+ <span class="kw">from</span> sklearn.ensemble <span class="kw">import</span> RandomForestClassifier
337
+ <span class="kw">from</span> tabgan <span class="kw">import</span> TabGANTransformer
338
+
339
+ pipe = <span class="fn">Pipeline</span>([
340
+ (<span class="str">"augment"</span>, <span class="fn">TabGANTransformer</span>(gen_x_times=<span class="num">2.0</span>, cat_cols=[<span class="str">"gender"</span>])),
341
+ (<span class="str">"model"</span>, <span class="fn">RandomForestClassifier</span>()),
342
+ ])
343
+ pipe.<span class="fn">fit</span>(X_train, y_train)</code></pre>
344
+
345
+ <!-- Benchmarks -->
346
+ <h2>Benchmarks</h2>
347
+ <h3>Quality (Normalized ROC AUC)</h3>
348
+ <table>
349
+ <thead><tr><th>Dataset</th><th>CTGAN</th><th>Forest Diffusion</th><th>Random</th></tr></thead>
350
+ <tbody>
351
+ <tr><td>Credit</td><td>0.752</td><td><strong>0.781</strong></td><td>0.501</td></tr>
352
+ <tr><td>Adult Census</td><td>0.689</td><td><strong>0.712</strong></td><td>0.523</td></tr>
353
+ <tr><td>Telecom</td><td><strong>0.814</strong></td><td>0.799</td><td>0.548</td></tr>
354
+ </tbody>
355
+ </table>
356
+ <p style="color:var(--muted); font-size:.9rem;">Higher is better.</p>
357
+
358
+ <h3>Speed (generation time, 1000 rows, 8 features)</h3>
359
+ <table>
360
+ <thead><tr><th>Generator</th><th>Time</th><th>Notes</th></tr></thead>
361
+ <tbody>
362
+ <tr><td><strong>Random Baseline</strong></td><td>~0.1s</td><td>Instant &mdash; just resampling</td></tr>
363
+ <tr><td><strong>CTGAN (GAN)</strong></td><td>~1&ndash;10s</td><td>Fast, depends on epochs</td></tr>
364
+ <tr><td><strong>Forest Diffusion</strong></td><td>~30&ndash;120s</td><td>High quality, but slower</td></tr>
365
+ <tr><td><strong>LLM (GReaT)</strong></td><td>~5&ndash;30min</td><td>Best for text columns, GPU recommended</td></tr>
366
+ </tbody>
367
+ </table>
368
+
369
+ <h3>Execution Timing</h3>
370
+ <pre><code>gen = <span class="fn">GANGenerator</span>(gen_x_times=<span class="num">1.1</span>)
371
+ synthetic, _ = gen.<span class="fn">generate_data_pipe</span>(train, target, test)
372
+ <span class="fn">print</span>(gen.last_timing_)
373
+ <span class="cm"># {'preprocess': 0.001, 'generation': 2.3, 'postprocess': 0.01,</span>
374
+ <span class="cm"># 'adversarial_filtering': 0.15, 'total': 2.46}</span></code></pre>
375
+
376
+ <!-- What's Next -->
377
+ <h2>What's Next</h2>
378
+ <ul>
379
+ <li><strong>Public Leaderboard</strong> for synthetic tabular data generators</li>
380
+ <li><strong>Differential Privacy</strong> guarantees (DP-SGD)</li>
381
+ <li><strong>Natural language generation</strong> &mdash; "Generate 1000 patients aged 20-40"</li>
382
+ </ul>
383
+
384
+ <!-- CTA -->
385
+ <div class="cta">
386
+ <a class="primary" href="https://pypi.org/project/tabgan/">pip install tabgan</a>
387
+ <a class="secondary" href="https://github.com/Diyago/Tabular-data-generation">GitHub</a>
388
+ <a class="secondary" href="https://huggingface.co/spaces/InsafQ/TabGAN">Interactive Demo</a>
389
+ </div>
390
+
391
+ <div class="footer">
392
+ <p>TabGAN is Apache 2.0 licensed. Contributions welcome!</p>
393
+ <p style="margin-top:.5rem;">
394
+ Star the repo if you find it useful:
395
+ <a href="https://github.com/Diyago/Tabular-data-generation">github.com/Diyago/Tabular-data-generation</a>
396
+ </p>
397
+ </div>
398
+
399
+ </div>
400
+ </body>
401
  </html>