lisekarimi commited on
Commit
db17eb5
·
1 Parent(s): 62587cb

Deploy version 0.1.0

Browse files
Files changed (14) hide show
  1. Dockerfile +29 -0
  2. README.md +77 -7
  3. assets/styles.css +237 -0
  4. main.py +14 -0
  5. pyproject.toml +35 -0
  6. src/__init__.py +0 -0
  7. src/constants.py +34 -0
  8. src/datagen.py +51 -0
  9. src/models.py +61 -0
  10. src/pipeline.py +88 -0
  11. src/prompts.py +78 -0
  12. src/ui.py +184 -0
  13. src/utils.py +78 -0
  14. uv.lock +0 -0
Dockerfile ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ # Install uv
4
+ RUN pip install uv
5
+
6
+ WORKDIR /app
7
+
8
+ # Copy dependency files first (changes rarely)
9
+ COPY pyproject.toml uv.lock ./
10
+
11
+ # Put venv outside of /app so it won't be affected by volume mounts
12
+ ENV UV_PROJECT_ENVIRONMENT=/opt/venv
13
+
14
+ # Install dependencies (this will now create venv at /opt/venv)
15
+ RUN uv sync --locked
16
+
17
+ # Copy all source code
18
+ COPY . .
19
+
20
+ # Create output directory with proper permissions
21
+ RUN mkdir -p /tmp/output && chmod 777 /tmp/output
22
+
23
+ # Set output directory environment variable for production
24
+ ENV OUTPUT_DIR=/tmp/output
25
+
26
+ # Disable UV cache entirely for production
27
+ ENV UV_NO_CACHE=1
28
+
29
+ CMD ["uv", "run", "main.py"]
README.md CHANGED
@@ -1,11 +1,81 @@
1
  ---
2
- title: Datagen
3
- emoji: 📚
4
- colorFrom: blue
5
- colorTo: gray
6
  sdk: docker
7
- pinned: false
8
- short_description: ✨ AI-powered platform for generating synthetic datasets
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: DataGen
3
+ emoji: 🧬
4
+ colorFrom: indigo
5
+ colorTo: pink
6
  sdk: docker
 
 
7
  ---
8
 
9
+ # 🧬 DataGen: AI-Powered Synthetic Data Generator
10
+
11
+ Generate realistic synthetic datasets by simply describing what you need.
12
+
13
+ [🚀 **Try the Live Demo**](https://huggingface.co/spaces/lisekarimi/snapr)
14
+
15
+ <img src="https://gitlab.com/lisekarimi/datagen/-/raw/main/assets/screenshot.png" alt="DataGen interface" width="450">
16
+
17
+ ## ✨ What DataGen Does
18
+
19
+ DataGen transforms simple descriptions into structured datasets using AI. Perfect for researchers, data scientists, and developers who need realistic test data fast.
20
+
21
+ **Key Features:**
22
+ - **Type what you want → Get real data**
23
+ - **Multiple formats:** CSV, JSON, Parquet, Markdown
24
+ - **Dataset types:** Tables, time-series, text data
25
+ - **AI-powered:** Uses GPT and Claude models
26
+ - **Instant download** with clean, ready-to-use datasets
27
+
28
+ ## 🚀 Quick Start
29
+
30
+ ### Prerequisites
31
+ - Python 3.11+
32
+ - [uv package manager](https://docs.astral.sh/uv/getting-started/installation/)
33
+
34
+ ### Installation
35
+ ```bash
36
+ git clone https://github.com/lisekarimi/datagen.git
37
+ cd datagen
38
+ uv sync
39
+ source .venv/bin/activate # Unix/macOS
40
+ # or .\.venv\Scripts\activate on Windows
41
+ ```
42
+
43
+ ### Configuration
44
+ 1. Copy `.env.example` to `.env`
45
+ 2. Populate it with the required secrets
46
+
47
+ ### Run DataGen
48
+ ```bash
49
+ # Local development
50
+ make run
51
+
52
+ # With hot reload
53
+ make ui
54
+ ```
55
+
56
+ *For complete setup instructions, commands, and development guidelines, see [docs/dev.md](https://gitlab.com/lisekarimi/datagen/-/blob/main/docs/dev.md)*
57
+
58
+ ## 🧑‍💻 How to Use
59
+
60
+ 1. **Describe your data:** "Customer purchase history with demographics"
61
+ 2. **Choose format:** CSV, JSON, Parquet, or Markdown
62
+ 3. **Select AI model:** GPT or Claude
63
+ 4. **Set sample size:** Number of records to generate
64
+ 5. **Generate & download** your dataset
65
+
66
+ ## 🛡️ Quality & Security
67
+
68
+ DataGen maintains high standards with comprehensive test coverage, automated security scanning, and code quality enforcement.
69
+
70
+ *For CI/CD setup and technical details, see [docs/cicd.md](https://gitlab.com/lisekarimi/datagen/-/blob/main/docs/cicd.md)*
71
+
72
+ ## 📝 Notes
73
+ - Generated files are automatically cleaned up after 5 minutes
74
+ - Supports 10-1000 samples per dataset
75
+ - JSON output includes proper indentation for readability
76
+ - Cross-platform compatibility (Windows, macOS, Linux)
77
+
78
+
79
+ ## 📄 License
80
+
81
+ MIT
assets/styles.css ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ html, body, #app, body > div, .gradio-container {
3
+ background-color: #0b0e18 !important; /* dark blue */
4
+ height: 100%;
5
+ margin: 0;
6
+ padding: 0;
7
+ font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
8
+ display: flex;
9
+ justify-content: center;
10
+ align-items: center;
11
+ }
12
+
13
+ #app-container {
14
+ background-color: #1d3451 !important;
15
+ padding: 40px;
16
+ border-radius: 12px;
17
+ box-shadow: 0 4px 25px rgba(0, 0, 0, 0.4);
18
+ max-width: 800px;
19
+ width: 100%;
20
+ color: white;
21
+ }
22
+
23
+ #app-container h4,
24
+ #app-container p,
25
+ #app-container ol,
26
+ #app-container li,
27
+ #app-container strong {
28
+ font-size: 16px;
29
+ line-height: 1.6;
30
+ color: white !important;
31
+ }
32
+
33
+ #app-title {
34
+ font-size: 42px;
35
+ background: linear-gradient(to left, #ff416c, #ff4b2b);
36
+ -webkit-background-clip: text;
37
+ background-clip: text;
38
+ color: transparent;
39
+ font-weight: 800;
40
+ margin-bottom: 5px;
41
+ text-align: center;
42
+ }
43
+
44
+ #app-subtitle {
45
+ font-size: 24px;
46
+ background: linear-gradient(to left, #ff416c, #ff4b2b);
47
+ -webkit-background-clip: text;
48
+ background-clip: text;
49
+ color: transparent;
50
+ font-weight: 600;
51
+ margin-top: 0;
52
+ text-align: center;
53
+ }
54
+
55
+ #intro-text {
56
+ font-size: 16px;
57
+ color: white !important;
58
+ margin-top: 20px;
59
+ line-height: 1.6;
60
+ }
61
+
62
+ #learn-more-button {
63
+ display: flex;
64
+ justify-content: center;
65
+ margin-top: 5px;
66
+ }
67
+
68
+ .button-link {
69
+ background: linear-gradient(to left, #ff416c, #ff4b2b);
70
+ color: white !important;
71
+ padding: 10px 20px;
72
+ text-decoration: none;
73
+ font-weight: bold;
74
+ border-radius: 8px;
75
+ transition: opacity 0.3s;
76
+ }
77
+
78
+ .button-link:hover {
79
+ opacity: 0.85;
80
+ }
81
+
82
+ #input-container {
83
+ background-color: #1f2937;
84
+ padding: 20px;
85
+ border-radius: 10px;
86
+ box-shadow: 0 2px 10px rgba(0, 0, 0, 0.2);
87
+ }
88
+
89
+ .label-box label {
90
+ background-color: #1f2937;
91
+ padding: 4px 10px;
92
+ border-radius: 8px;
93
+ display: inline-block;
94
+ margin-bottom: 6px;
95
+ }
96
+
97
+ .label-box span {
98
+ color: white !important;
99
+ }
100
+
101
+ .label-box {
102
+ background-color: #1f2937;
103
+ color: white;
104
+ padding: 4px 10px;
105
+ border-radius: 8px;
106
+ display: inline-block;
107
+ }
108
+
109
+ #input-container > div {
110
+ background: #1f2937 !important;
111
+ background-color: #1f2937 !important;
112
+ border: none !important;
113
+ box-shadow: none !important;
114
+ padding: 10px !important;
115
+ margin: 0 !important;
116
+ }
117
+ .row-spacer {
118
+ margin-top: 24px !important;
119
+ }
120
+
121
+ .column-gap {
122
+ gap: 16px;
123
+ }
124
+
125
+ textarea, input[type="text"] {
126
+ background-color: #374151 !important;
127
+ color: white !important;
128
+ }
129
+
130
+ #custom-dropdown .wrap {
131
+ background-color: #374151 !important;
132
+ border-radius: 6px;
133
+ }
134
+
135
+ input[role="listbox"] {
136
+ color: white !important;
137
+ background-color: #374151 !important;
138
+ }
139
+ .dropdown-arrow {
140
+ color: white !important;
141
+ }
142
+
143
+ ul[role="listbox"] {
144
+ background-color: #111827 !important; /* dark navy */
145
+ color: white !important;
146
+ border-radius: 6px;
147
+ padding: 4px 0;
148
+ }
149
+
150
+ ul[role="listbox"] li {
151
+ color: white !important;
152
+ padding: 8px 12px;
153
+ }
154
+
155
+ ul[role="listbox"] li:hover {
156
+ background-color: #1f2937 !important; /* slightly lighter hover */
157
+ }
158
+
159
+ ul[role="listbox"] li[aria-selected="true"] {
160
+ background-color: #111827 !important; /* same dark as others */
161
+ color: white !important;
162
+ }
163
+
164
+ input[type="number"] {
165
+ background-color: #374151;
166
+ color: white !important;
167
+ }
168
+
169
+ #business-problem-box {
170
+ margin-left: 0 !important;
171
+ margin-right: 0 !important;
172
+ padding-left: 0 !important;
173
+ padding-right: 0 !important;
174
+ width: 100% !important;
175
+ }
176
+
177
+ #business-problem-box textarea::placeholder {
178
+ color: #9ca3af !important; /* Tailwind's "gray-400" */
179
+ }
180
+
181
+
182
+ #run-btn {
183
+ background: linear-gradient(to left, #ff416c, #ff4b2b);
184
+ color: white !important;
185
+ font-weight: bold;
186
+ border: none;
187
+ padding: 10px 20px;
188
+ border-radius: 8px;
189
+ cursor: pointer;
190
+ transition: background 0.3s ease;
191
+ }
192
+
193
+ #run-btn:hover {
194
+ filter: brightness(1.1);
195
+ }
196
+
197
+ #download-box {
198
+ background-color: #1f2937;
199
+ border: 1px solid #3b3b3b;
200
+ border-radius: 8px;
201
+ padding: 10px;
202
+ margin-top: 16px;
203
+ font-weight: 500;
204
+ }
205
+
206
+ #download-box a {
207
+ color: #00c3ff !important;
208
+ text-decoration: underline;
209
+ font-weight: bold;
210
+ }
211
+ #download-box td.filename {
212
+ color: rgb(255, 255, 255) !important;
213
+ }
214
+
215
+ #download-box .file-preview-holder,
216
+ #download-box .file-preview,
217
+ #download-box table,
218
+ #download-box tr,
219
+ #download-box td {
220
+ background-color: #1f2937 !important;
221
+ }
222
+
223
+ /* #download-box label {
224
+ background-color: #1f2937 !important;
225
+ color: white !important;
226
+ font-weight: bold;
227
+ } */
228
+ #download-box > label {
229
+ display: none !important;
230
+ }
231
+
232
+
233
+ /* ==== Version ==== */
234
+ .version-banner {
235
+ text-align: center;
236
+ font-size: 0.9em;
237
+ }
main.py ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Entry point for the application."""
2
+
3
+ import os
4
+ from src.ui import build_ui
5
+
6
+ demo = build_ui()
7
+
8
+ # Main application entry point
9
+ if __name__ == "__main__":
10
+ demo.launch(
11
+ allowed_paths=["output"],
12
+ server_name="0.0.0.0",
13
+ server_port=int(os.environ.get("PORT", 7860)),
14
+ )
pyproject.toml ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "datagen"
3
+ version = "0.1.0"
4
+ description = "AI-powered platform for generating synthetic datasets"
5
+ readme = "README.md"
6
+ requires-python = ">=3.11"
7
+ dependencies = [
8
+ "anthropic==0.49.0",
9
+ "gradio",
10
+ "numpy>=2.2.6",
11
+ "openai==1.65.5",
12
+ "pandas>=2.2.3",
13
+ "pyarrow>=20.0.0",
14
+ "python-dotenv==1.0.1",
15
+ ]
16
+
17
+ [tool.pytest.ini_options]
18
+ pythonpath = ["."]
19
+ filterwarnings = [
20
+ "ignore::DeprecationWarning:websockets.legacy",
21
+ ]
22
+
23
+ [tool.ruff.lint]
24
+ select = [
25
+ "E", # pycodestyle errors
26
+ "W", # pycodestyle warnings
27
+ "F", # Pyflakes
28
+ "D", # pydocstyle (docstrings)
29
+ "UP", # pyupgrade
30
+ "B", # flake8-bugbear
31
+ ]
32
+ ignore = ["D104"] # Missing docstring in public package (__init__.py)
33
+
34
+ [tool.ruff.lint.pydocstyle]
35
+ convention = "google"
src/__init__.py ADDED
File without changes
src/constants.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # src/config/constants.py
2
+ """Constants for configuration across the project."""
3
+
4
+ import os
5
+ import tomllib
6
+ from pathlib import Path
7
+ import logging
8
+
9
+ # ==================== PROJECT METADATA ====================
10
+ root = Path(__file__).parent.parent
11
+ with open(root / "pyproject.toml", "rb") as f:
12
+ pyproject = tomllib.load(f)
13
+
14
+ PROJECT_NAME = pyproject["project"]["name"]
15
+ VERSION = pyproject["project"]["version"]
16
+
17
+ # ==================== AI MODEL CONFIG ====================
18
+ OPENAI_MODEL = "gpt-4o-mini"
19
+ CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
20
+
21
+ # Other constants can go here
22
+ OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "output")
23
+ MAX_TOKENS = 2000
24
+
25
+ # ==================== LOGGING CONFIG ====================
26
+
27
+ # Configure logging once
28
+ logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
29
+
30
+ # Create a shared logger
31
+ logger = logging.getLogger(__name__)
32
+
33
+ # ==================== FILE MANAGEMENT ====================
34
+ FILE_CLEANUP_SECONDS = 60 # 5 minutes
src/datagen.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Main data generation class for creating synthetic datasets using AI models."""
2
+
3
+ import os
4
+ from datetime import datetime
5
+ from .prompts import build_user_prompt, system_message
6
+ from .models import get_gpt_completion, get_claude_completion
7
+ from .utils import execute_code_in_virtualenv
8
+ from .constants import OUTPUT_DIR, logger
9
+
10
+
11
+ class DataGen:
12
+ """Handles synthetic data generation using AI models."""
13
+
14
+ def __init__(self, output_dir=None):
15
+ """Initialize the data generator with output directory."""
16
+ # Use provided output_dir, or fall back to OUTPUT_DIR constant
17
+ self.output_dir = output_dir or OUTPUT_DIR
18
+ os.makedirs(self.output_dir, exist_ok=True)
19
+
20
+ def get_timestamp(self):
21
+ """Return current timestamp for file naming."""
22
+ return datetime.now().strftime("%Y%m%d_%H%M%S")
23
+
24
+ def generate_dataset(self, **input_data):
25
+ """Generate synthetic dataset based on input parameters and model choice."""
26
+ try:
27
+ # Ensure output directory exists before generating
28
+ os.makedirs(self.output_dir, exist_ok=True)
29
+
30
+ # Add output directory path to input data for file generation
31
+ input_data["file_path"] = self.output_dir
32
+
33
+ # Build the prompt to send to the selected LLM
34
+ prompt = build_user_prompt(**input_data)
35
+
36
+ # Call the selected LLM based on the model parameter
37
+ if input_data["model"] == "GPT":
38
+ code = get_gpt_completion(prompt, system_message)
39
+ elif input_data["model"] == "Claude":
40
+ code = get_claude_completion(prompt, system_message)
41
+ else:
42
+ raise ValueError("Invalid model selected.")
43
+
44
+ # Execute the generated code and return the output file path
45
+ file_path = execute_code_in_virtualenv(code)
46
+ return file_path
47
+
48
+ except Exception as e:
49
+ # Log and re-raise any errors that occur during generation
50
+ logger.error(f"Error in generate_dataset: {e}")
51
+ raise
src/models.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """AI model clients and API configuration for OpenAI and Anthropic."""
2
+
3
+ from openai import OpenAI
4
+ import anthropic
5
+ import os
6
+ from dotenv import load_dotenv
7
+ from .constants import OPENAI_MODEL, CLAUDE_MODEL, MAX_TOKENS, logger
8
+
9
+ # Load environment variables from .env file
10
+ load_dotenv(override=True)
11
+
12
+ # Retrieve API keys from environment variables
13
+ openai_api_key = os.getenv("OPENAI_API_KEY")
14
+ anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
15
+
16
+ # Warn if any API key is missing for proper error handling
17
+ if not openai_api_key:
18
+ logger.error("❌ OpenAI API Key is missing!")
19
+
20
+ if not anthropic_api_key:
21
+ logger.error("❌ Anthropic API Key is missing!")
22
+
23
+ # Initialize API clients with the retrieved keys
24
+ openai = OpenAI(api_key=openai_api_key)
25
+ claude = anthropic.Anthropic()
26
+
27
+
28
+ def get_gpt_completion(prompt, system_message):
29
+ """Call OpenAI's GPT model with prompt and system message."""
30
+ try:
31
+ # Create chat completion with system and user messages
32
+ response = openai.chat.completions.create(
33
+ model=OPENAI_MODEL,
34
+ messages=[
35
+ {"role": "system", "content": system_message},
36
+ {"role": "user", "content": prompt},
37
+ ],
38
+ stream=False,
39
+ )
40
+ # Extract and return the generated content
41
+ return response.choices[0].message.content
42
+ except Exception as e:
43
+ logger.error(f"GPT error: {e}")
44
+ raise
45
+
46
+
47
+ def get_claude_completion(prompt, system_message):
48
+ """Call Anthropic's Claude model with prompt and system message."""
49
+ try:
50
+ # Create message with Claude API using system prompt and user message
51
+ result = claude.messages.create(
52
+ model=CLAUDE_MODEL,
53
+ max_tokens=MAX_TOKENS,
54
+ system=system_message,
55
+ messages=[{"role": "user", "content": prompt}],
56
+ )
57
+ # Extract and return the text content from response
58
+ return result.content[0].text
59
+ except Exception as e:
60
+ logger.error(f"Claude error: {e}")
61
+ raise
src/pipeline.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Pipeline orchestration for dataset generation."""
2
+
3
+ import os
4
+ import logging
5
+ import threading
6
+ import gradio as gr
7
+ from src.datagen import DataGen
8
+ from src.constants import FILE_CLEANUP_SECONDS
9
+
10
+ logger = logging.getLogger(__name__)
11
+
12
+
13
+ def safe_delete(file_path):
14
+ """Safely delete a file, ignoring errors if file doesn't exist."""
15
+ try:
16
+ if os.path.exists(file_path):
17
+ os.remove(file_path)
18
+ except Exception:
19
+ pass # Ignore deletion errors
20
+
21
+
22
+ class DatasetPipeline:
23
+ """Handles the dataset generation pipeline."""
24
+
25
+ def __init__(self):
26
+ """Initialize the pipeline with a DataGen instance."""
27
+ self.generator = DataGen()
28
+
29
+ def generate(
30
+ self, business_problem, dataset_type, output_format, num_samples, model
31
+ ):
32
+ """Generate synthetic dataset based on user inputs."""
33
+ # Check if business problem is empty
34
+ if not business_problem.strip():
35
+ error_msg = "❌ Please enter a business problem before generating."
36
+ yield [gr.update(visible=False), gr.update(visible=True), error_msg]
37
+ return
38
+
39
+ # Initial feedback while generating
40
+ yield [
41
+ gr.update(visible=False),
42
+ gr.update(visible=False),
43
+ "⏳ Generating dataset...",
44
+ ]
45
+
46
+ try:
47
+ # Pack inputs into a dictionary for the generator
48
+ input_data = {
49
+ "business_problem": business_problem,
50
+ "dataset_type": dataset_type,
51
+ "output_format": output_format,
52
+ "num_samples": num_samples,
53
+ "model": model,
54
+ }
55
+
56
+ # Generate dataset file
57
+ file_path = self.generator.generate_dataset(**input_data)
58
+
59
+ # Check if file exists and return success message + file path
60
+ if isinstance(file_path, str) and os.path.exists(file_path):
61
+ # Auto-delete after 5min with safe deletion
62
+ threading.Timer(
63
+ FILE_CLEANUP_SECONDS, safe_delete, args=[file_path]
64
+ ).start()
65
+ success_update = [
66
+ gr.update(value=file_path, visible=True),
67
+ gr.update(visible=True),
68
+ "✅ Dataset ready for download.",
69
+ ]
70
+ yield success_update
71
+ else:
72
+ # Handle invalid or missing file
73
+ error_update = [
74
+ gr.update(visible=False),
75
+ gr.update(visible=True),
76
+ "❌ Error: File not created or path invalid.",
77
+ ]
78
+ yield error_update
79
+
80
+ except Exception as e:
81
+ # Catch and display any errors in the pipeline
82
+ logger.error("Pipeline error: %s", e)
83
+ error_update = [
84
+ gr.update(visible=False),
85
+ gr.update(visible=True),
86
+ f"❌ Pipeline error: {e}",
87
+ ]
88
+ yield error_update
src/prompts.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Prompt templates and management for AI model interactions."""
2
+
3
+ from datetime import datetime
4
+ from src.constants import logger
5
+
6
+ # System message template that defines AI assistant behavior and rules
7
+ system_message = """
8
+ You are a helpful assistant whose main purpose is to generate synthetic datasets
9
+ based on a given business problem.
10
+
11
+ 🔹 General Guidelines:
12
+ - Be accurate and concise.
13
+ - Use only standard Python libraries (pandas, numpy, os, datetime, etc.)
14
+ - The dataset must contain the requested number of samples.
15
+ - Always respect the requested output format exactly.
16
+ - If multiple entities exist, save each to a separate file.
17
+ - Do not use f-strings anywhere in the code — not in file paths or in content.
18
+ Use standard string concatenation instead.
19
+
20
+ 🔹 File Path Rules:
21
+ - Define the full file path using os.path.join(...) — exactly as shown —
22
+ no shortcuts or direct strings.
23
+ - Use two hardcoded string literals only — no variables, no f-strings,
24
+ no formatting, no expressions.
25
+ - First argument: full directory path (use forward slashes).
26
+ - Second argument: full filename with timestamp and correct extension.
27
+ - Example: os.path.join("C:/Users/.../output", "sales_20250323_123456.json")
28
+ - ⚠️ Do not use intermediate variables like directory, filename, or output_dir.
29
+ - ⚠️ Do not skip or replace any of the above instructions. They are required
30
+ for the code to work correctly.
31
+
32
+ 🔹 File Saving Instructions:
33
+
34
+ - ✅ CSV:
35
+ df.to_csv(file_path, index=False, encoding="utf-8")
36
+
37
+ - ✅ JSON:
38
+ with open(file_path, "w", encoding="utf-8") as f:
39
+ df.to_json(f, orient="records", lines=False, force_ascii=False, indent=2)
40
+
41
+ - ✅ Parquet:
42
+ df.to_parquet(file_path, engine="pyarrow", index=False)
43
+
44
+ - ✅ Markdown (for Text):
45
+ - Generate properly formatted Markdown content.
46
+ - Save it as a `.md` file using UTF-8 encoding.
47
+ """
48
+
49
+
50
+ def build_user_prompt(**input_data):
51
+ """Build user prompt for AI model based on dataset generation parameters."""
52
+ try:
53
+ # Normalize file path separators to forward slashes for consistency
54
+ file_path = input_data["file_path"].replace("\\", "/")
55
+
56
+ # Generate timestamp for unique file naming
57
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
58
+
59
+ # Construct the user prompt for the LLM with all required parameters
60
+ user_prompt = (
61
+ f"Generate a synthetic {input_data['dataset_type'].lower()} "
62
+ f"dataset in {input_data['output_format'].upper()} format.\n"
63
+ f"Business problem: {input_data['business_problem']}\n"
64
+ f"Samples: {input_data['num_samples']}\n"
65
+ f"Directory: {file_path}\n"
66
+ f"Timestamp: {timestamp}"
67
+ )
68
+
69
+ return user_prompt
70
+
71
+ except KeyError as e:
72
+ # Handle missing keys in input_data dictionary
73
+ logger.warning(f"Missing input key: {e}")
74
+ raise
75
+ except Exception as e:
76
+ # Log any other error during prompt building process
77
+ logger.warning(f"Error in build_user_prompt: {e}")
78
+ raise
src/ui.py ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Gradio web interface for synthetic data generation."""
2
+
3
+ import logging
4
+ import gradio as gr
5
+ from src.pipeline import DatasetPipeline
6
+ from src.constants import PROJECT_NAME, VERSION
7
+
8
+ # Set up logger
9
+ logger = logging.getLogger(__name__)
10
+ logging.basicConfig(
11
+ level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
12
+ )
13
+
14
+ pipeline = DatasetPipeline()
15
+
16
+ PROJECT_NAME_CAP = PROJECT_NAME.capitalize()
17
+ REPO_URL = f"https://github.com/lisekarimi/{PROJECT_NAME}"
18
+
19
+
20
+ def update_output_format(dataset_type):
21
+ """Update output format choices based on selected dataset type."""
22
+ if dataset_type in ["Tabular", "Time-series"]:
23
+ return gr.update(choices=["JSON", "csv", "Parquet"], value="JSON")
24
+ elif dataset_type == "Text":
25
+ return gr.update(choices=["JSON", "Markdown"], value="JSON")
26
+
27
+
28
+ def build_ui(css_path="assets/styles.css"):
29
+ """Build and return the complete Gradio user interface with error handling."""
30
+ # Try to load CSS file with error handling
31
+ try:
32
+ with open(css_path, encoding="utf-8") as f:
33
+ css = f.read()
34
+ except Exception as e:
35
+ css = ""
36
+ logger.warning("⚠️ Failed to load CSS: %s", e)
37
+
38
+ # Building the UI with error handling
39
+ try:
40
+ with gr.Blocks(css=css, title=f"🧬{PROJECT_NAME_CAP}") as ui:
41
+ with gr.Column(elem_id="app-container"):
42
+ gr.Markdown(f"<h1 id='app-title'>🏷️ {PROJECT_NAME_CAP} </h1>")
43
+ gr.Markdown(
44
+ "<h2 id='app-subtitle'>AI-Powered Synthetic Dataset Generator</h2>"
45
+ )
46
+
47
+ # Fix the f-string in HTML
48
+ intro_html = f"""
49
+ <div id="intro-text">
50
+ <p>With {PROJECT_NAME_CAP}, easily generate
51
+ <strong>diverse datasets</strong>
52
+ for testing, development, and AI training.</p>
53
+
54
+ <h4>🎯 How It Works:</h4>
55
+ <p>1️⃣ Define your business problem.</p>
56
+ <p>2️⃣ Select dataset type, format, model, and samples.</p>
57
+ <p>3️⃣ Download your synthetic dataset!</p>
58
+ </div>
59
+ """
60
+ gr.HTML(intro_html)
61
+
62
+ # Fix the missing quote in REPO_URL
63
+ learn_more_html = f"""
64
+ <div id="learn-more-button">
65
+ <a href="{REPO_URL}/blob/main/README.md"
66
+ class="button-link" target="_blank">Learn More</a>
67
+ </div>
68
+ """
69
+ gr.HTML(learn_more_html)
70
+
71
+ examples_md = """
72
+ <p><strong>🧠 Need inspiration?</strong> Try these examples:</p>
73
+ <ul>
74
+ <li>Movie summaries for genre classification.</li>
75
+ <li>Customer chats with dialogue and sentiment labels.</li>
76
+ <li>Stock prices with date, ticker, open, close, volume.</li>
77
+ </ul>
78
+ """
79
+ gr.Markdown(examples_md)
80
+
81
+ gr.Markdown("<p><strong>Start generating now!</strong> 🗂️✨</p>")
82
+
83
+ with gr.Group(elem_id="input-container"):
84
+ business_problem = gr.Textbox(
85
+ placeholder=(
86
+ "Describe the dataset you want "
87
+ "(e.g., Job postings, Customer reviews)"
88
+ ),
89
+ lines=2,
90
+ label="📌 Business Problem",
91
+ elem_classes=["label-box"],
92
+ elem_id="business-problem-box",
93
+ )
94
+
95
+ with gr.Row(elem_classes="column-gap"):
96
+ with gr.Column(scale=1):
97
+ dataset_type = gr.Dropdown(
98
+ ["Tabular", "Time-series", "Text"],
99
+ value="Tabular",
100
+ label="📊 Dataset Type",
101
+ elem_classes=["label-box"],
102
+ elem_id="custom-dropdown",
103
+ )
104
+
105
+ with gr.Column(scale=1):
106
+ output_format = gr.Dropdown(
107
+ choices=["JSON", "csv", "Parquet"],
108
+ value="JSON",
109
+ label="📁 Output Format",
110
+ elem_classes=["label-box"],
111
+ elem_id="custom-dropdown",
112
+ )
113
+
114
+ # Bind the update function to the dataset type dropdown
115
+ dataset_type.change(
116
+ update_output_format,
117
+ inputs=[dataset_type],
118
+ outputs=[output_format],
119
+ )
120
+
121
+ with gr.Row(elem_classes="row-spacer column-gap"):
122
+ with gr.Column(scale=1):
123
+ model = gr.Dropdown(
124
+ ["GPT", "Claude"],
125
+ value="GPT",
126
+ label="🤖 Model",
127
+ elem_classes=["label-box"],
128
+ elem_id="custom-dropdown",
129
+ )
130
+
131
+ with gr.Column(scale=1):
132
+ num_samples = gr.Slider(
133
+ minimum=10,
134
+ maximum=1000,
135
+ value=10,
136
+ step=1,
137
+ interactive=True,
138
+ label="🔢 Number of Samples",
139
+ elem_classes=["label-box"],
140
+ )
141
+
142
+ # Hidden file component for dataset download
143
+ file_download = gr.File(
144
+ visible=False, elem_id="download-box", label=None
145
+ )
146
+
147
+ # Component to display status messages
148
+ status_message = gr.Markdown("", label="Status")
149
+
150
+ # Button to trigger dataset generation
151
+ run_btn = gr.Button("Create a dataset", elem_id="run-btn")
152
+ run_btn.click(
153
+ pipeline.generate,
154
+ inputs=[
155
+ business_problem,
156
+ dataset_type,
157
+ output_format,
158
+ num_samples,
159
+ model,
160
+ ],
161
+ outputs=[file_download, run_btn, status_message],
162
+ )
163
+
164
+ # Bottom: version info
165
+ gr.Markdown(
166
+ f"""
167
+ <p class="version-banner">
168
+ 🔖 <strong>
169
+ <a href="{REPO_URL}/blob/main/CHANGELOG.md"
170
+ target="_blank">Version {VERSION}</a>
171
+ </strong>
172
+ </p>
173
+ """
174
+ )
175
+
176
+ return ui
177
+
178
+ except Exception as e:
179
+ logger.error("❌ Error building UI: %s", e)
180
+ # Return a minimal error UI
181
+ with gr.Blocks() as error_ui:
182
+ gr.Markdown("# Error Loading Application")
183
+ gr.Markdown(f"An error occurred: {str(e)}")
184
+ return error_ui
src/utils.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Utility functions for extracting and executing Python code from LLM responses."""
2
+
3
+ import re
4
+ import os
5
+ import subprocess
6
+ import sys
7
+ import logging
8
+
9
+ # Set up logger
10
+ logger = logging.getLogger(__name__)
11
+ logging.basicConfig(
12
+ level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
13
+ )
14
+
15
+
16
+ def extract_code(text):
17
+ """Extract Python code block from LLM response text."""
18
+ try:
19
+ # Search for Python code block using regex
20
+ match = re.search(r"```python(.*?)```", text, re.DOTALL)
21
+ if match:
22
+ code = match.group(0).strip()
23
+ else:
24
+ code = ""
25
+ logger.warning("No matching code block found.")
26
+
27
+ # Clean up markdown formatting
28
+ return code.replace("```python\n", "").replace("```", "")
29
+ except Exception as e:
30
+ logger.error(f"Code extraction error: {e}")
31
+ raise
32
+
33
+
34
+ def extract_file_path(code_str):
35
+ """Extract file path from code string containing os.path.join() calls."""
36
+ try:
37
+ # Look for os.path.join() pattern with two string arguments
38
+ pattern = r'os\.path\.join\(\s*["\'](.+?)["\']\s*,\s*["\'](.+?)["\']\s*\)'
39
+ match = re.search(pattern, code_str)
40
+ if match:
41
+ folder = match.group(1)
42
+ filename = match.group(2)
43
+ return os.path.join(folder, filename)
44
+
45
+ logger.error("No file path found.")
46
+ return None
47
+ except Exception as e:
48
+ logger.error(f"File path extraction error: {e}")
49
+ raise
50
+
51
+
52
+ def execute_code_in_virtualenv(text, python_interpreter=sys.executable):
53
+ """Execute extracted Python code in a subprocess and return the file path."""
54
+ if not python_interpreter:
55
+ raise OSError("Python interpreter not found.")
56
+
57
+ # Extract the Python code from the input text
58
+ code_str = extract_code(text)
59
+
60
+ # Prepare subprocess command
61
+ command = [python_interpreter, "-c", code_str]
62
+
63
+ try:
64
+ # logger.info("✅ Running script: %s", command)
65
+
66
+ # Execute the code in subprocess
67
+ # Note: We capture the result but don't need to use it directly
68
+ # The subprocess.run() with check=True will raise an exception if it fails
69
+ subprocess.run(command, check=True, capture_output=True, text=True)
70
+
71
+ # Extract file path from the executed code
72
+ file_path = extract_file_path(code_str)
73
+ logger.info("✅ Extracted file path: %s", file_path)
74
+
75
+ return file_path
76
+ except subprocess.CalledProcessError as e:
77
+ # Return error information if subprocess execution fails
78
+ return (f"Execution error:\n{e.stderr.strip()}", None)
uv.lock ADDED
The diff for this file is too large to render. See raw diff