Spaces:
Sleeping
Sleeping
Commit ·
db17eb5
1
Parent(s): 62587cb
Deploy version 0.1.0
Browse files- Dockerfile +29 -0
- README.md +77 -7
- assets/styles.css +237 -0
- main.py +14 -0
- pyproject.toml +35 -0
- src/__init__.py +0 -0
- src/constants.py +34 -0
- src/datagen.py +51 -0
- src/models.py +61 -0
- src/pipeline.py +88 -0
- src/prompts.py +78 -0
- src/ui.py +184 -0
- src/utils.py +78 -0
- uv.lock +0 -0
Dockerfile
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.11-slim
|
| 2 |
+
|
| 3 |
+
# Install uv
|
| 4 |
+
RUN pip install uv
|
| 5 |
+
|
| 6 |
+
WORKDIR /app
|
| 7 |
+
|
| 8 |
+
# Copy dependency files first (changes rarely)
|
| 9 |
+
COPY pyproject.toml uv.lock ./
|
| 10 |
+
|
| 11 |
+
# Put venv outside of /app so it won't be affected by volume mounts
|
| 12 |
+
ENV UV_PROJECT_ENVIRONMENT=/opt/venv
|
| 13 |
+
|
| 14 |
+
# Install dependencies (this will now create venv at /opt/venv)
|
| 15 |
+
RUN uv sync --locked
|
| 16 |
+
|
| 17 |
+
# Copy all source code
|
| 18 |
+
COPY . .
|
| 19 |
+
|
| 20 |
+
# Create output directory with proper permissions
|
| 21 |
+
RUN mkdir -p /tmp/output && chmod 777 /tmp/output
|
| 22 |
+
|
| 23 |
+
# Set output directory environment variable for production
|
| 24 |
+
ENV OUTPUT_DIR=/tmp/output
|
| 25 |
+
|
| 26 |
+
# Disable UV cache entirely for production
|
| 27 |
+
ENV UV_NO_CACHE=1
|
| 28 |
+
|
| 29 |
+
CMD ["uv", "run", "main.py"]
|
README.md
CHANGED
|
@@ -1,11 +1,81 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: docker
|
| 7 |
-
pinned: false
|
| 8 |
-
short_description: ✨ AI-powered platform for generating synthetic datasets
|
| 9 |
---
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: DataGen
|
| 3 |
+
emoji: 🧬
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: pink
|
| 6 |
sdk: docker
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
+
# 🧬 DataGen: AI-Powered Synthetic Data Generator
|
| 10 |
+
|
| 11 |
+
Generate realistic synthetic datasets by simply describing what you need.
|
| 12 |
+
|
| 13 |
+
[🚀 **Try the Live Demo**](https://huggingface.co/spaces/lisekarimi/snapr)
|
| 14 |
+
|
| 15 |
+
<img src="https://gitlab.com/lisekarimi/datagen/-/raw/main/assets/screenshot.png" alt="DataGen interface" width="450">
|
| 16 |
+
|
| 17 |
+
## ✨ What DataGen Does
|
| 18 |
+
|
| 19 |
+
DataGen transforms simple descriptions into structured datasets using AI. Perfect for researchers, data scientists, and developers who need realistic test data fast.
|
| 20 |
+
|
| 21 |
+
**Key Features:**
|
| 22 |
+
- **Type what you want → Get real data**
|
| 23 |
+
- **Multiple formats:** CSV, JSON, Parquet, Markdown
|
| 24 |
+
- **Dataset types:** Tables, time-series, text data
|
| 25 |
+
- **AI-powered:** Uses GPT and Claude models
|
| 26 |
+
- **Instant download** with clean, ready-to-use datasets
|
| 27 |
+
|
| 28 |
+
## 🚀 Quick Start
|
| 29 |
+
|
| 30 |
+
### Prerequisites
|
| 31 |
+
- Python 3.11+
|
| 32 |
+
- [uv package manager](https://docs.astral.sh/uv/getting-started/installation/)
|
| 33 |
+
|
| 34 |
+
### Installation
|
| 35 |
+
```bash
|
| 36 |
+
git clone https://github.com/lisekarimi/datagen.git
|
| 37 |
+
cd datagen
|
| 38 |
+
uv sync
|
| 39 |
+
source .venv/bin/activate # Unix/macOS
|
| 40 |
+
# or .\.venv\Scripts\activate on Windows
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
### Configuration
|
| 44 |
+
1. Copy `.env.example` to `.env`
|
| 45 |
+
2. Populate it with the required secrets
|
| 46 |
+
|
| 47 |
+
### Run DataGen
|
| 48 |
+
```bash
|
| 49 |
+
# Local development
|
| 50 |
+
make run
|
| 51 |
+
|
| 52 |
+
# With hot reload
|
| 53 |
+
make ui
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
*For complete setup instructions, commands, and development guidelines, see [docs/dev.md](https://gitlab.com/lisekarimi/datagen/-/blob/main/docs/dev.md)*
|
| 57 |
+
|
| 58 |
+
## 🧑💻 How to Use
|
| 59 |
+
|
| 60 |
+
1. **Describe your data:** "Customer purchase history with demographics"
|
| 61 |
+
2. **Choose format:** CSV, JSON, Parquet, or Markdown
|
| 62 |
+
3. **Select AI model:** GPT or Claude
|
| 63 |
+
4. **Set sample size:** Number of records to generate
|
| 64 |
+
5. **Generate & download** your dataset
|
| 65 |
+
|
| 66 |
+
## 🛡️ Quality & Security
|
| 67 |
+
|
| 68 |
+
DataGen maintains high standards with comprehensive test coverage, automated security scanning, and code quality enforcement.
|
| 69 |
+
|
| 70 |
+
*For CI/CD setup and technical details, see [docs/cicd.md](https://gitlab.com/lisekarimi/datagen/-/blob/main/docs/cicd.md)*
|
| 71 |
+
|
| 72 |
+
## 📝 Notes
|
| 73 |
+
- Generated files are automatically cleaned up after 5 minutes
|
| 74 |
+
- Supports 10-1000 samples per dataset
|
| 75 |
+
- JSON output includes proper indentation for readability
|
| 76 |
+
- Cross-platform compatibility (Windows, macOS, Linux)
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
## 📄 License
|
| 80 |
+
|
| 81 |
+
MIT
|
assets/styles.css
ADDED
|
@@ -0,0 +1,237 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
html, body, #app, body > div, .gradio-container {
|
| 3 |
+
background-color: #0b0e18 !important; /* dark blue */
|
| 4 |
+
height: 100%;
|
| 5 |
+
margin: 0;
|
| 6 |
+
padding: 0;
|
| 7 |
+
font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
|
| 8 |
+
display: flex;
|
| 9 |
+
justify-content: center;
|
| 10 |
+
align-items: center;
|
| 11 |
+
}
|
| 12 |
+
|
| 13 |
+
#app-container {
|
| 14 |
+
background-color: #1d3451 !important;
|
| 15 |
+
padding: 40px;
|
| 16 |
+
border-radius: 12px;
|
| 17 |
+
box-shadow: 0 4px 25px rgba(0, 0, 0, 0.4);
|
| 18 |
+
max-width: 800px;
|
| 19 |
+
width: 100%;
|
| 20 |
+
color: white;
|
| 21 |
+
}
|
| 22 |
+
|
| 23 |
+
#app-container h4,
|
| 24 |
+
#app-container p,
|
| 25 |
+
#app-container ol,
|
| 26 |
+
#app-container li,
|
| 27 |
+
#app-container strong {
|
| 28 |
+
font-size: 16px;
|
| 29 |
+
line-height: 1.6;
|
| 30 |
+
color: white !important;
|
| 31 |
+
}
|
| 32 |
+
|
| 33 |
+
#app-title {
|
| 34 |
+
font-size: 42px;
|
| 35 |
+
background: linear-gradient(to left, #ff416c, #ff4b2b);
|
| 36 |
+
-webkit-background-clip: text;
|
| 37 |
+
background-clip: text;
|
| 38 |
+
color: transparent;
|
| 39 |
+
font-weight: 800;
|
| 40 |
+
margin-bottom: 5px;
|
| 41 |
+
text-align: center;
|
| 42 |
+
}
|
| 43 |
+
|
| 44 |
+
#app-subtitle {
|
| 45 |
+
font-size: 24px;
|
| 46 |
+
background: linear-gradient(to left, #ff416c, #ff4b2b);
|
| 47 |
+
-webkit-background-clip: text;
|
| 48 |
+
background-clip: text;
|
| 49 |
+
color: transparent;
|
| 50 |
+
font-weight: 600;
|
| 51 |
+
margin-top: 0;
|
| 52 |
+
text-align: center;
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
#intro-text {
|
| 56 |
+
font-size: 16px;
|
| 57 |
+
color: white !important;
|
| 58 |
+
margin-top: 20px;
|
| 59 |
+
line-height: 1.6;
|
| 60 |
+
}
|
| 61 |
+
|
| 62 |
+
#learn-more-button {
|
| 63 |
+
display: flex;
|
| 64 |
+
justify-content: center;
|
| 65 |
+
margin-top: 5px;
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
.button-link {
|
| 69 |
+
background: linear-gradient(to left, #ff416c, #ff4b2b);
|
| 70 |
+
color: white !important;
|
| 71 |
+
padding: 10px 20px;
|
| 72 |
+
text-decoration: none;
|
| 73 |
+
font-weight: bold;
|
| 74 |
+
border-radius: 8px;
|
| 75 |
+
transition: opacity 0.3s;
|
| 76 |
+
}
|
| 77 |
+
|
| 78 |
+
.button-link:hover {
|
| 79 |
+
opacity: 0.85;
|
| 80 |
+
}
|
| 81 |
+
|
| 82 |
+
#input-container {
|
| 83 |
+
background-color: #1f2937;
|
| 84 |
+
padding: 20px;
|
| 85 |
+
border-radius: 10px;
|
| 86 |
+
box-shadow: 0 2px 10px rgba(0, 0, 0, 0.2);
|
| 87 |
+
}
|
| 88 |
+
|
| 89 |
+
.label-box label {
|
| 90 |
+
background-color: #1f2937;
|
| 91 |
+
padding: 4px 10px;
|
| 92 |
+
border-radius: 8px;
|
| 93 |
+
display: inline-block;
|
| 94 |
+
margin-bottom: 6px;
|
| 95 |
+
}
|
| 96 |
+
|
| 97 |
+
.label-box span {
|
| 98 |
+
color: white !important;
|
| 99 |
+
}
|
| 100 |
+
|
| 101 |
+
.label-box {
|
| 102 |
+
background-color: #1f2937;
|
| 103 |
+
color: white;
|
| 104 |
+
padding: 4px 10px;
|
| 105 |
+
border-radius: 8px;
|
| 106 |
+
display: inline-block;
|
| 107 |
+
}
|
| 108 |
+
|
| 109 |
+
#input-container > div {
|
| 110 |
+
background: #1f2937 !important;
|
| 111 |
+
background-color: #1f2937 !important;
|
| 112 |
+
border: none !important;
|
| 113 |
+
box-shadow: none !important;
|
| 114 |
+
padding: 10px !important;
|
| 115 |
+
margin: 0 !important;
|
| 116 |
+
}
|
| 117 |
+
.row-spacer {
|
| 118 |
+
margin-top: 24px !important;
|
| 119 |
+
}
|
| 120 |
+
|
| 121 |
+
.column-gap {
|
| 122 |
+
gap: 16px;
|
| 123 |
+
}
|
| 124 |
+
|
| 125 |
+
textarea, input[type="text"] {
|
| 126 |
+
background-color: #374151 !important;
|
| 127 |
+
color: white !important;
|
| 128 |
+
}
|
| 129 |
+
|
| 130 |
+
#custom-dropdown .wrap {
|
| 131 |
+
background-color: #374151 !important;
|
| 132 |
+
border-radius: 6px;
|
| 133 |
+
}
|
| 134 |
+
|
| 135 |
+
input[role="listbox"] {
|
| 136 |
+
color: white !important;
|
| 137 |
+
background-color: #374151 !important;
|
| 138 |
+
}
|
| 139 |
+
.dropdown-arrow {
|
| 140 |
+
color: white !important;
|
| 141 |
+
}
|
| 142 |
+
|
| 143 |
+
ul[role="listbox"] {
|
| 144 |
+
background-color: #111827 !important; /* dark navy */
|
| 145 |
+
color: white !important;
|
| 146 |
+
border-radius: 6px;
|
| 147 |
+
padding: 4px 0;
|
| 148 |
+
}
|
| 149 |
+
|
| 150 |
+
ul[role="listbox"] li {
|
| 151 |
+
color: white !important;
|
| 152 |
+
padding: 8px 12px;
|
| 153 |
+
}
|
| 154 |
+
|
| 155 |
+
ul[role="listbox"] li:hover {
|
| 156 |
+
background-color: #1f2937 !important; /* slightly lighter hover */
|
| 157 |
+
}
|
| 158 |
+
|
| 159 |
+
ul[role="listbox"] li[aria-selected="true"] {
|
| 160 |
+
background-color: #111827 !important; /* same dark as others */
|
| 161 |
+
color: white !important;
|
| 162 |
+
}
|
| 163 |
+
|
| 164 |
+
input[type="number"] {
|
| 165 |
+
background-color: #374151;
|
| 166 |
+
color: white !important;
|
| 167 |
+
}
|
| 168 |
+
|
| 169 |
+
#business-problem-box {
|
| 170 |
+
margin-left: 0 !important;
|
| 171 |
+
margin-right: 0 !important;
|
| 172 |
+
padding-left: 0 !important;
|
| 173 |
+
padding-right: 0 !important;
|
| 174 |
+
width: 100% !important;
|
| 175 |
+
}
|
| 176 |
+
|
| 177 |
+
#business-problem-box textarea::placeholder {
|
| 178 |
+
color: #9ca3af !important; /* Tailwind's "gray-400" */
|
| 179 |
+
}
|
| 180 |
+
|
| 181 |
+
|
| 182 |
+
#run-btn {
|
| 183 |
+
background: linear-gradient(to left, #ff416c, #ff4b2b);
|
| 184 |
+
color: white !important;
|
| 185 |
+
font-weight: bold;
|
| 186 |
+
border: none;
|
| 187 |
+
padding: 10px 20px;
|
| 188 |
+
border-radius: 8px;
|
| 189 |
+
cursor: pointer;
|
| 190 |
+
transition: background 0.3s ease;
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
#run-btn:hover {
|
| 194 |
+
filter: brightness(1.1);
|
| 195 |
+
}
|
| 196 |
+
|
| 197 |
+
#download-box {
|
| 198 |
+
background-color: #1f2937;
|
| 199 |
+
border: 1px solid #3b3b3b;
|
| 200 |
+
border-radius: 8px;
|
| 201 |
+
padding: 10px;
|
| 202 |
+
margin-top: 16px;
|
| 203 |
+
font-weight: 500;
|
| 204 |
+
}
|
| 205 |
+
|
| 206 |
+
#download-box a {
|
| 207 |
+
color: #00c3ff !important;
|
| 208 |
+
text-decoration: underline;
|
| 209 |
+
font-weight: bold;
|
| 210 |
+
}
|
| 211 |
+
#download-box td.filename {
|
| 212 |
+
color: rgb(255, 255, 255) !important;
|
| 213 |
+
}
|
| 214 |
+
|
| 215 |
+
#download-box .file-preview-holder,
|
| 216 |
+
#download-box .file-preview,
|
| 217 |
+
#download-box table,
|
| 218 |
+
#download-box tr,
|
| 219 |
+
#download-box td {
|
| 220 |
+
background-color: #1f2937 !important;
|
| 221 |
+
}
|
| 222 |
+
|
| 223 |
+
/* #download-box label {
|
| 224 |
+
background-color: #1f2937 !important;
|
| 225 |
+
color: white !important;
|
| 226 |
+
font-weight: bold;
|
| 227 |
+
} */
|
| 228 |
+
#download-box > label {
|
| 229 |
+
display: none !important;
|
| 230 |
+
}
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
/* ==== Version ==== */
|
| 234 |
+
.version-banner {
|
| 235 |
+
text-align: center;
|
| 236 |
+
font-size: 0.9em;
|
| 237 |
+
}
|
main.py
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Entry point for the application."""
|
| 2 |
+
|
| 3 |
+
import os
|
| 4 |
+
from src.ui import build_ui
|
| 5 |
+
|
| 6 |
+
demo = build_ui()
|
| 7 |
+
|
| 8 |
+
# Main application entry point
|
| 9 |
+
if __name__ == "__main__":
|
| 10 |
+
demo.launch(
|
| 11 |
+
allowed_paths=["output"],
|
| 12 |
+
server_name="0.0.0.0",
|
| 13 |
+
server_port=int(os.environ.get("PORT", 7860)),
|
| 14 |
+
)
|
pyproject.toml
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[project]
|
| 2 |
+
name = "datagen"
|
| 3 |
+
version = "0.1.0"
|
| 4 |
+
description = "AI-powered platform for generating synthetic datasets"
|
| 5 |
+
readme = "README.md"
|
| 6 |
+
requires-python = ">=3.11"
|
| 7 |
+
dependencies = [
|
| 8 |
+
"anthropic==0.49.0",
|
| 9 |
+
"gradio",
|
| 10 |
+
"numpy>=2.2.6",
|
| 11 |
+
"openai==1.65.5",
|
| 12 |
+
"pandas>=2.2.3",
|
| 13 |
+
"pyarrow>=20.0.0",
|
| 14 |
+
"python-dotenv==1.0.1",
|
| 15 |
+
]
|
| 16 |
+
|
| 17 |
+
[tool.pytest.ini_options]
|
| 18 |
+
pythonpath = ["."]
|
| 19 |
+
filterwarnings = [
|
| 20 |
+
"ignore::DeprecationWarning:websockets.legacy",
|
| 21 |
+
]
|
| 22 |
+
|
| 23 |
+
[tool.ruff.lint]
|
| 24 |
+
select = [
|
| 25 |
+
"E", # pycodestyle errors
|
| 26 |
+
"W", # pycodestyle warnings
|
| 27 |
+
"F", # Pyflakes
|
| 28 |
+
"D", # pydocstyle (docstrings)
|
| 29 |
+
"UP", # pyupgrade
|
| 30 |
+
"B", # flake8-bugbear
|
| 31 |
+
]
|
| 32 |
+
ignore = ["D104"] # Missing docstring in public package (__init__.py)
|
| 33 |
+
|
| 34 |
+
[tool.ruff.lint.pydocstyle]
|
| 35 |
+
convention = "google"
|
src/__init__.py
ADDED
|
File without changes
|
src/constants.py
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# src/config/constants.py
|
| 2 |
+
"""Constants for configuration across the project."""
|
| 3 |
+
|
| 4 |
+
import os
|
| 5 |
+
import tomllib
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
import logging
|
| 8 |
+
|
| 9 |
+
# ==================== PROJECT METADATA ====================
|
| 10 |
+
root = Path(__file__).parent.parent
|
| 11 |
+
with open(root / "pyproject.toml", "rb") as f:
|
| 12 |
+
pyproject = tomllib.load(f)
|
| 13 |
+
|
| 14 |
+
PROJECT_NAME = pyproject["project"]["name"]
|
| 15 |
+
VERSION = pyproject["project"]["version"]
|
| 16 |
+
|
| 17 |
+
# ==================== AI MODEL CONFIG ====================
|
| 18 |
+
OPENAI_MODEL = "gpt-4o-mini"
|
| 19 |
+
CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
|
| 20 |
+
|
| 21 |
+
# Other constants can go here
|
| 22 |
+
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "output")
|
| 23 |
+
MAX_TOKENS = 2000
|
| 24 |
+
|
| 25 |
+
# ==================== LOGGING CONFIG ====================
|
| 26 |
+
|
| 27 |
+
# Configure logging once
|
| 28 |
+
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
|
| 29 |
+
|
| 30 |
+
# Create a shared logger
|
| 31 |
+
logger = logging.getLogger(__name__)
|
| 32 |
+
|
| 33 |
+
# ==================== FILE MANAGEMENT ====================
|
| 34 |
+
FILE_CLEANUP_SECONDS = 60 # 5 minutes
|
src/datagen.py
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Main data generation class for creating synthetic datasets using AI models."""
|
| 2 |
+
|
| 3 |
+
import os
|
| 4 |
+
from datetime import datetime
|
| 5 |
+
from .prompts import build_user_prompt, system_message
|
| 6 |
+
from .models import get_gpt_completion, get_claude_completion
|
| 7 |
+
from .utils import execute_code_in_virtualenv
|
| 8 |
+
from .constants import OUTPUT_DIR, logger
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
class DataGen:
|
| 12 |
+
"""Handles synthetic data generation using AI models."""
|
| 13 |
+
|
| 14 |
+
def __init__(self, output_dir=None):
|
| 15 |
+
"""Initialize the data generator with output directory."""
|
| 16 |
+
# Use provided output_dir, or fall back to OUTPUT_DIR constant
|
| 17 |
+
self.output_dir = output_dir or OUTPUT_DIR
|
| 18 |
+
os.makedirs(self.output_dir, exist_ok=True)
|
| 19 |
+
|
| 20 |
+
def get_timestamp(self):
|
| 21 |
+
"""Return current timestamp for file naming."""
|
| 22 |
+
return datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 23 |
+
|
| 24 |
+
def generate_dataset(self, **input_data):
|
| 25 |
+
"""Generate synthetic dataset based on input parameters and model choice."""
|
| 26 |
+
try:
|
| 27 |
+
# Ensure output directory exists before generating
|
| 28 |
+
os.makedirs(self.output_dir, exist_ok=True)
|
| 29 |
+
|
| 30 |
+
# Add output directory path to input data for file generation
|
| 31 |
+
input_data["file_path"] = self.output_dir
|
| 32 |
+
|
| 33 |
+
# Build the prompt to send to the selected LLM
|
| 34 |
+
prompt = build_user_prompt(**input_data)
|
| 35 |
+
|
| 36 |
+
# Call the selected LLM based on the model parameter
|
| 37 |
+
if input_data["model"] == "GPT":
|
| 38 |
+
code = get_gpt_completion(prompt, system_message)
|
| 39 |
+
elif input_data["model"] == "Claude":
|
| 40 |
+
code = get_claude_completion(prompt, system_message)
|
| 41 |
+
else:
|
| 42 |
+
raise ValueError("Invalid model selected.")
|
| 43 |
+
|
| 44 |
+
# Execute the generated code and return the output file path
|
| 45 |
+
file_path = execute_code_in_virtualenv(code)
|
| 46 |
+
return file_path
|
| 47 |
+
|
| 48 |
+
except Exception as e:
|
| 49 |
+
# Log and re-raise any errors that occur during generation
|
| 50 |
+
logger.error(f"Error in generate_dataset: {e}")
|
| 51 |
+
raise
|
src/models.py
ADDED
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""AI model clients and API configuration for OpenAI and Anthropic."""
|
| 2 |
+
|
| 3 |
+
from openai import OpenAI
|
| 4 |
+
import anthropic
|
| 5 |
+
import os
|
| 6 |
+
from dotenv import load_dotenv
|
| 7 |
+
from .constants import OPENAI_MODEL, CLAUDE_MODEL, MAX_TOKENS, logger
|
| 8 |
+
|
| 9 |
+
# Load environment variables from .env file
|
| 10 |
+
load_dotenv(override=True)
|
| 11 |
+
|
| 12 |
+
# Retrieve API keys from environment variables
|
| 13 |
+
openai_api_key = os.getenv("OPENAI_API_KEY")
|
| 14 |
+
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
|
| 15 |
+
|
| 16 |
+
# Warn if any API key is missing for proper error handling
|
| 17 |
+
if not openai_api_key:
|
| 18 |
+
logger.error("❌ OpenAI API Key is missing!")
|
| 19 |
+
|
| 20 |
+
if not anthropic_api_key:
|
| 21 |
+
logger.error("❌ Anthropic API Key is missing!")
|
| 22 |
+
|
| 23 |
+
# Initialize API clients with the retrieved keys
|
| 24 |
+
openai = OpenAI(api_key=openai_api_key)
|
| 25 |
+
claude = anthropic.Anthropic()
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def get_gpt_completion(prompt, system_message):
|
| 29 |
+
"""Call OpenAI's GPT model with prompt and system message."""
|
| 30 |
+
try:
|
| 31 |
+
# Create chat completion with system and user messages
|
| 32 |
+
response = openai.chat.completions.create(
|
| 33 |
+
model=OPENAI_MODEL,
|
| 34 |
+
messages=[
|
| 35 |
+
{"role": "system", "content": system_message},
|
| 36 |
+
{"role": "user", "content": prompt},
|
| 37 |
+
],
|
| 38 |
+
stream=False,
|
| 39 |
+
)
|
| 40 |
+
# Extract and return the generated content
|
| 41 |
+
return response.choices[0].message.content
|
| 42 |
+
except Exception as e:
|
| 43 |
+
logger.error(f"GPT error: {e}")
|
| 44 |
+
raise
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def get_claude_completion(prompt, system_message):
|
| 48 |
+
"""Call Anthropic's Claude model with prompt and system message."""
|
| 49 |
+
try:
|
| 50 |
+
# Create message with Claude API using system prompt and user message
|
| 51 |
+
result = claude.messages.create(
|
| 52 |
+
model=CLAUDE_MODEL,
|
| 53 |
+
max_tokens=MAX_TOKENS,
|
| 54 |
+
system=system_message,
|
| 55 |
+
messages=[{"role": "user", "content": prompt}],
|
| 56 |
+
)
|
| 57 |
+
# Extract and return the text content from response
|
| 58 |
+
return result.content[0].text
|
| 59 |
+
except Exception as e:
|
| 60 |
+
logger.error(f"Claude error: {e}")
|
| 61 |
+
raise
|
src/pipeline.py
ADDED
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Pipeline orchestration for dataset generation."""
|
| 2 |
+
|
| 3 |
+
import os
|
| 4 |
+
import logging
|
| 5 |
+
import threading
|
| 6 |
+
import gradio as gr
|
| 7 |
+
from src.datagen import DataGen
|
| 8 |
+
from src.constants import FILE_CLEANUP_SECONDS
|
| 9 |
+
|
| 10 |
+
logger = logging.getLogger(__name__)
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def safe_delete(file_path):
|
| 14 |
+
"""Safely delete a file, ignoring errors if file doesn't exist."""
|
| 15 |
+
try:
|
| 16 |
+
if os.path.exists(file_path):
|
| 17 |
+
os.remove(file_path)
|
| 18 |
+
except Exception:
|
| 19 |
+
pass # Ignore deletion errors
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
class DatasetPipeline:
|
| 23 |
+
"""Handles the dataset generation pipeline."""
|
| 24 |
+
|
| 25 |
+
def __init__(self):
|
| 26 |
+
"""Initialize the pipeline with a DataGen instance."""
|
| 27 |
+
self.generator = DataGen()
|
| 28 |
+
|
| 29 |
+
def generate(
|
| 30 |
+
self, business_problem, dataset_type, output_format, num_samples, model
|
| 31 |
+
):
|
| 32 |
+
"""Generate synthetic dataset based on user inputs."""
|
| 33 |
+
# Check if business problem is empty
|
| 34 |
+
if not business_problem.strip():
|
| 35 |
+
error_msg = "❌ Please enter a business problem before generating."
|
| 36 |
+
yield [gr.update(visible=False), gr.update(visible=True), error_msg]
|
| 37 |
+
return
|
| 38 |
+
|
| 39 |
+
# Initial feedback while generating
|
| 40 |
+
yield [
|
| 41 |
+
gr.update(visible=False),
|
| 42 |
+
gr.update(visible=False),
|
| 43 |
+
"⏳ Generating dataset...",
|
| 44 |
+
]
|
| 45 |
+
|
| 46 |
+
try:
|
| 47 |
+
# Pack inputs into a dictionary for the generator
|
| 48 |
+
input_data = {
|
| 49 |
+
"business_problem": business_problem,
|
| 50 |
+
"dataset_type": dataset_type,
|
| 51 |
+
"output_format": output_format,
|
| 52 |
+
"num_samples": num_samples,
|
| 53 |
+
"model": model,
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
+
# Generate dataset file
|
| 57 |
+
file_path = self.generator.generate_dataset(**input_data)
|
| 58 |
+
|
| 59 |
+
# Check if file exists and return success message + file path
|
| 60 |
+
if isinstance(file_path, str) and os.path.exists(file_path):
|
| 61 |
+
# Auto-delete after 5min with safe deletion
|
| 62 |
+
threading.Timer(
|
| 63 |
+
FILE_CLEANUP_SECONDS, safe_delete, args=[file_path]
|
| 64 |
+
).start()
|
| 65 |
+
success_update = [
|
| 66 |
+
gr.update(value=file_path, visible=True),
|
| 67 |
+
gr.update(visible=True),
|
| 68 |
+
"✅ Dataset ready for download.",
|
| 69 |
+
]
|
| 70 |
+
yield success_update
|
| 71 |
+
else:
|
| 72 |
+
# Handle invalid or missing file
|
| 73 |
+
error_update = [
|
| 74 |
+
gr.update(visible=False),
|
| 75 |
+
gr.update(visible=True),
|
| 76 |
+
"❌ Error: File not created or path invalid.",
|
| 77 |
+
]
|
| 78 |
+
yield error_update
|
| 79 |
+
|
| 80 |
+
except Exception as e:
|
| 81 |
+
# Catch and display any errors in the pipeline
|
| 82 |
+
logger.error("Pipeline error: %s", e)
|
| 83 |
+
error_update = [
|
| 84 |
+
gr.update(visible=False),
|
| 85 |
+
gr.update(visible=True),
|
| 86 |
+
f"❌ Pipeline error: {e}",
|
| 87 |
+
]
|
| 88 |
+
yield error_update
|
src/prompts.py
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Prompt templates and management for AI model interactions."""
|
| 2 |
+
|
| 3 |
+
from datetime import datetime
|
| 4 |
+
from src.constants import logger
|
| 5 |
+
|
| 6 |
+
# System message template that defines AI assistant behavior and rules
|
| 7 |
+
system_message = """
|
| 8 |
+
You are a helpful assistant whose main purpose is to generate synthetic datasets
|
| 9 |
+
based on a given business problem.
|
| 10 |
+
|
| 11 |
+
🔹 General Guidelines:
|
| 12 |
+
- Be accurate and concise.
|
| 13 |
+
- Use only standard Python libraries (pandas, numpy, os, datetime, etc.)
|
| 14 |
+
- The dataset must contain the requested number of samples.
|
| 15 |
+
- Always respect the requested output format exactly.
|
| 16 |
+
- If multiple entities exist, save each to a separate file.
|
| 17 |
+
- Do not use f-strings anywhere in the code — not in file paths or in content.
|
| 18 |
+
Use standard string concatenation instead.
|
| 19 |
+
|
| 20 |
+
🔹 File Path Rules:
|
| 21 |
+
- Define the full file path using os.path.join(...) — exactly as shown —
|
| 22 |
+
no shortcuts or direct strings.
|
| 23 |
+
- Use two hardcoded string literals only — no variables, no f-strings,
|
| 24 |
+
no formatting, no expressions.
|
| 25 |
+
- First argument: full directory path (use forward slashes).
|
| 26 |
+
- Second argument: full filename with timestamp and correct extension.
|
| 27 |
+
- Example: os.path.join("C:/Users/.../output", "sales_20250323_123456.json")
|
| 28 |
+
- ⚠️ Do not use intermediate variables like directory, filename, or output_dir.
|
| 29 |
+
- ⚠️ Do not skip or replace any of the above instructions. They are required
|
| 30 |
+
for the code to work correctly.
|
| 31 |
+
|
| 32 |
+
🔹 File Saving Instructions:
|
| 33 |
+
|
| 34 |
+
- ✅ CSV:
|
| 35 |
+
df.to_csv(file_path, index=False, encoding="utf-8")
|
| 36 |
+
|
| 37 |
+
- ✅ JSON:
|
| 38 |
+
with open(file_path, "w", encoding="utf-8") as f:
|
| 39 |
+
df.to_json(f, orient="records", lines=False, force_ascii=False, indent=2)
|
| 40 |
+
|
| 41 |
+
- ✅ Parquet:
|
| 42 |
+
df.to_parquet(file_path, engine="pyarrow", index=False)
|
| 43 |
+
|
| 44 |
+
- ✅ Markdown (for Text):
|
| 45 |
+
- Generate properly formatted Markdown content.
|
| 46 |
+
- Save it as a `.md` file using UTF-8 encoding.
|
| 47 |
+
"""
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def build_user_prompt(**input_data):
|
| 51 |
+
"""Build user prompt for AI model based on dataset generation parameters."""
|
| 52 |
+
try:
|
| 53 |
+
# Normalize file path separators to forward slashes for consistency
|
| 54 |
+
file_path = input_data["file_path"].replace("\\", "/")
|
| 55 |
+
|
| 56 |
+
# Generate timestamp for unique file naming
|
| 57 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 58 |
+
|
| 59 |
+
# Construct the user prompt for the LLM with all required parameters
|
| 60 |
+
user_prompt = (
|
| 61 |
+
f"Generate a synthetic {input_data['dataset_type'].lower()} "
|
| 62 |
+
f"dataset in {input_data['output_format'].upper()} format.\n"
|
| 63 |
+
f"Business problem: {input_data['business_problem']}\n"
|
| 64 |
+
f"Samples: {input_data['num_samples']}\n"
|
| 65 |
+
f"Directory: {file_path}\n"
|
| 66 |
+
f"Timestamp: {timestamp}"
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
return user_prompt
|
| 70 |
+
|
| 71 |
+
except KeyError as e:
|
| 72 |
+
# Handle missing keys in input_data dictionary
|
| 73 |
+
logger.warning(f"Missing input key: {e}")
|
| 74 |
+
raise
|
| 75 |
+
except Exception as e:
|
| 76 |
+
# Log any other error during prompt building process
|
| 77 |
+
logger.warning(f"Error in build_user_prompt: {e}")
|
| 78 |
+
raise
|
src/ui.py
ADDED
|
@@ -0,0 +1,184 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Gradio web interface for synthetic data generation."""
|
| 2 |
+
|
| 3 |
+
import logging
|
| 4 |
+
import gradio as gr
|
| 5 |
+
from src.pipeline import DatasetPipeline
|
| 6 |
+
from src.constants import PROJECT_NAME, VERSION
|
| 7 |
+
|
| 8 |
+
# Set up logger
|
| 9 |
+
logger = logging.getLogger(__name__)
|
| 10 |
+
logging.basicConfig(
|
| 11 |
+
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
| 12 |
+
)
|
| 13 |
+
|
| 14 |
+
pipeline = DatasetPipeline()
|
| 15 |
+
|
| 16 |
+
PROJECT_NAME_CAP = PROJECT_NAME.capitalize()
|
| 17 |
+
REPO_URL = f"https://github.com/lisekarimi/{PROJECT_NAME}"
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def update_output_format(dataset_type):
|
| 21 |
+
"""Update output format choices based on selected dataset type."""
|
| 22 |
+
if dataset_type in ["Tabular", "Time-series"]:
|
| 23 |
+
return gr.update(choices=["JSON", "csv", "Parquet"], value="JSON")
|
| 24 |
+
elif dataset_type == "Text":
|
| 25 |
+
return gr.update(choices=["JSON", "Markdown"], value="JSON")
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def build_ui(css_path="assets/styles.css"):
|
| 29 |
+
"""Build and return the complete Gradio user interface with error handling."""
|
| 30 |
+
# Try to load CSS file with error handling
|
| 31 |
+
try:
|
| 32 |
+
with open(css_path, encoding="utf-8") as f:
|
| 33 |
+
css = f.read()
|
| 34 |
+
except Exception as e:
|
| 35 |
+
css = ""
|
| 36 |
+
logger.warning("⚠️ Failed to load CSS: %s", e)
|
| 37 |
+
|
| 38 |
+
# Building the UI with error handling
|
| 39 |
+
try:
|
| 40 |
+
with gr.Blocks(css=css, title=f"🧬{PROJECT_NAME_CAP}") as ui:
|
| 41 |
+
with gr.Column(elem_id="app-container"):
|
| 42 |
+
gr.Markdown(f"<h1 id='app-title'>🏷️ {PROJECT_NAME_CAP} </h1>")
|
| 43 |
+
gr.Markdown(
|
| 44 |
+
"<h2 id='app-subtitle'>AI-Powered Synthetic Dataset Generator</h2>"
|
| 45 |
+
)
|
| 46 |
+
|
| 47 |
+
# Fix the f-string in HTML
|
| 48 |
+
intro_html = f"""
|
| 49 |
+
<div id="intro-text">
|
| 50 |
+
<p>With {PROJECT_NAME_CAP}, easily generate
|
| 51 |
+
<strong>diverse datasets</strong>
|
| 52 |
+
for testing, development, and AI training.</p>
|
| 53 |
+
|
| 54 |
+
<h4>🎯 How It Works:</h4>
|
| 55 |
+
<p>1️⃣ Define your business problem.</p>
|
| 56 |
+
<p>2️⃣ Select dataset type, format, model, and samples.</p>
|
| 57 |
+
<p>3️⃣ Download your synthetic dataset!</p>
|
| 58 |
+
</div>
|
| 59 |
+
"""
|
| 60 |
+
gr.HTML(intro_html)
|
| 61 |
+
|
| 62 |
+
# Fix the missing quote in REPO_URL
|
| 63 |
+
learn_more_html = f"""
|
| 64 |
+
<div id="learn-more-button">
|
| 65 |
+
<a href="{REPO_URL}/blob/main/README.md"
|
| 66 |
+
class="button-link" target="_blank">Learn More</a>
|
| 67 |
+
</div>
|
| 68 |
+
"""
|
| 69 |
+
gr.HTML(learn_more_html)
|
| 70 |
+
|
| 71 |
+
examples_md = """
|
| 72 |
+
<p><strong>🧠 Need inspiration?</strong> Try these examples:</p>
|
| 73 |
+
<ul>
|
| 74 |
+
<li>Movie summaries for genre classification.</li>
|
| 75 |
+
<li>Customer chats with dialogue and sentiment labels.</li>
|
| 76 |
+
<li>Stock prices with date, ticker, open, close, volume.</li>
|
| 77 |
+
</ul>
|
| 78 |
+
"""
|
| 79 |
+
gr.Markdown(examples_md)
|
| 80 |
+
|
| 81 |
+
gr.Markdown("<p><strong>Start generating now!</strong> 🗂️✨</p>")
|
| 82 |
+
|
| 83 |
+
with gr.Group(elem_id="input-container"):
|
| 84 |
+
business_problem = gr.Textbox(
|
| 85 |
+
placeholder=(
|
| 86 |
+
"Describe the dataset you want "
|
| 87 |
+
"(e.g., Job postings, Customer reviews)"
|
| 88 |
+
),
|
| 89 |
+
lines=2,
|
| 90 |
+
label="📌 Business Problem",
|
| 91 |
+
elem_classes=["label-box"],
|
| 92 |
+
elem_id="business-problem-box",
|
| 93 |
+
)
|
| 94 |
+
|
| 95 |
+
with gr.Row(elem_classes="column-gap"):
|
| 96 |
+
with gr.Column(scale=1):
|
| 97 |
+
dataset_type = gr.Dropdown(
|
| 98 |
+
["Tabular", "Time-series", "Text"],
|
| 99 |
+
value="Tabular",
|
| 100 |
+
label="📊 Dataset Type",
|
| 101 |
+
elem_classes=["label-box"],
|
| 102 |
+
elem_id="custom-dropdown",
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
with gr.Column(scale=1):
|
| 106 |
+
output_format = gr.Dropdown(
|
| 107 |
+
choices=["JSON", "csv", "Parquet"],
|
| 108 |
+
value="JSON",
|
| 109 |
+
label="📁 Output Format",
|
| 110 |
+
elem_classes=["label-box"],
|
| 111 |
+
elem_id="custom-dropdown",
|
| 112 |
+
)
|
| 113 |
+
|
| 114 |
+
# Bind the update function to the dataset type dropdown
|
| 115 |
+
dataset_type.change(
|
| 116 |
+
update_output_format,
|
| 117 |
+
inputs=[dataset_type],
|
| 118 |
+
outputs=[output_format],
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
with gr.Row(elem_classes="row-spacer column-gap"):
|
| 122 |
+
with gr.Column(scale=1):
|
| 123 |
+
model = gr.Dropdown(
|
| 124 |
+
["GPT", "Claude"],
|
| 125 |
+
value="GPT",
|
| 126 |
+
label="🤖 Model",
|
| 127 |
+
elem_classes=["label-box"],
|
| 128 |
+
elem_id="custom-dropdown",
|
| 129 |
+
)
|
| 130 |
+
|
| 131 |
+
with gr.Column(scale=1):
|
| 132 |
+
num_samples = gr.Slider(
|
| 133 |
+
minimum=10,
|
| 134 |
+
maximum=1000,
|
| 135 |
+
value=10,
|
| 136 |
+
step=1,
|
| 137 |
+
interactive=True,
|
| 138 |
+
label="🔢 Number of Samples",
|
| 139 |
+
elem_classes=["label-box"],
|
| 140 |
+
)
|
| 141 |
+
|
| 142 |
+
# Hidden file component for dataset download
|
| 143 |
+
file_download = gr.File(
|
| 144 |
+
visible=False, elem_id="download-box", label=None
|
| 145 |
+
)
|
| 146 |
+
|
| 147 |
+
# Component to display status messages
|
| 148 |
+
status_message = gr.Markdown("", label="Status")
|
| 149 |
+
|
| 150 |
+
# Button to trigger dataset generation
|
| 151 |
+
run_btn = gr.Button("Create a dataset", elem_id="run-btn")
|
| 152 |
+
run_btn.click(
|
| 153 |
+
pipeline.generate,
|
| 154 |
+
inputs=[
|
| 155 |
+
business_problem,
|
| 156 |
+
dataset_type,
|
| 157 |
+
output_format,
|
| 158 |
+
num_samples,
|
| 159 |
+
model,
|
| 160 |
+
],
|
| 161 |
+
outputs=[file_download, run_btn, status_message],
|
| 162 |
+
)
|
| 163 |
+
|
| 164 |
+
# Bottom: version info
|
| 165 |
+
gr.Markdown(
|
| 166 |
+
f"""
|
| 167 |
+
<p class="version-banner">
|
| 168 |
+
🔖 <strong>
|
| 169 |
+
<a href="{REPO_URL}/blob/main/CHANGELOG.md"
|
| 170 |
+
target="_blank">Version {VERSION}</a>
|
| 171 |
+
</strong>
|
| 172 |
+
</p>
|
| 173 |
+
"""
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
return ui
|
| 177 |
+
|
| 178 |
+
except Exception as e:
|
| 179 |
+
logger.error("❌ Error building UI: %s", e)
|
| 180 |
+
# Return a minimal error UI
|
| 181 |
+
with gr.Blocks() as error_ui:
|
| 182 |
+
gr.Markdown("# Error Loading Application")
|
| 183 |
+
gr.Markdown(f"An error occurred: {str(e)}")
|
| 184 |
+
return error_ui
|
src/utils.py
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Utility functions for extracting and executing Python code from LLM responses."""
|
| 2 |
+
|
| 3 |
+
import re
|
| 4 |
+
import os
|
| 5 |
+
import subprocess
|
| 6 |
+
import sys
|
| 7 |
+
import logging
|
| 8 |
+
|
| 9 |
+
# Set up logger
|
| 10 |
+
logger = logging.getLogger(__name__)
|
| 11 |
+
logging.basicConfig(
|
| 12 |
+
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def extract_code(text):
|
| 17 |
+
"""Extract Python code block from LLM response text."""
|
| 18 |
+
try:
|
| 19 |
+
# Search for Python code block using regex
|
| 20 |
+
match = re.search(r"```python(.*?)```", text, re.DOTALL)
|
| 21 |
+
if match:
|
| 22 |
+
code = match.group(0).strip()
|
| 23 |
+
else:
|
| 24 |
+
code = ""
|
| 25 |
+
logger.warning("No matching code block found.")
|
| 26 |
+
|
| 27 |
+
# Clean up markdown formatting
|
| 28 |
+
return code.replace("```python\n", "").replace("```", "")
|
| 29 |
+
except Exception as e:
|
| 30 |
+
logger.error(f"Code extraction error: {e}")
|
| 31 |
+
raise
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def extract_file_path(code_str):
|
| 35 |
+
"""Extract file path from code string containing os.path.join() calls."""
|
| 36 |
+
try:
|
| 37 |
+
# Look for os.path.join() pattern with two string arguments
|
| 38 |
+
pattern = r'os\.path\.join\(\s*["\'](.+?)["\']\s*,\s*["\'](.+?)["\']\s*\)'
|
| 39 |
+
match = re.search(pattern, code_str)
|
| 40 |
+
if match:
|
| 41 |
+
folder = match.group(1)
|
| 42 |
+
filename = match.group(2)
|
| 43 |
+
return os.path.join(folder, filename)
|
| 44 |
+
|
| 45 |
+
logger.error("No file path found.")
|
| 46 |
+
return None
|
| 47 |
+
except Exception as e:
|
| 48 |
+
logger.error(f"File path extraction error: {e}")
|
| 49 |
+
raise
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def execute_code_in_virtualenv(text, python_interpreter=sys.executable):
|
| 53 |
+
"""Execute extracted Python code in a subprocess and return the file path."""
|
| 54 |
+
if not python_interpreter:
|
| 55 |
+
raise OSError("Python interpreter not found.")
|
| 56 |
+
|
| 57 |
+
# Extract the Python code from the input text
|
| 58 |
+
code_str = extract_code(text)
|
| 59 |
+
|
| 60 |
+
# Prepare subprocess command
|
| 61 |
+
command = [python_interpreter, "-c", code_str]
|
| 62 |
+
|
| 63 |
+
try:
|
| 64 |
+
# logger.info("✅ Running script: %s", command)
|
| 65 |
+
|
| 66 |
+
# Execute the code in subprocess
|
| 67 |
+
# Note: We capture the result but don't need to use it directly
|
| 68 |
+
# The subprocess.run() with check=True will raise an exception if it fails
|
| 69 |
+
subprocess.run(command, check=True, capture_output=True, text=True)
|
| 70 |
+
|
| 71 |
+
# Extract file path from the executed code
|
| 72 |
+
file_path = extract_file_path(code_str)
|
| 73 |
+
logger.info("✅ Extracted file path: %s", file_path)
|
| 74 |
+
|
| 75 |
+
return file_path
|
| 76 |
+
except subprocess.CalledProcessError as e:
|
| 77 |
+
# Return error information if subprocess execution fails
|
| 78 |
+
return (f"Execution error:\n{e.stderr.strip()}", None)
|
uv.lock
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|