Spaces:

lisekarimi
/

datagen

Sleeping

App Files Files Community

lisekarimi commited on Jun 1, 2025

Commit

db17eb5

1 Parent(s): 62587cb

Deploy version 0.1.0

Browse files

Files changed (14) hide show

Dockerfile +29 -0
README.md +77 -7
assets/styles.css +237 -0
main.py +14 -0
pyproject.toml +35 -0
src/__init__.py +0 -0
src/constants.py +34 -0
src/datagen.py +51 -0
src/models.py +61 -0
src/pipeline.py +88 -0
src/prompts.py +78 -0
src/ui.py +184 -0
src/utils.py +78 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,29 @@

+FROM python:3.11-slim
+# Install uv
+RUN pip install uv
+WORKDIR /app
+# Copy dependency files first (changes rarely)
+COPY pyproject.toml uv.lock ./
+# Put venv outside of /app so it won't be affected by volume mounts
+ENV UV_PROJECT_ENVIRONMENT=/opt/venv
+# Install dependencies (this will now create venv at /opt/venv)
+RUN uv sync --locked
+# Copy all source code
+COPY . .
+# Create output directory with proper permissions
+RUN mkdir -p /tmp/output && chmod 777 /tmp/output
+# Set output directory environment variable for production
+ENV OUTPUT_DIR=/tmp/output
+# Disable UV cache entirely for production
+ENV UV_NO_CACHE=1
+CMD ["uv", "run", "main.py"]

README.md CHANGED Viewed

@@ -1,11 +1,81 @@
 ---
-title: Datagen
-emoji: 📚
-colorFrom: blue
-colorTo: gray
 sdk: docker
-pinned: false
-short_description: ✨ AI-powered platform for generating synthetic datasets
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: DataGen
+emoji: 🧬
+colorFrom: indigo
+colorTo: pink
 sdk: docker
 ---
+# 🧬 DataGen: AI-Powered Synthetic Data Generator
+Generate realistic synthetic datasets by simply describing what you need.
+[🚀 **Try the Live Demo**](https://huggingface.co/spaces/lisekarimi/snapr)
+<img src="https://gitlab.com/lisekarimi/datagen/-/raw/main/assets/screenshot.png" alt="DataGen interface" width="450">
+## ✨ What DataGen Does
+DataGen transforms simple descriptions into structured datasets using AI. Perfect for researchers, data scientists, and developers who need realistic test data fast.
+**Key Features:**
+- **Type what you want → Get real data**
+- **Multiple formats:** CSV, JSON, Parquet, Markdown
+- **Dataset types:** Tables, time-series, text data
+- **AI-powered:** Uses GPT and Claude models
+- **Instant download** with clean, ready-to-use datasets
+## 🚀 Quick Start
+### Prerequisites
+- Python 3.11+
+- [uv package manager](https://docs.astral.sh/uv/getting-started/installation/)
+### Installation
+```bash
+git clone https://github.com/lisekarimi/datagen.git
+cd datagen
+uv sync
+source .venv/bin/activate  # Unix/macOS
+# or .\.venv\Scripts\activate on Windows
+```
+### Configuration
+1. Copy `.env.example` to `.env`
+2. Populate it with the required secrets
+### Run DataGen
+```bash
+# Local development
+make run
+# With hot reload
+make ui
+```
+*For complete setup instructions, commands, and development guidelines, see [docs/dev.md](https://gitlab.com/lisekarimi/datagen/-/blob/main/docs/dev.md)*
+## 🧑‍💻 How to Use
+1. **Describe your data:** "Customer purchase history with demographics"
+2. **Choose format:** CSV, JSON, Parquet, or Markdown
+3. **Select AI model:** GPT or Claude
+4. **Set sample size:** Number of records to generate
+5. **Generate & download** your dataset
+## 🛡️ Quality & Security
+DataGen maintains high standards with comprehensive test coverage, automated security scanning, and code quality enforcement.
+*For CI/CD setup and technical details, see [docs/cicd.md](https://gitlab.com/lisekarimi/datagen/-/blob/main/docs/cicd.md)*
+## 📝 Notes
+- Generated files are automatically cleaned up after 5 minutes
+- Supports 10-1000 samples per dataset
+- JSON output includes proper indentation for readability
+- Cross-platform compatibility (Windows, macOS, Linux)
+## 📄 License
+MIT

assets/styles.css ADDED Viewed

	@@ -0,0 +1,237 @@

+html, body, #app, body > div, .gradio-container {
+    background-color: #0b0e18 !important;  /* dark blue */
+    height: 100%;
+    margin: 0;
+    padding: 0;
+    font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
+    display: flex;
+    justify-content: center;
+    align-items: center;
+}
+#app-container {
+    background-color: #1d3451 !important;
+    padding: 40px;
+    border-radius: 12px;
+    box-shadow: 0 4px 25px rgba(0, 0, 0, 0.4);
+    max-width: 800px;
+    width: 100%;
+    color: white;
+}
+#app-container h4,
+#app-container p,
+#app-container ol,
+#app-container li,
+#app-container strong {
+    font-size: 16px;
+    line-height: 1.6;
+    color: white !important;
+}
+#app-title {
+    font-size: 42px;
+    background: linear-gradient(to left, #ff416c, #ff4b2b);
+    -webkit-background-clip: text;
+    background-clip: text;
+    color: transparent;
+    font-weight: 800;
+    margin-bottom: 5px;
+    text-align: center;
+}
+#app-subtitle {
+    font-size: 24px;
+    background: linear-gradient(to left, #ff416c, #ff4b2b);
+    -webkit-background-clip: text;
+    background-clip: text;
+    color: transparent;
+    font-weight: 600;
+    margin-top: 0;
+    text-align: center;
+}
+#intro-text {
+    font-size: 16px;
+    color: white !important;
+    margin-top: 20px;
+    line-height: 1.6;
+}
+#learn-more-button {
+    display: flex;
+    justify-content: center;
+    margin-top: 5px;
+}
+.button-link {
+    background: linear-gradient(to left, #ff416c, #ff4b2b);
+    color: white !important;
+    padding: 10px 20px;
+    text-decoration: none;
+    font-weight: bold;
+    border-radius: 8px;
+    transition: opacity 0.3s;
+}
+.button-link:hover {
+    opacity: 0.85;
+}
+#input-container {
+    background-color: #1f2937;
+    padding: 20px;
+    border-radius: 10px;
+    box-shadow: 0 2px 10px rgba(0, 0, 0, 0.2);
+}
+.label-box label {
+    background-color: #1f2937;
+    padding: 4px 10px;
+    border-radius: 8px;
+    display: inline-block;
+    margin-bottom: 6px;
+}
+.label-box span {
+    color: white !important;
+}
+.label-box {
+    background-color: #1f2937;
+    color: white;
+    padding: 4px 10px;
+    border-radius: 8px;
+    display: inline-block;
+}
+#input-container > div {
+    background: #1f2937 !important;
+    background-color: #1f2937 !important;
+    border: none !important;
+    box-shadow: none !important;
+    padding: 10px !important;
+    margin: 0 !important;
+}
+.row-spacer {
+    margin-top: 24px !important;
+}
+.column-gap {
+  gap: 16px;
+}
+textarea, input[type="text"] {
+    background-color: #374151 !important;
+    color: white !important;
+}
+#custom-dropdown .wrap {
+    background-color: #374151 !important;
+    border-radius: 6px;
+}
+input[role="listbox"] {
+    color: white !important;
+    background-color: #374151 !important;
+}
+.dropdown-arrow {
+    color: white !important;
+}
+ul[role="listbox"] {
+    background-color: #111827 !important;  /* dark navy */
+    color: white !important;
+    border-radius: 6px;
+    padding: 4px 0;
+}
+ul[role="listbox"] li {
+    color: white !important;
+    padding: 8px 12px;
+}
+ul[role="listbox"] li:hover {
+    background-color: #1f2937 !important;  /* slightly lighter hover */
+}
+ul[role="listbox"] li[aria-selected="true"] {
+    background-color: #111827 !important;  /* same dark as others */
+    color: white !important;
+}
+input[type="number"] {
+    background-color: #374151;
+    color: white !important;
+}
+#business-problem-box {
+    margin-left: 0 !important;
+    margin-right: 0 !important;
+    padding-left: 0 !important;
+    padding-right: 0 !important;
+    width: 100% !important;
+}
+#business-problem-box textarea::placeholder {
+    color: #9ca3af !important; /* Tailwind's "gray-400" */
+}
+#run-btn {
+    background: linear-gradient(to left, #ff416c, #ff4b2b);
+    color: white !important;
+    font-weight: bold;
+    border: none;
+    padding: 10px 20px;
+    border-radius: 8px;
+    cursor: pointer;
+    transition: background 0.3s ease;
+}
+#run-btn:hover {
+    filter: brightness(1.1);
+}
+#download-box {
+    background-color: #1f2937;
+    border: 1px solid #3b3b3b;
+    border-radius: 8px;
+    padding: 10px;
+    margin-top: 16px;
+    font-weight: 500;
+}
+#download-box a {
+    color: #00c3ff !important;
+    text-decoration: underline;
+    font-weight: bold;
+}
+#download-box td.filename {
+    color: rgb(255, 255, 255) !important;
+}
+#download-box .file-preview-holder,
+#download-box .file-preview,
+#download-box table,
+#download-box tr,
+#download-box td {
+    background-color: #1f2937 !important;
+}
+/* #download-box label {
+    background-color: #1f2937 !important;
+    color: white !important;
+    font-weight: bold;
+} */
+#download-box > label {
+    display: none !important;
+}
+/* ==== Version ==== */
+.version-banner {
+    text-align: center;
+    font-size: 0.9em;
+}

main.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""Entry point for the application."""
+import os
+from src.ui import build_ui
+demo = build_ui()
+# Main application entry point
+if __name__ == "__main__":
+    demo.launch(
+        allowed_paths=["output"],
+        server_name="0.0.0.0",
+        server_port=int(os.environ.get("PORT", 7860)),
+    )

pyproject.toml ADDED Viewed

	@@ -0,0 +1,35 @@

+[project]
+name = "datagen"
+version = "0.1.0"
+description = "AI-powered platform for generating synthetic datasets"
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+    "anthropic==0.49.0",
+    "gradio",
+    "numpy>=2.2.6",
+    "openai==1.65.5",
+    "pandas>=2.2.3",
+    "pyarrow>=20.0.0",
+    "python-dotenv==1.0.1",
+]
+[tool.pytest.ini_options]
+pythonpath = ["."]
+filterwarnings = [
+    "ignore::DeprecationWarning:websockets.legacy",
+]
+[tool.ruff.lint]
+select = [
+    "E",   # pycodestyle errors
+    "W",   # pycodestyle warnings
+    "F",   # Pyflakes
+    "D",   # pydocstyle (docstrings)
+    "UP",  # pyupgrade
+    "B",   # flake8-bugbear
+]
+ignore = ["D104"]  # Missing docstring in public package (__init__.py)
+[tool.ruff.lint.pydocstyle]
+convention = "google"

src/__init__.py ADDED Viewed

File without changes

src/constants.py ADDED Viewed

	@@ -0,0 +1,34 @@

+# src/config/constants.py
+"""Constants for configuration across the project."""
+import os
+import tomllib
+from pathlib import Path
+import logging
+# ==================== PROJECT METADATA ====================
+root = Path(__file__).parent.parent
+with open(root / "pyproject.toml", "rb") as f:
+    pyproject = tomllib.load(f)
+PROJECT_NAME = pyproject["project"]["name"]
+VERSION = pyproject["project"]["version"]
+# ==================== AI MODEL CONFIG ====================
+OPENAI_MODEL = "gpt-4o-mini"
+CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
+# Other constants can go here
+OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "output")
+MAX_TOKENS = 2000
+# ==================== LOGGING CONFIG ====================
+# Configure logging once
+logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
+# Create a shared logger
+logger = logging.getLogger(__name__)
+# ==================== FILE MANAGEMENT ====================
+FILE_CLEANUP_SECONDS = 60  # 5 minutes

src/datagen.py ADDED Viewed

	@@ -0,0 +1,51 @@

+"""Main data generation class for creating synthetic datasets using AI models."""
+import os
+from datetime import datetime
+from .prompts import build_user_prompt, system_message
+from .models import get_gpt_completion, get_claude_completion
+from .utils import execute_code_in_virtualenv
+from .constants import OUTPUT_DIR, logger
+class DataGen:
+    """Handles synthetic data generation using AI models."""
+    def __init__(self, output_dir=None):
+        """Initialize the data generator with output directory."""
+        # Use provided output_dir, or fall back to OUTPUT_DIR constant
+        self.output_dir = output_dir or OUTPUT_DIR
+        os.makedirs(self.output_dir, exist_ok=True)
+    def get_timestamp(self):
+        """Return current timestamp for file naming."""
+        return datetime.now().strftime("%Y%m%d_%H%M%S")
+    def generate_dataset(self, **input_data):
+        """Generate synthetic dataset based on input parameters and model choice."""
+        try:
+            # Ensure output directory exists before generating
+            os.makedirs(self.output_dir, exist_ok=True)
+            # Add output directory path to input data for file generation
+            input_data["file_path"] = self.output_dir
+            # Build the prompt to send to the selected LLM
+            prompt = build_user_prompt(**input_data)
+            # Call the selected LLM based on the model parameter
+            if input_data["model"] == "GPT":
+                code = get_gpt_completion(prompt, system_message)
+            elif input_data["model"] == "Claude":
+                code = get_claude_completion(prompt, system_message)
+            else:
+                raise ValueError("Invalid model selected.")
+            # Execute the generated code and return the output file path
+            file_path = execute_code_in_virtualenv(code)
+            return file_path
+        except Exception as e:
+            # Log and re-raise any errors that occur during generation
+            logger.error(f"Error in generate_dataset: {e}")
+            raise

src/models.py ADDED Viewed

	@@ -0,0 +1,61 @@

+"""AI model clients and API configuration for OpenAI and Anthropic."""
+from openai import OpenAI
+import anthropic
+import os
+from dotenv import load_dotenv
+from .constants import OPENAI_MODEL, CLAUDE_MODEL, MAX_TOKENS, logger
+# Load environment variables from .env file
+load_dotenv(override=True)
+# Retrieve API keys from environment variables
+openai_api_key = os.getenv("OPENAI_API_KEY")
+anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
+# Warn if any API key is missing for proper error handling
+if not openai_api_key:
+    logger.error("❌ OpenAI API Key is missing!")
+if not anthropic_api_key:
+    logger.error("❌ Anthropic API Key is missing!")
+# Initialize API clients with the retrieved keys
+openai = OpenAI(api_key=openai_api_key)
+claude = anthropic.Anthropic()
+def get_gpt_completion(prompt, system_message):
+    """Call OpenAI's GPT model with prompt and system message."""
+    try:
+        # Create chat completion with system and user messages
+        response = openai.chat.completions.create(
+            model=OPENAI_MODEL,
+            messages=[
+                {"role": "system", "content": system_message},
+                {"role": "user", "content": prompt},
+            ],
+            stream=False,
+        )
+        # Extract and return the generated content
+        return response.choices[0].message.content
+    except Exception as e:
+        logger.error(f"GPT error: {e}")
+        raise
+def get_claude_completion(prompt, system_message):
+    """Call Anthropic's Claude model with prompt and system message."""
+    try:
+        # Create message with Claude API using system prompt and user message
+        result = claude.messages.create(
+            model=CLAUDE_MODEL,
+            max_tokens=MAX_TOKENS,
+            system=system_message,
+            messages=[{"role": "user", "content": prompt}],
+        )
+        # Extract and return the text content from response
+        return result.content[0].text
+    except Exception as e:
+        logger.error(f"Claude error: {e}")
+        raise

src/pipeline.py ADDED Viewed

	@@ -0,0 +1,88 @@

+"""Pipeline orchestration for dataset generation."""
+import os
+import logging
+import threading
+import gradio as gr
+from src.datagen import DataGen
+from src.constants import FILE_CLEANUP_SECONDS
+logger = logging.getLogger(__name__)
+def safe_delete(file_path):
+    """Safely delete a file, ignoring errors if file doesn't exist."""
+    try:
+        if os.path.exists(file_path):
+            os.remove(file_path)
+    except Exception:
+        pass  # Ignore deletion errors
+class DatasetPipeline:
+    """Handles the dataset generation pipeline."""
+    def __init__(self):
+        """Initialize the pipeline with a DataGen instance."""
+        self.generator = DataGen()
+    def generate(
+        self, business_problem, dataset_type, output_format, num_samples, model
+    ):
+        """Generate synthetic dataset based on user inputs."""
+        # Check if business problem is empty
+        if not business_problem.strip():
+            error_msg = "❌ Please enter a business problem before generating."
+            yield [gr.update(visible=False), gr.update(visible=True), error_msg]
+            return
+        # Initial feedback while generating
+        yield [
+            gr.update(visible=False),
+            gr.update(visible=False),
+            "⏳ Generating dataset...",
+        ]
+        try:
+            # Pack inputs into a dictionary for the generator
+            input_data = {
+                "business_problem": business_problem,
+                "dataset_type": dataset_type,
+                "output_format": output_format,
+                "num_samples": num_samples,
+                "model": model,
+            }
+            # Generate dataset file
+            file_path = self.generator.generate_dataset(**input_data)
+            # Check if file exists and return success message + file path
+            if isinstance(file_path, str) and os.path.exists(file_path):
+                # Auto-delete after 5min with safe deletion
+                threading.Timer(
+                    FILE_CLEANUP_SECONDS, safe_delete, args=[file_path]
+                ).start()
+                success_update = [
+                    gr.update(value=file_path, visible=True),
+                    gr.update(visible=True),
+                    "✅ Dataset ready for download.",
+                ]
+                yield success_update
+            else:
+                # Handle invalid or missing file
+                error_update = [
+                    gr.update(visible=False),
+                    gr.update(visible=True),
+                    "❌ Error: File not created or path invalid.",
+                ]
+                yield error_update
+        except Exception as e:
+            # Catch and display any errors in the pipeline
+            logger.error("Pipeline error: %s", e)
+            error_update = [
+                gr.update(visible=False),
+                gr.update(visible=True),
+                f"❌ Pipeline error: {e}",
+            ]
+            yield error_update

src/prompts.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""Prompt templates and management for AI model interactions."""
+from datetime import datetime
+from src.constants import logger
+# System message template that defines AI assistant behavior and rules
+system_message = """
+You are a helpful assistant whose main purpose is to generate synthetic datasets
+based on a given business problem.
+🔹 General Guidelines:
+- Be accurate and concise.
+- Use only standard Python libraries (pandas, numpy, os, datetime, etc.)
+- The dataset must contain the requested number of samples.
+- Always respect the requested output format exactly.
+- If multiple entities exist, save each to a separate file.
+- Do not use f-strings anywhere in the code — not in file paths or in content.
+  Use standard string concatenation instead.
+🔹 File Path Rules:
+- Define the full file path using os.path.join(...) — exactly as shown —
+  no shortcuts or direct strings.
+  - Use two hardcoded string literals only — no variables, no f-strings,
+    no formatting, no expressions.
+  - First argument: full directory path (use forward slashes).
+  - Second argument: full filename with timestamp and correct extension.
+  - Example: os.path.join("C:/Users/.../output", "sales_20250323_123456.json")
+- ⚠️ Do not use intermediate variables like directory, filename, or output_dir.
+- ⚠️ Do not skip or replace any of the above instructions. They are required
+  for the code to work correctly.
+🔹 File Saving Instructions:
+- ✅ CSV:
+    df.to_csv(file_path, index=False, encoding="utf-8")
+- ✅ JSON:
+    with open(file_path, "w", encoding="utf-8") as f:
+        df.to_json(f, orient="records", lines=False, force_ascii=False, indent=2)
+- ✅ Parquet:
+    df.to_parquet(file_path, engine="pyarrow", index=False)
+- ✅ Markdown (for Text):
+    - Generate properly formatted Markdown content.
+    - Save it as a `.md` file using UTF-8 encoding.
+"""
+def build_user_prompt(**input_data):
+    """Build user prompt for AI model based on dataset generation parameters."""
+    try:
+        # Normalize file path separators to forward slashes for consistency
+        file_path = input_data["file_path"].replace("\\", "/")
+        # Generate timestamp for unique file naming
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        # Construct the user prompt for the LLM with all required parameters
+        user_prompt = (
+            f"Generate a synthetic {input_data['dataset_type'].lower()} "
+            f"dataset in {input_data['output_format'].upper()} format.\n"
+            f"Business problem: {input_data['business_problem']}\n"
+            f"Samples: {input_data['num_samples']}\n"
+            f"Directory: {file_path}\n"
+            f"Timestamp: {timestamp}"
+        )
+        return user_prompt
+    except KeyError as e:
+        # Handle missing keys in input_data dictionary
+        logger.warning(f"Missing input key: {e}")
+        raise
+    except Exception as e:
+        # Log any other error during prompt building process
+        logger.warning(f"Error in build_user_prompt: {e}")
+        raise

src/ui.py ADDED Viewed

	@@ -0,0 +1,184 @@

+"""Gradio web interface for synthetic data generation."""
+import logging
+import gradio as gr
+from src.pipeline import DatasetPipeline
+from src.constants import PROJECT_NAME, VERSION
+# Set up logger
+logger = logging.getLogger(__name__)
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+pipeline = DatasetPipeline()
+PROJECT_NAME_CAP = PROJECT_NAME.capitalize()
+REPO_URL = f"https://github.com/lisekarimi/{PROJECT_NAME}"
+def update_output_format(dataset_type):
+    """Update output format choices based on selected dataset type."""
+    if dataset_type in ["Tabular", "Time-series"]:
+        return gr.update(choices=["JSON", "csv", "Parquet"], value="JSON")
+    elif dataset_type == "Text":
+        return gr.update(choices=["JSON", "Markdown"], value="JSON")
+def build_ui(css_path="assets/styles.css"):
+    """Build and return the complete Gradio user interface with error handling."""
+    # Try to load CSS file with error handling
+    try:
+        with open(css_path, encoding="utf-8") as f:
+            css = f.read()
+    except Exception as e:
+        css = ""
+        logger.warning("⚠️ Failed to load CSS: %s", e)
+    # Building the UI with error handling
+    try:
+        with gr.Blocks(css=css, title=f"🧬{PROJECT_NAME_CAP}") as ui:
+            with gr.Column(elem_id="app-container"):
+                gr.Markdown(f"<h1 id='app-title'>🏷️ {PROJECT_NAME_CAP} </h1>")
+                gr.Markdown(
+                    "<h2 id='app-subtitle'>AI-Powered Synthetic Dataset Generator</h2>"
+                )
+                # Fix the f-string in HTML
+                intro_html = f"""
+                <div id="intro-text">
+                    <p>With {PROJECT_NAME_CAP}, easily generate
+                    <strong>diverse datasets</strong>
+                    for testing, development, and AI training.</p>
+                    <h4>🎯 How It Works:</h4>
+                        <p>1️⃣ Define your business problem.</p>
+                        <p>2️⃣ Select dataset type, format, model, and samples.</p>
+                        <p>3️⃣ Download your synthetic dataset!</p>
+                </div>
+                """
+                gr.HTML(intro_html)
+                # Fix the missing quote in REPO_URL
+                learn_more_html = f"""
+                    <div id="learn-more-button">
+                        <a href="{REPO_URL}/blob/main/README.md"
+                           class="button-link" target="_blank">Learn More</a>
+                    </div>
+                    """
+                gr.HTML(learn_more_html)
+                examples_md = """
+                    <p><strong>🧠 Need inspiration?</strong> Try these examples:</p>
+                    <ul>
+                    <li>Movie summaries for genre classification.</li>
+                    <li>Customer chats with dialogue and sentiment labels.</li>
+                    <li>Stock prices with date, ticker, open, close, volume.</li>
+                    </ul>
+                    """
+                gr.Markdown(examples_md)
+                gr.Markdown("<p><strong>Start generating now!</strong> 🗂️✨</p>")
+                with gr.Group(elem_id="input-container"):
+                    business_problem = gr.Textbox(
+                        placeholder=(
+                            "Describe the dataset you want "
+                            "(e.g., Job postings, Customer reviews)"
+                        ),
+                        lines=2,
+                        label="📌 Business Problem",
+                        elem_classes=["label-box"],
+                        elem_id="business-problem-box",
+                    )
+                    with gr.Row(elem_classes="column-gap"):
+                        with gr.Column(scale=1):
+                            dataset_type = gr.Dropdown(
+                                ["Tabular", "Time-series", "Text"],
+                                value="Tabular",
+                                label="📊 Dataset Type",
+                                elem_classes=["label-box"],
+                                elem_id="custom-dropdown",
+                            )
+                        with gr.Column(scale=1):
+                            output_format = gr.Dropdown(
+                                choices=["JSON", "csv", "Parquet"],
+                                value="JSON",
+                                label="📁 Output Format",
+                                elem_classes=["label-box"],
+                                elem_id="custom-dropdown",
+                            )
+                        # Bind the update function to the dataset type dropdown
+                        dataset_type.change(
+                            update_output_format,
+                            inputs=[dataset_type],
+                            outputs=[output_format],
+                        )
+                    with gr.Row(elem_classes="row-spacer column-gap"):
+                        with gr.Column(scale=1):
+                            model = gr.Dropdown(
+                                ["GPT", "Claude"],
+                                value="GPT",
+                                label="🤖 Model",
+                                elem_classes=["label-box"],
+                                elem_id="custom-dropdown",
+                            )
+                        with gr.Column(scale=1):
+                            num_samples = gr.Slider(
+                                minimum=10,
+                                maximum=1000,
+                                value=10,
+                                step=1,
+                                interactive=True,
+                                label="🔢 Number of Samples",
+                                elem_classes=["label-box"],
+                            )
+                # Hidden file component for dataset download
+                file_download = gr.File(
+                    visible=False, elem_id="download-box", label=None
+                )
+                # Component to display status messages
+                status_message = gr.Markdown("", label="Status")
+                # Button to trigger dataset generation
+                run_btn = gr.Button("Create a dataset", elem_id="run-btn")
+                run_btn.click(
+                    pipeline.generate,
+                    inputs=[
+                        business_problem,
+                        dataset_type,
+                        output_format,
+                        num_samples,
+                        model,
+                    ],
+                    outputs=[file_download, run_btn, status_message],
+                )
+            # Bottom: version info
+            gr.Markdown(
+                f"""
+                <p class="version-banner">
+                    🔖 <strong>
+                    <a href="{REPO_URL}/blob/main/CHANGELOG.md"
+                    target="_blank">Version {VERSION}</a>
+                    </strong>
+                </p>
+                """
+            )
+        return ui
+    except Exception as e:
+        logger.error("❌ Error building UI: %s", e)
+        # Return a minimal error UI
+        with gr.Blocks() as error_ui:
+            gr.Markdown("# Error Loading Application")
+            gr.Markdown(f"An error occurred: {str(e)}")
+        return error_ui

src/utils.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""Utility functions for extracting and executing Python code from LLM responses."""
+import re
+import os
+import subprocess
+import sys
+import logging
+# Set up logger
+logger = logging.getLogger(__name__)
+logging.basicConfig(
+    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+)
+def extract_code(text):
+    """Extract Python code block from LLM response text."""
+    try:
+        # Search for Python code block using regex
+        match = re.search(r"```python(.*?)```", text, re.DOTALL)
+        if match:
+            code = match.group(0).strip()
+        else:
+            code = ""
+            logger.warning("No matching code block found.")
+        # Clean up markdown formatting
+        return code.replace("```python\n", "").replace("```", "")
+    except Exception as e:
+        logger.error(f"Code extraction error: {e}")
+        raise
+def extract_file_path(code_str):
+    """Extract file path from code string containing os.path.join() calls."""
+    try:
+        # Look for os.path.join() pattern with two string arguments
+        pattern = r'os\.path\.join\(\s*["\'](.+?)["\']\s*,\s*["\'](.+?)["\']\s*\)'
+        match = re.search(pattern, code_str)
+        if match:
+            folder = match.group(1)
+            filename = match.group(2)
+            return os.path.join(folder, filename)
+        logger.error("No file path found.")
+        return None
+    except Exception as e:
+        logger.error(f"File path extraction error: {e}")
+        raise
+def execute_code_in_virtualenv(text, python_interpreter=sys.executable):
+    """Execute extracted Python code in a subprocess and return the file path."""
+    if not python_interpreter:
+        raise OSError("Python interpreter not found.")
+    # Extract the Python code from the input text
+    code_str = extract_code(text)
+    # Prepare subprocess command
+    command = [python_interpreter, "-c", code_str]
+    try:
+        # logger.info("✅ Running script: %s", command)
+        # Execute the code in subprocess
+        # Note: We capture the result but don't need to use it directly
+        # The subprocess.run() with check=True will raise an exception if it fails
+        subprocess.run(command, check=True, capture_output=True, text=True)
+        # Extract file path from the executed code
+        file_path = extract_file_path(code_str)
+        logger.info("✅ Extracted file path: %s", file_path)
+        return file_path
+    except subprocess.CalledProcessError as e:
+        # Return error information if subprocess execution fails
+        return (f"Execution error:\n{e.stderr.strip()}", None)

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff