datum / README.md
subhamb04's picture
Upload folder using huggingface_hub
886eb32 verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: datum
app_file: app.py
sdk: gradio
sdk_version: 5.44.1

Datum - AI-Powered Data Analysis Agent

A simple yet powerful data analysis agent that uses AI to generate SQL queries, execute them against your data, and provide visualizations and insights through a web interface.

Features

  • Natural Language Queries: Ask questions about your data in plain English
  • Auto Routing (Chat vs SQL): Agent decides between a quick chat reply or full SQL/database analysis
  • AI-Generated SQL: Automatically converts questions into SQL queries
  • Data Visualization: Generates charts and graphs from query results
  • Intelligent Insights: Provides narrative analysis and recommendations
  • Web Interface: Clean, user-friendly Gradio interface
  • DuckDB Integration: Fast, in-memory SQL database for data analysis
  • LangSmith Tracing: Built-in observability and debugging with LangSmith integration

Project Structure

datum/
β”œβ”€β”€ app.py                  # Main application with LangGraph workflow
β”œβ”€β”€ builder/
β”‚   β”œβ”€β”€ graph_builder.py    # Graph with router + conditional edges
β”‚   β”œβ”€β”€ nodes.py            # Agent nodes (decider, chat, SQL, charting, narration)
β”‚   β”œβ”€β”€ state.py            # Typed state definition for the agent
β”‚   └── ui.py               # Gradio UI wiring
β”œβ”€β”€ clients/
β”‚   └── llm.py              # LLM configuration (Google Gemini)
β”œβ”€β”€ datastore/
β”‚   └── db.py               # DuckDB setup and data loading
β”œβ”€β”€ utils/
β”‚   β”œβ”€β”€ charts.py           # Chart generation utilities
β”‚   β”œβ”€β”€ insight_utils.py    # Insight helpers
β”‚   └── tracer_utils.py     # LangSmith tracing helpers
β”œβ”€β”€ sample_data/            # Sample datasets
β”‚   β”œβ”€β”€ sales.csv
β”‚   β”œβ”€β”€ marketing_spend.csv
β”‚   └── customers.csv
└── requirements.txt        # Python dependencies

Setup Instructions

Prerequisites

  • Python 3.8 or higher
  • Google API key for Gemini AI

Installation

  1. Clone the repository

    git clone <repository-url>
    cd datum
    
  2. Create a virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install dependencies

    pip install -r requirements.txt
    
  4. Set up environment variables Create a .env file in the project root:

    GOOGLE_API_KEY=your_google_api_key_here
    LANGCHAIN_PROJECT=datum-analysis  # Optional: for LangSmith tracing
    LANGCHAIN_API_KEY=your_langsmith_api_key  # Optional: for LangSmith tracing
    LANGCHAIN_TRACING_V2=true  # Optional: enable LangSmith tracing
    
  5. Run the application

    python app.py
    
  6. Access the web interface Open your browser and navigate to the URL shown in the terminal (typically http://127.0.0.1:7860)

Usage

  1. Ask a question: Type your data analysis question in natural language

    • Example: "What are the top 3 regions by revenue?"
    • Example: "Show me marketing spend by channel"
    • Example: "Which products have the highest unit sales?"
  2. Agent chooses the path automatically

    • Chat route: Direct conversational answer when no database analysis is needed
    • SQL route: The agent generates SQL and provides:
      • Query Result (table)
      • Chart (visualization)
      • Insights (narrative + recommendation)
      • SQL (for transparency)

Routing at a Glance

The decider node analyzes your question and sets a route of chat or sql. The graph then either calls general_chat or runs the SQL flow (sql_generator β†’ sql_executor β†’ chart_generator + narrator).

Sample Data

The project includes sample datasets:

  • Sales: Date, region, product, revenue, units sold
  • Marketing Spend: Date, region, channel, spend amount
  • Customers: Customer ID, region, age, income

Technology Stack

  • LangGraph: Workflow orchestration
  • Google Gemini: AI language model
  • DuckDB: In-memory SQL database
  • Gradio: Web interface
  • Matplotlib: Chart generation
  • Pandas: Data manipulation
  • LangSmith: Observability and tracing platform

Customization

  • Add your own data: Replace CSV files in the sample_data/ directory and update the schema in nodes.py
  • Modify the LLM: Change the model or provider in llm.py
  • Customize charts: Modify chart generation logic in charts.py
  • Extend the workflow: Add new nodes to the LangGraph workflow in app.py

Observability & Debugging

The application includes built-in LangSmith tracing for monitoring and debugging:

  • Trace Execution: All agent steps are automatically traced and logged
  • Performance Monitoring: Track execution times and token usage
  • Debug Information: View detailed logs of SQL generation, execution, and LLM calls
  • Project Organization: Traces are organized by project name for easy filtering

To enable tracing, set the LangSmith environment variables in your .env file. Without these variables, the application will run normally but without tracing capabilities.

Troubleshooting

  • API Key Error: Ensure your GOOGLE_API_KEY is set correctly in the .env file
  • Import Errors: Make sure all dependencies are installed with pip install -r requirements.txt
  • Data Issues: Verify your CSV files are in the correct format and location
  • Tracing Issues: Check LangSmith credentials if you want to use the observability features

License

This project is open source and available under the MIT License.