---
title: datum
app_file: app.py
sdk: gradio
sdk_version: 5.44.1
---
# Datum - AI-Powered Data Analysis Agent

A simple yet powerful data analysis agent that uses AI to generate SQL queries, execute them against your data, and provide visualizations and insights through a web interface.

## Features

- **Natural Language Queries**: Ask questions about your data in plain English
- **Auto Routing (Chat vs SQL)**: Agent decides between a quick chat reply or full SQL/database analysis
- **AI-Generated SQL**: Automatically converts questions into SQL queries
- **Data Visualization**: Generates charts and graphs from query results
- **Intelligent Insights**: Provides narrative analysis and recommendations
- **Web Interface**: Clean, user-friendly Gradio interface
- **DuckDB Integration**: Fast, in-memory SQL database for data analysis
- **LangSmith Tracing**: Built-in observability and debugging with LangSmith integration

## Project Structure

```
datum/
├── app.py                  # Main application with LangGraph workflow
├── builder/
│   ├── graph_builder.py    # Graph with router + conditional edges
│   ├── nodes.py            # Agent nodes (decider, chat, SQL, charting, narration)
│   ├── state.py            # Typed state definition for the agent
│   └── ui.py               # Gradio UI wiring
├── clients/
│   └── llm.py              # LLM configuration (Google Gemini)
├── datastore/
│   └── db.py               # DuckDB setup and data loading
├── utils/
│   ├── charts.py           # Chart generation utilities
│   ├── insight_utils.py    # Insight helpers
│   └── tracer_utils.py     # LangSmith tracing helpers
├── sample_data/            # Sample datasets
│   ├── sales.csv
│   ├── marketing_spend.csv
│   └── customers.csv
└── requirements.txt        # Python dependencies
```

## Setup Instructions

### Prerequisites

- Python 3.8 or higher
- Google API key for Gemini AI

### Installation

1. **Clone the repository**
   ```bash
   git clone <repository-url>
   cd datum
   ```

2. **Create a virtual environment**
   ```bash
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```

3. **Install dependencies**
   ```bash
   pip install -r requirements.txt
   ```

4. **Set up environment variables**
   Create a `.env` file in the project root:
   ```bash
   GOOGLE_API_KEY=your_google_api_key_here
   LANGCHAIN_PROJECT=datum-analysis  # Optional: for LangSmith tracing
   LANGCHAIN_API_KEY=your_langsmith_api_key  # Optional: for LangSmith tracing
   LANGCHAIN_TRACING_V2=true  # Optional: enable LangSmith tracing
   ```

5. **Run the application**
   ```bash
   python app.py
   ```

6. **Access the web interface**
   Open your browser and navigate to the URL shown in the terminal (typically `http://127.0.0.1:7860`)

## Usage

1. **Ask a question**: Type your data analysis question in natural language
   - Example: "What are the top 3 regions by revenue?"
   - Example: "Show me marketing spend by channel"
   - Example: "Which products have the highest unit sales?"

2. **Agent chooses the path automatically**
   - **Chat route**: Direct conversational answer when no database analysis is needed
   - **SQL route**: The agent generates SQL and provides:
     - **Query Result** (table)
     - **Chart** (visualization)
     - **Insights** (narrative + recommendation)
     - **SQL** (for transparency)

### Routing at a Glance
The `decider` node analyzes your question and sets a `route` of `chat` or `sql`. The graph then either calls `general_chat` or runs the SQL flow (`sql_generator` → `sql_executor` → `chart_generator` + `narrator`).

## Sample Data

The project includes sample datasets:
- **Sales**: Date, region, product, revenue, units sold
- **Marketing Spend**: Date, region, channel, spend amount
- **Customers**: Customer ID, region, age, income

## Technology Stack

- **LangGraph**: Workflow orchestration
- **Google Gemini**: AI language model
- **DuckDB**: In-memory SQL database
- **Gradio**: Web interface
- **Matplotlib**: Chart generation
- **Pandas**: Data manipulation
- **LangSmith**: Observability and tracing platform

## Customization

- **Add your own data**: Replace CSV files in the `sample_data/` directory and update the schema in `nodes.py`
- **Modify the LLM**: Change the model or provider in `llm.py`
- **Customize charts**: Modify chart generation logic in `charts.py`
- **Extend the workflow**: Add new nodes to the LangGraph workflow in `app.py`

## Observability & Debugging

The application includes built-in LangSmith tracing for monitoring and debugging:

- **Trace Execution**: All agent steps are automatically traced and logged
- **Performance Monitoring**: Track execution times and token usage
- **Debug Information**: View detailed logs of SQL generation, execution, and LLM calls
- **Project Organization**: Traces are organized by project name for easy filtering

To enable tracing, set the LangSmith environment variables in your `.env` file. Without these variables, the application will run normally but without tracing capabilities.

## Troubleshooting

- **API Key Error**: Ensure your `GOOGLE_API_KEY` is set correctly in the `.env` file
- **Import Errors**: Make sure all dependencies are installed with `pip install -r requirements.txt`
- **Data Issues**: Verify your CSV files are in the correct format and location
- **Tracing Issues**: Check LangSmith credentials if you want to use the observability features

## License

This project is open source and available under the MIT License.