File size: 5,827 Bytes
9cccf74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
886eb32
9cccf74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
886eb32
9cccf74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---

title: datum
app_file: app.py
sdk: gradio
sdk_version: 5.44.1
---

# Datum - AI-Powered Data Analysis Agent

A simple yet powerful data analysis agent that uses AI to generate SQL queries, execute them against your data, and provide visualizations and insights through a web interface.

## Features

- **Natural Language Queries**: Ask questions about your data in plain English
- **Auto Routing (Chat vs SQL)**: Agent decides between a quick chat reply or full SQL/database analysis
- **AI-Generated SQL**: Automatically converts questions into SQL queries
- **Data Visualization**: Generates charts and graphs from query results
- **Intelligent Insights**: Provides narrative analysis and recommendations
- **Web Interface**: Clean, user-friendly Gradio interface
- **DuckDB Integration**: Fast, in-memory SQL database for data analysis
- **LangSmith Tracing**: Built-in observability and debugging with LangSmith integration

## Project Structure

```

datum/

β”œβ”€β”€ app.py                  # Main application with LangGraph workflow

β”œβ”€β”€ builder/

β”‚   β”œβ”€β”€ graph_builder.py    # Graph with router + conditional edges

β”‚   β”œβ”€β”€ nodes.py            # Agent nodes (decider, chat, SQL, charting, narration)

β”‚   β”œβ”€β”€ state.py            # Typed state definition for the agent

β”‚   └── ui.py               # Gradio UI wiring

β”œβ”€β”€ clients/

β”‚   └── llm.py              # LLM configuration (Google Gemini)

β”œβ”€β”€ datastore/

β”‚   └── db.py               # DuckDB setup and data loading

β”œβ”€β”€ utils/

β”‚   β”œβ”€β”€ charts.py           # Chart generation utilities

β”‚   β”œβ”€β”€ insight_utils.py    # Insight helpers

β”‚   └── tracer_utils.py     # LangSmith tracing helpers

β”œβ”€β”€ sample_data/            # Sample datasets

β”‚   β”œβ”€β”€ sales.csv

β”‚   β”œβ”€β”€ marketing_spend.csv

β”‚   └── customers.csv

└── requirements.txt        # Python dependencies

```

## Setup Instructions

### Prerequisites

- Python 3.8 or higher
- Google API key for Gemini AI

### Installation

1. **Clone the repository**
   ```bash

   git clone <repository-url>

   cd datum

   ```

2. **Create a virtual environment**
   ```bash

   python -m venv venv

   source venv/bin/activate  # On Windows: venv\Scripts\activate

   ```

3. **Install dependencies**
   ```bash

   pip install -r requirements.txt

   ```

4. **Set up environment variables**
   Create a `.env` file in the project root:
   ```bash

   GOOGLE_API_KEY=your_google_api_key_here

   LANGCHAIN_PROJECT=datum-analysis  # Optional: for LangSmith tracing

   LANGCHAIN_API_KEY=your_langsmith_api_key  # Optional: for LangSmith tracing

   LANGCHAIN_TRACING_V2=true  # Optional: enable LangSmith tracing

   ```

5. **Run the application**
   ```bash

   python app.py

   ```

6. **Access the web interface**
   Open your browser and navigate to the URL shown in the terminal (typically `http://127.0.0.1:7860`)

## Usage

1. **Ask a question**: Type your data analysis question in natural language
   - Example: "What are the top 3 regions by revenue?"
   - Example: "Show me marketing spend by channel"
   - Example: "Which products have the highest unit sales?"

2. **Agent chooses the path automatically**
   - **Chat route**: Direct conversational answer when no database analysis is needed
   - **SQL route**: The agent generates SQL and provides:
     - **Query Result** (table)
     - **Chart** (visualization)
     - **Insights** (narrative + recommendation)
     - **SQL** (for transparency)

### Routing at a Glance
The `decider` node analyzes your question and sets a `route` of `chat` or `sql`. The graph then either calls `general_chat` or runs the SQL flow (`sql_generator` β†’ `sql_executor` β†’ `chart_generator` + `narrator`).

## Sample Data

The project includes sample datasets:
- **Sales**: Date, region, product, revenue, units sold
- **Marketing Spend**: Date, region, channel, spend amount
- **Customers**: Customer ID, region, age, income

## Technology Stack

- **LangGraph**: Workflow orchestration
- **Google Gemini**: AI language model
- **DuckDB**: In-memory SQL database
- **Gradio**: Web interface
- **Matplotlib**: Chart generation
- **Pandas**: Data manipulation
- **LangSmith**: Observability and tracing platform

## Customization

- **Add your own data**: Replace CSV files in the `sample_data/` directory and update the schema in `nodes.py`
- **Modify the LLM**: Change the model or provider in `llm.py`
- **Customize charts**: Modify chart generation logic in `charts.py`
- **Extend the workflow**: Add new nodes to the LangGraph workflow in `app.py`

## Observability & Debugging

The application includes built-in LangSmith tracing for monitoring and debugging:

- **Trace Execution**: All agent steps are automatically traced and logged
- **Performance Monitoring**: Track execution times and token usage
- **Debug Information**: View detailed logs of SQL generation, execution, and LLM calls
- **Project Organization**: Traces are organized by project name for easy filtering

To enable tracing, set the LangSmith environment variables in your `.env` file. Without these variables, the application will run normally but without tracing capabilities.

## Troubleshooting

- **API Key Error**: Ensure your `GOOGLE_API_KEY` is set correctly in the `.env` file
- **Import Errors**: Make sure all dependencies are installed with `pip install -r requirements.txt`
- **Data Issues**: Verify your CSV files are in the correct format and location
- **Tracing Issues**: Check LangSmith credentials if you want to use the observability features

## License

This project is open source and available under the MIT License.