File size: 5,827 Bytes
9cccf74 886eb32 9cccf74 886eb32 9cccf74 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | ---
title: datum
app_file: app.py
sdk: gradio
sdk_version: 5.44.1
---
# Datum - AI-Powered Data Analysis Agent
A simple yet powerful data analysis agent that uses AI to generate SQL queries, execute them against your data, and provide visualizations and insights through a web interface.
## Features
- **Natural Language Queries**: Ask questions about your data in plain English
- **Auto Routing (Chat vs SQL)**: Agent decides between a quick chat reply or full SQL/database analysis
- **AI-Generated SQL**: Automatically converts questions into SQL queries
- **Data Visualization**: Generates charts and graphs from query results
- **Intelligent Insights**: Provides narrative analysis and recommendations
- **Web Interface**: Clean, user-friendly Gradio interface
- **DuckDB Integration**: Fast, in-memory SQL database for data analysis
- **LangSmith Tracing**: Built-in observability and debugging with LangSmith integration
## Project Structure
```
datum/
βββ app.py # Main application with LangGraph workflow
βββ builder/
β βββ graph_builder.py # Graph with router + conditional edges
β βββ nodes.py # Agent nodes (decider, chat, SQL, charting, narration)
β βββ state.py # Typed state definition for the agent
β βββ ui.py # Gradio UI wiring
βββ clients/
β βββ llm.py # LLM configuration (Google Gemini)
βββ datastore/
β βββ db.py # DuckDB setup and data loading
βββ utils/
β βββ charts.py # Chart generation utilities
β βββ insight_utils.py # Insight helpers
β βββ tracer_utils.py # LangSmith tracing helpers
βββ sample_data/ # Sample datasets
β βββ sales.csv
β βββ marketing_spend.csv
β βββ customers.csv
βββ requirements.txt # Python dependencies
```
## Setup Instructions
### Prerequisites
- Python 3.8 or higher
- Google API key for Gemini AI
### Installation
1. **Clone the repository**
```bash
git clone <repository-url>
cd datum
```
2. **Create a virtual environment**
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. **Install dependencies**
```bash
pip install -r requirements.txt
```
4. **Set up environment variables**
Create a `.env` file in the project root:
```bash
GOOGLE_API_KEY=your_google_api_key_here
LANGCHAIN_PROJECT=datum-analysis # Optional: for LangSmith tracing
LANGCHAIN_API_KEY=your_langsmith_api_key # Optional: for LangSmith tracing
LANGCHAIN_TRACING_V2=true # Optional: enable LangSmith tracing
```
5. **Run the application**
```bash
python app.py
```
6. **Access the web interface**
Open your browser and navigate to the URL shown in the terminal (typically `http://127.0.0.1:7860`)
## Usage
1. **Ask a question**: Type your data analysis question in natural language
- Example: "What are the top 3 regions by revenue?"
- Example: "Show me marketing spend by channel"
- Example: "Which products have the highest unit sales?"
2. **Agent chooses the path automatically**
- **Chat route**: Direct conversational answer when no database analysis is needed
- **SQL route**: The agent generates SQL and provides:
- **Query Result** (table)
- **Chart** (visualization)
- **Insights** (narrative + recommendation)
- **SQL** (for transparency)
### Routing at a Glance
The `decider` node analyzes your question and sets a `route` of `chat` or `sql`. The graph then either calls `general_chat` or runs the SQL flow (`sql_generator` β `sql_executor` β `chart_generator` + `narrator`).
## Sample Data
The project includes sample datasets:
- **Sales**: Date, region, product, revenue, units sold
- **Marketing Spend**: Date, region, channel, spend amount
- **Customers**: Customer ID, region, age, income
## Technology Stack
- **LangGraph**: Workflow orchestration
- **Google Gemini**: AI language model
- **DuckDB**: In-memory SQL database
- **Gradio**: Web interface
- **Matplotlib**: Chart generation
- **Pandas**: Data manipulation
- **LangSmith**: Observability and tracing platform
## Customization
- **Add your own data**: Replace CSV files in the `sample_data/` directory and update the schema in `nodes.py`
- **Modify the LLM**: Change the model or provider in `llm.py`
- **Customize charts**: Modify chart generation logic in `charts.py`
- **Extend the workflow**: Add new nodes to the LangGraph workflow in `app.py`
## Observability & Debugging
The application includes built-in LangSmith tracing for monitoring and debugging:
- **Trace Execution**: All agent steps are automatically traced and logged
- **Performance Monitoring**: Track execution times and token usage
- **Debug Information**: View detailed logs of SQL generation, execution, and LLM calls
- **Project Organization**: Traces are organized by project name for easy filtering
To enable tracing, set the LangSmith environment variables in your `.env` file. Without these variables, the application will run normally but without tracing capabilities.
## Troubleshooting
- **API Key Error**: Ensure your `GOOGLE_API_KEY` is set correctly in the `.env` file
- **Import Errors**: Make sure all dependencies are installed with `pip install -r requirements.txt`
- **Data Issues**: Verify your CSV files are in the correct format and location
- **Tracing Issues**: Check LangSmith credentials if you want to use the observability features
## License
This project is open source and available under the MIT License.
|