datum / README.md
subhamb04's picture
Upload folder using huggingface_hub
886eb32 verified
---
title: datum
app_file: app.py
sdk: gradio
sdk_version: 5.44.1
---
# Datum - AI-Powered Data Analysis Agent
A simple yet powerful data analysis agent that uses AI to generate SQL queries, execute them against your data, and provide visualizations and insights through a web interface.
## Features
- **Natural Language Queries**: Ask questions about your data in plain English
- **Auto Routing (Chat vs SQL)**: Agent decides between a quick chat reply or full SQL/database analysis
- **AI-Generated SQL**: Automatically converts questions into SQL queries
- **Data Visualization**: Generates charts and graphs from query results
- **Intelligent Insights**: Provides narrative analysis and recommendations
- **Web Interface**: Clean, user-friendly Gradio interface
- **DuckDB Integration**: Fast, in-memory SQL database for data analysis
- **LangSmith Tracing**: Built-in observability and debugging with LangSmith integration
## Project Structure
```
datum/
β”œβ”€β”€ app.py # Main application with LangGraph workflow
β”œβ”€β”€ builder/
β”‚ β”œβ”€β”€ graph_builder.py # Graph with router + conditional edges
β”‚ β”œβ”€β”€ nodes.py # Agent nodes (decider, chat, SQL, charting, narration)
β”‚ β”œβ”€β”€ state.py # Typed state definition for the agent
β”‚ └── ui.py # Gradio UI wiring
β”œβ”€β”€ clients/
β”‚ └── llm.py # LLM configuration (Google Gemini)
β”œβ”€β”€ datastore/
β”‚ └── db.py # DuckDB setup and data loading
β”œβ”€β”€ utils/
β”‚ β”œβ”€β”€ charts.py # Chart generation utilities
β”‚ β”œβ”€β”€ insight_utils.py # Insight helpers
β”‚ └── tracer_utils.py # LangSmith tracing helpers
β”œβ”€β”€ sample_data/ # Sample datasets
β”‚ β”œβ”€β”€ sales.csv
β”‚ β”œβ”€β”€ marketing_spend.csv
β”‚ └── customers.csv
└── requirements.txt # Python dependencies
```
## Setup Instructions
### Prerequisites
- Python 3.8 or higher
- Google API key for Gemini AI
### Installation
1. **Clone the repository**
```bash
git clone <repository-url>
cd datum
```
2. **Create a virtual environment**
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
3. **Install dependencies**
```bash
pip install -r requirements.txt
```
4. **Set up environment variables**
Create a `.env` file in the project root:
```bash
GOOGLE_API_KEY=your_google_api_key_here
LANGCHAIN_PROJECT=datum-analysis # Optional: for LangSmith tracing
LANGCHAIN_API_KEY=your_langsmith_api_key # Optional: for LangSmith tracing
LANGCHAIN_TRACING_V2=true # Optional: enable LangSmith tracing
```
5. **Run the application**
```bash
python app.py
```
6. **Access the web interface**
Open your browser and navigate to the URL shown in the terminal (typically `http://127.0.0.1:7860`)
## Usage
1. **Ask a question**: Type your data analysis question in natural language
- Example: "What are the top 3 regions by revenue?"
- Example: "Show me marketing spend by channel"
- Example: "Which products have the highest unit sales?"
2. **Agent chooses the path automatically**
- **Chat route**: Direct conversational answer when no database analysis is needed
- **SQL route**: The agent generates SQL and provides:
- **Query Result** (table)
- **Chart** (visualization)
- **Insights** (narrative + recommendation)
- **SQL** (for transparency)
### Routing at a Glance
The `decider` node analyzes your question and sets a `route` of `chat` or `sql`. The graph then either calls `general_chat` or runs the SQL flow (`sql_generator` β†’ `sql_executor` β†’ `chart_generator` + `narrator`).
## Sample Data
The project includes sample datasets:
- **Sales**: Date, region, product, revenue, units sold
- **Marketing Spend**: Date, region, channel, spend amount
- **Customers**: Customer ID, region, age, income
## Technology Stack
- **LangGraph**: Workflow orchestration
- **Google Gemini**: AI language model
- **DuckDB**: In-memory SQL database
- **Gradio**: Web interface
- **Matplotlib**: Chart generation
- **Pandas**: Data manipulation
- **LangSmith**: Observability and tracing platform
## Customization
- **Add your own data**: Replace CSV files in the `sample_data/` directory and update the schema in `nodes.py`
- **Modify the LLM**: Change the model or provider in `llm.py`
- **Customize charts**: Modify chart generation logic in `charts.py`
- **Extend the workflow**: Add new nodes to the LangGraph workflow in `app.py`
## Observability & Debugging
The application includes built-in LangSmith tracing for monitoring and debugging:
- **Trace Execution**: All agent steps are automatically traced and logged
- **Performance Monitoring**: Track execution times and token usage
- **Debug Information**: View detailed logs of SQL generation, execution, and LLM calls
- **Project Organization**: Traces are organized by project name for easy filtering
To enable tracing, set the LangSmith environment variables in your `.env` file. Without these variables, the application will run normally but without tracing capabilities.
## Troubleshooting
- **API Key Error**: Ensure your `GOOGLE_API_KEY` is set correctly in the `.env` file
- **Import Errors**: Make sure all dependencies are installed with `pip install -r requirements.txt`
- **Data Issues**: Verify your CSV files are in the correct format and location
- **Tracing Issues**: Check LangSmith credentials if you want to use the observability features
## License
This project is open source and available under the MIT License.