| # Course Project: DataCrew | |
| **A CLI tool that uses local LLMs and multi-agent systems to transform spreadsheets into intelligent PDF reports.** | |
| --- | |
| ## Overview | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β DataCrew CLI β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β $ datacrew ingest sales_2024.xlsx β | |
| β $ datacrew ask "What were the top 5 products by revenue?" β | |
| β $ datacrew report "Q4 Executive Summary" --output report.pdf β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Architecture | |
| ``` | |
| CSV/XLSX βββΊ SQLite βββΊ MCP Server βββΊ Multi-Agent Crew βββΊ PDF Report | |
| β | |
| βΌ | |
| Docker Model Runner | |
| (Local LLM) | |
| ``` | |
| ### Data Flow | |
| ``` | |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ | |
| β CSV/XLSX ββββββΊβ SQLite ββββββΊβ MCP Server β | |
| β Files β β Database β β (Tools) β | |
| ββββββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ | |
| β | |
| βΌ | |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ | |
| β PDF Report βββββββ Agent Crew βββββββ Local LLM β | |
| β Output β β (CrewAI) β β (Docker) β | |
| ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ | |
| ``` | |
| --- | |
| ## Agent System | |
| | Agent | Role | Tools | Output | | |
| |-------|------|-------|--------| | |
| | **Data Analyst** | Understands schema, writes SQL queries | MCP Database Tools | Query results, data summaries | | |
| | **Insights Agent** | Interprets results, finds patterns | Python Analysis, Statistics | Key findings, trends, anomalies | | |
| | **Report Writer** | Creates narrative sections | LLM Generation | Executive summary, section text | | |
| | **PDF Composer** | Formats and assembles final report | ReportLab/WeasyPrint | Formatted PDF document | | |
| ### Agent Workflow | |
| ``` | |
| User Request: "Generate Q4 Executive Summary" | |
| β | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β Data Analyst β | |
| β "What data do we β | |
| β need for Q4?" β | |
| ββββββββββββ¬βββββββββββ | |
| β SQL Queries | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β Insights Agent β | |
| β "What patterns β | |
| β emerge from β | |
| β this data?" β | |
| ββββββββββββ¬βββββββββββ | |
| β Key Findings | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β Report Writer β | |
| β "Write narrative β | |
| β sections for β | |
| β each finding" β | |
| ββββββββββββ¬βββββββββββ | |
| β Text Sections | |
| βΌ | |
| βββββββββββββββββββββββ | |
| β PDF Composer β | |
| β "Assemble into β | |
| β formatted PDF" β | |
| ββββββββββββ¬βββββββββββ | |
| β | |
| βΌ | |
| report.pdf | |
| ``` | |
| --- | |
| ## CLI Commands | |
| ### `datacrew ingest` | |
| Ingest CSV or XLSX files into the local SQLite database. | |
| ```bash | |
| # Ingest a single file | |
| datacrew ingest sales_2024.xlsx | |
| # Ingest with custom table name | |
| datacrew ingest sales_2024.xlsx --table quarterly_sales | |
| # Ingest multiple files | |
| datacrew ingest data/*.csv | |
| # Ingest with schema inference options | |
| datacrew ingest sales.csv --infer-types --date-columns "order_date,ship_date" | |
| ``` | |
| **Options:** | |
| | Flag | Description | Default | | |
| |------|-------------|---------| | |
| | `--table` | Custom table name | Filename (sanitized) | | |
| | `--if-exists` | Behavior if table exists: `fail`, `replace`, `append` | `fail` | | |
| | `--infer-types` | Automatically infer column types | `true` | | |
| | `--date-columns` | Comma-separated list of date columns | Auto-detect | | |
| | `--db` | Database file path | `./data/datacrew.db` | | |
| ### `datacrew ask` | |
| Query the database using natural language. | |
| ```bash | |
| # Simple query | |
| datacrew ask "What were the top 5 products by revenue?" | |
| # Query with output format | |
| datacrew ask "Show monthly sales trends" --format table | |
| # Query with export | |
| datacrew ask "List all customers from California" --export customers_ca.csv | |
| # Interactive mode | |
| datacrew ask --interactive | |
| ``` | |
| **Options:** | |
| | Flag | Description | Default | | |
| |------|-------------|---------| | |
| | `--format` | Output format: `table`, `json`, `csv`, `markdown` | `table` | | |
| | `--export` | Export results to file | None | | |
| | `--explain` | Show generated SQL query | `false` | | |
| | `--interactive` | Enter interactive query mode | `false` | | |
| | `--limit` | Maximum rows to return | `100` | | |
| ### `datacrew report` | |
| Generate PDF reports using the multi-agent system. | |
| ```bash | |
| # Generate a report | |
| datacrew report "Q4 Executive Summary" | |
| # Specify output file | |
| datacrew report "Q4 Executive Summary" --output reports/q4_summary.pdf | |
| # Use a template | |
| datacrew report "Monthly Sales" --template executive | |
| # Include specific analyses | |
| datacrew report "Product Analysis" --include trends,comparisons,recommendations | |
| ``` | |
| **Options:** | |
| | Flag | Description | Default | | |
| |------|-------------|---------| | |
| | `--output`, `-o` | Output PDF file path | `./report.pdf` | | |
| | `--template` | Report template: `executive`, `detailed`, `minimal` | `executive` | | |
| | `--include` | Analyses to include | All | | |
| | `--date-range` | Date range for analysis | All data | | |
| | `--verbose`, `-v` | Show agent reasoning | `false` | | |
| ### `datacrew config` | |
| Manage configuration settings. | |
| ```bash | |
| # Show current config | |
| datacrew config show | |
| # Set LLM model | |
| datacrew config set llm.model "llama3.2:3b" | |
| # Set database path | |
| datacrew config set database.path "./data/mydata.db" | |
| # Reset to defaults | |
| datacrew config reset | |
| ``` | |
| ### `datacrew schema` | |
| Inspect database schema. | |
| ```bash | |
| # List all tables | |
| datacrew schema list | |
| # Show table details | |
| datacrew schema describe sales | |
| # Show sample data | |
| datacrew schema sample sales --rows 5 | |
| ``` | |
| --- | |
| ## Configuration | |
| Configuration is stored in `~/.config/datacrew/config.toml` or `./datacrew.toml` in the project directory. | |
| ```toml | |
| [datacrew] | |
| version = "1.0.0" | |
| [database] | |
| path = "./data/datacrew.db" | |
| echo = false | |
| [llm] | |
| provider = "docker" # docker, ollama, openai | |
| model = "llama3.2:3b" | |
| temperature = 0.7 | |
| max_tokens = 4096 | |
| base_url = "http://localhost:11434" | |
| [llm.docker] | |
| runtime = "nvidia" # nvidia, cpu | |
| memory_limit = "8g" | |
| [agents] | |
| verbose = false | |
| max_iterations = 10 | |
| [agents.analyst] | |
| role = "Data Analyst" | |
| goal = "Analyze data and write accurate SQL queries" | |
| [agents.insights] | |
| role = "Insights Specialist" | |
| goal = "Find meaningful patterns and trends in data" | |
| [agents.writer] | |
| role = "Report Writer" | |
| goal = "Create clear, compelling narrative content" | |
| [agents.composer] | |
| role = "PDF Composer" | |
| goal = "Assemble professional PDF reports" | |
| [reports] | |
| output_dir = "./reports" | |
| default_template = "executive" | |
| [reports.templates.executive] | |
| include_charts = true | |
| include_recommendations = true | |
| max_pages = 10 | |
| [reports.templates.detailed] | |
| include_charts = true | |
| include_recommendations = true | |
| include_raw_data = true | |
| max_pages = 50 | |
| [observability] | |
| enabled = true | |
| provider = "langfuse" # langfuse, langsmith, console | |
| trace_agents = true | |
| log_tokens = true | |
| ``` | |
| --- | |
| ## Docker Stack | |
| ### docker-compose.yml | |
| ```yaml | |
| version: "3.9" | |
| services: | |
| # Local LLM via Docker Model Runner | |
| llm: | |
| image: ollama/ollama:latest | |
| runtime: nvidia | |
| environment: | |
| - OLLAMA_HOST=0.0.0.0 | |
| volumes: | |
| - ollama_data:/root/.ollama | |
| ports: | |
| - "11434:11434" | |
| deploy: | |
| resources: | |
| reservations: | |
| devices: | |
| - driver: nvidia | |
| count: 1 | |
| capabilities: [gpu] | |
| healthcheck: | |
| test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"] | |
| interval: 30s | |
| timeout: 10s | |
| retries: 3 | |
| # MCP Server for database access | |
| mcp-server: | |
| build: | |
| context: ./src/datacrew/mcp | |
| dockerfile: Dockerfile | |
| environment: | |
| - DATABASE_PATH=/data/datacrew.db | |
| - MCP_PORT=3000 | |
| volumes: | |
| - ./data:/data | |
| ports: | |
| - "3000:3000" | |
| depends_on: | |
| - llm | |
| # Main application (for containerized usage) | |
| datacrew: | |
| build: | |
| context: . | |
| dockerfile: Dockerfile | |
| environment: | |
| - LLM_BASE_URL=http://llm:11434 | |
| - MCP_SERVER_URL=http://mcp-server:3000 | |
| - DATABASE_PATH=/data/datacrew.db | |
| volumes: | |
| - ./data:/data | |
| - ./reports:/reports | |
| - ./input:/input:ro | |
| depends_on: | |
| llm: | |
| condition: service_healthy | |
| mcp-server: | |
| condition: service_started | |
| profiles: | |
| - cli | |
| volumes: | |
| ollama_data: | |
| ``` | |
| ### Running the Stack | |
| ```bash | |
| # Start LLM and MCP server | |
| docker compose up -d llm mcp-server | |
| # Pull the model (first time only) | |
| docker compose exec llm ollama pull llama3.2:3b | |
| # Run DataCrew commands | |
| docker compose run --rm datacrew ingest /input/sales.xlsx | |
| docker compose run --rm datacrew ask "What is total revenue?" | |
| docker compose run --rm datacrew report "Sales Summary" -o /reports/summary.pdf | |
| # Or run locally with Docker backend | |
| datacrew ingest sales.xlsx | |
| datacrew ask "What is total revenue?" | |
| datacrew report "Sales Summary" | |
| ``` | |
| --- | |
| ## Project Structure | |
| ``` | |
| datacrew/ | |
| βββ pyproject.toml # pixi/uv project config | |
| βββ pixi.lock | |
| βββ docker-compose.yml # Full stack orchestration | |
| βββ Dockerfile | |
| βββ datacrew.toml # Default configuration | |
| βββ README.md | |
| β | |
| βββ src/ | |
| β βββ datacrew/ | |
| β βββ __init__.py | |
| β βββ __main__.py # Entry point | |
| β βββ cli.py # Typer CLI commands | |
| β βββ config.py # TOML configuration loader | |
| β β | |
| β βββ ingestion/ # CSV/XLSX β SQLite | |
| β β βββ __init__.py | |
| β β βββ readers.py # File readers (pandas, openpyxl) | |
| β β βββ schema.py # Schema inference | |
| β β βββ database.py # SQLite operations | |
| β β | |
| β βββ query/ # Natural language queries | |
| β β βββ __init__.py | |
| β β βββ nl2sql.py # NL to SQL conversion | |
| β β βββ executor.py # Query execution | |
| β β βββ formatter.py # Result formatting | |
| β β | |
| β βββ agents/ # CrewAI agents | |
| β β βββ __init__.py | |
| β β βββ crew.py # Crew orchestration | |
| β β βββ analyst.py # Data Analyst agent | |
| β β βββ insights.py # Insights Specialist agent | |
| β β βββ writer.py # Report Writer agent | |
| β β βββ composer.py # PDF Composer agent | |
| β β | |
| β βββ tools/ # Agent tools | |
| β β βββ __init__.py | |
| β β βββ sql_tools.py # SQL execution tools | |
| β β βββ analysis.py # Statistical analysis tools | |
| β β βββ charts.py # Chart generation tools | |
| β β | |
| β βββ mcp/ # MCP server | |
| β β βββ __init__.py | |
| β β βββ server.py # MCP server implementation | |
| β β βββ tools.py # MCP tool definitions | |
| β β βββ Dockerfile # MCP server container | |
| β β | |
| β βββ reports/ # PDF generation | |
| β β βββ __init__.py | |
| β β βββ generator.py # Report generation orchestrator | |
| β β βββ pdf.py # PDF creation (WeasyPrint) | |
| β β βββ charts.py # Chart rendering | |
| β β βββ templates/ # HTML/CSS templates | |
| β β βββ executive.html | |
| β β βββ detailed.html | |
| β β βββ minimal.html | |
| β β βββ styles.css | |
| β β | |
| β βββ llm/ # LLM integration | |
| β β βββ __init__.py | |
| β β βββ client.py # LLM client (Docker/Ollama/OpenAI) | |
| β β βββ prompts.py # Prompt templates | |
| β β | |
| β βββ observability/ # Logging & tracing | |
| β βββ __init__.py | |
| β βββ tracing.py # Distributed tracing | |
| β βββ metrics.py # Token/cost tracking | |
| β | |
| βββ tests/ | |
| β βββ __init__.py | |
| β βββ conftest.py # Pytest fixtures | |
| β βββ test_cli.py | |
| β βββ test_ingestion.py | |
| β βββ test_query.py | |
| β βββ test_agents.py | |
| β βββ test_reports.py | |
| β βββ fixtures/ | |
| β βββ sample_sales.csv | |
| β βββ sample_products.xlsx | |
| β βββ expected_outputs/ | |
| β | |
| βββ data/ # Local data directory | |
| β βββ .gitkeep | |
| β | |
| βββ reports/ # Generated reports | |
| β βββ .gitkeep | |
| β | |
| βββ docs/ # Documentation (Quarto) | |
| βββ _quarto.yml | |
| βββ index.qmd | |
| βββ chapters/ | |
| ``` | |
| --- | |
| ## Technology Stack | |
| | Category | Tools | | |
| |----------|-------| | |
| | **Package Management** | pixi, uv | | |
| | **CLI Framework** | Typer, Rich | | |
| | **Local LLM** | Docker Model Runner, Ollama | | |
| | **LLM Framework** | LangChain | | |
| | **Multi-Agent** | CrewAI | | |
| | **MCP** | Docker MCP Toolkit | | |
| | **Database** | SQLite | | |
| | **Data Processing** | pandas, openpyxl | | |
| | **PDF Generation** | WeasyPrint | | |
| | **Charts** | matplotlib, plotly | | |
| | **Observability** | Langfuse, OpenTelemetry | | |
| | **Testing** | pytest, DeepEval | | |
| | **Containerization** | Docker, Docker Compose | | |
| --- | |
| ## Example Usage | |
| ### End-to-End Workflow | |
| ```bash | |
| # 1. Start the Docker stack | |
| docker compose up -d | |
| # 2. Ingest your data | |
| datacrew ingest quarterly_sales_2024.xlsx | |
| datacrew ingest product_catalog.csv | |
| datacrew ingest customer_data.csv | |
| # 3. Explore with natural language queries | |
| datacrew ask "How many records are in each table?" | |
| datacrew ask "What are the top 10 products by revenue in Q4?" | |
| datacrew ask "Show me the monthly sales trend for 2024" | |
| # 4. Generate a comprehensive report | |
| datacrew report "2024 Annual Sales Analysis" \ | |
| --template detailed \ | |
| --output reports/annual_2024.pdf \ | |
| --include trends,top_products,regional_breakdown,recommendations \ | |
| --verbose | |
| # 5. View agent reasoning (verbose mode) | |
| # [Data Analyst] Analyzing schema... found 3 tables | |
| # [Data Analyst] Executing: SELECT strftime('%Y-%m', order_date) as month, SUM(revenue) ... | |
| # [Insights Agent] Identified trend: 23% YoY growth in Q4 | |
| # [Insights Agent] Anomaly detected: December spike in electronics category | |
| # [Report Writer] Generating executive summary... | |
| # [PDF Composer] Assembling 12-page report... | |
| # β Report saved to reports/annual_2024.pdf | |
| ``` | |
| ### Sample Report Output | |
| ``` | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β 2024 Annual Sales Analysis β | |
| β Executive Summary β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β Key Findings: β | |
| β β’ Total revenue: $4.2M (+23% YoY) β | |
| β β’ Top product category: Electronics (38% of revenue) β | |
| β β’ Strongest region: West Coast (42% of sales) β | |
| β β’ Customer retention rate: 78% β | |
| β β | |
| β [Monthly Revenue Trend Chart] β | |
| β β | |
| β Recommendations: β | |
| β 1. Expand electronics inventory for Q1 2025 β | |
| β 2. Increase marketing spend in Midwest region β | |
| β 3. Launch loyalty program to improve retention β | |
| β β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## Learning Outcomes | |
| By building DataCrew, learners will be able to: | |
| 1. β Set up modern Python projects with pixi and reproducible environments | |
| 2. β Build professional CLI tools with Typer and Rich | |
| 3. β Run local LLMs using Docker Model Runner | |
| 4. β Ingest and query data from spreadsheets using natural language | |
| 5. β Build MCP servers to connect AI agents to data sources | |
| 6. β Design multi-agent systems with CrewAI | |
| 7. β Generate PDF reports programmatically | |
| 8. β Implement observability for AI applications | |
| 9. β Test non-deterministic systems effectively | |
| 10. β Deploy self-hosted AI applications with Docker Compose | |