Spaces:

Arunvithyasegar
/

analytics-bot

Sleeping

App Files Files Community

analytics-bot / README.md

Arunvithyasegar

Upload 6 files

7878c12 verified about 2 months ago

preview code

raw

history blame contribute delete

6.38 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

metadata

title: Analytics Validation Demo
emoji: 📊
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.35.0
app_file: app.py
pinned: false
license: mit

Analytics Validation Demo

A local, demo-ready automation tool that helps data analysts validate business metrics before presenting them to leadership. Built with Python and LangChain.

What Problem This Solves

Data analysts frequently receive raw CSV exports from finance or operations systems and must manually scan them for quality issues — missing values, date gaps, sudden metric drops — before any numbers are presented to executives.

This tool automates that first-pass validation. It runs a deterministic rule engine against daily KPI data, surfaces detected issues with precise context (which column, which date, how large the drop), and optionally uses a local LLM to narrate the findings in plain business English.

The goal is to give an analyst a clean, trustworthy summary in seconds — not to replace the analyst's judgment.

What Is Automated vs. What Is Not

Automated	Not Automated
Missing value detection (per column, per date)	Root cause investigation
Row count / date gap detection	Database reconciliation
Day-over-day drop flagging (>20% threshold)	Trend analysis or forecasting
Descriptive statistics (min/max/mean/std/total)	Seasonality modeling
Duplicate date detection	Dashboard or report publishing
Plain-language issue narration (via LLM, optional)	Alerting or notifications
Fallback text summary when LLM unavailable	Automated decision-making

The system will never invent data, speculate on causes, or auto-correct issues. It only reports what it finds.

How to Run Locally

Prerequisites

Python 3.10 or later
pip

Install dependencies

pip install pandas langchain-core langchain-community langchain-ollama

Run the demo

cd analytics_langchain_demo
python demo.py

The script reads data.csv from the same directory and prints a structured validation report to the console.

Optional: Enable LLM summaries via Ollama

If you have Ollama installed:

# Pull the model (one-time, ~2 GB)
ollama pull llama3.2

# Start the Ollama server (in a separate terminal)
ollama serve

Then run python demo.py as normal. The report will include an LLM-generated executive summary instead of the rule-based fallback text.

If Ollama is not running, the demo still produces a complete report — it logs a message to stderr and uses the deterministic fallback automatically.

Project Structure

analytics_langchain_demo/
├── demo.py        Main runner. Contains load_data(), run_checks(),
│                  format_console_output(), and main(). All rule-engine
│                  logic lives here. No LLM calls.
│
├── llm_utils.py   LangChain integration layer. Tries Ollama; falls back
│                  to a deterministic text summary if unavailable.
│                  The only file that imports LangChain.
│
├── data.csv       90-day synthetic daily metrics (revenue, orders) with
│                  injected missing values, a date gap, and anomaly drops.
│
└── README.md      This file.

Why LLMs Are Used Cautiously

The rule engine is the source of truth. It produces deterministic, auditable findings: specific columns, specific dates, specific percentage changes. These facts never change between runs.

The LLM only narrates those pre-computed facts in plain language. The prompt explicitly instructs it to:

Not speculate on causes
Not introduce information not present in the data
Use a neutral, executive-friendly tone

This means the LLM cannot hallucinate a root cause or suggest a business conclusion. It is a translator, not an analyst.

If the LLM is unavailable, the output is functionally identical — just template-generated rather than model-generated. This makes the tool safe to use in regulated or high-stakes reporting environments where outputs must be reproducible.

How This Scales in Production

This demo is intentionally small. Here is how each layer would evolve:

Data input Replace pd.read_csv() with a SQLAlchemy connector to query directly from a data warehouse (Snowflake, Redshift, BigQuery). The run_checks() function signature does not change — it still receives a DataFrame.

Scheduling Wrap main() in a Prefect or Airflow task to run nightly. Add a --date argument so the tool checks a specific reporting period rather than the full CSV.

LLM Replace Ollama with a hosted model via langchain-anthropic or langchain-openai. Inject the API key via environment variable. The _try_ollama_summary() function is the only change point.

Output Export results as JSON to a shared location for downstream dashboard ingestion. Add a Slack or email alert when any ERROR-severity issues are found.

Thresholds Move ANOMALY_THRESHOLD from a hardcoded constant to a YAML config file so different teams can tune sensitivity without changing code.

Sample Output (no Ollama)

============================================================
  ANALYTICS VALIDATION REPORT
  Generated : 2024-04-01 09:15:32
  Data file : data.csv
============================================================

---- [ SECTION 1: DATA OVERVIEW ] --------------------------
  Date Range    : 2024-01-01  to  2024-03-30
  Actual Rows   : 89
  Expected Rows : 90
  Row Status    : 1 row gap detected

---- [ SECTION 2: KPI STATISTICS ] -------------------------

  REVENUE (USD)
    Min      : $28,341.00
    Max      : $51,203.40
    Mean     : $44,782.15
    ...

---- [ SECTION 3: DETECTED ISSUES ] ------------------------

  Total issues found: 6

  [WARNING] missing_values | Column: revenue
    Detail : 1 missing value(s) in 'revenue' column
    Dates  : 2024-02-05
  ...

---- [ SECTION 4: EXECUTIVE SUMMARY ] ----------------------

  Source: Rule-Based Fallback (Ollama unavailable)

  The dataset spans 89 rows from 2024-01-01 to 2024-03-30
  (expected 90 calendar days). Missing values were identified
  in column(s): orders, revenue. A row count gap of 1 was
  detected in the date sequence. 3 day-over-day drop(s)
  exceeding the 20% anomaly threshold were flagged...