Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.56.0
title: Analytics Validation Demo
emoji: π
colorFrom: blue
colorTo: indigo
sdk: streamlit
sdk_version: 1.35.0
app_file: app.py
pinned: false
license: mit
Analytics Validation Demo
A local, demo-ready automation tool that helps data analysts validate business metrics before presenting them to leadership. Built with Python and LangChain.
What Problem This Solves
Data analysts frequently receive raw CSV exports from finance or operations systems and must manually scan them for quality issues β missing values, date gaps, sudden metric drops β before any numbers are presented to executives.
This tool automates that first-pass validation. It runs a deterministic rule engine against daily KPI data, surfaces detected issues with precise context (which column, which date, how large the drop), and optionally uses a local LLM to narrate the findings in plain business English.
The goal is to give an analyst a clean, trustworthy summary in seconds β not to replace the analyst's judgment.
What Is Automated vs. What Is Not
| Automated | Not Automated |
|---|---|
| Missing value detection (per column, per date) | Root cause investigation |
| Row count / date gap detection | Database reconciliation |
| Day-over-day drop flagging (>20% threshold) | Trend analysis or forecasting |
| Descriptive statistics (min/max/mean/std/total) | Seasonality modeling |
| Duplicate date detection | Dashboard or report publishing |
| Plain-language issue narration (via LLM, optional) | Alerting or notifications |
| Fallback text summary when LLM unavailable | Automated decision-making |
The system will never invent data, speculate on causes, or auto-correct issues. It only reports what it finds.
How to Run Locally
Prerequisites
- Python 3.10 or later
- pip
Install dependencies
pip install pandas langchain-core langchain-community langchain-ollama
Run the demo
cd analytics_langchain_demo
python demo.py
The script reads data.csv from the same directory and prints a structured
validation report to the console.
Optional: Enable LLM summaries via Ollama
If you have Ollama installed:
# Pull the model (one-time, ~2 GB)
ollama pull llama3.2
# Start the Ollama server (in a separate terminal)
ollama serve
Then run python demo.py as normal. The report will include an LLM-generated
executive summary instead of the rule-based fallback text.
If Ollama is not running, the demo still produces a complete report β it logs a message to stderr and uses the deterministic fallback automatically.
Project Structure
analytics_langchain_demo/
βββ demo.py Main runner. Contains load_data(), run_checks(),
β format_console_output(), and main(). All rule-engine
β logic lives here. No LLM calls.
β
βββ llm_utils.py LangChain integration layer. Tries Ollama; falls back
β to a deterministic text summary if unavailable.
β The only file that imports LangChain.
β
βββ data.csv 90-day synthetic daily metrics (revenue, orders) with
β injected missing values, a date gap, and anomaly drops.
β
βββ README.md This file.
Why LLMs Are Used Cautiously
The rule engine is the source of truth. It produces deterministic, auditable findings: specific columns, specific dates, specific percentage changes. These facts never change between runs.
The LLM only narrates those pre-computed facts in plain language. The prompt explicitly instructs it to:
- Not speculate on causes
- Not introduce information not present in the data
- Use a neutral, executive-friendly tone
This means the LLM cannot hallucinate a root cause or suggest a business conclusion. It is a translator, not an analyst.
If the LLM is unavailable, the output is functionally identical β just template-generated rather than model-generated. This makes the tool safe to use in regulated or high-stakes reporting environments where outputs must be reproducible.
How This Scales in Production
This demo is intentionally small. Here is how each layer would evolve:
Data input
Replace pd.read_csv() with a SQLAlchemy connector to query directly from a
data warehouse (Snowflake, Redshift, BigQuery). The run_checks() function
signature does not change β it still receives a DataFrame.
Scheduling
Wrap main() in a Prefect or Airflow task to run nightly. Add a --date
argument so the tool checks a specific reporting period rather than the
full CSV.
LLM
Replace Ollama with a hosted model via langchain-anthropic or
langchain-openai. Inject the API key via environment variable. The
_try_ollama_summary() function is the only change point.
Output
Export results as JSON to a shared location for downstream dashboard
ingestion. Add a Slack or email alert when any ERROR-severity issues are found.
Thresholds
Move ANOMALY_THRESHOLD from a hardcoded constant to a YAML config file
so different teams can tune sensitivity without changing code.
Sample Output (no Ollama)
============================================================
ANALYTICS VALIDATION REPORT
Generated : 2024-04-01 09:15:32
Data file : data.csv
============================================================
---- [ SECTION 1: DATA OVERVIEW ] --------------------------
Date Range : 2024-01-01 to 2024-03-30
Actual Rows : 89
Expected Rows : 90
Row Status : 1 row gap detected
---- [ SECTION 2: KPI STATISTICS ] -------------------------
REVENUE (USD)
Min : $28,341.00
Max : $51,203.40
Mean : $44,782.15
...
---- [ SECTION 3: DETECTED ISSUES ] ------------------------
Total issues found: 6
[WARNING] missing_values | Column: revenue
Detail : 1 missing value(s) in 'revenue' column
Dates : 2024-02-05
...
---- [ SECTION 4: EXECUTIVE SUMMARY ] ----------------------
Source: Rule-Based Fallback (Ollama unavailable)
The dataset spans 89 rows from 2024-01-01 to 2024-03-30
(expected 90 calendar days). Missing values were identified
in column(s): orders, revenue. A row count gap of 1 was
detected in the date sequence. 3 day-over-day drop(s)
exceeding the 20% anomaly threshold were flagged...