Spaces:

Arunvithyasegar
/

analytics-bot

Sleeping

App Files Files Community

Arunvithyasegar commited on Feb 18

Commit

7878c12

verified ·

1 Parent(s): 362dd69

Upload 6 files

Browse files

Files changed (6) hide show

README.md +202 -13
app.py +127 -0
data.csv +90 -0
demo.py +396 -0
llm_utils.py +240 -0
requirements.txt +4 -3

README.md CHANGED Viewed

@@ -1,20 +1,209 @@
 ---
-title: Analytics Bot
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
 pinned: false
-short_description: Streamlit template space
 license: mit
 ---
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

 ---
+title: Analytics Validation Demo
+emoji: 📊
+colorFrom: blue
+colorTo: indigo
+sdk: streamlit
+sdk_version: 1.35.0
+app_file: app.py
 pinned: false
 license: mit
 ---
+# Analytics Validation Demo
+A local, demo-ready automation tool that helps data analysts validate business
+metrics before presenting them to leadership. Built with Python and LangChain.
+---
+## What Problem This Solves
+Data analysts frequently receive raw CSV exports from finance or operations
+systems and must manually scan them for quality issues — missing values, date
+gaps, sudden metric drops — before any numbers are presented to executives.
+This tool automates that first-pass validation. It runs a deterministic rule
+engine against daily KPI data, surfaces detected issues with precise context
+(which column, which date, how large the drop), and optionally uses a local
+LLM to narrate the findings in plain business English.
+**The goal is to give an analyst a clean, trustworthy summary in seconds —
+not to replace the analyst's judgment.**
+---
+## What Is Automated vs. What Is Not
+| Automated | Not Automated |
+|---|---|
+| Missing value detection (per column, per date) | Root cause investigation |
+| Row count / date gap detection | Database reconciliation |
+| Day-over-day drop flagging (>20% threshold) | Trend analysis or forecasting |
+| Descriptive statistics (min/max/mean/std/total) | Seasonality modeling |
+| Duplicate date detection | Dashboard or report publishing |
+| Plain-language issue narration (via LLM, optional) | Alerting or notifications |
+| Fallback text summary when LLM unavailable | Automated decision-making |
+The system will **never** invent data, speculate on causes, or auto-correct
+issues. It only reports what it finds.
+---
+## How to Run Locally
+### Prerequisites
+- Python 3.10 or later
+- pip
+### Install dependencies
+```bash
+pip install pandas langchain-core langchain-community langchain-ollama
+```
+### Run the demo
+```bash
+cd analytics_langchain_demo
+python demo.py
+```
+The script reads `data.csv` from the same directory and prints a structured
+validation report to the console.
+### Optional: Enable LLM summaries via Ollama
+If you have [Ollama](https://ollama.com) installed:
+```bash
+# Pull the model (one-time, ~2 GB)
+ollama pull llama3.2
+# Start the Ollama server (in a separate terminal)
+ollama serve
+```
+Then run `python demo.py` as normal. The report will include an LLM-generated
+executive summary instead of the rule-based fallback text.
+If Ollama is not running, the demo still produces a complete report — it logs
+a message to stderr and uses the deterministic fallback automatically.
+---
+## Project Structure
+```
+analytics_langchain_demo/
+├── demo.py        Main runner. Contains load_data(), run_checks(),
+│                  format_console_output(), and main(). All rule-engine
+│                  logic lives here. No LLM calls.
+│
+├── llm_utils.py   LangChain integration layer. Tries Ollama; falls back
+│                  to a deterministic text summary if unavailable.
+│                  The only file that imports LangChain.
+│
+├── data.csv       90-day synthetic daily metrics (revenue, orders) with
+│                  injected missing values, a date gap, and anomaly drops.
+│
+└── README.md      This file.
+```
+---
+## Why LLMs Are Used Cautiously
+The rule engine is the source of truth. It produces deterministic, auditable
+findings: specific columns, specific dates, specific percentage changes. These
+facts never change between runs.
+The LLM only narrates those pre-computed facts in plain language. The prompt
+explicitly instructs it to:
+- Not speculate on causes
+- Not introduce information not present in the data
+- Use a neutral, executive-friendly tone
+This means the LLM cannot hallucinate a root cause or suggest a business
+conclusion. It is a translator, not an analyst.
+If the LLM is unavailable, the output is functionally identical — just
+template-generated rather than model-generated. This makes the tool safe to
+use in regulated or high-stakes reporting environments where outputs must
+be reproducible.
+---
+## How This Scales in Production
+This demo is intentionally small. Here is how each layer would evolve:
+**Data input**
+Replace `pd.read_csv()` with a SQLAlchemy connector to query directly from a
+data warehouse (Snowflake, Redshift, BigQuery). The `run_checks()` function
+signature does not change — it still receives a DataFrame.
+**Scheduling**
+Wrap `main()` in a Prefect or Airflow task to run nightly. Add a `--date`
+argument so the tool checks a specific reporting period rather than the
+full CSV.
+**LLM**
+Replace Ollama with a hosted model via `langchain-anthropic` or
+`langchain-openai`. Inject the API key via environment variable. The
+`_try_ollama_summary()` function is the only change point.
+**Output**
+Export `results` as JSON to a shared location for downstream dashboard
+ingestion. Add a Slack or email alert when any ERROR-severity issues are found.
+**Thresholds**
+Move `ANOMALY_THRESHOLD` from a hardcoded constant to a YAML config file
+so different teams can tune sensitivity without changing code.
+---
+## Sample Output (no Ollama)
+```
+============================================================
+  ANALYTICS VALIDATION REPORT
+  Generated : 2024-04-01 09:15:32
+  Data file : data.csv
+============================================================
+---- [ SECTION 1: DATA OVERVIEW ] --------------------------
+  Date Range    : 2024-01-01  to  2024-03-30
+  Actual Rows   : 89
+  Expected Rows : 90
+  Row Status    : 1 row gap detected
+---- [ SECTION 2: KPI STATISTICS ] -------------------------
+  REVENUE (USD)
+    Min      : $28,341.00
+    Max      : $51,203.40
+    Mean     : $44,782.15
+    ...
+---- [ SECTION 3: DETECTED ISSUES ] ------------------------
+  Total issues found: 6
+  [WARNING] missing_values | Column: revenue
+    Detail : 1 missing value(s) in 'revenue' column
+    Dates  : 2024-02-05
+  ...
+---- [ SECTION 4: EXECUTIVE SUMMARY ] ----------------------
+  Source: Rule-Based Fallback (Ollama unavailable)
+  The dataset spans 89 rows from 2024-01-01 to 2024-03-30
+  (expected 90 calendar days). Missing values were identified
+  in column(s): orders, revenue. A row count gap of 1 was
+  detected in the date sequence. 3 day-over-day drop(s)
+  exceeding the 20% anomaly threshold were flagged...
+```

app.py ADDED Viewed

	@@ -0,0 +1,127 @@

+import streamlit as st
+import pandas as pd
+from pathlib import Path
+from demo import load_data, run_checks, format_console_output
+from llm_utils import generate_summary
+# ---------------------------------------------------------------------------
+# Configuration & Setup
+# ---------------------------------------------------------------------------
+st.set_page_config(
+    page_title="Analytics Validation Demo",
+    page_icon="📊",
+    layout="wide"
+)
+# ---------------------------------------------------------------------------
+# Helper Functions
+# ---------------------------------------------------------------------------
+@st.cache_data
+def load_and_verify_data(filepath):
+    """Load data using existing logic from demo.py."""
+    return load_data(filepath)
+def display_issues(issues):
+    """Render list of issues as Streamlit alerts."""
+    if not issues:
+        st.success("✅ No issues detected. Data appears clean.")
+        return
+    st.warning(f"⚠️ Found {len(issues)} issues")
+    for issue in issues:
+        severity_icon = "🔴" if issue["severity"] == "ERROR" else "⚠️"
+        with st.expander(f"{severity_icon} [{issue['severity']}] {issue['type']} in {issue.get('column', 'General')}"):
+            st.write(f"**Detail:** {issue['detail']}")
+            if issue.get("dates"):
+                st.write(f"**Affected Dates:** {', '.join(issue['dates'])}")
+def display_metrics(stats):
+    """Render Key Metrics in columns."""
+    st.subheader("Key Metrics")
+    col1, col2 = st.columns(2)
+    with col1:
+        rev = stats["revenue"]
+        st.metric("Total Revenue", f"${rev['total']:,.2f}")
+        st.caption(f"Mean: ${rev['mean']:,.2f} | Missing: {rev['missing_count']}")
+    with col2:
+        ord_ = stats["orders"]
+        st.metric("Total Orders", f"{int(ord_['total']):,}")
+        st.caption(f"Mean: {ord_['mean']:,.0f} | Missing: {ord_['missing_count']}")
+    st.divider()
+# ---------------------------------------------------------------------------
+# Main App Layout
+# ---------------------------------------------------------------------------
+def main():
+    st.title("📊 Analytics Validation Engine")
+    st.markdown("""
+    This demo validates daily business metrics for anomalies, missing data, and consistency errors.
+    It uses a deterministic rule engine and an optional local LLM for narration.
+    """)
+    # Sidebar parameters
+    st.sidebar.header("Configuration")
+    uploaded_file = st.sidebar.file_uploader("Upload CSV", type=["csv"])
+    # Check for default data if no upload
+    data_path = "data.csv"
+    if uploaded_file is not None:
+        data_path = uploaded_file
+    elif not Path(data_path).exists():
+        st.error("❌ data.csv not found and no file uploaded.")
+        st.stop()
+    # Load Data
+    try:
+        if uploaded_file:
+            df = pd.read_csv(uploaded_file, parse_dates=["date"])
+             # Quick re-validation using original logic if possible, or just use pandas directly
+             # Ideally we re-use load_data but it expects a filepath string.
+             # For simplicity in this demo, let's just use the loaded DF if it's from upload,
+             # but we need to ensure it matches schema.
+            if not pd.api.types.is_datetime64_any_dtype(df["date"]):
+                 df["date"] = pd.to_datetime(df["date"])
+            df = df.sort_values("date").reset_index(drop=True)
+        else:
+            df = load_and_verify_data(data_path)
+        st.sidebar.success(f"Loaded {len(df)} rows")
+    except Exception as e:
+        st.error(f"Error loading data: {e}")
+        st.stop()
+    # Run Analysis
+    with st.spinner("Running validation rules..."):
+        results = run_checks(df)
+    # Display Findings
+    # 1. Executive Summary
+    st.header("Executive Summary")
+    # We use a placeholder to update it if we want to stream (future optimization),
+    # but for now synchronous generation is fine.
+    with st.spinner("Generating summary (LLM or Rule-based)..."):
+        summary = generate_summary(results)
+    st.info(summary)
+    # 2. Detailed Issues
+    st.header("Detected Issues")
+    display_issues(results["issues"])
+    # 3. Data Overview
+    st.header("Data Overview")
+    display_metrics(results["stats"])
+    with st.expander("View Raw Data"):
+        st.dataframe(df)
+if __name__ == "__main__":
+    main()

data.csv ADDED Viewed

	@@ -0,0 +1,90 @@

+date,revenue,orders
+2024-01-01,44596.55,305
+2024-01-02,44688.32,317
+2024-01-03,44642.75,290
+2024-01-04,45930.49,314
+2024-01-05,44392.52,308
+2024-01-06,49250.43,354
+2024-01-07,50438.58,349
+2024-01-08,42932.7,284
+2024-01-09,45689.76,331
+2024-01-10,45116.64,310
+2024-01-11,46488.97,303
+2024-01-12,44125.62,310
+2024-01-13,51045.53,349
+2024-01-14,49654.48,345
+2024-01-15,47190.52,312
+2024-01-16,46591.1,303
+2024-01-17,37664.15,252
+2024-01-18,42435.73,303
+2024-01-19,46859.94,309
+2024-01-20,50972.61,340
+2024-01-21,48358.52,330
+2024-01-22,45320.38,322
+2024-01-23,46787.56,327
+2024-01-24,46819.85,329
+2024-01-25,43244.44,290
+2024-01-26,43684.09,307
+2024-01-27,47899.68,358
+2024-01-28,46305.98,306
+2024-01-29,47151.73,342
+2024-01-30,46415.94,330
+2024-01-31,48993.77,337
+2024-02-01,41015.71,276
+2024-02-02,47668.11,311
+2024-02-03,48693.89,339
+2024-02-04,47716.34,338
+2024-02-05,,349
+2024-02-06,46735.91,
+2024-02-07,43426.97,290
+2024-02-08,47666.37,322
+2024-02-09,44803.27,318
+2024-02-10,46574.3,318
+2024-02-11,43444.4,287
+2024-02-12,43410.34,304
+2024-02-13,48341.78,333
+2024-02-14,30842.06,212
+2024-02-15,48037.28,342
+2024-02-16,45766.34,303
+2024-02-17,51129.48,357
+2024-02-18,52035.44,359
+2024-02-19,50468.68,344
+2024-02-21,49460.57,342
+2024-02-22,43554.45,287
+2024-02-23,44577.11,325
+2024-02-24,50885.85,359
+2024-02-25,41947.56,298
+2024-02-26,46556.39,314
+2024-02-27,43243.32,298
+2024-02-28,49829.65,331
+2024-02-29,43802.15,318
+2024-03-01,43750.78,297
+2024-03-02,48873.77,322
+2024-03-03,49215.85,325
+2024-03-04,47478.56,327
+2024-03-05,51393.61,358
+2024-03-06,48823.93,321
+2024-03-07,44658.01,312
+2024-03-08,49888.02,324
+2024-03-09,51373.74,361
+2024-03-10,33649.8,373
+2024-03-11,45145.8,305
+2024-03-12,41505.1,289
+2024-03-13,44463.25,331
+2024-03-14,43289.13,302
+2024-03-15,40606.8,275
+2024-03-16,49331.21,350
+2024-03-17,52654.68,363
+2024-03-18,41871.71,294
+2024-03-19,46447.56,326
+2024-03-20,43039.19,310
+2024-03-21,45246.11,320
+2024-03-22,48567.41,342
+2024-03-23,49402.24,367
+2024-03-24,49282.37,336
+2024-03-25,45313.51,330
+2024-03-26,45332.39,319
+2024-03-27,48348.05,327
+2024-03-28,40165.6,281
+2024-03-29,45643.26,307
+2024-03-30,51039.04,359

demo.py ADDED Viewed

	@@ -0,0 +1,396 @@

+"""
+demo.py — Analytics Validation Demo
+Entry point for the analytics automation demo. Loads a CSV of daily business
+metrics, runs a deterministic rule-based validation engine, optionally enriches
+the findings with an LLM narrative, and prints a structured console report.
+Usage:
+    python demo.py
+The script expects data.csv to be in the same directory as this file.
+"""
+from __future__ import annotations
+import sys
+from datetime import datetime
+from pathlib import Path
+import pandas as pd  # type: ignore
+from llm_utils import generate_summary  # type: ignore
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+DIVIDER = "=" * 60
+SECTION_DIV = "-" * 60
+# A day-over-day drop beyond this threshold is flagged as an anomaly.
+# Why 20%: this is a common business heuristic for "something unusual happened"
+# in daily revenue or order metrics. Adjust for your domain as needed.
+ANOMALY_THRESHOLD = -0.20
+REQUIRED_COLUMNS = {"date", "revenue", "orders"}
+# ---------------------------------------------------------------------------
+# Data loading
+# ---------------------------------------------------------------------------
+def load_data(filepath: str) -> pd.DataFrame:
+    """
+    Read and validate the input CSV file.
+    Args:
+        filepath: Absolute or relative path to the CSV.
+    Returns:
+        A DataFrame with columns [date (datetime64), revenue (float), orders (numeric)],
+        sorted ascending by date.
+    Raises:
+        FileNotFoundError: if the file does not exist.
+        ValueError: if required columns are missing or dates cannot be parsed.
+        pd.errors.EmptyDataError: if the file is empty.
+    """
+    df = pd.read_csv(filepath, parse_dates=["date"])
+    # Validate required columns before any processing
+    missing_cols = REQUIRED_COLUMNS - set(df.columns)
+    if missing_cols:
+        found = sorted(df.columns.tolist())
+        raise ValueError(
+            f"CSV is missing required column(s): {sorted(missing_cols)}. "
+            f"Found: {found}"
+        )
+    # Confirm date column parsed correctly — bad date formats produce object dtype
+    if not pd.api.types.is_datetime64_any_dtype(df["date"]):
+        raise ValueError(
+            "The 'date' column could not be parsed as dates. "
+            "Ensure dates are in YYYY-MM-DD format."
+        )
+    # Sort ascending so day-over-day calculations are always forward-looking
+    df = df.sort_values("date").reset_index(drop=True)
+    return df
+# ---------------------------------------------------------------------------
+# Rule engine
+# ---------------------------------------------------------------------------
+def run_checks(df: pd.DataFrame) -> dict:
+    """
+    Run all deterministic validation checks against the DataFrame.
+    This function never calls any LLM or external service. Every result is
+    derived purely from the data using pandas arithmetic and comparisons.
+    Args:
+        df: Clean DataFrame from load_data().
+    Returns:
+        A dict with keys:
+          "issues": list of issue dicts (see _make_issue for structure)
+          "stats":  dict with "revenue", "orders", and "date_range" sub-dicts
+    """
+    issues: list[dict] = []
+    # -- Check A: Missing values -----------------------------------------
+    # NaN values in numeric columns make any aggregation over that period
+    # potentially misleading. Surface them so analysts know before reporting.
+    for col in ["revenue", "orders"]:
+        missing_mask = df[col].isna()
+        missing_count = int(missing_mask.sum())
+        if missing_count > 0:
+            affected_dates = df.loc[missing_mask, "date"].dt.strftime("%Y-%m-%d").tolist()
+            issues.append(
+                _make_issue(
+                    type_="missing_values",
+                    severity="WARNING",
+                    column=col,
+                    detail=f"{missing_count} missing value(s) in '{col}' column",
+                    dates=affected_dates,
+                    value=float(missing_count),
+                )
+            )
+    # -- Check B: Row count consistency ----------------------------------
+    # A missing date in a time series is invisible to most BI tools and
+    # creates silent gaps in trend lines. Flagging it early prevents charts
+    # that appear continuous but are actually dropping a day of data.
+    date_start = df["date"].min()
+    date_end = df["date"].max()
+    expected_rows = (date_end - date_start).days + 1
+    actual_rows = len(df)
+    row_gap = expected_rows - actual_rows
+    if row_gap != 0:
+        issues.append(
+            _make_issue(
+                type_="row_count",
+                severity="WARNING",
+                column=None,
+                detail=(
+                    f"Expected {expected_rows} rows for date range "
+                    f"{date_start.date()} to {date_end.date()}, "
+                    f"found {actual_rows} (gap: {row_gap})"
+                ),
+                dates=[],
+                value=float(row_gap),
+            )
+        )
+    # -- Check C: Day-over-day anomaly drops -----------------------------
+    # A >20% single-day drop almost always signals either a data quality
+    # problem or a significant business event requiring executive attention.
+    # We drop NaN rows before calling pct_change() to prevent a missing value
+    # from propagating into the percentage calculation for adjacent rows.
+    for col in ["revenue", "orders"]:
+        series_clean = df[["date", col]].dropna(subset=[col]).copy()
+        series_clean["pct_change"] = series_clean[col].pct_change()
+        anomalies = series_clean[series_clean["pct_change"] < ANOMALY_THRESHOLD]
+        for _, row in anomalies.iterrows():
+            pct = float(row["pct_change"])
+            issues.append(
+                _make_issue(
+                    type_="anomaly_drop",
+                    severity="WARNING",
+                    column=col,
+                    detail=(
+                        f"'{col}' dropped {pct:.1%} on "
+                        f"{row['date'].strftime('%Y-%m-%d')}"
+                    ),
+                    dates=[str(row["date"].strftime("%Y-%m-%d"))],
+                    value=round(float(pct * 100), 2),  # type: ignore
+                )
+            )
+    # -- Check D: Duplicate dates ----------------------------------------
+    # Duplicate dates cause silent double-counting in GROUP BY aggregations.
+    # This is classified as ERROR (not WARNING) because it corrupts totals.
+    dup_mask = df["date"].duplicated(keep=False)
+    dup_count = int(dup_mask.sum())
+    if dup_count > 0:
+        raw_dates = df.loc[dup_mask, "date"].dt.strftime("%Y-%m-%d").unique().tolist()
+        dup_dates: list[str] = sorted(str(d) for d in raw_dates)
+        issues.append(
+            _make_issue(
+                type_="duplicate_dates",
+                severity="ERROR",
+                column="date",
+                detail=f"{dup_count} rows share duplicate dates: {', '.join(dup_dates)}",
+                dates=dup_dates,
+                value=float(dup_count),
+            )
+        )
+    # -- Statistics (always computed, skipna=True so one NaN doesn't block all stats)
+    stats = _compute_stats(df, date_start, date_end, actual_rows, expected_rows, row_gap)
+    return {"issues": issues, "stats": stats}
+def _make_issue(
+    *,
+    type_: str,
+    severity: str,
+    column: str | None,
+    detail: str,
+    dates: list[str],
+    value: float | None,
+) -> dict:
+    """Return a consistently structured issue dict."""
+    return {
+        "type": type_,
+        "severity": severity,
+        "column": column,
+        "detail": detail,
+        "dates": dates,
+        "value": value,
+    }
+def _compute_stats(
+    df: pd.DataFrame,
+    date_start: pd.Timestamp,
+    date_end: pd.Timestamp,
+    actual_rows: int,
+    expected_rows: int,
+    row_gap: int,
+) -> dict:
+    """Compute descriptive statistics for revenue and orders."""
+    stats: dict = {}
+    for col in ["revenue", "orders"]:
+        col_data = df[col]
+        stats[col] = {
+            "min": _safe_round(col_data.min(skipna=True), 2),
+            "max": _safe_round(col_data.max(skipna=True), 2),
+            "mean": _safe_round(col_data.mean(skipna=True), 2),
+            "std": _safe_round(col_data.std(skipna=True), 2),
+            "total": _safe_round(col_data.sum(skipna=True), 2),
+            "missing_count": int(col_data.isna().sum()),
+        }
+    stats["date_range"] = {
+        "start": date_start.strftime("%Y-%m-%d"),
+        "end": date_end.strftime("%Y-%m-%d"),
+        "actual_rows": actual_rows,
+        "expected_rows": expected_rows,
+        "row_gap": row_gap,
+    }
+    return stats
+def _safe_round(value: float | None, ndigits: int) -> float | None:
+    """Round a value, returning None if it is NaN (e.g., all-NaN column)."""
+    try:
+        if value is None or (isinstance(value, float) and pd.isna(value)):
+            return None
+        return round(float(value), int(ndigits))  # type: ignore
+    except (TypeError, ValueError):
+        return None
+# ---------------------------------------------------------------------------
+# Console rendering
+# ---------------------------------------------------------------------------
+def format_console_output(results: dict, llm_summary: str) -> None:
+    """
+    Print a structured, ASCII-safe validation report to stdout.
+    This is a pure I/O function. It reads from results and llm_summary only;
+    it does not compute, modify, or validate anything.
+    Args:
+        results:     The dict returned by run_checks().
+        llm_summary: The string returned by generate_summary().
+    """
+    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+    dr = results["stats"]["date_range"]
+    issues = results["issues"]
+    # Header
+    print(DIVIDER)
+    print("  ANALYTICS VALIDATION REPORT")
+    print(f"  Generated : {timestamp}")
+    print(f"  Data file : data.csv")
+    print(DIVIDER)
+    # Section 1: Data Overview
+    _section("1: DATA OVERVIEW")
+    print(f"  Date Range    : {dr['start']}  to  {dr['end']}")
+    print(f"  Actual Rows   : {dr['actual_rows']}")
+    print(f"  Expected Rows : {dr['expected_rows']}")
+    gap_label = f"{dr['row_gap']} row gap detected" if dr["row_gap"] != 0 else "Consistent"
+    print(f"  Row Status    : {gap_label}")
+    # Section 2: KPI Statistics
+    _section("2: KPI STATISTICS")
+    rev = results["stats"]["revenue"]
+    print("\n  REVENUE (USD)")
+    print(f"    Min      : {_fmt_usd(rev['min'])}")
+    print(f"    Max      : {_fmt_usd(rev['max'])}")
+    print(f"    Mean     : {_fmt_usd(rev['mean'])}")
+    print(f"    Std Dev  : {_fmt_usd(rev['std'])}")
+    print(f"    Total    : {_fmt_usd(rev['total'])}")
+    print(f"    Missing  : {rev['missing_count']} value(s)")
+    ord_ = results["stats"]["orders"]
+    print("\n  ORDERS")
+    print(f"    Min      : {_fmt_int(ord_['min'])}")
+    print(f"    Max      : {_fmt_int(ord_['max'])}")
+    print(f"    Mean     : {_fmt_int(ord_['mean'])}")
+    print(f"    Std Dev  : {_fmt_int(ord_['std'])}")
+    print(f"    Total    : {_fmt_int(ord_['total'])}")
+    print(f"    Missing  : {ord_['missing_count']} value(s)")
+    # Section 3: Detected Issues
+    _section("3: DETECTED ISSUES")
+    print(f"\n  Total issues found: {len(issues)}")
+    if not issues:
+        print("\n  No issues detected. Data appears clean.")
+    else:
+        for issue in issues:
+            col_label = issue["column"] if issue["column"] else "N/A"
+            print(f"\n  [{issue['severity']}] {issue['type']} | Column: {col_label}")
+            print(f"    Detail : {issue['detail']}")
+            if issue["dates"]:
+                print(f"    Dates  : {', '.join(issue['dates'])}")
+    # Section 4: Executive Summary
+    _section("4: EXECUTIVE SUMMARY")
+    print()
+    print(llm_summary)
+    # Footer
+    print(f"\n{DIVIDER}")
+    print("  END OF REPORT")
+    print(DIVIDER)
+def _section(title: str) -> None:
+    """Print a section divider line."""
+    header = f"---- [ SECTION {title} ] "
+    print(f"\n{header}{'-' * max(0, 60 - len(header))}")
+def _fmt_usd(value: float | None) -> str:
+    if value is None:
+        return "N/A"
+    return f"${value:,.2f}"
+def _fmt_int(value: float | None) -> str:
+    if value is None:
+        return "N/A"
+    return f"{int(value):,}"
+# ---------------------------------------------------------------------------
+# Entry point
+# ---------------------------------------------------------------------------
+def main() -> None:
+    # Resolve data path relative to this script's directory so the demo works
+    # regardless of where the user invokes it from (e.g., from the repo root).
+    script_dir = Path(__file__).parent
+    data_path = script_dir / "data.csv"
+    # Load — surface file and format errors before doing any work
+    try:
+        df = load_data(str(data_path))
+    except FileNotFoundError:
+        print(f"ERROR: data.csv not found at '{data_path}'")
+        print("Ensure data.csv is in the same directory as demo.py.")
+        sys.exit(1)
+    except (ValueError, pd.errors.EmptyDataError) as exc:
+        print(f"ERROR: Could not load data — {exc}")
+        sys.exit(1)
+    # Rule engine — always deterministic, never calls external services
+    results = run_checks(df)
+    # LLM narration — may fall back to rule-based text (handled inside generate_summary)
+    llm_summary = generate_summary(results)
+    # Render report
+    format_console_output(results, llm_summary)
+if __name__ == "__main__":
+    main()

llm_utils.py ADDED Viewed

	@@ -0,0 +1,240 @@

+"""
+llm_utils.py — LangChain integration layer for the analytics demo.
+Design principle: the LLM is an optional narrator, not a decision-maker.
+Every public function must return a valid result even when Ollama is absent.
+"""
+from __future__ import annotations
+import sys
+# Guard against environments where LangChain is not installed.
+# The rule engine in demo.py never imports this at module level for logic,
+# so an ImportError here degrades gracefully to the fallback path.
+#
+# We prefer langchain-ollama (the current, non-deprecated package) and fall
+# back to langchain_community.llms.Ollama for environments that only have
+# the older package installed.
+try:
+    from langchain_core.prompts import PromptTemplate
+    try:
+        from langchain_ollama import OllamaLLM as Ollama  # preferred (langchain-ollama)
+    except ImportError:
+        from langchain_community.llms import Ollama  # fallback (older install)
+    LANGCHAIN_AVAILABLE = True
+except ImportError:
+    LANGCHAIN_AVAILABLE = False
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+OLLAMA_BASE_URL = "http://localhost:11434"
+OLLAMA_MODEL = "llama3.2"
+# The system instruction is kept as a constant so it can be audited,
+# versioned, and referenced in documentation independently of the template.
+SYSTEM_PROMPT = (
+    "You are assisting a business analyst. "
+    "Explain the following metric validation findings clearly and factually. "
+    "Do not speculate on causes. "
+    "Do not introduce information not present in the data. "
+    "Use a neutral, executive-friendly tone."
+)
+# ---------------------------------------------------------------------------
+# Internal helpers
+# ---------------------------------------------------------------------------
+def _build_prompt_template() -> "PromptTemplate":
+    """
+    Construct the LangChain PromptTemplate.
+    Why PromptTemplate rather than an f-string: the variable injection point
+    is explicit and the template can be tested and versioned independently
+    of the findings serialization logic.
+    """
+    template = (
+        f"{SYSTEM_PROMPT}\n\n"
+        "Metric Validation Findings:\n"
+        "{findings_text}\n\n"
+        "Provide a concise executive summary (3-5 sentences) of the above findings. "
+        "Stick strictly to the facts presented."
+    )
+    return PromptTemplate(
+        input_variables=["findings_text"],
+        template=template,
+    )
+def _serialize_findings(findings: dict) -> str:
+    """
+    Convert the structured findings dict into a plain-text paragraph suitable
+    for injection into the LLM prompt.
+    Why plain text rather than raw JSON: LLMs produce better, more natural
+    summaries when given prose-style context rather than nested JSON objects.
+    """
+    lines: list[str] = []
+    dr = findings["stats"]["date_range"]
+    lines.append(
+        f"Dataset covers {dr['actual_rows']} rows from {dr['start']} to {dr['end']}. "
+        f"Expected {dr['expected_rows']} rows based on the date range "
+        f"(gap: {dr['row_gap']} row(s))."
+    )
+    rev = findings["stats"]["revenue"]
+    lines.append(
+        f"Revenue (USD): mean=${rev['mean']:,.2f}, std=${rev['std']:,.2f}, "
+        f"total=${rev['total']:,.2f}, missing={rev['missing_count']} value(s)."
+    )
+    ord_ = findings["stats"]["orders"]
+    lines.append(
+        f"Orders: mean={ord_['mean']:,.0f}, std={ord_['std']:,.1f}, "
+        f"total={int(ord_['total']):,}, missing={ord_['missing_count']} value(s)."
+    )
+    issues = findings["issues"]
+    if issues:
+        lines.append(f"\nDetected {len(issues)} issue(s):")
+        for issue in issues:
+            date_str = f" on {', '.join(issue['dates'])}" if issue["dates"] else ""
+            lines.append(f"  - [{issue['severity']}] {issue['detail']}{date_str}")
+    else:
+        lines.append("\nNo data quality issues detected. Dataset appears clean.")
+    return "\n".join(lines)
+def _try_ollama_summary(findings_text: str) -> str | None:
+    """
+    Attempt a local Ollama call via LangChain. Returns the summary string on
+    success, or None on any failure (connection refused, model not found, etc.).
+    Why return None instead of raising: the caller uses None as a signal to
+    activate the deterministic fallback, keeping all error-handling in one place.
+    Errors are printed to stderr so they don't pollute the report on stdout.
+    """
+    if not LANGCHAIN_AVAILABLE:
+        return None
+    try:
+        prompt_template = _build_prompt_template()
+        llm = Ollama(base_url=OLLAMA_BASE_URL, model=OLLAMA_MODEL, timeout=30)
+        # LCEL pipe syntax: preferred over deprecated LLMChain
+        chain = prompt_template | llm
+        result = chain.invoke({"findings_text": findings_text})
+        # langchain_community.llms.Ollama returns a plain str;
+        # ChatOllama returns an AIMessage — handle both defensively.
+        if hasattr(result, "content"):
+            return result.content.strip() or None
+        return str(result).strip() or None
+    except Exception as exc:
+        print(
+            f"[llm_utils] Ollama unavailable ({type(exc).__name__}): {exc}",
+            file=sys.stderr,
+        )
+        return None
+def _rule_based_summary(findings: dict) -> str:
+    """
+    Generate a deterministic plain-text summary from the findings dict.
+    This is the guaranteed fallback when no LLM is available. Template-driven
+    text is auditable, predictable, and consistent — qualities analysts require
+    from a validation tool used in reporting contexts.
+    """
+    dr = findings["stats"]["date_range"]
+    issues = findings["issues"]
+    line1 = (
+        f"The dataset spans {dr['actual_rows']} rows from {dr['start']} to "
+        f"{dr['end']} (expected {dr['expected_rows']} calendar days)."
+    )
+    if not issues:
+        return (
+            f"  {line1}\n"
+            "  No data quality issues were detected. "
+            "The dataset appears suitable for reporting."
+        )
+    parts: list[str] = []
+    missing_issues = [i for i in issues if i["type"] == "missing_values"]
+    row_issues = [i for i in issues if i["type"] == "row_count"]
+    anomaly_issues = [i for i in issues if i["type"] == "anomaly_drop"]
+    duplicate_issues = [i for i in issues if i["type"] == "duplicate_dates"]
+    if missing_issues:
+        cols = sorted({i["column"] for i in missing_issues})
+        parts.append(f"Missing values were identified in column(s): {', '.join(cols)}.")
+    if row_issues:
+        gap = row_issues[0]["value"]
+        parts.append(f"A row count gap of {int(gap)} was detected in the date sequence.")
+    if duplicate_issues:
+        parts.append(
+            f"{len(duplicate_issues)} duplicate date(s) were found, "
+            "which may cause double-counting in aggregations."
+        )
+    if anomaly_issues:
+        # Report worst single-day drop
+        worst = min(anomaly_issues, key=lambda x: x["value"])
+        parts.append(
+            f"{len(anomaly_issues)} day-over-day drop(s) exceeding the 20% anomaly "
+            f"threshold were flagged; the largest was {worst['value']:.1f}% "
+            f"on {worst['dates'][0]}."
+        )
+    parts.append(
+        "These findings should be reviewed and resolved before this dataset "
+        "is used in executive or board-level reporting."
+    )
+    body = " ".join(parts)
+    return f"  {line1}\n  {body}"
+# ---------------------------------------------------------------------------
+# Public API
+# ---------------------------------------------------------------------------
+def generate_summary(findings: dict) -> str:
+    """
+    Generate a human-readable summary of the findings dict.
+    Attempts an Ollama (local LLM) call first; falls back to a deterministic
+    rule-based summary if Ollama is unavailable. Always returns a non-empty string.
+    Args:
+        findings: The dict returned by demo.run_checks()
+    Returns:
+        A formatted summary string including a source label.
+    """
+    findings_text = _serialize_findings(findings)
+    llm_result = _try_ollama_summary(findings_text)
+    if llm_result:
+        source_label = f"  Source: Ollama ({OLLAMA_MODEL})\n"
+        # Indent each line of LLM output to match the report's 2-space style
+        indented = "\n".join(f"  {line}" for line in llm_result.splitlines())
+        return source_label + "\n" + indented
+    source_label = "  Source: Rule-Based Fallback (Ollama unavailable)\n"
+    return source_label + "\n" + _rule_based_summary(findings)

requirements.txt CHANGED Viewed

@@ -1,3 +1,4 @@
-altair
-pandas
-streamlit

+pandas
+langchain-ollama
+streamlit
+watchdog