Auto_ML / FEATURE.md
abhiraj12's picture
Streamline export bundle by removing auxiliary files
807485b

CODIN Feature Guide

What This Project Is

CODIN is an end-to-end AutoML studio built with:

  • FastAPI for APIs and backend orchestration
  • Streamlit for the multi-page workspace UI
  • SQLAlchemy + SQLite for datasets, jobs, experiment history, notes, registry labels, and drift records
  • scikit-learn, LightGBM, Optuna, and custom pipeline logic for training and evaluation

The product is designed as a guided ML workspace rather than a single training script. A user can ingest data, profile it, train models, inspect results, simulate predictions, detect drift, generate synthetic data, compare runs, and export reports from the same app.

Core Product Flow

  1. A dataset is uploaded or imported.
  2. The backend profiles the dataset and stores metadata in datasets.
  3. A training job is created and the pipeline runs validation, cleaning, feature engineering, model selection, training, and evaluation.
  4. Results are saved to run artifacts and experiment history.
  5. The user explores explainability, contracts, what-if tools, drift, augmentation, and reports.

Main User-Facing Workspaces

1. Home / Upload And Configure

Primary purpose:

  • ingest new data
  • restore exported bundles
  • import from external connectors
  • merge datasets
  • configure AutoML training

Key features:

  • File upload for tabular, document, image, PDF, SQLite, and export bundle formats
  • Connector-based import for SQL sources
  • Merge Studio for previewing joins before materializing merged datasets
  • Auto problem detection to infer likely target/task hints
  • Training configuration for mode, metric, imbalance handling, cleaning, CV, and feature selection

Logic:

  • Uploads are normalized through core/file_loader.py
  • Dataset profile information is generated and saved in profile_json
  • Dataset lineage is tracked with parent_dataset_id
  • Training configuration is stored on the job and reused in later pages

2. Dataset DNA

Primary purpose:

  • inspect dataset structure and quality before training

Key features:

  • dataset profile overview
  • numeric/categorical breakdown
  • per-column stats table
  • auto-detect details
  • leakage and quality analysis
  • repair preview and repair apply
  • dataset lineage timeline and graph

Logic:

  • Profiling computes row counts, column types, missingness, target hints, and column stats
  • Leakage checks look for suspicious target correlation, ID-like fields, constants, duplicates, and temporal leakage hints
  • Repair flows preview transformations first, then create a cleaned derived dataset

3. Training Lab

Primary purpose:

  • run AutoML training with pipeline-backed execution

Key features:

  • job launch
  • progress/status monitoring
  • reasoning stream from the pipeline
  • deep analysis handoff to Results Console

Logic:

  • Training uses component-based pipeline execution from core/pipeline_engine.py
  • Each component updates status and appends reasoning so the UI can show not just what happened, but why

4. Results Console

Primary purpose:

  • inspect the trained model and all downstream analysis

Key features:

  • leaderboard and winner summary
  • execution profile and pipeline metrics
  • tested model breakdown
  • SHAP global importance
  • calibration report
  • threshold tuner
  • feature lineage
  • recommendations
  • trust heatmap
  • drift timeline
  • prediction sandbox
  • feature contract checker
  • counterfactual generation
  • scenario sweep
  • synthetic augmentation controls
  • lightweight model chat

Logic:

  • Results are normalized through infra/result_contract.py
  • UI panels call dedicated APIs for each analysis instead of overloading one giant endpoint
  • Mixed display tables are sanitized before rendering to keep Streamlit Arrow serialization stable

5. Experiment Tracker

Primary purpose:

  • compare historical runs and manage model promotion workflow

Key features:

  • experiment archive
  • global leaderboard
  • run history viewer
  • reasoning stream viewer
  • registry labels: champion, challenger, candidate, archived
  • team notes
  • side-by-side run comparison
  • run diff engine
  • battle arena charts

Logic:

  • Each completed job can be persisted as an ExperimentRun
  • Registry and note models allow lightweight model governance without needing an external MLOps platform

6. Drift Monitor

Primary purpose:

  • detect dataset drift against the saved baseline and operationalize retraining

Key features:

  • upload new batch for drift check
  • per-feature drift metrics
  • drift severity summary
  • cadence scheduling
  • drift history
  • one-click retrain on drifted data

Logic:

  • Baseline distributions are fit during training
  • Later uploads are compared with PSI/KS-style checks
  • Drift checks are stored so timeline views and cadence workflows can be built on top

7. Smart AI Hub

Primary purpose:

  • provide higher-order utilities around the trained models and datasets

Key features:

  • ensemble builder
  • what-if simulation
  • synthetic data generation
  • natural language ML helpers

Logic:

  • This page reuses completed jobs and current workspace datasets rather than creating isolated tooling
  • It is meant to sit on top of the main AutoML lifecycle

Training Pipeline Components

Data Validation

Implemented in backend/services/training/components.py.

What it does:

  • loads the dataset
  • checks target existence
  • trims to selected features if requested
  • drops columns with extreme missingness
  • attempts numeric coercion on object columns
  • runs leakage detection
  • saves the data contract
  • fits drift baseline

Why:

  • it prevents garbage-in training runs
  • it creates the baseline metadata needed by later features such as drift monitoring and contract checks

Feature Engineering

What it does:

  • optional auto-cleaning
  • task-type inference
  • managed feature generation for smaller datasets
  • invalid target cleanup
  • label encoding for classification
  • train/test split
  • summary statistics for later reporting

Why:

  • it balances automation with safety
  • feature synthesis is only used when dataset size and width are still manageable

Model Selection

What it does:

  • builds a profile of the problem
  • asks the selector/meta-learner for a candidate pool
  • chooses light or full preprocessing based on mode
  • enables dimensionality guidance for wide datasets

Why:

  • it makes the model search adaptive instead of static

Training

What it does:

  • handles imbalance when configured
  • executes sweep/tuning behavior
  • trains the final pipeline
  • evaluates on holdout data
  • computes leaderboard, metrics, and explainability artifacts
  • stores model metadata and run artifacts

Why:

  • it separates fast exploration from deeper optimization
  • it preserves enough metadata for later explainability and operational tools

Explainability Features

SHAP Global Importance

  • ranks feature impact across the trained model
  • used in the Results Console and reports

Local Explanation

  • explains a single prediction using per-feature contributions
  • used to rank candidate features for counterfactual search

Counterfactual Generator

Endpoint:

  • POST /api/counterfactual/{job_id}

Logic:

  • only enabled for classification jobs
  • loads the saved model and training dataset context
  • validates that all expected feature inputs are present
  • scores the original row
  • gets local feature contributions
  • ranks the most influential features
  • tries one-feature changes using numeric quantiles or common categorical alternatives
  • returns the smallest single-field changes that flip the prediction

Current behavior:

  • verified against job 3c29d593-31b3-4116-bf5e-a1b3d48d130b
  • returned a valid one-feature flip suggestion on Age

Feature Lineage

  • inspects the saved preprocessor and maps transformed feature names back to raw groups

Calibration And Threshold Tuning

  • computes classification calibration bins
  • sweeps thresholds and compares precision/recall/F1

Trust Heatmap

  • checks how often features appear across recent runs for the same dataset
  • compares historical importance stability
  • marks features as stable, noisy, drift-prone, or leakage-risky

Prediction And Simulation Features

Live Prediction

Endpoint:

  • POST /api/predict/{job_id}

Logic:

  • validates exact feature contract
  • builds a one-row inference frame in expected column order
  • returns prediction and confidence when probabilities exist

Scenario Sweep / What-If Simulation

Endpoint:

  • POST /api/future

Logic:

  • takes a base feature vector
  • varies one selected feature over multiple values
  • scores each generated row independently
  • returns prediction and optional confidence for each point

Current behavior:

  • verified against job 3c29d593-31b3-4116-bf5e-a1b3d48d130b
  • returned valid prediction points for a Salary sweep with no point-level errors

Feature Contract Checker

  • compares uploaded inference data against the training feature schema
  • reports missing features, extra columns, dtype mismatches, and alignment status

Data Management Features

Dataset Catalog

  • lists known datasets with profile summaries
  • supports derived datasets and lineage-aware workflows

Merge Studio

  • previews join quality before creating merged datasets
  • reports overlap, duplicate keys, row multiplier, and sample merged records

Repair Preview / Apply

  • shows proposed cleaning effect before materializing a new dataset version

Synthetic Data Generator

  • creates synthetic rows from the current dataset
  • stores derived dataset with lineage back to parent

Synthetic Data Judge

  • compares synthetic output against the parent dataset
  • checks numeric distribution drift and categorical mix overlap
  • returns realism score, verdict, and notes

Experiment And Governance Features

Experiment Tracking

  • stores run-level summary information in experiment_runs
  • powers run archive and comparison UI

Registry Labels

  • lightweight promotion workflow for production candidates

Team Notes

  • lets users annotate runs without leaving the product

Run Diff Engine

  • compares config and output changes between two runs

Reporting And Export

Report Generator

  • creates PDF summaries with dataset overview, metrics, and SHAP importance

Export Bundle

  • packages model, metadata, and training context for later restore

Natural Language And AI Helpers

Narrative Generator

  • builds a human-readable summary of the experiment
  • uses stored job story when available, otherwise composes a fallback narrative

Natural Language ML Helpers

  • parse or support ML-oriented natural language workflows from the Smart AI Hub

Storage And State Model

Database Models

  • DatasetModel: uploaded/imported/derived datasets
  • JobModel: training jobs and result payloads
  • ExperimentRun: completed-run archive
  • DriftCheck: drift history
  • DriftSchedule: cadence configuration
  • ModelRegistryEntry and TeamNote: governance helpers

Run Artifacts

Per-run directories store:

  • model pickle
  • metrics JSON
  • schema/data contract
  • drift baseline
  • model metadata
  • exports and reports

Important Implementation Notes

Result Contract Normalization

  • all result payloads are sanitized to be JSON-safe
  • prevents NaN/Inf and shape drift from breaking the UI

Session Safety

  • the DB session now uses expire_on_commit=False
  • this avoids detached-instance crashes when recently loaded attributes are used after commit

Known Residual Noise

  • LightGBM/scikit-learn still emits X does not have valid feature names warnings during some prediction paths
  • these warnings did not block counterfactual or scenario sweep in runtime verification, but they are worth cleaning up later for quieter logs

Verified Status From This Pass

  • Counterfactual: working on the verified completed job
  • Scenario sweep: working on the verified completed job
  • Trust heatmap: no longer failing from the detached-session regression in the verified direct check
  • Dataset list helper: no longer failing from the detached-session regression in the verified direct check