Spaces:

abhiraj12
/

Auto_ML

Paused

App Files Files Community

Auto_ML / FEATURE.md

abhiraj12

Streamline export bundle by removing auxiliary files

807485b about 1 month ago

preview code

raw

history blame contribute delete

12.1 kB

CODIN Feature Guide

What This Project Is

CODIN is an end-to-end AutoML studio built with:

FastAPI for APIs and backend orchestration
Streamlit for the multi-page workspace UI
SQLAlchemy + SQLite for datasets, jobs, experiment history, notes, registry labels, and drift records
scikit-learn, LightGBM, Optuna, and custom pipeline logic for training and evaluation

The product is designed as a guided ML workspace rather than a single training script. A user can ingest data, profile it, train models, inspect results, simulate predictions, detect drift, generate synthetic data, compare runs, and export reports from the same app.

Core Product Flow

A dataset is uploaded or imported.
The backend profiles the dataset and stores metadata in datasets.
A training job is created and the pipeline runs validation, cleaning, feature engineering, model selection, training, and evaluation.
Results are saved to run artifacts and experiment history.
The user explores explainability, contracts, what-if tools, drift, augmentation, and reports.

Main User-Facing Workspaces

1. Home / Upload And Configure

Primary purpose:

ingest new data
restore exported bundles
import from external connectors
merge datasets
configure AutoML training

Key features:

File upload for tabular, document, image, PDF, SQLite, and export bundle formats
Connector-based import for SQL sources
Merge Studio for previewing joins before materializing merged datasets
Auto problem detection to infer likely target/task hints
Training configuration for mode, metric, imbalance handling, cleaning, CV, and feature selection

Logic:

Uploads are normalized through core/file_loader.py
Dataset profile information is generated and saved in profile_json
Dataset lineage is tracked with parent_dataset_id
Training configuration is stored on the job and reused in later pages

2. Dataset DNA

Primary purpose:

inspect dataset structure and quality before training

Key features:

dataset profile overview
numeric/categorical breakdown
per-column stats table
auto-detect details
leakage and quality analysis
repair preview and repair apply
dataset lineage timeline and graph

Logic:

Profiling computes row counts, column types, missingness, target hints, and column stats
Leakage checks look for suspicious target correlation, ID-like fields, constants, duplicates, and temporal leakage hints
Repair flows preview transformations first, then create a cleaned derived dataset

3. Training Lab

Primary purpose:

run AutoML training with pipeline-backed execution

Key features:

job launch
progress/status monitoring
reasoning stream from the pipeline
deep analysis handoff to Results Console

Logic:

Training uses component-based pipeline execution from core/pipeline_engine.py
Each component updates status and appends reasoning so the UI can show not just what happened, but why

4. Results Console

Primary purpose:

inspect the trained model and all downstream analysis

Key features:

leaderboard and winner summary
execution profile and pipeline metrics
tested model breakdown
SHAP global importance
calibration report
threshold tuner
feature lineage
recommendations
trust heatmap
drift timeline
prediction sandbox
feature contract checker
counterfactual generation
scenario sweep
synthetic augmentation controls
lightweight model chat

Logic:

Results are normalized through infra/result_contract.py
UI panels call dedicated APIs for each analysis instead of overloading one giant endpoint
Mixed display tables are sanitized before rendering to keep Streamlit Arrow serialization stable

5. Experiment Tracker

Primary purpose:

compare historical runs and manage model promotion workflow

Key features:

experiment archive
global leaderboard
run history viewer
reasoning stream viewer
registry labels: champion, challenger, candidate, archived
team notes
side-by-side run comparison
run diff engine
battle arena charts

Logic:

Each completed job can be persisted as an ExperimentRun
Registry and note models allow lightweight model governance without needing an external MLOps platform

6. Drift Monitor

Primary purpose:

detect dataset drift against the saved baseline and operationalize retraining

Key features:

upload new batch for drift check
per-feature drift metrics
drift severity summary
cadence scheduling
drift history
one-click retrain on drifted data

Logic:

Baseline distributions are fit during training
Later uploads are compared with PSI/KS-style checks
Drift checks are stored so timeline views and cadence workflows can be built on top

7. Smart AI Hub

Primary purpose:

provide higher-order utilities around the trained models and datasets

Key features:

ensemble builder
what-if simulation
synthetic data generation
natural language ML helpers

Logic:

This page reuses completed jobs and current workspace datasets rather than creating isolated tooling
It is meant to sit on top of the main AutoML lifecycle

Training Pipeline Components

Data Validation

Implemented in backend/services/training/components.py.

What it does:

loads the dataset
checks target existence
trims to selected features if requested
drops columns with extreme missingness
attempts numeric coercion on object columns
runs leakage detection
saves the data contract
fits drift baseline

Why:

it prevents garbage-in training runs
it creates the baseline metadata needed by later features such as drift monitoring and contract checks

Feature Engineering

What it does:

optional auto-cleaning
task-type inference
managed feature generation for smaller datasets
invalid target cleanup
label encoding for classification
train/test split
summary statistics for later reporting

Why:

it balances automation with safety
feature synthesis is only used when dataset size and width are still manageable

Model Selection

What it does:

builds a profile of the problem
asks the selector/meta-learner for a candidate pool
chooses light or full preprocessing based on mode
enables dimensionality guidance for wide datasets

Why:

it makes the model search adaptive instead of static

Training

What it does:

handles imbalance when configured
executes sweep/tuning behavior
trains the final pipeline
evaluates on holdout data
computes leaderboard, metrics, and explainability artifacts
stores model metadata and run artifacts

Why:

it separates fast exploration from deeper optimization
it preserves enough metadata for later explainability and operational tools

Explainability Features

SHAP Global Importance

ranks feature impact across the trained model
used in the Results Console and reports

Local Explanation

explains a single prediction using per-feature contributions
used to rank candidate features for counterfactual search

Counterfactual Generator

Endpoint:

POST /api/counterfactual/{job_id}

Logic:

only enabled for classification jobs
loads the saved model and training dataset context
validates that all expected feature inputs are present
scores the original row
gets local feature contributions
ranks the most influential features
tries one-feature changes using numeric quantiles or common categorical alternatives
returns the smallest single-field changes that flip the prediction

Current behavior:

verified against job 3c29d593-31b3-4116-bf5e-a1b3d48d130b
returned a valid one-feature flip suggestion on Age

Feature Lineage

inspects the saved preprocessor and maps transformed feature names back to raw groups

Calibration And Threshold Tuning

computes classification calibration bins
sweeps thresholds and compares precision/recall/F1

Trust Heatmap

checks how often features appear across recent runs for the same dataset
compares historical importance stability
marks features as stable, noisy, drift-prone, or leakage-risky

Prediction And Simulation Features

Live Prediction

Endpoint:

POST /api/predict/{job_id}

Logic:

validates exact feature contract
builds a one-row inference frame in expected column order
returns prediction and confidence when probabilities exist

Scenario Sweep / What-If Simulation

Endpoint:

POST /api/future

Logic:

takes a base feature vector
varies one selected feature over multiple values
scores each generated row independently
returns prediction and optional confidence for each point

Current behavior:

verified against job 3c29d593-31b3-4116-bf5e-a1b3d48d130b
returned valid prediction points for a Salary sweep with no point-level errors

Feature Contract Checker

compares uploaded inference data against the training feature schema
reports missing features, extra columns, dtype mismatches, and alignment status

Data Management Features

Dataset Catalog

lists known datasets with profile summaries
supports derived datasets and lineage-aware workflows

Merge Studio

previews join quality before creating merged datasets
reports overlap, duplicate keys, row multiplier, and sample merged records

Repair Preview / Apply

shows proposed cleaning effect before materializing a new dataset version

Synthetic Data Generator

creates synthetic rows from the current dataset
stores derived dataset with lineage back to parent

Synthetic Data Judge

compares synthetic output against the parent dataset
checks numeric distribution drift and categorical mix overlap
returns realism score, verdict, and notes

Experiment And Governance Features

Experiment Tracking

stores run-level summary information in experiment_runs
powers run archive and comparison UI

Registry Labels

lightweight promotion workflow for production candidates

Team Notes

lets users annotate runs without leaving the product

Run Diff Engine

compares config and output changes between two runs

Reporting And Export

Report Generator

creates PDF summaries with dataset overview, metrics, and SHAP importance

Export Bundle

packages model, metadata, and training context for later restore

Natural Language And AI Helpers

Narrative Generator

builds a human-readable summary of the experiment
uses stored job story when available, otherwise composes a fallback narrative

Natural Language ML Helpers

parse or support ML-oriented natural language workflows from the Smart AI Hub

Storage And State Model

Database Models

DatasetModel: uploaded/imported/derived datasets
JobModel: training jobs and result payloads
ExperimentRun: completed-run archive
DriftCheck: drift history
DriftSchedule: cadence configuration
ModelRegistryEntry and TeamNote: governance helpers

Run Artifacts

Per-run directories store:

model pickle
metrics JSON
schema/data contract
drift baseline
model metadata
exports and reports

Important Implementation Notes

Result Contract Normalization

all result payloads are sanitized to be JSON-safe
prevents NaN/Inf and shape drift from breaking the UI

Session Safety

the DB session now uses expire_on_commit=False
this avoids detached-instance crashes when recently loaded attributes are used after commit

Known Residual Noise

LightGBM/scikit-learn still emits X does not have valid feature names warnings during some prediction paths
these warnings did not block counterfactual or scenario sweep in runtime verification, but they are worth cleaning up later for quieter logs

Verified Status From This Pass

Counterfactual: working on the verified completed job
Scenario sweep: working on the verified completed job
Trust heatmap: no longer failing from the detached-session regression in the verified direct check
Dataset list helper: no longer failing from the detached-session regression in the verified direct check