# CODIN Feature Guide

## What This Project Is

CODIN is an end-to-end AutoML studio built with:

- `FastAPI` for APIs and backend orchestration
- `Streamlit` for the multi-page workspace UI
- `SQLAlchemy + SQLite` for datasets, jobs, experiment history, notes, registry labels, and drift records
- `scikit-learn`, `LightGBM`, `Optuna`, and custom pipeline logic for training and evaluation

The product is designed as a guided ML workspace rather than a single training script. A user can ingest data, profile it, train models, inspect results, simulate predictions, detect drift, generate synthetic data, compare runs, and export reports from the same app.

## Core Product Flow

1. A dataset is uploaded or imported.
2. The backend profiles the dataset and stores metadata in `datasets`.
3. A training job is created and the pipeline runs validation, cleaning, feature engineering, model selection, training, and evaluation.
4. Results are saved to run artifacts and experiment history.
5. The user explores explainability, contracts, what-if tools, drift, augmentation, and reports.

## Main User-Facing Workspaces

### 1. Home / Upload And Configure

Primary purpose:

- ingest new data
- restore exported bundles
- import from external connectors
- merge datasets
- configure AutoML training

Key features:

- File upload for tabular, document, image, PDF, SQLite, and export bundle formats
- Connector-based import for SQL sources
- Merge Studio for previewing joins before materializing merged datasets
- Auto problem detection to infer likely target/task hints
- Training configuration for mode, metric, imbalance handling, cleaning, CV, and feature selection

Logic:

- Uploads are normalized through `core/file_loader.py`
- Dataset profile information is generated and saved in `profile_json`
- Dataset lineage is tracked with `parent_dataset_id`
- Training configuration is stored on the job and reused in later pages

### 2. Dataset DNA

Primary purpose:

- inspect dataset structure and quality before training

Key features:

- dataset profile overview
- numeric/categorical breakdown
- per-column stats table
- auto-detect details
- leakage and quality analysis
- repair preview and repair apply
- dataset lineage timeline and graph

Logic:

- Profiling computes row counts, column types, missingness, target hints, and column stats
- Leakage checks look for suspicious target correlation, ID-like fields, constants, duplicates, and temporal leakage hints
- Repair flows preview transformations first, then create a cleaned derived dataset

### 3. Training Lab

Primary purpose:

- run AutoML training with pipeline-backed execution

Key features:

- job launch
- progress/status monitoring
- reasoning stream from the pipeline
- deep analysis handoff to Results Console

Logic:

- Training uses component-based pipeline execution from `core/pipeline_engine.py`
- Each component updates status and appends reasoning so the UI can show not just what happened, but why

### 4. Results Console

Primary purpose:

- inspect the trained model and all downstream analysis

Key features:

- leaderboard and winner summary
- execution profile and pipeline metrics
- tested model breakdown
- SHAP global importance
- calibration report
- threshold tuner
- feature lineage
- recommendations
- trust heatmap
- drift timeline
- prediction sandbox
- feature contract checker
- counterfactual generation
- scenario sweep
- synthetic augmentation controls
- lightweight model chat

Logic:

- Results are normalized through `infra/result_contract.py`
- UI panels call dedicated APIs for each analysis instead of overloading one giant endpoint
- Mixed display tables are sanitized before rendering to keep Streamlit Arrow serialization stable

### 5. Experiment Tracker

Primary purpose:

- compare historical runs and manage model promotion workflow

Key features:

- experiment archive
- global leaderboard
- run history viewer
- reasoning stream viewer
- registry labels: champion, challenger, candidate, archived
- team notes
- side-by-side run comparison
- run diff engine
- battle arena charts

Logic:

- Each completed job can be persisted as an `ExperimentRun`
- Registry and note models allow lightweight model governance without needing an external MLOps platform

### 6. Drift Monitor

Primary purpose:

- detect dataset drift against the saved baseline and operationalize retraining

Key features:

- upload new batch for drift check
- per-feature drift metrics
- drift severity summary
- cadence scheduling
- drift history
- one-click retrain on drifted data

Logic:

- Baseline distributions are fit during training
- Later uploads are compared with PSI/KS-style checks
- Drift checks are stored so timeline views and cadence workflows can be built on top

### 7. Smart AI Hub

Primary purpose:

- provide higher-order utilities around the trained models and datasets

Key features:

- ensemble builder
- what-if simulation
- synthetic data generation
- natural language ML helpers

Logic:

- This page reuses completed jobs and current workspace datasets rather than creating isolated tooling
- It is meant to sit on top of the main AutoML lifecycle

## Training Pipeline Components

### Data Validation

Implemented in `backend/services/training/components.py`.

What it does:

- loads the dataset
- checks target existence
- trims to selected features if requested
- drops columns with extreme missingness
- attempts numeric coercion on object columns
- runs leakage detection
- saves the data contract
- fits drift baseline

Why:

- it prevents garbage-in training runs
- it creates the baseline metadata needed by later features such as drift monitoring and contract checks

### Feature Engineering

What it does:

- optional auto-cleaning
- task-type inference
- managed feature generation for smaller datasets
- invalid target cleanup
- label encoding for classification
- train/test split
- summary statistics for later reporting

Why:

- it balances automation with safety
- feature synthesis is only used when dataset size and width are still manageable

### Model Selection

What it does:

- builds a profile of the problem
- asks the selector/meta-learner for a candidate pool
- chooses light or full preprocessing based on mode
- enables dimensionality guidance for wide datasets

Why:

- it makes the model search adaptive instead of static

### Training

What it does:

- handles imbalance when configured
- executes sweep/tuning behavior
- trains the final pipeline
- evaluates on holdout data
- computes leaderboard, metrics, and explainability artifacts
- stores model metadata and run artifacts

Why:

- it separates fast exploration from deeper optimization
- it preserves enough metadata for later explainability and operational tools

## Explainability Features

### SHAP Global Importance

- ranks feature impact across the trained model
- used in the Results Console and reports

### Local Explanation

- explains a single prediction using per-feature contributions
- used to rank candidate features for counterfactual search

### Counterfactual Generator

Endpoint:

- `POST /api/counterfactual/{job_id}`

Logic:

- only enabled for classification jobs
- loads the saved model and training dataset context
- validates that all expected feature inputs are present
- scores the original row
- gets local feature contributions
- ranks the most influential features
- tries one-feature changes using numeric quantiles or common categorical alternatives
- returns the smallest single-field changes that flip the prediction

Current behavior:

- verified against job `3c29d593-31b3-4116-bf5e-a1b3d48d130b`
- returned a valid one-feature flip suggestion on `Age`

### Feature Lineage

- inspects the saved preprocessor and maps transformed feature names back to raw groups

### Calibration And Threshold Tuning

- computes classification calibration bins
- sweeps thresholds and compares precision/recall/F1

### Trust Heatmap

- checks how often features appear across recent runs for the same dataset
- compares historical importance stability
- marks features as stable, noisy, drift-prone, or leakage-risky

## Prediction And Simulation Features

### Live Prediction

Endpoint:

- `POST /api/predict/{job_id}`

Logic:

- validates exact feature contract
- builds a one-row inference frame in expected column order
- returns prediction and confidence when probabilities exist

### Scenario Sweep / What-If Simulation

Endpoint:

- `POST /api/future`

Logic:

- takes a base feature vector
- varies one selected feature over multiple values
- scores each generated row independently
- returns prediction and optional confidence for each point

Current behavior:

- verified against job `3c29d593-31b3-4116-bf5e-a1b3d48d130b`
- returned valid prediction points for a `Salary` sweep with no point-level errors

### Feature Contract Checker

- compares uploaded inference data against the training feature schema
- reports missing features, extra columns, dtype mismatches, and alignment status

## Data Management Features

### Dataset Catalog

- lists known datasets with profile summaries
- supports derived datasets and lineage-aware workflows

### Merge Studio

- previews join quality before creating merged datasets
- reports overlap, duplicate keys, row multiplier, and sample merged records

### Repair Preview / Apply

- shows proposed cleaning effect before materializing a new dataset version

### Synthetic Data Generator

- creates synthetic rows from the current dataset
- stores derived dataset with lineage back to parent

### Synthetic Data Judge

- compares synthetic output against the parent dataset
- checks numeric distribution drift and categorical mix overlap
- returns realism score, verdict, and notes

## Experiment And Governance Features

### Experiment Tracking

- stores run-level summary information in `experiment_runs`
- powers run archive and comparison UI

### Registry Labels

- lightweight promotion workflow for production candidates

### Team Notes

- lets users annotate runs without leaving the product

### Run Diff Engine

- compares config and output changes between two runs

## Reporting And Export

### Report Generator

- creates PDF summaries with dataset overview, metrics, and SHAP importance

### Export Bundle

- packages model, metadata, and training context for later restore

## Natural Language And AI Helpers

### Narrative Generator

- builds a human-readable summary of the experiment
- uses stored job story when available, otherwise composes a fallback narrative

### Natural Language ML Helpers

- parse or support ML-oriented natural language workflows from the Smart AI Hub

## Storage And State Model

### Database Models

- `DatasetModel`: uploaded/imported/derived datasets
- `JobModel`: training jobs and result payloads
- `ExperimentRun`: completed-run archive
- `DriftCheck`: drift history
- `DriftSchedule`: cadence configuration
- `ModelRegistryEntry` and `TeamNote`: governance helpers

### Run Artifacts

Per-run directories store:

- model pickle
- metrics JSON
- schema/data contract
- drift baseline
- model metadata
- exports and reports

## Important Implementation Notes

### Result Contract Normalization

- all result payloads are sanitized to be JSON-safe
- prevents NaN/Inf and shape drift from breaking the UI

### Session Safety

- the DB session now uses `expire_on_commit=False`
- this avoids detached-instance crashes when recently loaded attributes are used after commit

### Known Residual Noise

- LightGBM/scikit-learn still emits `X does not have valid feature names` warnings during some prediction paths
- these warnings did not block counterfactual or scenario sweep in runtime verification, but they are worth cleaning up later for quieter logs

## Verified Status From This Pass

- Counterfactual: working on the verified completed job
- Scenario sweep: working on the verified completed job
- Trust heatmap: no longer failing from the detached-session regression in the verified direct check
- Dataset list helper: no longer failing from the detached-session regression in the verified direct check