# TaoTrain: Production-Grade LLM Training Framework

**TaoTrain** is a sophisticated PyTorch framework for training large language models at every scale—from experimental pretraining through supervised fine-tuning to reinforcement learning. Unlike fragmented training scripts or heavyweight frameworks, TaoTrain unifies the **entire training pipeline** in a clean, modular codebase that appeals to both ML engineers and software engineers.

## Current Taotern Work

TaoTrain now includes the Taotern comparison architectures used by the current SSM LLM work:

- `taonet`: the attention/MLA baseline.
- `taonet_ssm`: the TaoNet shell with the attention mixer replaced by the Gamma Space Model DPLR SSM.
- `taonet_hybrid`: an alternating attention/SSM TaoNet used for the current best 200M-class candidate.

The current selected deployment-oriented run is `hybrid_ssm_first_199m`, a `199,480,928` parameter model with 16 layers: SSM layers at `0,2,4,6,8,10,12,14` and attention layers at `1,3,5,7,9,11,13,15`. It uses the DPLR SSM core with split two-lane mixing, channel gates, per-channel local shift, and the faster convolution path for long-sequence training.

Remote run `taotern-200m-hybrid-chat-20260512` trains this model on TaoData for a 4B-token base stage and then runs SFT so the final artifact can be loaded as a chat model. The trainable fixes added for this run are:

- Async JSONL iteration keeps polling while tokenization workers are alive instead of ending early after a temporary empty queue.
- Cached JSONL scan metadata is reused safely while recomputing chunk ranges for the active `samples_per_chunk` and `max_samples` settings.

## Why TaoTrain?

- **Complete Unified Pipeline**: Pretraining → SFT → RL in a single, consistent framework. No context switching between different codebases or architectures.
- **Production-Grade Engineering**: Type-safe Pydantic configs, comprehensive checkpointing, AimStack integration, and proper gradient handling—not research code, but a framework you can deploy.
- **Extensibility Without Modification**: Register custom models, optimizers, schedulers, and datasets via decorators. Experiment freely without forking the framework.
- **Developer Experience First**: Interactive TUI for inference, intuitive YAML configurations, async data loading that eliminates I/O bottlenecks, and clear abstractions that make the codebase a pleasure to work with.

## Key Capabilities

| Capability | Details |
|---|---|
| **Multi-Stage Training** | Unified infrastructure for pretraining, SFT, and RL. Share model checkpoints, logging, and evaluation across stages. |
| **Advanced Optimization** | Hybrid Muon + AdamW optimizer: efficient 2D weight updates via SVD-based methods + adaptive learning for 1D parameters. |
| **Modern Architectures** | DeepSeek MLA with grouped query attention (GQA), YaRN context extension, and factorized embeddings—all configurable via YAML. |
| **Production Features** | BF16 mixed precision training, gradient accumulation, proper gradient clipping, checkpoint resumption, and validation loops. |
| **Async Data Pipeline** | Background tokenization with multi-threaded workers. Stream billion-token datasets from JSONL without loading into memory. |
| **Interactive Inference** | TUI chat interface with real-time generation speed metrics and multi-model comparison. |
| **Logging & Monitoring** | AimStack integration tracks loss, metrics, hyperparameters, and git hashes for reproducibility. Visualize training runs in your browser. |

## Getting Started

### Installation

```bash
git clone https://github.com/lobakkang/taoTrain.git
cd taoTrain
pip install -e .
```

### Training Examples

**Pretraining on a custom dataset:**
```bash
train pretrain --config configs/pretrain.yaml
```
Starts from scratch, learns representations from raw text via next-token prediction.

**Supervised Fine-tuning:**
```bash
train sft --config configs/sft.yaml
```
Fine-tune a pretrained model on instruction-response pairs for improved task performance.

**Reinforcement Learning (DPO):**
```bash
train rl --config configs/rl_dpo.yaml
```
Align models with human preferences using Direct Preference Optimization.

**Interactive Chat:**
```bash
tui-chat --model checkpoints/model.pt
```
Launch an interactive TUI to chat with your model and monitor generation metrics in real-time.

### Configuration

All training is configured via YAML with Pydantic validation. Configs are type-safe and automatically validated:

```yaml
# configs/sft.yaml
model:
  architecture_type: "mla"  # DeepSeek MLA with GQA
  hidden_dim: 2048
  num_layers: 24
  num_heads: 32
  d_latent_kv: 1536  # KV compression factor

training:
  num_epochs: 3
  batch_size: 32
  learning_rate: 1e-4
  warmup_ratio: 0.1
  max_grad_norm: 1.0

optimizer:
  optimizer_type: "muon_adamw"  # Hybrid Muon + AdamW
  muon_momentum: 0.95

data:
  dataset_type: "sft_jsonl"  # or "sft_hf" for HuggingFace
  path: "data/sft_training.jsonl"
  
logging:
  log_to_aim: true
  aim_repo: "/tmp/aim_logs"
```

See `configs/` for complete examples.

## Project Architecture

```
src/taoTrain/
├── cli.py                      # Main CLI entry point
├── config.py                   # Pydantic configuration schemas
│
├── core/                       # Base abstractions
│   └── base.py                 # BaseModel, BaseDataset, BaseTrainer
│
├── models/                     # Pluggable architecture system
│   ├── registry.py             # Architecture factory with @register_architecture
│   ├── taonet.py               # SimpleLLM with DeepSeek MLA
│   ├── mla_components.py       # KV compression, GQA, YaRN
│   ├── embeddings.py           # Factorized embeddings
│   └── transformer.py          # Standard Transformer reference
│
├── data/                       # Advanced data pipeline
│   ├── factory.py              # Dataset factory (HF + JSONL backends)
│   ├── async_loader.py         # Async batch iteration (no I/O bottleneck)
│   ├── tokenization_queue.py   # Background multi-threaded tokenization
│   ├── chunk_manager.py        # Stream billion-token JSONL files
│   ├── hf_pretrain.py          # HuggingFace pretraining datasets
│   ├── hf_sft.py               # HuggingFace SFT datasets
│   ├── hf_rl.py                # HuggingFace RL datasets
│   ├── pretrain_jsonl.py       # JSONL pretraining
│   ├── sft_jsonl.py            # JSONL SFT with instructions
│   └── rl_jsonl.py             # JSONL RL with preferences
│
├── training/                   # Unified training infrastructure
│   └── trainer.py              # Trainer + PretrainTrainer, SFTTrainer, RLTrainer
│
├── optimizers/                 # Pluggable optimizer system
│   ├── registry.py             # Optimizer factory with @register_optimizer
│   ├── hybrid_muon_adamw.py    # Composite: Muon (2D) + AdamW (1D)
│   ├── adamw.py                # AdamW with weight decay
│   ├── adam.py                 # Standard Adam
│   └── sgd.py                  # SGD variants
│
├── schedulers/                 # Learning rate schedules
│   ├── registry.py             # LR scheduler factory
│   ├── cosine_warmup.py        # 3-phase: linear warmup → plateau → cosine decay
│   ├── linear_warmup.py        # Linear warmup + constant
│   └── constant.py             # Constant learning rate
│
├── inference/                  # Inference & interaction
│   ├── inferencer.py           # Load & run inference from checkpoints
│   └── tui.py                  # Interactive chat with metrics display
│
├── checkpointing/              # State management
│   └── checkpoint.py           # Save/load model + optimizer + config + metrics
│
├── logging/                    # Experiment tracking
│   └── aim_logger.py           # AimStack integration (loss, metrics, hyperparams)
│
├── benchmarks/                 # Evaluation tools
│   └── runner.py               # Perplexity, speed, and task-specific benchmarks
│
└── utils/
    └── helpers.py              # Utility functions

configs/                        # Example YAML configurations
├── pretrain.yaml               # Pretraining config
├── sft.yaml                    # SFT config
├── rl_dpo.yaml                 # RL/DPO config
└── tokenizer.yaml              # Tokenizer config

tests/                          # Unit & integration tests
└── test_dataset.py
```

## Extensible Architecture: The Registry Pattern

TaoTrain's power lies in its **pluggable design**. Add custom models, optimizers, schedulers, and datasets without modifying the framework.

### Custom Model Architecture

```python
from taoTrain.models import register_architecture, BaseModel
import torch.nn as nn

@register_architecture("custom_moe")
class MixtureOfExperts(BaseModel):
    """Your custom MoE architecture"""
    def __init__(self, config):
        super().__init__(config)
        self.experts = nn.ModuleList([
            nn.Linear(config.hidden_dim, config.hidden_dim)
            for _ in range(config.num_experts)
        ])
        self.router = nn.Linear(config.hidden_dim, config.num_experts)
    
    def forward(self, input_ids, attention_mask=None):
        # Your implementation
        logits = self.compute_logits(input_ids)
        loss = self.compute_loss(logits, labels) if labels is not None else None
        return {"logits": logits, "loss": loss}
```

Then use it in your config:

```yaml
model:
  architecture_type: "custom_moe"
  hidden_dim: 2048
  num_experts: 8
```

### Custom Optimizers & Schedulers

The same pattern works for optimizers and learning rate schedules:

```python
from taoTrain.optimizers import register_optimizer
from torch.optim import Optimizer

@register_optimizer("my_adaptive_optimizer")
class MyAdaptiveOptimizer(Optimizer):
    def step(self, closure=None):
        # Your optimization logic
        pass
```

```python
from taoTrain.schedulers import register_scheduler

@register_scheduler("my_schedule")
def my_schedule(initial_lr, step, total_steps, **kwargs):
    return initial_lr * (1.0 - step / total_steps)  # Linear decay
```

**The key principle**: No framework code needs to change. You register once, it's available everywhere.

### Dataset Backend Flexibility

Define custom datasets (JSONL, HF, streaming, etc.) and let the factory route to them:

```python
from taoTrain.data import register_dataset

@register_dataset("pretrain", "my_backend")
class MyPretrainDataset(BaseDataset):
    def __init__(self, config):
        # Load from your custom backend
        pass
    
    def __getitem__(self, idx):
        return {"input_ids": ..., "attention_mask": ...}
```

Use in config:

```yaml
data:
  dataset_type: "pretrain"
  backend_type: "my_backend"  # Routes to MyPretrainDataset
```

## Why TaoTrain Framework?

### Async Data Loading: No I/O Bottleneck

Most training frameworks load and tokenize data on the main training thread, blocking compute. TaoTrain's **multi-threaded tokenization pipeline**:

- Tokenizes data in background workers while your GPU trains
- Supports streaming billion-token JSONL files without loading into memory
- Intelligent chunking (by file size or sample count)
- Metadata caching to avoid rescanning

**Result**: 10-100x faster data iteration on large datasets.

### Type-Safe Configuration

Forget YAML parsing errors or mysterious config bugs. TaoTrain uses **Pydantic dataclasses** for configuration:

- Automatic type validation: mistyped `learning_rate: "1e-4"` becomes an error, not silent failure
- Serialization: configs are part of checkpoints, ensuring reproducibility
- IDE support: autocomplete and type hints for all config fields
- Defaults: sensible defaults for all parameters

### Benchmarking & Metrics

Track what matters:

- **Perplexity**: Language modeling quality on held-out data
- **Generation Speed**: Tokens-per-second (useful for TUI or deployment)
- **Task-Specific Accuracy**: Evaluate on downstream tasks
- **Training Metrics**: Loss curves, gradient norms, effective batch size

All logged to AimStack with git hashes for reproducibility.

## Logging with AimStack

Automatically track and visualize experiments:

```bash
aim up --host 0.0.0.0
```

Then open `http://localhost:43800` to see:

- **Loss curves** per training step
- **Hyperparameters** (learning rate, batch size, model architecture)
- **Git hashes** for reproducibility
- **Custom metrics** (perplexity, validation accuracy, generation speed)
- **Compare runs**: Side-by-side experiment comparison

## Advanced Features

### Checkpointing with Resumption

TaoTrain saves complete training state:

```python
checkpoint = {
    "step": 12500,
    "model_state": model.state_dict(),
    "optimizer_state": optimizer.state_dict(),
    "config": config,  # Full config as Pydantic object
    "metrics": metrics_tracker.to_dict(),
}
```

Resume training from any checkpoint without loss of state. Keep last N checkpoints automatically.

### Mixed Precision Training (BF16)

```yaml
training:
  use_bfloat16: true
  gradient_accumulation_steps: 4
```

- BF16 via `torch.autocast` for ~2x speedup with minimal accuracy loss
- Proper gradient scaling and clipping
- Compatible with all optimizers and architectures

### 3-Phase Learning Rate Schedule

```yaml
scheduler:
  scheduler_type: "cosine_warmup"
  warmup_ratio: 0.1          # 10% of training steps
  steady_ratio: 0.5          # 50% at steady rate
  min_lr_ratio: 0.1          # Final LR = 0.1 × initial_lr
  num_cycles: 1
```

This schedule:
1. **Linear warmup** (0 → 1) over 10% of steps
2. **Steady plateau** at full LR over 50% of steps
3. **Cosine decay** (1 → 0.1) over remaining 40% of steps

Better convergence than simple cosine or linear decay.

### Gradient Accumulation & Clipping

Simulate larger batch sizes with gradient accumulation:

```yaml
training:
  batch_size: 32
  gradient_accumulation_steps: 4  # Effective batch = 128
  max_grad_norm: 1.0               # Gradient clipping
```

## Contributing

Contributions are welcome! TaoTrain is designed to make contributions easy:

1. **Add a model**: Implement `BaseModel` and `@register_architecture("name")`
2. **Add an optimizer**: Implement `torch.optim.Optimizer` and `@register_optimizer("name")`
3. **Add a dataset**: Implement `BaseDataset` and `@register_dataset(mode, backend_type)`
4. **Improve the core**: Submit PRs to `training/`, `data/`, `logging/`, etc.

Ensure new code includes:
- Type hints throughout
- Pydantic configs for new parameters
- Unit tests in `tests/`
- Documentation in docstrings and README

## Current Scope & Roadmap

### ✅ Currently Supported

- **Single GPU / single node** training
- **Pretraining, SFT, and RL training** stages
- **HuggingFace and JSONL** data backends
- **BF16 mixed precision** training
- **Checkpoint saving/loading** with resumption
- **Interactive inference** via TUI
- **Benchmarking** (perplexity, speed)
- **Pluggable architectures, optimizers, schedulers, datasets**

### 🚀 Roadmap (Future)

- **Distributed training** (DDP, FSDP) for multi-GPU/multi-node scaling
- **Quantization** support (INT8, QLoRA)
- **Advanced evaluation** (BLEU, ROUGE, custom tasks)
- **Streaming inference** with KV cache
- **Speculative decoding** for faster generation
- **Integration with popular model hubs** (Hugging Face Hub upload/download)

---

## Getting Help

- **Questions?** Open an issue on GitHub
- **Want to contribute?** See `CONTRIBUTING.md` (coming soon)
- **Found a bug?** Report it with a minimal reproduction script

## License

MIT