File size: 16,266 Bytes

e2bfccc

# TaoTrain: Production-Grade LLM Training Framework

**TaoTrain** is a sophisticated PyTorch framework for training large language models at every scale—from experimental pretraining through supervised fine-tuning to reinforcement learning. Unlike fragmented training scripts or heavyweight frameworks, TaoTrain unifies the **entire training pipeline** in a clean, modular codebase that appeals to both ML engineers and software engineers.

## Current Taotern Work

TaoTrain now includes the Taotern comparison architectures used by the current SSM LLM work:

- `taonet`: the attention/MLA baseline.
- `taonet_ssm`: the TaoNet shell with the attention mixer replaced by the Gamma Space Model DPLR SSM.
- `taonet_hybrid`: an alternating attention/SSM TaoNet used for the current best 200M-class candidate.

The current selected deployment-oriented run is `hybrid_ssm_first_199m`, a `199,480,928` parameter model with 16 layers: SSM layers at `0,2,4,6,8,10,12,14` and attention layers at `1,3,5,7,9,11,13,15`. It uses the DPLR SSM core with split two-lane mixing, channel gates, per-channel local shift, and the faster convolution path for long-sequence training.

Remote run `taotern-200m-hybrid-chat-20260512` trains this model on TaoData for a 4B-token base stage and then runs SFT so the final artifact can be loaded as a chat model. The trainable fixes added for this run are:

- Async JSONL iteration keeps polling while tokenization workers are alive instead of ending early after a temporary empty queue.
- Cached JSONL scan metadata is reused safely while recomputing chunk ranges for the active `samples_per_chunk` and `max_samples` settings.

## Why TaoTrain?

- **Complete Unified Pipeline**: Pretraining → SFT → RL in a single, consistent framework. No context switching between different codebases or architectures.
- **Production-Grade Engineering**: Type-safe Pydantic configs, comprehensive checkpointing, AimStack integration, and proper gradient handling—not research code, but a framework you can deploy.
- **Extensibility Without Modification**: Register custom models, optimizers, schedulers, and datasets via decorators. Experiment freely without forking the framework.
- **Developer Experience First**: Interactive TUI for inference, intuitive YAML configurations, async data loading that eliminates I/O bottlenecks, and clear abstractions that make the codebase a pleasure to work with.

## Key Capabilities

| Capability | Details |
|---|---|
| **Multi-Stage Training** | Unified infrastructure for pretraining, SFT, and RL. Share model checkpoints, logging, and evaluation across stages. |
| **Advanced Optimization** | Hybrid Muon + AdamW optimizer: efficient 2D weight updates via SVD-based methods + adaptive learning for 1D parameters. |
| **Modern Architectures** | DeepSeek MLA with grouped query attention (GQA), YaRN context extension, and factorized embeddings—all configurable via YAML. |
| **Production Features** | BF16 mixed precision training, gradient accumulation, proper gradient clipping, checkpoint resumption, and validation loops. |
| **Async Data Pipeline** | Background tokenization with multi-threaded workers. Stream billion-token datasets from JSONL without loading into memory. |
| **Interactive Inference** | TUI chat interface with real-time generation speed metrics and multi-model comparison. |
| **Logging & Monitoring** | AimStack integration tracks loss, metrics, hyperparameters, and git hashes for reproducibility. Visualize training runs in your browser. |

## Getting Started

### Installation

```bash

git clone https://github.com/lobakkang/taoTrain.git

cd taoTrain

pip install -e .

```

### Training Examples

**Pretraining on a custom dataset:**
```bash

train pretrain --config configs/pretrain.yaml

```
Starts from scratch, learns representations from raw text via next-token prediction.

**Supervised Fine-tuning:**
```bash

train sft --config configs/sft.yaml

```
Fine-tune a pretrained model on instruction-response pairs for improved task performance.

**Reinforcement Learning (DPO):**
```bash

train rl --config configs/rl_dpo.yaml

```
Align models with human preferences using Direct Preference Optimization.

**Interactive Chat:**
```bash

tui-chat --model checkpoints/model.pt

```
Launch an interactive TUI to chat with your model and monitor generation metrics in real-time.

### Configuration

All training is configured via YAML with Pydantic validation. Configs are type-safe and automatically validated:

```yaml

# configs/sft.yaml

model:

  architecture_type: "mla"  # DeepSeek MLA with GQA

  hidden_dim: 2048

  num_layers: 24

  num_heads: 32

  d_latent_kv: 1536  # KV compression factor



training:

  num_epochs: 3

  batch_size: 32

  learning_rate: 1e-4

  warmup_ratio: 0.1

  max_grad_norm: 1.0



optimizer:

  optimizer_type: "muon_adamw"  # Hybrid Muon + AdamW

  muon_momentum: 0.95



data:

  dataset_type: "sft_jsonl"  # or "sft_hf" for HuggingFace

  path: "data/sft_training.jsonl"

  

logging:

  log_to_aim: true

  aim_repo: "/tmp/aim_logs"

```

See `configs/` for complete examples.

## Project Architecture

```

src/taoTrain/

├── cli.py                      # Main CLI entry point

├── config.py                   # Pydantic configuration schemas

│

├── core/                       # Base abstractions

│   └── base.py                 # BaseModel, BaseDataset, BaseTrainer

│

├── models/                     # Pluggable architecture system

│   ├── registry.py             # Architecture factory with @register_architecture

│   ├── taonet.py               # SimpleLLM with DeepSeek MLA

│   ├── mla_components.py       # KV compression, GQA, YaRN

│   ├── embeddings.py           # Factorized embeddings

│   └── transformer.py          # Standard Transformer reference

│

├── data/                       # Advanced data pipeline

│   ├── factory.py              # Dataset factory (HF + JSONL backends)

│   ├── async_loader.py         # Async batch iteration (no I/O bottleneck)

│   ├── tokenization_queue.py   # Background multi-threaded tokenization

│   ├── chunk_manager.py        # Stream billion-token JSONL files

│   ├── hf_pretrain.py          # HuggingFace pretraining datasets

│   ├── hf_sft.py               # HuggingFace SFT datasets

│   ├── hf_rl.py                # HuggingFace RL datasets

│   ├── pretrain_jsonl.py       # JSONL pretraining

│   ├── sft_jsonl.py            # JSONL SFT with instructions

│   └── rl_jsonl.py             # JSONL RL with preferences

│

├── training/                   # Unified training infrastructure

│   └── trainer.py              # Trainer + PretrainTrainer, SFTTrainer, RLTrainer

│

├── optimizers/                 # Pluggable optimizer system

│   ├── registry.py             # Optimizer factory with @register_optimizer

│   ├── hybrid_muon_adamw.py    # Composite: Muon (2D) + AdamW (1D)

│   ├── adamw.py                # AdamW with weight decay

│   ├── adam.py                 # Standard Adam

│   └── sgd.py                  # SGD variants

│

├── schedulers/                 # Learning rate schedules

│   ├── registry.py             # LR scheduler factory

│   ├── cosine_warmup.py        # 3-phase: linear warmup → plateau → cosine decay

│   ├── linear_warmup.py        # Linear warmup + constant

│   └── constant.py             # Constant learning rate

│

├── inference/                  # Inference & interaction

│   ├── inferencer.py           # Load & run inference from checkpoints

│   └── tui.py                  # Interactive chat with metrics display

│

├── checkpointing/              # State management

│   └── checkpoint.py           # Save/load model + optimizer + config + metrics

│

├── logging/                    # Experiment tracking

│   └── aim_logger.py           # AimStack integration (loss, metrics, hyperparams)

│

├── benchmarks/                 # Evaluation tools

│   └── runner.py               # Perplexity, speed, and task-specific benchmarks

│

└── utils/

    └── helpers.py              # Utility functions



configs/                        # Example YAML configurations

├── pretrain.yaml               # Pretraining config

├── sft.yaml                    # SFT config

├── rl_dpo.yaml                 # RL/DPO config

└── tokenizer.yaml              # Tokenizer config



tests/                          # Unit & integration tests

└── test_dataset.py

```

## Extensible Architecture: The Registry Pattern

TaoTrain's power lies in its **pluggable design**. Add custom models, optimizers, schedulers, and datasets without modifying the framework.

### Custom Model Architecture

```python

from taoTrain.models import register_architecture, BaseModel

import torch.nn as nn



@register_architecture("custom_moe")

class MixtureOfExperts(BaseModel):

    """Your custom MoE architecture"""

    def __init__(self, config):

        super().__init__(config)

        self.experts = nn.ModuleList([

            nn.Linear(config.hidden_dim, config.hidden_dim)

            for _ in range(config.num_experts)

        ])

        self.router = nn.Linear(config.hidden_dim, config.num_experts)

    

    def forward(self, input_ids, attention_mask=None):

        # Your implementation

        logits = self.compute_logits(input_ids)

        loss = self.compute_loss(logits, labels) if labels is not None else None

        return {"logits": logits, "loss": loss}

```

Then use it in your config:

```yaml

model:

  architecture_type: "custom_moe"

  hidden_dim: 2048

  num_experts: 8

```

### Custom Optimizers & Schedulers

The same pattern works for optimizers and learning rate schedules:

```python

from taoTrain.optimizers import register_optimizer

from torch.optim import Optimizer



@register_optimizer("my_adaptive_optimizer")

class MyAdaptiveOptimizer(Optimizer):

    def step(self, closure=None):

        # Your optimization logic

        pass

```

```python

from taoTrain.schedulers import register_scheduler



@register_scheduler("my_schedule")

def my_schedule(initial_lr, step, total_steps, **kwargs):

    return initial_lr * (1.0 - step / total_steps)  # Linear decay

```

**The key principle**: No framework code needs to change. You register once, it's available everywhere.

### Dataset Backend Flexibility

Define custom datasets (JSONL, HF, streaming, etc.) and let the factory route to them:

```python

from taoTrain.data import register_dataset



@register_dataset("pretrain", "my_backend")

class MyPretrainDataset(BaseDataset):

    def __init__(self, config):

        # Load from your custom backend

        pass

    

    def __getitem__(self, idx):

        return {"input_ids": ..., "attention_mask": ...}

```

Use in config:

```yaml

data:

  dataset_type: "pretrain"

  backend_type: "my_backend"  # Routes to MyPretrainDataset

```

## Why TaoTrain Framework?

### Async Data Loading: No I/O Bottleneck

Most training frameworks load and tokenize data on the main training thread, blocking compute. TaoTrain's **multi-threaded tokenization pipeline**:

- Tokenizes data in background workers while your GPU trains
- Supports streaming billion-token JSONL files without loading into memory
- Intelligent chunking (by file size or sample count)
- Metadata caching to avoid rescanning

**Result**: 10-100x faster data iteration on large datasets.

### Type-Safe Configuration

Forget YAML parsing errors or mysterious config bugs. TaoTrain uses **Pydantic dataclasses** for configuration:

- Automatic type validation: mistyped `learning_rate: "1e-4"` becomes an error, not silent failure
- Serialization: configs are part of checkpoints, ensuring reproducibility
- IDE support: autocomplete and type hints for all config fields
- Defaults: sensible defaults for all parameters

### Benchmarking & Metrics

Track what matters:

- **Perplexity**: Language modeling quality on held-out data
- **Generation Speed**: Tokens-per-second (useful for TUI or deployment)
- **Task-Specific Accuracy**: Evaluate on downstream tasks
- **Training Metrics**: Loss curves, gradient norms, effective batch size

All logged to AimStack with git hashes for reproducibility.

## Logging with AimStack

Automatically track and visualize experiments:

```bash

aim up --host 0.0.0.0

```

Then open `http://localhost:43800` to see:

- **Loss curves** per training step
- **Hyperparameters** (learning rate, batch size, model architecture)
- **Git hashes** for reproducibility
- **Custom metrics** (perplexity, validation accuracy, generation speed)
- **Compare runs**: Side-by-side experiment comparison

## Advanced Features

### Checkpointing with Resumption

TaoTrain saves complete training state:

```python

checkpoint = {

    "step": 12500,

    "model_state": model.state_dict(),

    "optimizer_state": optimizer.state_dict(),

    "config": config,  # Full config as Pydantic object

    "metrics": metrics_tracker.to_dict(),

}

```

Resume training from any checkpoint without loss of state. Keep last N checkpoints automatically.

### Mixed Precision Training (BF16)

```yaml

training:

  use_bfloat16: true

  gradient_accumulation_steps: 4

```

- BF16 via `torch.autocast` for ~2x speedup with minimal accuracy loss
- Proper gradient scaling and clipping
- Compatible with all optimizers and architectures

### 3-Phase Learning Rate Schedule

```yaml

scheduler:

  scheduler_type: "cosine_warmup"

  warmup_ratio: 0.1          # 10% of training steps

  steady_ratio: 0.5          # 50% at steady rate

  min_lr_ratio: 0.1          # Final LR = 0.1 × initial_lr

  num_cycles: 1

```

This schedule:
1. **Linear warmup** (0 → 1) over 10% of steps
2. **Steady plateau** at full LR over 50% of steps
3. **Cosine decay** (1 → 0.1) over remaining 40% of steps

Better convergence than simple cosine or linear decay.

### Gradient Accumulation & Clipping

Simulate larger batch sizes with gradient accumulation:

```yaml

training:

  batch_size: 32

  gradient_accumulation_steps: 4  # Effective batch = 128

  max_grad_norm: 1.0               # Gradient clipping

```

## Contributing

Contributions are welcome! TaoTrain is designed to make contributions easy:

1. **Add a model**: Implement `BaseModel` and `@register_architecture("name")`
2. **Add an optimizer**: Implement `torch.optim.Optimizer` and `@register_optimizer("name")`
3. **Add a dataset**: Implement `BaseDataset` and `@register_dataset(mode, backend_type)`
4. **Improve the core**: Submit PRs to `training/`, `data/`, `logging/`, etc.

Ensure new code includes:
- Type hints throughout
- Pydantic configs for new parameters
- Unit tests in `tests/`
- Documentation in docstrings and README

## Current Scope & Roadmap

### ✅ Currently Supported

- **Single GPU / single node** training
- **Pretraining, SFT, and RL training** stages
- **HuggingFace and JSONL** data backends
- **BF16 mixed precision** training
- **Checkpoint saving/loading** with resumption
- **Interactive inference** via TUI
- **Benchmarking** (perplexity, speed)
- **Pluggable architectures, optimizers, schedulers, datasets**

### 🚀 Roadmap (Future)

- **Distributed training** (DDP, FSDP) for multi-GPU/multi-node scaling
- **Quantization** support (INT8, QLoRA)
- **Advanced evaluation** (BLEU, ROUGE, custom tasks)
- **Streaming inference** with KV cache
- **Speculative decoding** for faster generation
- **Integration with popular model hubs** (Hugging Face Hub upload/download)

---

## Getting Help

- **Questions?** Open an issue on GitHub
- **Want to contribute?** See `CONTRIBUTING.md` (coming soon)
- **Found a bug?** Report it with a minimal reproduction script

## License

MIT