|
|
.. _training_api: |
|
|
|
|
|
Training API (experimental) |
|
|
=========================== |
|
|
|
|
|
Kornia provides a Training API with the specific purpose to train and fine-tune the |
|
|
supported deep learning algorithms within the library. |
|
|
|
|
|
.. sidebar:: **Deep Alchemy** |
|
|
|
|
|
.. image:: https://github.com/kornia/data/raw/main/pixie_alchemist.png |
|
|
:width: 100% |
|
|
:align: center |
|
|
|
|
|
A seemingly magical process of transformation, creation, or combination of data to usable deep learning models. |
|
|
|
|
|
|
|
|
.. important:: |
|
|
In order to use our Training API you must: ``pip install kornia[x]`` |
|
|
|
|
|
Why a Training API ? |
|
|
-------------------- |
|
|
|
|
|
Kornia includes deep learning models that eventually need to be updated through fine-tuning. |
|
|
Our aim is to have an API flexible enough to be used across our vision models and enable us to |
|
|
override methods or dynamically pass callbacks to ease the process of debugging and experimentations. |
|
|
|
|
|
.. admonition:: **Disclaimer** |
|
|
:class: seealso |
|
|
|
|
|
We do not pretend to be a general purpose training library but instead we allow Kornia users to |
|
|
experiment with the training of our models. |
|
|
|
|
|
Design Principles |
|
|
----------------- |
|
|
|
|
|
- `kornia` golden rule is to not have heavy dependencies. |
|
|
- Our models are simple enough so that a light training API can fulfill our needs. |
|
|
- Flexible and full control to the training/validation loops and customize the pipeline. |
|
|
- Decouple the model definition from the training pipeline. |
|
|
- Use plane PyTorch abstractions and recipes to write your own routines. |
|
|
- Implement `accelerate <https://github.com/huggingface/accelerate/>`_ library to scale the problem. |
|
|
|
|
|
Trainer Usage |
|
|
------------- |
|
|
|
|
|
The entry point to start traning with Kornia is through the :py:class:`~kornia.x.Trainer` class. |
|
|
|
|
|
The main API is a self contained module that heavily relies on `accelerate <https://github.com/huggingface/accelerate/>`_ |
|
|
to easily scale the training over multi-GPUs/TPU/fp16 `(see more) <https://github.com/huggingface/accelerate#supported-integrations/>`_ |
|
|
by following standard PyTorch recipes. Our API expects to consume standard PyTorch components and you decide if `kornia` makes the magic |
|
|
for you. |
|
|
|
|
|
1. Define your model |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
model = nn.Sequential( |
|
|
kornia.contrib.VisionTransformer(image_size=32, patch_size=16), |
|
|
kornia.contrib.ClassificationHead(num_classes=10), |
|
|
) |
|
|
|
|
|
2. Create the datasets and dataloaders for training and validation |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
# datasets |
|
|
train_dataset = torchvision.datasets.CIFAR10( |
|
|
root=config.data_path, train=True, download=True, transform=T.ToTensor()) |
|
|
|
|
|
valid_dataset = torchvision.datasets.CIFAR10( |
|
|
root=config.data_path, train=False, download=True, transform=T.ToTensor()) |
|
|
|
|
|
# dataloaders |
|
|
train_dataloader = torch.utils.data.DataLoader( |
|
|
train_dataset, batch_size=config.batch_size, shuffle=True) |
|
|
|
|
|
valid_daloader = torch.utils.data.DataLoader( |
|
|
valid_dataset, batch_size=config.batch_size, shuffle=True) |
|
|
|
|
|
3. Create your loss function, optimizer and scheduler |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
# loss function |
|
|
criterion = nn.CrossEntropyLoss() |
|
|
|
|
|
# optimizer and scheduler |
|
|
optimizer = torch.optim.AdamW(model.parameters(), lr=config.lr) |
|
|
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( |
|
|
optimizer, config.num_epochs * len(train_dataloader) |
|
|
) |
|
|
|
|
|
4. Create the Trainer and execute the training pipeline |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
trainer = kornia.train.Trainer( |
|
|
model, train_dataloader, valid_daloader, criterion, optimizer, scheduler, config, |
|
|
) |
|
|
trainer.fit() # execute your training ! |
|
|
|
|
|
|
|
|
Customize [callbacks] |
|
|
--------------------- |
|
|
|
|
|
At this point you might think - *Is this API generic enough ?* |
|
|
|
|
|
Of course not ! What is next ? Let's have fun and **customize**. |
|
|
|
|
|
The :py:class:`~kornia.x.Trainer` internals are clearly defined such in a way so that e.g you can |
|
|
subclass and just override the :py:func:`~kornia.x.Trainer.evaluate` method and adjust |
|
|
according to your needs. We provide predefined classes for generic problems such as |
|
|
:py:class:`~kornia.x.ImageClassifierTrainer`, :py:class:`~kornia.x.SemanticSegmentationTrainer`. |
|
|
|
|
|
.. note:: |
|
|
More trainers will come as soon as we include more models. |
|
|
|
|
|
You can easily customize by creating your own class, or even through ``callbacks`` as follows: |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
@torch.no_grad() |
|
|
def my_evaluate(self) -> dict: |
|
|
self.model.eval() |
|
|
for sample_id, sample in enumerate(self.valid_dataloader): |
|
|
source, target = sample # this might change with new pytorch ataset structure |
|
|
|
|
|
# perform the preprocess and augmentations in batch |
|
|
img = self.preprocess(source) |
|
|
# Forward |
|
|
out = self.model(img) |
|
|
# Loss computation |
|
|
val_loss = self.criterion(out, target) |
|
|
|
|
|
# measure accuracy and record loss |
|
|
acc1, acc5 = accuracy(out.detach(), target, topk=(1, 5)) |
|
|
|
|
|
# create the trainer and pass the evaluate method as follows |
|
|
trainer = K.train.Trainer(..., callbacks={"evaluate", my_evaluate}) |
|
|
|
|
|
**Still not convinced ?** |
|
|
|
|
|
You can even override the whole :py:func:`~kornia.x.ImageClassifierTrainer.fit()` |
|
|
method and implement your custom for loops and the trainer will setup for you using the Accelerator all |
|
|
the data to the device and the rest of the story is just PyTorch :) |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
def my_fit(self, ): # this is a custom pytorch training loop |
|
|
self.model.train() |
|
|
for epoch in range(self.num_epochs): |
|
|
for source, targets in self.train_dataloader: |
|
|
self.optimizer.zero_grad() |
|
|
|
|
|
output = self.model(source) |
|
|
loss = self.criterion(output, targets) |
|
|
|
|
|
self.backward(loss) |
|
|
self.optimizer.step() |
|
|
|
|
|
stats = self.evaluate() # do whatever you want with validation |
|
|
|
|
|
# create the trainer and pass the evaluate method as follows |
|
|
trainer = K.train.Trainer(..., callbacks={"fit", my_fit}) |
|
|
|
|
|
.. note:: |
|
|
The following hooks are available to override: ``preprocess``, ``augmentations``, ``evaluate``, ``fit``, |
|
|
``on_checkpoint``, ``on_epoch_end``, ``on_before_model`` |
|
|
|
|
|
|
|
|
Preprocess and augmentations |
|
|
---------------------------- |
|
|
|
|
|
Taking a pre-trained model from an external source and assume that fine-tuning with your |
|
|
data by just changing few things in your model is usually a bad assumption in practice. |
|
|
|
|
|
Fine-tuning a model need a lot tricks which usually means designing a good augmentation |
|
|
or preprocess strategy before you execute the training pipeline. For this reason, we enable |
|
|
through callbacks to pass pointers to the ``proprocess`` and ``augmentation`` functions to make easy |
|
|
the debugging and experimentation experience. |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
def preprocess(x): |
|
|
return x.float() / 255. |
|
|
|
|
|
augmentations = nn.Sequential( |
|
|
K.augmentation.RandomHorizontalFlip(p=0.75), |
|
|
K.augmentation.RandomVerticalFlip(p=0.75), |
|
|
K.augmentation.RandomAffine(degrees=10.), |
|
|
K.augmentation.PatchSequential( |
|
|
K.augmentation.ColorJitter(0.1, 0.1, 0.1, 0.1, p=0.8), |
|
|
grid_size=(2, 2), # cifar-10 is 32x32 and vit is patch 16 |
|
|
patchwise_apply=False, |
|
|
), |
|
|
) |
|
|
|
|
|
# create the trainer and pass the augmentation or preprocess |
|
|
trainer = K.train.ImageClassifierTrainer(..., |
|
|
callbacks={"preprocess", preprocess, "augmentations": augmentations}) |
|
|
|
|
|
Callbacks utilities |
|
|
------------------- |
|
|
|
|
|
We also provide utilities to save checkpoints of the model or early stop the training. You can use |
|
|
as follows passing as ``callbacks`` the classes :py:class:`~kornia.x.ModelCheckpoint` and |
|
|
:py:class:`~kornia.x.EarlyStopping`. |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
model_checkpoint = ModelCheckpoint( |
|
|
filepath="./outputs", monitor="top5", |
|
|
) |
|
|
|
|
|
early_stop = EarlyStopping(monitor="top5") |
|
|
|
|
|
trainer = K.train.ImageClassifierTrainer(..., |
|
|
callbacks={"on_checkpoint", model_checkpoint, "on_epoch_end": early_stop}) |
|
|
|
|
|
Hyperparameter sweeps |
|
|
--------------------- |
|
|
|
|
|
Use `hydra <https://hydra.cc>`_ to implement an easy search strategy for your hyper-parameters as follows: |
|
|
|
|
|
.. note:: |
|
|
|
|
|
Checkout the toy example in `here <https://github.com/kornia/kornia/tree/master/examples/train/image_classifier>`__ |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
python ./train/image_classifier/main.py num_epochs=50 batch_size=32 |
|
|
|
|
|
.. code:: python |
|
|
|
|
|
python ./train/image_classifier/main.py --multirun lr=1e-3,1e-4 |
|
|
|
|
|
Distributed Training |
|
|
-------------------- |
|
|
|
|
|
Kornia :py:class:`~kornia.x.Trainer` heavily relies on `accelerate <https://github.com/huggingface/accelerate/>`_ to |
|
|
decouple the process of running your training scripts in a distributed environment. |
|
|
|
|
|
.. note:: |
|
|
|
|
|
We haven't tested yet all the possibilities for distributed training. |
|
|
Expect some adventures or `join us <https://join.slack.com/t/kornia/shared_invite/zt-csobk21g-CnydWe5fmvkcktIeRFGCEQ>`_ and help to iterate :) |
|
|
|
|
|
The below recipes are taken from the `accelerate` library in `here <https://github.com/huggingface/accelerate/tree/main/examples#simple-vision-example>`__: |
|
|
|
|
|
- single CPU: |
|
|
|
|
|
* from a server without GPU |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
python ./train/image_classifier/main.py |
|
|
|
|
|
* from any server by passing `cpu=True` to the `Accelerator`. |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
python ./train/image_classifier/main.py --data_path path_to_data --cpu |
|
|
|
|
|
* from any server with Accelerate launcher |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
accelerate launch --cpu ./train/image_classifier/main.py --data_path path_to_data |
|
|
|
|
|
- single GPU: |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
python ./train/image_classifier/main.py # from a server with a GPU |
|
|
|
|
|
- with fp16 (mixed-precision) |
|
|
|
|
|
* from any server by passing `fp16=True` to the `Accelerator`. |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
python ./train/image_classifier/main.py --data_path path_to_data --fp16 |
|
|
|
|
|
* from any server with Accelerate launcher |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
accelerate launch --fp16 ./train/image_classifier/main.py --data_path path_to_data |
|
|
|
|
|
- multi GPUs (using PyTorch distributed mode) |
|
|
|
|
|
* With Accelerate config and launcher |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
accelerate config # This will create a config file on your server |
|
|
accelerate launch ./train/image_classifier/main.py --data_path path_to_data # This will run the script on your server |
|
|
|
|
|
* With traditional PyTorch launcher |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
python -m torch.distributed.launch --nproc_per_node 2 --use_env ./train/image_classifier/main.py --data_path path_to_data |
|
|
|
|
|
- multi GPUs, multi node (several machines, using PyTorch distributed mode) |
|
|
|
|
|
* With Accelerate config and launcher, on each machine: |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
accelerate config # This will create a config file on each server |
|
|
accelerate launch ./train/image_classifier/main.py --data_path path_to_data # This will run the script on each server |
|
|
|
|
|
* With PyTorch launcher only |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
python -m torch.distributed.launch --nproc_per_node 2 \ |
|
|
--use_env \ |
|
|
--node_rank 0 \ |
|
|
--master_addr master_node_ip_address \ |
|
|
./train/image_classifier/main.py --data_path path_to_data # On the first server |
|
|
|
|
|
python -m torch.distributed.launch --nproc_per_node 2 \ |
|
|
--use_env \ |
|
|
--node_rank 1 \ |
|
|
--master_addr master_node_ip_address \ |
|
|
./train/image_classifier/main.py --data_path path_to_data # On the second server |
|
|
|
|
|
- (multi) TPUs |
|
|
|
|
|
* With Accelerate config and launcher |
|
|
|
|
|
.. code:: bash |
|
|
|
|
|
accelerate config # This will create a config file on your TPU server |
|
|
accelerate launch ./train/image_classifier/main.py --data_path path_to_data # This will run the script on each server |
|
|
|
|
|
* In PyTorch: |
|
|
Add an `xmp.spawn` line in your script as you usually do. |
|
|
|