| | --- |
| | license: apache-2.0 |
| | pipeline_tag: image-feature-extraction |
| | --- |
| | |
| | # MaxSup: Overcoming Representation Collapse in Label Smoothing |
| |
|
| | **Max Suppression (MaxSup)** is a novel regularization technique that overcomes the shortcomings of traditional **Label Smoothing (LS)**. While LS prevents overconfidence by softening one-hot labels, it inadvertently collapses intra-class feature diversity and can boost overconfident errors. In contrast, **MaxSup** applies a uniform smoothing penalty to the model’s top prediction—regardless of correctness—preserving richer per-sample information and improving both classification performance and downstream transfer. |
| |
|
| | --- |
| |
|
| | ## Table of Contents |
| |
|
| | 1. [Overview](#overview) |
| | 2. [Methodology: MaxSup vs. Label Smoothing](#methodology-maxsup-vs-label-smoothing) |
| | 3. [Enhanced Feature Representation](#enhanced-feature-representation) |
| | - [Qualitative Evaluation](#qualitative-evaluation) |
| | - [Quantitative Evaluation](#quantitative-evaluation) |
| | 4. [Training Vision Transformers with MaxSup](#training-vision-transformers-with-maxsup) |
| | - [Accelerated Data Loading via Caching (Optional)](#accelerated-data-loading-via-caching-optional) |
| | - [Preparing Data and Annotations for Caching](#preparing-data-and-annotations-for-caching) |
| | 5. [Pretrained Weights](#pretrained-weights) |
| | 6. [Training ConvNets with MaxSup](#training-convnets-with-maxsup) |
| | 7. [Logit Characteristic Visualization](#logit-characteristic-visualization) |
| | 8. [Citation](#citation) |
| | 9. [References](#references) |
| |
|
| | --- |
| |
|
| | ## Overview |
| |
|
| | Traditional Label Smoothing (LS) replaces one-hot labels with a smoothed version to reduce overconfidence. However, LS can over-tighten feature clusters within each class and may reinforce errors by making mispredictions overconfident. **MaxSup** tackles these issues by applying a smoothing penalty to the model's **top-1 logit** output regardless of whether the prediction is correct, thus preserving intra-class diversity and enhancing inter-class separation. The result is improved performance on both classification tasks and downstream applications such as linear transfer and image segmentation. |
| |
|
| | --- |
| |
|
| | ## Methodology: MaxSup vs. Label Smoothing |
| |
|
| | Label Smoothing softens the target distribution by blending the one-hot vector with a uniform distribution. Although effective at reducing overconfidence, LS inadvertently introduces two effects: |
| | - A **regularization term** that limits the sharpness of predictions. |
| | - An **error-enhancement term** that can cause overconfident wrong predictions. |
| |
|
| | **MaxSup** addresses this by uniformly penalizing the highest logit output, whether it corresponds to the true class or not. This approach enforces a consistent regularization effect across all samples. In formula form: |
| |
|
| | ```math |
| | L_{\text{MaxSup}} = \alpha \left( z_{\max} - \frac{1}{K}\sum_{k=1}^{K} z_k \right), |
| | ``` |
| |
|
| | where \( z_{\max} \) is the highest logit among the \( K \) classes. This mechanism prevents the prediction distribution from becoming too peaky while preserving informative signals from non-target classes. |
| | |
| | --- |
| | |
| | ## Enhanced Feature Representation |
| | |
| | ### Qualitative Evaluation |
| | |
| | MaxSup-trained models display richer intra-class feature diversity compared to models trained with traditional LS. Feature embedding visualizations show that while LS forces features into tight clusters, MaxSup preserves finer-grained differences among samples. Grad-CAM analyses also demonstrate that MaxSup-trained models focus more precisely on relevant class-discriminative regions. |
| | |
| |  |
| | **Figure 1:** Feature representations. MaxSup maintains greater intra-class diversity and clear inter-class boundaries. |
| | |
| |  |
| | **Figure 2:** Grad-CAM visualizations. The MaxSup model (row 2) accurately highlights target objects, whereas the LS model (row 3) and Baseline (row 4) show more diffuse activations. |
| | |
| | ### Quantitative Evaluation |
| | |
| | We evaluated feature representations on ResNet-50 trained on ImageNet-1K. Intra-class variation (reflecting the diversity within classes) and inter-class separability (indicating class distinctiveness) were measured. Additionally, a linear transfer learning task on CIFAR-10 was performed. |
| | |
| | **Table 1: Feature Representation Metrics (ResNet-50 on ImageNet-1K)** |
| | |
| | | Method | Intra-class Var. (Train) | Intra-class Var. (Val) | Inter-class Sep. (Train) | Inter-class Sep. (Val) | |
| | |---------------------------|--------------------------|------------------------|--------------------------|------------------------| |
| | | **Baseline** | 0.3114 | 0.3313 | 0.4025 | 0.4451 | |
| | | **Label Smoothing** | 0.2632 | 0.2543 | 0.4690 | 0.4611 | |
| | | **Online LS** | 0.2707 | 0.2820 | 0.5943 | 0.5708 | |
| | | **Zipf’s LS** | 0.2611 | 0.2932 | 0.5522 | 0.4790 | |
| | | **MaxSup (ours)** | **0.2926** | **0.2998** | 0.5188 | 0.4972 | |
| | |
| | *Higher intra-class variation indicates more preserved sample-specific details, while higher inter-class separability suggests better class discrimination.* |
| | |
| | **Table 2: Linear Transfer Accuracy on CIFAR-10** |
| | |
| | | Pretraining Method | Accuracy (%) | |
| | |----------------------|--------------| |
| | | **Baseline** | 81.43 | |
| | | **Label Smoothing** | 74.58 | |
| | | **MaxSup** | **81.02** | |
| | |
| | Label Smoothing degrades transfer accuracy due to its over-smoothing effect, whereas MaxSup nearly matches the baseline performance while still offering improved calibration. |
| | |
| | --- |
| | |
| | ## Training Vision Transformers with MaxSup |
| | |
| | We integrated MaxSup into the training pipeline for Vision Transformers using the [DeiT](https://github.com/facebookresearch/deit) framework. |
| | |
| | ### To Train a ViT with MaxSup: |
| | |
| | ```bash |
| | cd Deit |
| | python train_with_MaxSup.sh |
| | ``` |
| | |
| | This script trains a DeiT-Small model on ImageNet-1K with MaxSup regularization. |
| | |
| | ### Accelerated Data Loading via Caching (Optional) |
| | |
| | For improved data loading efficiency on systems with slow I/O, a caching mechanism is provided. This feature compresses the ImageNet dataset into ZIP files and loads them into memory. Enable caching by adding the `--cache` flag to the training script. |
| | |
| | ### Preparing Data and Annotations for Caching |
| | |
| | 1. **Create ZIP Archives:** |
| | In your ImageNet data directory, run: |
| | ```bash |
| | cd data/ImageNet |
| | zip -r train.zip train |
| | zip -r val.zip val |
| | ``` |
| | |
| | 2. **Mapping Files:** |
| | Download `train_map.txt` and `val_map.txt` from our release assets and place them in the `data/ImageNet` directory. The directory should appear as follows: |
| | ``` |
| | data/ImageNet/ |
| | ├── train_map.txt # Relative paths and labels for training images |
| | ├── val_map.txt # Relative paths and labels for validation images |
| | ├── train.zip # Compressed training images |
| | └── val.zip # Compressed validation images |
| | ``` |
| | - **train_map.txt:** Each line should be in the format `<class_folder>/<image_filename>\t<label>`. |
| | - **val_map.txt:** Each line should be in the format `<image_filename>\t<label>`. |
| | |
| | --- |
| | |
| | ## Pretrained Weights |
| | |
| | - **ConvNet (ResNet-50):** Pretrained weights can be downloaded from this page. |
| | |
| | These checkpoints can be used for direct evaluation or fine-tuning on downstream tasks. |
| | |
| | --- |
| | |
| | ## Training ConvNets with MaxSup |
| | |
| | The `Conv/` directory provides scripts for training convolutional networks with MaxSup: |
| | |
| | - **Conv/ffcv:** Contains scripts to reproduce ImageNet results using FFCV for efficient data loading. See `Conv/ffcv/README.md` for details. |
| | - **Conv/common_resnet:** Contains additional experiments with ResNet architectures. Refer to `Conv/common_resnet/README.md` for further instructions. |
| |
|
| | --- |
| |
|
| | ## Logit Characteristic Visualization |
| |
|
| | The `viz/` directory contains a toolkit to analyze the distribution of logits produced by models trained with LS versus MaxSup. |
| |
|
| | ### Step 1: Extract Logits |
| |
|
| | Run the following command to extract logits from your trained model: |
| |
|
| | ```bash |
| | python viz/logits.py \ |
| | --checkpoint /path/to/model_checkpoint.pth \ |
| | --output /path/to/save/logits_labels.pt |
| | ``` |
| |
|
| | - `--checkpoint`: Path to your model checkpoint. |
| | - `--output`: Destination file for the extracted logits and labels. |
| |
|
| | ### Step 2: Analyze Logits |
| |
|
| | After extraction, run: |
| |
|
| | ```bash |
| | python viz/analysis.py --input /path/to/save/logits_labels.pt --output /path/to/analysis_results/ |
| | ``` |
| |
|
| | This script generates: |
| | - A histogram of near-zero logit proportions. |
| | - A scatter plot comparing top-1 probabilities with near-zero proportions. |
| | - Saved visualizations for side-by-side comparisons. |
| |
|
| |  |
| | **Figure 3:** Logit distribution comparing LS and MaxSup. |
| |
|
| | For more details, see our paper on [arXiv:2502.15798](https://arxiv.org/abs/2502.15798). |
| |
|
| | ## References |
| | - **DeiT (Vision Transformer):** |
| | Touvron et al., *Training Data-Efficient Image Transformers & Distillation through Attention*, ICML 2021. [GitHub](https://github.com/facebookresearch/deit). |
| | - **Grad-CAM:** |
| | Selvaraju et al., *Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization*, ICCV 2017. |
| | - **Online Label Smoothing:** See paper for details. |
| | - **Zipf’s Label Smoothing:** See paper for details. |
| |
|
| | --- |
| |
|
| | This repository provides the official implementation of MaxSup. Contributions and discussions are welcome. For any questions or issues, please open an issue on GitHub or contact the authors directly. |
| |
|
| | --- |
| |
|