Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,194 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# MaxSup: Overcoming Representation Collapse in Label Smoothing
|
| 6 |
+
|
| 7 |
+
**Max Suppression (MaxSup)** is a novel regularization technique that overcomes the shortcomings of traditional **Label Smoothing (LS)**. While LS prevents overconfidence by softening one-hot labels, it inadvertently collapses intra-class feature diversity and can boost overconfident errors. In contrast, **MaxSup** applies a uniform smoothing penalty to the model’s top prediction—regardless of correctness—preserving richer per-sample information and improving both classification performance and downstream transfer.
|
| 8 |
+
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
## Table of Contents
|
| 12 |
+
|
| 13 |
+
1. [Overview](#overview)
|
| 14 |
+
2. [Methodology: MaxSup vs. Label Smoothing](#methodology-maxsup-vs-label-smoothing)
|
| 15 |
+
3. [Enhanced Feature Representation](#enhanced-feature-representation)
|
| 16 |
+
- [Qualitative Evaluation](#qualitative-evaluation)
|
| 17 |
+
- [Quantitative Evaluation](#quantitative-evaluation)
|
| 18 |
+
4. [Training Vision Transformers with MaxSup](#training-vision-transformers-with-maxsup)
|
| 19 |
+
- [Accelerated Data Loading via Caching (Optional)](#accelerated-data-loading-via-caching-optional)
|
| 20 |
+
- [Preparing Data and Annotations for Caching](#preparing-data-and-annotations-for-caching)
|
| 21 |
+
5. [Pretrained Weights](#pretrained-weights)
|
| 22 |
+
6. [Training ConvNets with MaxSup](#training-convnets-with-maxsup)
|
| 23 |
+
7. [Logit Characteristic Visualization](#logit-characteristic-visualization)
|
| 24 |
+
8. [Citation](#citation)
|
| 25 |
+
9. [References](#references)
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## Overview
|
| 30 |
+
|
| 31 |
+
Traditional Label Smoothing (LS) replaces one-hot labels with a smoothed version to reduce overconfidence. However, LS can over-tighten feature clusters within each class and may reinforce errors by making mispredictions overconfident. **MaxSup** tackles these issues by applying a smoothing penalty to the model's **top-1 logit** output regardless of whether the prediction is correct, thus preserving intra-class diversity and enhancing inter-class separation. The result is improved performance on both classification tasks and downstream applications such as linear transfer and image segmentation.
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## Methodology: MaxSup vs. Label Smoothing
|
| 36 |
+
|
| 37 |
+
Label Smoothing softens the target distribution by blending the one-hot vector with a uniform distribution. Although effective at reducing overconfidence, LS inadvertently introduces two effects:
|
| 38 |
+
- A **regularization term** that limits the sharpness of predictions.
|
| 39 |
+
- An **error-enhancement term** that can cause overconfident wrong predictions.
|
| 40 |
+
|
| 41 |
+
**MaxSup** addresses this by uniformly penalizing the highest logit output, whether it corresponds to the true class or not. This approach enforces a consistent regularization effect across all samples. In formula form:
|
| 42 |
+
|
| 43 |
+
```math
|
| 44 |
+
L_{\text{MaxSup}} = \alpha \left( z_{\max} - \frac{1}{K}\sum_{k=1}^{K} z_k \right),
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
where \( z_{\max} \) is the highest logit among the \( K \) classes. This mechanism prevents the prediction distribution from becoming too peaky while preserving informative signals from non-target classes.
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## Enhanced Feature Representation
|
| 52 |
+
|
| 53 |
+
### Qualitative Evaluation
|
| 54 |
+
|
| 55 |
+
MaxSup-trained models display richer intra-class feature diversity compared to models trained with traditional LS. Feature embedding visualizations show that while LS forces features into tight clusters, MaxSup preserves finer-grained differences among samples. Grad-CAM analyses also demonstrate that MaxSup-trained models focus more precisely on relevant class-discriminative regions.
|
| 56 |
+
|
| 57 |
+

|
| 58 |
+
**Figure 1:** Feature representations. MaxSup maintains greater intra-class diversity and clear inter-class boundaries.
|
| 59 |
+
|
| 60 |
+

|
| 61 |
+
**Figure 2:** Grad-CAM visualizations. The MaxSup model (row 2) accurately highlights target objects, whereas the LS model (row 3) and Baseline (row 4) show more diffuse activations.
|
| 62 |
+
|
| 63 |
+
### Quantitative Evaluation
|
| 64 |
+
|
| 65 |
+
We evaluated feature representations on ResNet-50 trained on ImageNet-1K. Intra-class variation (reflecting the diversity within classes) and inter-class separability (indicating class distinctiveness) were measured. Additionally, a linear transfer learning task on CIFAR-10 was performed.
|
| 66 |
+
|
| 67 |
+
**Table 1: Feature Representation Metrics (ResNet-50 on ImageNet-1K)**
|
| 68 |
+
|
| 69 |
+
| Method | Intra-class Var. (Train) | Intra-class Var. (Val) | Inter-class Sep. (Train) | Inter-class Sep. (Val) |
|
| 70 |
+
|---------------------------|--------------------------|------------------------|--------------------------|------------------------|
|
| 71 |
+
| **Baseline** | 0.3114 | 0.3313 | 0.4025 | 0.4451 |
|
| 72 |
+
| **Label Smoothing** | 0.2632 | 0.2543 | 0.4690 | 0.4611 |
|
| 73 |
+
| **Online LS** | 0.2707 | 0.2820 | 0.5943 | 0.5708 |
|
| 74 |
+
| **Zipf’s LS** | 0.2611 | 0.2932 | 0.5522 | 0.4790 |
|
| 75 |
+
| **MaxSup (ours)** | **0.2926** | **0.2998** | 0.5188 | 0.4972 |
|
| 76 |
+
|
| 77 |
+
*Higher intra-class variation indicates more preserved sample-specific details, while higher inter-class separability suggests better class discrimination.*
|
| 78 |
+
|
| 79 |
+
**Table 2: Linear Transfer Accuracy on CIFAR-10**
|
| 80 |
+
|
| 81 |
+
| Pretraining Method | Accuracy (%) |
|
| 82 |
+
|----------------------|--------------|
|
| 83 |
+
| **Baseline** | 81.43 |
|
| 84 |
+
| **Label Smoothing** | 74.58 |
|
| 85 |
+
| **MaxSup** | **81.02** |
|
| 86 |
+
|
| 87 |
+
Label Smoothing degrades transfer accuracy due to its over-smoothing effect, whereas MaxSup nearly matches the baseline performance while still offering improved calibration.
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## Training Vision Transformers with MaxSup
|
| 92 |
+
|
| 93 |
+
We integrated MaxSup into the training pipeline for Vision Transformers using the [DeiT](https://github.com/facebookresearch/deit) framework.
|
| 94 |
+
|
| 95 |
+
### To Train a ViT with MaxSup:
|
| 96 |
+
|
| 97 |
+
```bash
|
| 98 |
+
cd Deit
|
| 99 |
+
python train_with_MaxSup.sh
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
This script trains a DeiT-Small model on ImageNet-1K with MaxSup regularization.
|
| 103 |
+
|
| 104 |
+
### Accelerated Data Loading via Caching (Optional)
|
| 105 |
+
|
| 106 |
+
For improved data loading efficiency on systems with slow I/O, a caching mechanism is provided. This feature compresses the ImageNet dataset into ZIP files and loads them into memory. Enable caching by adding the `--cache` flag to the training script.
|
| 107 |
+
|
| 108 |
+
### Preparing Data and Annotations for Caching
|
| 109 |
+
|
| 110 |
+
1. **Create ZIP Archives:**
|
| 111 |
+
In your ImageNet data directory, run:
|
| 112 |
+
```bash
|
| 113 |
+
cd data/ImageNet
|
| 114 |
+
zip -r train.zip train
|
| 115 |
+
zip -r val.zip val
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
2. **Mapping Files:**
|
| 119 |
+
Download `train_map.txt` and `val_map.txt` from our release assets and place them in the `data/ImageNet` directory. The directory should appear as follows:
|
| 120 |
+
```
|
| 121 |
+
data/ImageNet/
|
| 122 |
+
├── train_map.txt # Relative paths and labels for training images
|
| 123 |
+
├── val_map.txt # Relative paths and labels for validation images
|
| 124 |
+
├── train.zip # Compressed training images
|
| 125 |
+
└── val.zip # Compressed validation images
|
| 126 |
+
```
|
| 127 |
+
- **train_map.txt:** Each line should be in the format `<class_folder>/<image_filename>\t<label>`.
|
| 128 |
+
- **val_map.txt:** Each line should be in the format `<image_filename>\t<label>`.
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## Pretrained Weights
|
| 133 |
+
|
| 134 |
+
- **ConvNet (ResNet-50):** Pretrained weights can be downloaded from this page.
|
| 135 |
+
|
| 136 |
+
These checkpoints can be used for direct evaluation or fine-tuning on downstream tasks.
|
| 137 |
+
|
| 138 |
+
---
|
| 139 |
+
|
| 140 |
+
## Training ConvNets with MaxSup
|
| 141 |
+
|
| 142 |
+
The `Conv/` directory provides scripts for training convolutional networks with MaxSup:
|
| 143 |
+
|
| 144 |
+
- **Conv/ffcv:** Contains scripts to reproduce ImageNet results using FFCV for efficient data loading. See `Conv/ffcv/README.md` for details.
|
| 145 |
+
- **Conv/common_resnet:** Contains additional experiments with ResNet architectures. Refer to `Conv/common_resnet/README.md` for further instructions.
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
## Logit Characteristic Visualization
|
| 150 |
+
|
| 151 |
+
The `viz/` directory contains a toolkit to analyze the distribution of logits produced by models trained with LS versus MaxSup.
|
| 152 |
+
|
| 153 |
+
### Step 1: Extract Logits
|
| 154 |
+
|
| 155 |
+
Run the following command to extract logits from your trained model:
|
| 156 |
+
|
| 157 |
+
```bash
|
| 158 |
+
python viz/logits.py \
|
| 159 |
+
--checkpoint /path/to/model_checkpoint.pth \
|
| 160 |
+
--output /path/to/save/logits_labels.pt
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
- `--checkpoint`: Path to your model checkpoint.
|
| 164 |
+
- `--output`: Destination file for the extracted logits and labels.
|
| 165 |
+
|
| 166 |
+
### Step 2: Analyze Logits
|
| 167 |
+
|
| 168 |
+
After extraction, run:
|
| 169 |
+
|
| 170 |
+
```bash
|
| 171 |
+
python viz/analysis.py --input /path/to/save/logits_labels.pt --output /path/to/analysis_results/
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
This script generates:
|
| 175 |
+
- A histogram of near-zero logit proportions.
|
| 176 |
+
- A scatter plot comparing top-1 probabilities with near-zero proportions.
|
| 177 |
+
- Saved visualizations for side-by-side comparisons.
|
| 178 |
+
|
| 179 |
+

|
| 180 |
+
**Figure 3:** Logit distribution comparing LS and MaxSup.
|
| 181 |
+
|
| 182 |
+
## References
|
| 183 |
+
- **DeiT (Vision Transformer):**
|
| 184 |
+
Touvron et al., *Training Data-Efficient Image Transformers & Distillation through Attention*, ICML 2021. [GitHub](https://github.com/facebookresearch/deit).
|
| 185 |
+
- **Grad-CAM:**
|
| 186 |
+
Selvaraju et al., *Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization*, ICCV 2017.
|
| 187 |
+
- **Online Label Smoothing:** See paper for details.
|
| 188 |
+
- **Zipf’s Label Smoothing:** See paper for details.
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
+
This repository provides the official implementation of MaxSup. Contributions and discussions are welcome. For any questions or issues, please open an issue on GitHub or contact the authors directly.
|
| 193 |
+
|
| 194 |
+
---
|