Anime.MILI Ordinal

Ordinal Regression of Subjective Visual Characteristics via Feature Adaptation of Self-Supervised Vision Transformers

Release weights: anime_mili_ordinal_corn_vitl16.pth — full state_dict (DINOv3 ViT-L/16 + CORN head). Outputs a continuous subjective score in [0, 1]. A local DINOv3 backbone may be included under backbone/dinov3_vitl16/ (Meta DINOv3 License); the same architecture can be loaded from Hugging Face by model id.

Abstract

Estimating subjective visual metrics, such as the mili characteristic, requires models capable of mapping input images to a strictly ordered scale. Traditional multiclass classification ignores the internal hierarchy of quality levels. Standard continuous regression approaches suffer from vanishing gradients at scale boundaries and instability when processing unevenly distributed ordinal data. This technical release presents an architecture based on the DINOv3 vision transformer, adapted specifically for ordinal regression. The implementation utilizes a two-phase training method: initial freezing of the backbone to warm up a specialized multi-layer perceptron, followed by partial fine-tuning of the final transformer blocks. Employing a Conditional Ordinal Regression Network (CORN) loss function allows the model to explicitly enforce rank monotonicity. Evaluation on a dataset of 4,715 images demonstrates a Quadratic Weighted Kappa (QWK) of 0.74. The results confirm the efficacy of combining self-supervised features with ordinal-aware architectures for visual quality assessment.

Teaser: ordinal scale with validation examples

Two validation images per ordinal bin (columns 0.00 … 1.00).

Introduction

Human assessment of visual characteristics frequently relies on discrete quality levels that form an ordinal structure. The mili attribute represents a human-centric score measured on a scale from 0 to 1 with fixed 0.25 increments. Standard neural network architectures typically process such values either as independent classes, discarding the distance information between them, or as continuous regression targets. Utilizing Mean Squared Error (MSE) for regression tasks with fixed boundaries minimizes derivatives during extreme predictions, restricting the ability of the model to correct severe errors.

The development of self-supervised methods, particularly the DINOv3 architecture, provides access to universal visual features that encode both high-level semantics and low-level structural patterns. Applying frozen DINOv3 features to highly specialized downstream tasks necessitates specific adaptation mechanisms. The current architecture transforms the base DINOv3 representations into a probabilistic representation of an ordinal scale. Replacing standard classification layers with an architecture that predicts the conditional probabilities of exceeding each scale threshold ensures increased prediction accuracy and stability.

Related Work

Vision transformers consistently demonstrate high efficiency in feature extraction tasks compared to convolutional neural networks. Self-supervised methods utilize feature distillation and regularization to construct robust global and local descriptors. Adapting these models for regression tasks traditionally relies on L1 and L2 loss functions, which assume symmetric error penalties.

Ordinal regression accounts for label ranking, a critical requirement for subjective evaluations. The CORN framework addresses this by decomposing the ordinal problem into a series of binary classifications for each rank threshold. The implementation detailed here extends the application of CORN logic by integrating it on top of the high-capacity DINOv3 transformer architecture. It employs partial weight unfreezing strategies to preserve the generalization capabilities of the self-supervised features while adapting to the target domain.

Method

The architecture utilizes a pre-trained DINOv3 ViT-L/16 backbone as the primary feature extractor. Input images are processed at a resolution of 224×224 pixels. To maintain the integrity of the self-supervised representations, the base network remains frozen during the initial training phase. The output of the base network is the [CLS] token: a vector whose dimension matches the backbone hidden size (1024 in the released checkpoint), aggregating global image information.

Pipeline diagram (CORN on frozen / partially unfrozen DINOv3)

The specialized head is a multi-layer perceptron optimized for ordinal estimation. The sequence consists of a LayerNorm operation, a Dropout layer with a 0.2 probability, a linear projection into a 256-dimensional hidden space, a GELU activation function, an additional Dropout layer at 0.1, and a final linear projection. Unlike standard DINOv3 classification usage, the final layer returns K−1 logits for K quality levels, specifically yielding 4 logits for the 5 target classes.

Optimization is driven by the CORN loss function, calculated as the sum of binary cross-entropy terms for each ordinal threshold. For a target label belonging to the set of ranks 0 through 4, the model predicts the probability of the rank exceeding a threshold k. The expected class is computed as the sum of probabilities across all thresholds. This loss function forces the model to learn the cumulative distribution of quality levels, guaranteeing the mathematical consistency of the ordinal scale.

Experiments

The dataset comprises 4,715 images, partitioned into a 3,771-image training split and a 944-image validation split. Labels are distributed unevenly across five classes: 1,204 samples for class 0.00, 513 for class 0.25, 949 for class 0.50, 1,394 for class 0.75, and 655 for class 1.00.

Train / validation distribution (real counts)

Bar heights are file counts per ordinal bin in the training and validation splits used for this release.

The training procedure spans two phases. In the first phase, spanning 5 epochs, only the head weights are updated with a learning rate of 3e-4. This calibrates the classifier to the fixed DINOv3 features. In the second phase, lasting 30 epochs, the final 6 transformer blocks are unfrozen. The learning rate is reduced to 2e-5 and regulated by a Cosine Annealing schedule. Optimization utilizes the AdamW algorithm with a weight decay coefficient of 1e-3.

To increase inference reliability, Test-Time Augmentation (TTA) is applied. Each validation image is processed in three variations: original, horizontally flipped, and center-cropped. The logits from the three passes are averaged before calculating the expected class. Model evaluation relies on the Quadratic Weighted Kappa metric, which penalizes predictions proportionally to the squared distance between the true and predicted ranks.

Results

The two-phase training strategy produces a consistent performance increase. During the first phase, utilizing a completely frozen backbone, the model reaches a QWK of 0.54. This establishes a strong baseline derived entirely from the raw DINOv3 features. Transitioning to the second phase with partial unfreezing of the transformer blocks facilitates further metric growth.

The validation loss steadily decreases, reaching a minimum of 0.38, while the QWK stabilizes at 0.74. The discrepancy between training and validation errors remains minimal throughout the training process. The combination of the CORN loss function and the TTA algorithm prevents the severe overfitting observed in early experimental iterations that utilized standard MSE and a fully unfrozen backbone.

Learning curves

Train/val CORN loss and validation QWK by epoch (dashed line: start of phase 2 — last six transformer blocks unfrozen). The GIF animates the QWK curve over epochs.

Discussion

The efficiency of the described architecture stems from the mathematical alignment between the CORN loss function and the QWK metric. Decomposing the task into predicting conditional probabilities for each threshold eliminates the vanishing gradient problem. The model receives a stable error signal even when predictions deviate significantly from the true value.

The success of incorporating DINOv3 lies in the high resilience of its self-supervised features to visual noise. The two-phase training strategy is critical for preserving these features. The initial freezing prevents the destruction of the complex transformer attention system by random gradients from the uninitialized head. The subsequent unfreezing of the final layers allows the model to adjust its perception of specific textural and geometric patterns relevant exclusively to the mili characteristic. The stabilization of the metric by the 30th epoch indicates complete assimilation of patterns from the available training volume.

Ordinal confusion matrix (validation, single-view inference)

Cell counts: validation labels vs. predicted bins (rounded). Weights file: anime_mili_ordinal_corn_vitl16.pth.

Self-supervised features vs. gradient saliency

Each row: input · PCA of DINOv3 patch tokens (after register tokens) · saliency from the CORN head path.

Limitations

The computational complexity of the ViT-L/16 architecture requires significant hardware resources for inference, complicating deployment on memory-constrained devices.
The mathematical formulation for calculating the expected class implicitly assumes equidistant intervals between classes, potentially failing to fully reflect the non-linear nature of human quality perception.
The dataset imbalance, specifically the prevalence of the 0.00 and 0.75 classes over the 0.25 class, introduces a risk of prediction bias toward majority classes despite the use of robust loss functions.
The performance improvement achieved through Test-Time Augmentation triples the inference latency, severely limiting application in real-time systems.

Conclusion

The described architecture provides a framework for ordinal regression of subjective visual evaluations using DINOv3 features. The methodology connects self-supervised visual representations with domain-specific ordinal classification via a CORN head and phased fine-tuning. The reported metrics support ordinal consistency as a training objective for human-centric quality scores.

Repository layout

Path	Description
`mili_score_inference/`	Package: architecture, preprocessing, `load_model`, `predict_pil`
`weights/anime_mili_ordinal_corn_vitl16.pth`	Fine-tuned full checkpoint (backbone + CORN head)
`backbone/dinov3_vitl16/`	Local DINOv3 ViT-L/16 HF-format checkpoint (`config.json`, `model.safetensors`, …) for offline inference
`requirements.txt`	Inference dependencies
`example_predict.py`	CLI: image → score
`LICENSE`	Apache 2.0 — code and Anime.MILI Ordinal weights in this repository
`LICENSE-DINOv3-Meta.txt`	Meta DINOv3 License (bundled backbone and Hub DINOv3 use)
`NOTICE`	Licensing summary
`assets/`	Figures for this README

Installation

pip install -r requirements.txt
export PYTHONPATH="$PWD"   # Linux/macOS
# set PYTHONPATH=%CD%       # Windows cmd

Quick start (local backbone)

from pathlib import Path
from PIL import Image
import torch
from mili_score_inference.predict import load_model, predict_pil

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = load_model(
    Path("weights/anime_mili_ordinal_corn_vitl16.pth"),
    Path("backbone/dinov3_vitl16"),
    device=device,
    local_backbone=True,
)
score = predict_pil(model, Image.open("photo.jpg"), device=device)
print(score)

Backbone from Hugging Face Hub

load_model accepts either a path to backbone/dinov3_vitl16 or the Hub model id facebook/dinov3-vitl16-pretrain-lvd1689m (gated; Meta terms on the model card apply).

model = load_model(
    Path("weights/anime_mili_ordinal_corn_vitl16.pth"),
    "facebook/dinov3-vitl16-pretrain-lvd1689m",
    device=device,
    local_backbone=False,
)

Use the same model id as in training (from_pretrained).

CLI

python example_predict.py \
  --weights weights/anime_mili_ordinal_corn_vitl16.pth \
  --backbone backbone/dinov3_vitl16 \
  --image photo.jpg

For Hub-only backbone, pass --backbone facebook/dinov3-vitl16-pretrain-lvd1689m instead of a local directory.

Licensing

Code, documentation, and fine-tuned checkpoint in this repository: Apache License 2.0 (LICENSE).
DINOv3 architecture and pretrained weights are subject to Meta’s DINOv3 License (LICENSE-DINOv3-Meta.txt). When loading a backbone from the Hub, follow that model card’s terms.
Details: NOTICE.

Domain-specific scores reflect the training distribution; use only in appropriate scenarios.

Citation

Cite this repository when using the release. For research publications, acknowledge use of DINO Materials per Meta’s requirements (see LICENSE-DINOv3-Meta.txt).

Downloads last month: -; Downloads are not tracked for this model. How to track