arxiv:2601.05741

ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers

Published on Jan 9

· Submitted by

Guray Ozgur on Jan 12

Upvote

Authors:

Guray Ozgur ,

Abstract

ViTNT-FIQA measures face image quality by analyzing patch embedding stability across Vision Transformer blocks with a single forward pass.

AI-generated summary

Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.

View arXiv page View PDF GitHub 3 Add to collection

Community

gurayozgur

Paper author Paper submitter about 17 hours ago

https://github.com/gurayozgur/ViTNT-FIQA

gurayozgur

Paper author Paper submitter about 17 hours ago

ViTNT-FIQA is a training-free Face Image Quality Assessment (FIQA) method that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. Unlike existing approaches that require multiple forward passes, backpropagation, or additional training, our method achieves competitive performance with just a single forward pass through pre-trained ViT-based face recognition models. https://github.com/gurayozgur/ViTNT-FIQA

librarian-bot

about 2 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.05741 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.05741 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.05741 in a Space README.md to link it from this page.