Spaces:

SagarKeshave
/

SOL9X_FaceLiveliness_Detection

Running

App Files Files Community

SOL9X_FaceLiveliness_Detection / docs /ARCHITECTURE.md

sol9x-sagar

initial setup

2979822 29 days ago

preview code

raw

history blame contribute delete

5.51 kB

A newer version of the Gradio SDK is available: 6.5.1

Upgrade

What is this?

MiniFASNet V2 SE is a small neural network built for face anti-spoofing, not a general-purpose image classifier. The architecture was specifically designed to tell real faces apart from fakes (printed photos, phone screens, masks, that kind of thing).

Input: RGB face crop. Output: Real or Spoof prediction.

The problem with spoofs

Detecting fake faces isn't the same as recognizing faces. The task is about finding artifacts, the weird stuff that gives away a presentation attack:

Moiré patterns when someone holds up a phone screen
Slight color shifts from printed photos (printers can't perfectly reproduce skin tones)
Texture differences (paper and plastic don't look like skin up close)
Reflections and glare from screens

The tricky part is these artifacts are subtle. A good print or high-res display can fool a naive classifier pretty easily. That's why MiniFASNet uses a few specific tricks:

Depthwise separable convolutions (keeps the model tiny)
SE blocks that let the network focus on whatever channels matter most
A Fourier Transform branch during training that forces the network to pay attention to frequency patterns

How it's structured

Early layers

Two conv layers right at the start cut the resolution in half and bump up the channels:

Conv1: 3×3, stride 2 → 64×64×32
Conv2 DW: depthwise 3×3 → stays at 64×64×32

Downsampling early keeps computation low for the rest of the network.

Stage 1: 32×32×64

One depth-wise block drops resolution to 32×32, then 4 residual blocks process features at that scale. Each residual block is the usual 1×1 → 3×3 DW → 1×1 pattern. The last block has an SE module attached.

Residual connections are here for the usual reasons: helping training and letting the network skip layers that aren't needed.

Stage 2: 16×16×128

Same idea, but 6 residual blocks instead of 4. More capacity here since 16×16 is where a lot of the texture information lives. SE block on the final residual again.

This is also where the FT Generator branch taps in during training (more on that below).

Stage 3: 8×8×128

Down to 8×8 now. Only 2 residual blocks here since spatial resolution is small enough that more aren't needed. SE on the last one.

By this point the receptive field covers most of the input image.

Reduction

Before the classifier head, spatial dimensions need to collapse:

SepConv 1×1 bumps channels to 512
DW Conv 5×5 pools everything down to 1×1×512

This ends up being cheaper than global average pooling + a big FC layer.

Classifier head

Pretty standard stuff:

Flatten to 512-d vector
FC 512→128
BatchNorm + Dropout (0.75 dropout rate, which is aggressive)
FC 128→2 for the final Real/Spoof logits

The 0.75 dropout seems high but anti-spoofing models overfit hard. Training data just doesn't cover all the ways people try to spoof. Cranking up dropout helps.

SE blocks

Squeeze-and-Excitation blocks show up at the end of each stage. The idea is simple:

Global average pool to get one number per channel
Two small FC layers produce a weight for each channel
Multiply the original features by those weights

The network can learn to emphasize channels that matter for the current input. Maybe certain channels fire more on screen moiré, others on paper texture. SE lets the network figure that out.

SE blocks only appear at stage boundaries. Adding SE to every block would slow things down for not much gain.

Fourier Transform branch

This is the interesting part. During training only, there's an auxiliary branch that takes Stage 2 features and tries to predict a Fourier spectrum.

Why bother? Spoofing artifacts often show up cleaner in frequency space:

Screen pixels produce regular grid patterns → shows up as peaks in the spectrum
Print halftoning has characteristic frequency signatures
Real skin texture is more random/organic

The branch is just three conv layers (128→64→1) outputting a single-channel prediction, supervised against actual FFTs of the input. Gradients flow back into the backbone, so the whole network learns to pick up on frequency patterns.

At inference time, this branch gets removed entirely since the job was done during training.

Numbers

Rough parameter breakdown:

Part	Params
Conv1 + Conv2	~3K
Stage 1	~150K
Stage 2	~500K
Stage 3	~250K
Reduction	~300K
Head	~70K
FT branch	~500K (training only)

Total is around 1.8M params. After quantizing to INT8, the deployed ONNX model comes out to about 600 KB.

Why not MobileNetV4?

MobileNetV4 was actually tried first. It worked fine, but:

	MobileNetV4	MiniFASNet
Size	~3.5 MB	~1.8 MB
Purpose	General classification	Anti-spoofing
FT supervision	Bolted on	Native
Training	Fiddly (pretrained weights, LR schedules)	Straightforward

Accuracy was about the same, so the smaller, simpler option won. MobileNetV4 code is still in src/mobilenetv4/ for anyone wanting to experiment.

References

Silent-Face-Anti-Spoofing - where MiniFASNet comes from
Squeeze-and-Excitation Networks
Learning Deep Models for Face Anti-Spoofing - the Fourier supervision idea