MobilePelikan-80M - Foreground Mask Generation Model

MobilePelikan-80M is an in-house foreground mask-generation model developed by ZundTeam, a German AI lab. It is designed for image segmentation workflows where a system receives a single RGB image and predicts a foreground region mask directly in image space. Rather than asking the model to classify every pixel into a large label set, MobilePelikan-80M is optimized for the simpler and often more practical task of separating a dominant foreground subject from the surrounding background.

MobilePelikan-80M is built as a compact custom architecture with a transformer backbone and a learned mask decoder. The model encodes the image into visual tokens, processes them through a lightweight segmentation stack, and produces dense mask logits together with a confidence-style mask score. It is not a wrapper around SAM, SAM2, SAM3, or any released promptable segmentation model. MobilePelikan-80M is a standalone PyTorch and safetensors model intended for lightweight segmentation pipelines, object-isolation tools, annotation bootstrapping, and computer-vision research.

Unlike promptable systems such as Pelikan-2, MobilePelikan-80M does not require a box, point, or text instruction at inference time. It is a prompt-free model. The surrounding application simply supplies an image, and the model predicts a single foreground mask. This makes MobilePelikan-80M useful when a project wants a direct image-to-mask component instead of an interactive prompt pathway.

MobilePelikan-80M belongs to the MobilePelikan model family, a smaller and faster line positioned below larger flagship segmentation systems. The important idea behind the family is not to reproduce a larger promptable model in miniature, but to provide a lighter segmentation model line that is easier to run, easier to integrate, and practical for moderate workloads where prompt-free foreground extraction is enough.

The model is especially useful when segmentation should remain simple. Instead of deciding among many semantic categories or relying on a user to provide a prompt, the surrounding application can treat MobilePelikan-80M as a direct foreground extractor. A cutout tool can use it to isolate a subject. An annotation workflow can use it to create an initial mask that a human corrects. A preprocessing pipeline can use it to remove background clutter before downstream vision tasks. A research system can use it as a compact baseline for studying transformer-based mask generation.

This also gives the model a different operational feel from heavier segmentation systems. In practice, MobilePelikan-80M is meant to behave like a dependable visual utility: pass in an image, receive a spatial foreground estimate, and decide in the surrounding application how strict, soft, or refined the final mask should be. That makes it well suited to products and experiments where the segmentation model should stay focused on a narrow job instead of becoming a larger interactive assistant.

The release format follows that same philosophy. The repository exposes raw model artifacts and a direct configuration rather than hiding the model behind a complex framework-specific abstraction. Developers can load the weights, inspect the configuration, adapt preprocessing to their own environment, and build small wrappers around the core forward pass. In that sense, the model card is intended not only as a description of the artifact, but also as a guide for how to think about the model as a reusable segmentation component.

At a Glance

Property	Value
Model name	`MobilePelikan-80M`
Developer	ZundTeam
Model type	`MobilePelikanForMaskGeneration`
Task	Binary foreground mask generation
Parameter count	`75,643,009`
Input image size	`256 x 256`
Patch size	`16`
Encoder width	`512`
Encoder depth	`14` layers
Decoder depth	`6` layers
Attention heads	`8`
Mask embedding dim	`256`
Aux neck dim	`256`
Default mask threshold	`0.5`
Output	Single foreground mask + mask score
Format	PyTorch / safetensors

Foreground Segmentation Flow

MobilePelikan-80M follows a simple image-to-mask flow:

An RGB image is resized to the configured input size.
A convolutional stem converts the image into early visual features.
Patch embedding turns those features into a compact visual token grid.
A transformer encoder models the global image context.
A learned mask query attends over the encoded image representation.
A dense reconstruction path converts the decoded representation into mask logits.
A downstream system thresholds or post-processes the final foreground mask.

This design gives MobilePelikan-80M the structure expected from a modern compact segmentation model while remaining small enough for iteration and deployment in lighter workflows. The workflow image above is generated from a real photo run through the released model. It shows the raw input, the probability map, the thresholded binary mask, and the final overlay.

The interaction pattern is deliberately simple. There is no prompt encoder, no class vocabulary, and no multi-candidate prompt selection loop. A system gives the model an image, and the model returns a foreground estimate. That simplicity is one of the main design goals of the MobilePelikan line.

Another way to understand the flow is to think of the model as compressing the image into a compact scene representation and then expanding that representation back into a foreground decision map. The encoder is responsible for understanding the overall structure of the scene, while the decoder and reconstruction path are responsible for translating that understanding into a dense spatial output. This separation of responsibilities keeps the architecture conceptually clean and makes the model easier to reason about during integration.

Model Overview

The released MobilePelikan-80M model has approximately 75.6M parameters. The architecture is custom and identified in the configuration as MobilePelikanForMaskGeneration.

MobilePelikan-80M is designed for image-space foreground segmentation. Images are reduced into visual patches, encoded by a transformer, and decoded into a dense binary mask. The decoder is guided by a learned mask query rather than by a user prompt. This keeps the model focused on one task: produce a useful foreground mask from a single image.

The model includes a score head that predicts a confidence-style value for the decoded mask. This can help downstream systems rank outputs, decide whether the result should be accepted automatically, or determine when additional post-processing or human review may be useful.

The model is built around 256 x 256 inputs. For larger images, a practical pipeline can resize the full image, crop around the likely foreground region, or use MobilePelikan-80M as a first-pass mask proposal stage before higher-resolution refinement. The predicted mask can then be resized back to the original image resolution and optionally cleaned with post-processing.

Like larger segmentation systems, MobilePelikan-80M emits raw mask logits rather than only hard binary masks. This gives applications direct control over thresholding. A stricter threshold can produce cleaner regions with fewer uncertain pixels, while a softer threshold can preserve more boundary detail around difficult shapes.

Because the model is prompt-free, the burden of deciding what the foreground should be sits partly in the image distribution and partly in the surrounding pipeline design. In many practical cases this is a strength rather than a limitation. A controlled product experience can decide that the most visually dominant subject should be extracted. A batch-processing system can decide that the largest coherent region is the desired result. A review tool can surface the model output as a starting point and let a human quickly approve or correct it. The model therefore works best when it is given a clear role inside a broader application decision flow.

Architecture

MobilePelikan-80M uses a custom transformer-based foreground segmentation architecture with four main parts:

Convolutional stem: prepares the RGB image before patch tokenization.
Image encoder: converts the feature map into a contextualized grid of visual tokens.
Mask decoder: lets a learned mask query attend over the image representation.
Mask reconstruction path: projects decoded features back into dense image-space mask logits.

The released configuration for mobilepelikan-80m is:

Component	Setting
Stem input channels	`3`
Encoder dimension	`512`
Encoder blocks	`14`
Decoder blocks	`6`
Attention heads	`8`
MLP ratio	`4.0`
Dropout	`0.0`
Attention dropout	`0.0`
Grid size	`16 x 16` patches
Mask query count	`1`

The image encoder uses patch embedding followed by stacked self-attention and feed-forward layers. This gives the model global scene context while preserving a spatial token grid for downstream mask prediction.

The decoder uses a learned mask query rather than a prompt-conditioned token set. That query attends to the encoded image memory, extracts object-level structure from the scene, and feeds the reconstruction path that produces dense mask logits.

The dense mask reconstruction path combines decoded mask information with upsampled visual features from the convolutional neck. This restores enough spatial detail to produce a useful foreground mask in image space.

This design keeps MobilePelikan-80M focused on direct foreground extraction. It is not an image classifier, object detector, captioning system, text-to-image model, or multimodal assistant. It is a dedicated foreground mask-generation model.

That focus is important for understanding the tradeoff the model makes. MobilePelikan-80M does not attempt to solve every vision task at once. Instead, it narrows the problem to one concrete output type: a foreground mask. By narrowing the task, the implementation can remain comparatively direct, and the integration story becomes easier for applications that only need segmentation rather than a larger multimodal stack.

Design Goals

MobilePelikan-80M was designed around four practical goals.

First, it should be simple. The model should accept an image and return a mask without requiring prompt engineering or interactive inputs. This makes it easy to place inside batch pipelines and lightweight applications.

Second, it should preserve an image-space workflow. The model predicts dense segmentation masks, not captions, labels, or class IDs. Its role is to turn image content into a spatial foreground estimate that can be inspected, edited, or reused downstream.

Third, it should be lightweight relative to flagship segmentation systems. The MobilePelikan line is intended to be easier to train, iterate, and serve while still benefiting from transformer-based image encoding.

Fourth, it should remain research-friendly. The release uses PyTorch and safetensors artifacts, a straightforward configuration file, and a direct model card so developers can load the model, test variations, and build their own wrappers around it.

Taken together, these goals define the MobilePelikan line as a practical engineering model family. The release is meant to be used, adapted, and integrated. It is not presented as a universal solution for all segmentation problems, and it does not depend on a large prompt interface to be useful. Its main value is that it provides a compact, direct path from image pixels to a usable foreground mask.

Result Comparisons

The examples below use real photos and real model outputs. They are meant to show the actual behavior of the model rather than a drawn diagram. Each preview shows the relationship between the source image, the predicted foreground probability map, and the final mask overlay.

These comparisons show the same basic pattern across different images: start from a raw photo, estimate the foreground region, threshold the result into a mask, and use that mask as the basis for isolation, cleanup, or downstream processing.

Intended Use

MobilePelikan-80M is intended for:

foreground object extraction
lightweight binary image segmentation
annotation bootstrapping
mask proposal generation
cutout and background-removal style workflows
segmentation research with smaller transformer models
educational or experimental vision pipelines

MobilePelikan-80M can be used wherever a project needs a learned model that receives an image and returns a direct foreground mask. It is especially relevant for pipelines where the goal is subject isolation rather than promptable region selection.

Example application areas include object cutout tools, simple background-removal workflows, annotation bootstrapping, image preprocessing for downstream models, visual editing experiments, and compact transformer-based segmentation research.

The model can also be useful as a mask proposal model. A downstream system can run MobilePelikan-80M as a first-pass segmenter, then keep the predicted mask as-is, refine it with post-processing, or pass it into a second-stage model that operates at higher resolution.

It is especially useful in environments where a full promptable workflow would be unnecessary overhead. If the application repeatedly wants a likely foreground subject and does not need a box-and-point interaction pattern, a prompt-free model can be operationally simpler. That simplicity can matter for desktop tools, small services, embedded workflows, or internal research tooling where the best experience is often the one with the fewest moving parts.

Usage

This repository contains the exported config.json, model.pt, and model.safetensors files for MobilePelikan-80M.

A compatible implementation should:

Read config.json.
Construct MobilePelikanForMaskGeneration.
Load the weights from model.safetensors or model.pt.
Resize the input image to 256 x 256.
Run inference to obtain mask logits, probabilities, and a mask score.
Threshold and resize the predicted mask back to the original resolution if needed.

Example loading the safetensors weights:

import json

from safetensors.torch import load_file

from mobilepelikan.modeling_mobilepelikan import (
    MobilePelikanConfig,
    MobilePelikanForMaskGeneration,
)

with open("config.json", "r", encoding="utf-8") as handle:
    config = MobilePelikanConfig(**json.load(handle))

model = MobilePelikanForMaskGeneration(config)
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict)
model.eval()

At a high level, a MobilePelikan-80M inference pipeline should:

Convert the image to RGB.
Resize it to the configured input size.
Convert it to a batched tensor.
Run model.generate_masks(...).
Apply thresholding to mask_probabilities.
Resize the output mask back to the source image size if required.

The exact preprocessing, threshold, resize strategy, and post-processing depend on the surrounding system. An implementation should keep the image tensor and output mask aligned so that the final mask can be mapped cleanly back to the original image space.

In practice, this means the model is straightforward to wrap in either a script, a small API, or a GUI application. A lightweight service can accept an uploaded image, run the model, and return a PNG mask or overlay. A desktop tool can run the same logic locally and let a user inspect the result before exporting. A batch pipeline can treat the model as one stage in a longer visual-processing chain. The model does not require a complicated runtime contract, which makes it easier to integrate than systems with multiple prompt tensors and interaction states.

Input and Output Behavior

MobilePelikan-80M expects an RGB image in the coordinate space used by the loader. A typical application normalizes the image, resizes it to 256 x 256, converts it into a batched tensor, and passes it into the model.

The main input is a batch of RGB tensors. The output contains:

mask_logits: raw binary segmentation logits
mask_probabilities: sigmoid probabilities for the predicted foreground region
mask_scores: a scalar confidence-style score for the predicted mask
binary_masks: thresholded masks returned by generate_masks()

The output mask is best treated as a probability-like spatial field before thresholding. Downstream code can apply a chosen threshold, remove tiny disconnected regions, fill holes, smooth contours, or resize the final result back into the original image coordinate system.

This output behavior is useful because it leaves control with the application rather than forcing one final interpretation of the mask. A visual editing tool may prefer a softer output that preserves uncertain edges for later correction. A background-removal service may prefer a stricter cutoff that favors cleaner silhouettes. A research workflow may keep the logits or probabilities for further analysis instead of immediately collapsing them into a hard binary decision.

Integration Notes

MobilePelikan-80M is best treated as a compact segmentation component. A complete product may pair it with:

a preprocessing stage that normalizes or crops images
mask post-processing such as thresholding, hole filling, or contour smoothing
a UI that lets a user accept or refine the predicted mask
a downstream model that consumes isolated foreground subjects
human review for annotation workflows

The model outputs raw mask logits and probabilities. Applications should choose a threshold, resize strategy, and optional cleanup step depending on their visual quality requirements.

The model can sit behind a small service that accepts an image and returns a predicted mask. It can also be embedded directly inside a desktop tool, data engine, or research pipeline. For repeated use, a serving implementation can keep the model loaded in memory and wrap the single forward pass with lightweight preprocessing and post-processing.

One practical advantage of this setup is predictability. Because the interface is just image in, mask out, the model is easy to compose with other tools. It can run before an editor, before a classifier, before a retrieval system, or before a human review loop. That modularity is one of the main reasons compact foreground models remain useful even when larger vision systems exist.

Post-Processing Suggestions

The model emits mask probabilities, so downstream systems can tune the final binary output. Common post-processing steps include:

resizing the mask to the source image resolution
applying a custom threshold
removing tiny disconnected components
filling small holes
smoothing edges or contours
keeping only the largest connected foreground region

For user-facing cutout workflows, these simple steps often improve the visual result noticeably. A softer threshold can preserve more of the target subject around uncertain boundaries, while a stricter threshold can produce a cleaner silhouette for automated extraction.

For annotation workflows, it may be better to preserve some uncertainty rather than over-clean the result. For automated background removal, a stricter threshold and stronger cleanup rules may be preferred.

There is no single universally correct threshold or cleanup recipe. The right choice depends on whether the application values clean silhouettes, boundary recall, visual smoothness, or conservative subject preservation. In many products, exposing one or two preset quality modes is more useful than hard-coding a single post-processing policy.

Model Line

MobilePelikan-80M is the smaller release of the MobilePelikan foreground segmentation line. The model establishes the compact architecture pattern, direct image-to-mask objective, and PyTorch and safetensors release structure used by the family.

Future MobilePelikan releases can build on the same design with larger variants, longer training, stronger data mixtures, higher-resolution adaptation, or improved reconstruction heads. The line is intended to leave room for small practical models beneath the larger Pelikan promptable segmentation family.

Scope

MobilePelikan-80M is a segmentation model artifact, not a complete product. The release does not include:

a graphical annotation interface
a web application
a production inference server
prompt tools or multimodal inputs
standalone labeling software

The model is best suited for developers and researchers who are comfortable integrating raw PyTorch or safetensors weights into their own computer-vision systems. It can also serve as a reference point for experiments around compact foreground segmentation and smaller transformer-based mask decoders.

It should be understood as one useful part of a larger segmentation workflow rather than as a complete turnkey product. The release gives you the segmentation core. The surrounding product decisions, visual UX, and domain-specific quality controls remain the responsibility of the integrating system.

Limitations

MobilePelikan-80M is trained for binary foreground segmentation, not general-purpose semantic segmentation.
It assumes a relatively simple foreground/background separation task.
Performance depends on image distribution, preprocessing, thresholding, and post-processing.
Small objects, thin structures, occlusions, and unusual backgrounds can reduce output quality.
A dominant single-object assumption may not hold for crowded or highly complex scenes.
The model should be evaluated on the target domain before deployment.
It is not intended for safety-critical medical, legal, biometric, or identity-sensitive segmentation without independent expert review.

Citation

@misc{zundteam_2026_mobilepelikan80m,
  title        = {MobilePelikan-80M: Foreground Mask Generation Model},
  author       = {ZundTeam},
  year         = {2026},
  note         = {Model artifact and weights release}
}

Downloads last month: 6

Safetensors

Model size

75.6M params

Tensor type

F32

Inference Providers NEW

Mask Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

ZundTeam
/

mobilepelikan-80m