Pelikan-2 / README.md
Martinfriick's picture
Update README.md
b819235 verified
---
license: apache-2.0
library_name: pytorch
pipeline_tag: image-segmentation
tags:
- image-segmentation
- mask-generation
- promptable-segmentation
- segmentation
- transformer
- vision-transformer
- visual-prompting
---
# Pelikan-2 - Promptable Mask Generation Model
<p align="center">
<img src="asets/pelikan2_result_gallery.png" alt="Pelikan-2 promptable mask generation gallery" width="100%">
</p>
Pelikan-2 is an in-house promptable mask-generation model developed by **ZundTeam**, a German AI lab. It is designed for image segmentation workflows where a user or system provides a visual prompt, such as a bounding box or foreground point, and the model predicts an object or region mask directly in image space.
Pelikan-2 is built around a SAM-style interaction pattern while remaining a custom architecture. The model encodes the image, embeds visual prompts, and uses a learned mask decoder to produce candidate masks and quality estimates. It is not a wrapper around SAM, SAM2, SAM3, or any other released segmentation model. Pelikan-2 is a standalone PyTorch/safetensors model intended for custom segmentation pipelines, annotation tools, and computer-vision research.
Unlike fixed-category semantic segmentation systems, Pelikan-2 is centered on visual intent rather than a closed label set. A prompt tells the model what region matters, and the model returns a mask for that region. This makes Pelikan-2 useful for interactive segmentation, visual editing tools, object isolation, dataset annotation, and experiments around promptable vision models.
Pelikan-2 is intended to feel familiar to users who know promptable segmentation systems: select a region, pass the prompt into the model, receive a mask, then refine or use that mask downstream. The important difference is that Pelikan-2 is built as its own model line. The encoder, prompt pathway, decoder, mask heads, and release format are designed for ZundTeam's own segmentation research rather than for re-exporting another segmentation backbone.
The model is especially useful when a segmentation system should remain flexible. Instead of asking the model to decide from a fixed class list, the surrounding application can decide what to ask for. A user can click on an object. A detector can provide a box. A video system can track a region and re-prompt the model frame by frame. A data engine can use Pelikan-2 to create candidate masks that humans review and correct.
## Visual Prompting
Pelikan-2 follows a simple promptable segmentation flow:
1. An RGB image is converted into visual tokens.
2. A box prompt and foreground point are embedded as target hints.
3. The prompt information is merged with the image representation.
4. Learned mask tokens query the image features.
5. The decoder produces candidate masks and a mask-quality estimate.
6. A downstream system selects or post-processes the final mask.
<p align="center">
<img src="asets/preview_00.png" alt="Pelikan-2 image prompt and mask output preview" width="100%">
</p>
This design gives Pelikan-2 the structure expected from modern promptable segmentation systems while keeping the implementation compact enough for research and iteration. The model can be used as a segmentation component inside a larger tool where prompts are supplied by a user interface, a detector, a tracking system, or another vision model.
The visual prompt is deliberately simple. A box gives the model a coarse region of interest, while a foreground point gives it a direct hint about which part of that region should be treated as the target. This combination is useful for annotation systems because it maps naturally to how people select objects: drag a rough box, click inside the object, and let the model propose the precise boundary.
Pelikan-2 can also be used in automatic pipelines. A detector can generate candidate boxes, a saliency or tracking component can choose foreground points, and Pelikan-2 can convert those cues into masks. The model does not require text prompts, class names, or category labels at inference time.
## Model Overview
The released Pelikan-2 model has approximately **764M parameters**. The architecture is custom and identified in the configuration as `Pelikan2ForPromptableMaskGeneration`.
| Property | Value |
| --- | --- |
| Model name | `Pelikan-2` |
| Developer | ZundTeam |
| Model type | Promptable mask generation |
| Architecture family | Pelikan Vision Transformer |
| Parameter count | ~764M |
| Image size | 512px |
| Patch size | 16 |
| Hidden size | 1280 |
| Encoder depth | 33 layers |
| Decoder type | Promptable transformer mask decoder |
| Output | Candidate masks with IoU-quality head |
| Format | PyTorch / safetensors |
Pelikan-2 is designed for image-space segmentation. Images are split into visual patches, encoded by a transformer, and decoded into dense masks. Prompt embeddings steer the decoder toward the selected object or region.
The model includes a quality head that predicts an IoU-style confidence value for each mask token. This allows downstream systems to rank candidate masks, pick the best result, or expose multiple possible masks to a user.
The model is built around 512px inputs. For larger images, a practical pipeline can resize the full image, crop around the prompted region, or run Pelikan-2 over tiles depending on the target application. The mask output can then be resized back to the original image resolution and optionally refined with post-processing.
Pelikan-2 emits raw mask logits instead of only hard binary masks. This gives applications control over thresholding. A stricter threshold can produce cleaner regions with fewer uncertain pixels, while a softer threshold can preserve more of the target shape around difficult boundaries.
## Architecture
Pelikan-2 uses a custom transformer-based promptable segmentation architecture with four main parts:
- **Image encoder:** converts an RGB image into a dense grid of visual tokens.
- **Prompt encoder:** turns a bounding box and foreground point into dense prompt embeddings.
- **Mask decoder:** lets learned mask tokens attend over image features and prompt-conditioned representations.
- **Mask heads:** project decoder outputs into image-space segmentation masks.
The image encoder uses patch embedding followed by stacked self-attention and feed-forward layers. This gives the model global scene context while preserving a spatial token grid for downstream mask prediction.
The prompt encoder gives the model a direct representation of user intent. Box coordinates and foreground point coordinates are embedded into the same hidden space as the visual features. These prompt signals guide the decoder toward the requested region.
The mask decoder uses learned mask tokens and an IoU token. The mask tokens attend to the encoded visual features, then small per-mask heads convert the decoded token features into segmentation masks. A lightweight upsampling path restores spatial detail from the transformer grid to the mask output.
This design keeps Pelikan-2 focused on direct visual segmentation. It is not an image classifier, object detector, captioning system, text-to-image model, or general multimodal assistant. It is a dedicated promptable mask-generation model.
## Design Goals
Pelikan-2 was designed around four practical goals.
First, it should be promptable. The model should respond to visual hints rather than only producing a fixed semantic map. This makes it useful for tools where humans or other models decide what object should be segmented.
Second, it should preserve an image-space workflow. The model predicts segmentation masks, not captions, object labels, or textual descriptions. Its role is to turn visual intent into a spatial mask that can be inspected, edited, or reused.
Third, it should be modular. The encoder, prompt encoder, decoder, and mask heads are separate conceptual parts. This makes the architecture easier to study, extend, and integrate into larger systems.
Fourth, it should remain research-friendly. The repository uses PyTorch/safetensors artifacts, a clear configuration file, and a direct model card so developers can build loaders, experiment with prompt formats, or add their own inference wrappers.
## Input and Output Behavior
Pelikan-2 expects an RGB image and prompt coordinates in the image coordinate space used by the loader. A typical application normalizes the image, resizes it to the configured input size, converts box and point prompts into the same coordinate frame, and passes all tensors into the model.
The main image input is a batch of RGB tensors. The prompt input contains a rectangular box and at least one foreground point. The output contains candidate mask logits and a mask-quality prediction. Applications can select the highest-quality candidate, expose all candidates to a user, or combine candidate masks with their own heuristics.
The output mask is best treated as a probability-like spatial field before thresholding. Downstream code can apply a sigmoid, choose a threshold, remove tiny disconnected regions, fill small holes, smooth contours, or crop the final result back into the original image coordinate system.
## Result Examples
## Result Examples
The examples below show Pelikan-2 operating on image-and-prompt inputs. Each preview shows the relationship between the visual prompt, the target region, and the predicted mask overlay.
<p align="center">
<img src="asets/preview_01.png" alt="Pelikan-2 result preview 1" width="100%">
</p>
<p align="center">
<img src="asets/preview_02.png" alt="Pelikan-2 result preview 2" width="100%">
</p>
<p align="center">
<img src="asets/preview_03.png" alt="Pelikan-2 result preview 3" width="100%">
</p>
<p align="center">
<img src="asets/preview_04.png" alt="Pelikan-2 result preview 4" width="100%">
</p>
<p align="center">
<img src="asets/preview_05.png" alt="Pelikan-2 result preview 5" width="100%">
</p>
## Data
Pelikan-2 was trained on a mixed segmentation data pool containing scene parsing, foreground object masks, and object-level segmentation examples. The training format converts segmentation annotations into promptable mask-generation samples.
Each training example contains:
- an RGB image
- a target mask
- a box prompt derived from the target region
- a foreground point sampled from the target mask
- a dense mask target for image-space supervision
The data mixture is intended to expose the model to varied object shapes, indoor and outdoor scenes, semantic regions, foreground objects, texture changes, and real-world boundaries. The goal is to teach prompt-conditioned mask prediction rather than fixed-label classification.
The promptable format is important. Instead of only learning that a pixel belongs to a named semantic category, the model learns a conditional task: given this image and this prompt, recover the selected region. This better matches annotation, editing, and object-selection workflows where the user intent changes from one prompt to the next.
The training setup also encourages the model to learn both coarse localization and boundary recovery. The box prompt gives global context for where to look, while the mask target teaches the decoder to recover the selected region at the pixel level.
## Intended Use
Pelikan-2 is intended for:
- promptable image segmentation
- mask-generation research
- visual annotation tools
- object and region selection workflows
- interactive segmentation systems
- segmentation-assisted image editing
- dataset labeling and mask bootstrapping
- transformer-based vision experiments
- SAM-style visual prompting research
Pelikan-2 can be used wherever a project needs a learned model that receives an image-space prompt and returns a segmentation mask. It is especially relevant for pipelines where another component proposes a box or point and Pelikan-2 refines that signal into a usable mask.
Example application areas include interactive image editors, semi-automatic dataset labeling, object extraction tools, visual search interfaces, robotics perception research, scene understanding experiments, and segmentation-assisted media workflows.
Pelikan-2 can also be useful as a mask proposal model. A downstream application can generate several possible prompts, run Pelikan-2 for each prompt, and keep the masks that match the desired shape, size, confidence, or region constraints.
## Usage
This repository contains the model weights and configuration for Pelikan-2. A compatible implementation should construct `Pelikan2ForPromptableMaskGeneration` using `config.json`, then load `model.safetensors`.
```python
from safetensors.torch import load_file
state_dict = load_file("model.safetensors")
```
At a high level, a Pelikan-2 inference pipeline should:
1. Prepare an RGB image at the expected input size.
2. Normalize and batch the image tensor.
3. Provide a visual prompt such as a bounding box and foreground point.
4. Run the Pelikan image encoder.
5. Run the promptable mask decoder.
6. Select the best candidate mask using the quality head.
7. Resize or post-process the selected mask for the application.
The exact preprocessing, threshold, prompt coordinate format, and post-processing depend on the surrounding system.
An implementation should keep the image tensor and prompt coordinates aligned. If the image is resized from an original resolution into 512px model space, the box and point should be transformed by the same scale. After prediction, the mask can be resized back to the original image size.
For user-facing tools, a good default interaction is to let the user draw a box around the target and click once inside it. For automatic systems, boxes may come from an object detector and points may be sampled from the center of the detected region or from a confidence map.
## Post-Processing Suggestions
Pelikan-2 outputs mask logits so applications can decide how strict the final mask should be. A simple pipeline can apply `sigmoid`, threshold the mask, and resize it to the original image resolution.
For cleaner visual results, applications may add lightweight post-processing. Common steps include removing tiny disconnected components, filling small holes, smoothing boundaries, keeping only the largest connected component inside the prompted box, or snapping the mask to a crop around the target region.
For annotation workflows, it is often better to preserve uncertainty rather than aggressively clean the mask. A softer threshold can give a human annotator more boundary information to correct. For automated object extraction, a stricter threshold may be preferred.
## Integration Notes
Pelikan-2 is best treated as a segmentation component. A complete product may pair it with:
- a UI for point and box prompts
- a detector that proposes boxes automatically
- a tracker that propagates prompts across frames
- mask post-processing such as thresholding, hole filling, or contour smoothing
- human review for annotation workflows
The model outputs raw mask logits. Applications should choose a mask threshold, resize strategy, and optional cleanup step depending on their visual quality requirements.
The model can sit behind a small service that accepts an image, a box, and a point, then returns mask candidates. It can also be embedded directly inside a desktop annotation tool or batch-processing pipeline.
For repeated prompts on the same image, a serving implementation can cache the image encoder output and run only the prompt encoder and mask decoder for each new prompt. This follows the same basic interaction pattern used by promptable segmentation systems and can make interactive usage more responsive.
## Practical Prompting Tips
Pelikan-2 works best when the prompt clearly identifies the target region. A box should cover the object or region without including too much unrelated context. A point should be placed inside the desired foreground region rather than near the boundary.
If the first mask is too broad, use a tighter box. If the mask selects the wrong region inside the box, move the foreground point closer to the center of the intended object. If the target object is small, crop around the region before resizing to the model input size.
For objects with thin parts, reflective surfaces, or complex boundaries, post-processing and human review may be useful. The model can provide a strong starting mask even when the final production mask needs cleanup.
## Model Line
Pelikan-2 is the base release of the Pelikan promptable segmentation line. The model establishes the custom architecture, prompt format, image-space mask objective, and PyTorch/safetensors release structure.
Future Pelikan releases can build on the same design with longer training, larger data mixtures, higher-resolution adaptation, stronger decoders, improved prompt handling, or more mask candidates. The naming is intended to leave room for incremental releases such as Pelikan-2.1 and larger upgrades such as Pelikan-2.5.
## Scope
Pelikan-2 is one part of a larger segmentation system. It does not include a graphical annotation interface, web server, prompt editor, dataset browser, model-serving wrapper, or production moderation layer.
The model is best suited for researchers and developers who are comfortable integrating raw PyTorch/safetensors weights into custom computer-vision systems. It can also serve as a reference point for experiments around promptable segmentation, custom mask decoders, and transformer-based visual prompting.
## Limitations
- Pelikan-2 requires a compatible custom loader implementation.
- Output quality depends on prompt quality, image distribution, thresholding, and post-processing.
- Ambiguous prompts can produce ambiguous masks.
- Thin structures, transparent objects, tiny objects, and unusual boundaries may require additional refinement.
- The model should be tested on target data before deployment.
- It is not intended for safety-critical medical, legal, biometric, or identity-sensitive segmentation without independent expert review.
## Citation
```bibtex
@misc{zundteam_2026_pelikan2,
title = {Pelikan-2: Promptable Mask Generation Model},
author = {Ill-Ness, Martin Frick},
year = {2026},
url = {https://huggingface.co/ZundTeam/Pelikan-2}
}
```