PLaMo 2.1-2B-VL

Model Description

PLaMo 2.1-2B-VL is a vision-language model developed by Preferred Networks, Inc., based on the instruction-tuned PLaMo 2.1-2B and trained on English and Japanese datasets.

While the 8B model (pfnet/plamo-2.1-8b-vl) will serve as the main offering, supported by performance comparisons, this smaller and lighter 2B model is made available as an easy option to try, although its use cases are more limited.

The development of PLaMo 2.1-2B-VL was focused on creating a model for usage on autonomous devices such as drones, robots, vehicles, and surveillance cameras, aiming to create a high-accuracy, lightweight VLM that runs efficiently on edge devices.

The model emphasizes the following two fundamental capabilities:

VQA (Visual Question Answering): the ability to take an image and text as input and return a natural-language response
Visual Grounding: the ability to identify in an image the person/object indicated by a text instruction (including REC: Referring Expression Comprehension)

Building on these, the model targets advancing the following real-world deployments:

Task analysis: tool recognition for analyzing workers’ tasks in factories.
Anomaly detection: anomaly detection using an image from a camera mounted on security patrol drones and surveillance cameras.

We adopt a proven, standard architecture similar to LLaVA, with the Samba-based PLaMo 2 as the LLM, and SigLIP2 as the image encoder. Both are connected via a simple MLP image adapter. Dynamic image resolutions are handled via dynamic tiling similar to NVIDIA's Eagle 2.

PLaMo 2.1-2B-VL is released under PLaMo community license. Please check the following license and agree to this before downloading.

Scope and Limitations

PLaMo 2.1-2B-VL is trained primarily for VQA and Visual Grounding on natural images. Accordingly, it has not been optimized for OCR tasks, and its ability to accurately read text in images or to understand document images, charts, tables, and mathematical expressions remains limited at this stage. The model is also restricted to single-image input and does not support tasks that assume the simultaneous processing of multiple images. Furthermore, although the model has a certain degree of general everyday knowledge, it is not intended to serve as a source of deep domain-specific expertise. For applications that require specialized judgment, it should therefore be used in conjunction with additional fine-tuning, external knowledge sources, or other supporting systems.

PLaMo 2.1-2B-VL is trained under the assumption that inputs are provided as image-text pairs. Therefore, it may not function properly when given only images or only text. It is also not designed for use cases where both an image and text are provided but the model is instructed to ignore the image.

For commercial users

Please check the PLaMo community license and contact us via the following form to use for commercial purposes.

(EN/JA) https://forms.gle/mTL8tBLrMYXKNZD56

Benchmark Results

As a single generalist model, PLaMo 2.1-2B-VL excels in visual reasoning, grounding, and specialized real-world applications across the following benchmarks.

Benchmark	JA-VG-VQA-500	JA-VG-VQA-500	JA-VG-VQA-500	Ref-L4	Ja-Ref-L4	Task analysis	Anomaly detection
Metric	ROUGE-L	LLM-as-a-judge	English Likert LLM judge	Accuracy @ IoU > 0.5	Accuracy @ IoU > 0.5	Accuracy	F1-score
PLaMo 2.1-2B-VL	60.7	71.6	4.41	83.51	82.39	40.9	38.0
PLaMo 2.1-8B-VL	61.5	72.4	4.37	86.8	85.2	53.9	39.3
Qwen2.5-VL-7B	9.9	44.2	3.094	83.1	76.9	27.6	2.5
Qwen3-VL-8B	41.6	60.4	4.06	84.1	80.6	38.3	6.1
Qwen3-VL-235B	*	*	*	86	81.6	45.8	25.1
Asagi-14B	56.8	70.6	4.05	**	**	**	**

(*) Qwen3-VL-235B-A22B-Instruct on JA-VG-VQA-500 tends to generate more specific responses than the benchmark expects, resulting in unfairly low scores and is thus being excluded.
(**) Asagi-14B was not designed for these benchmark tasks and is thus being excluded.

These results highlight PLaMo 2.1-2B-VL's readiness for real-world deployment. Key takeaways include:

Multilingual grounding: High accuracy in both English and Japanese visual grounding (Ref-L4).
Generalist capability: Exceptional performance on specialized downstream tasks like anomaly detection and task analysis without needing task-specific fine-tuning.

Requirements

torch==2.8.0
transformers==4.57.1
pillow==12.1.1
mamba_ssm==2.3.0
causal_conv1d==1.6.0
numba==0.64.0
numpy==1.26.4

Files

File	Description
`modeling_plamo2.py`	Base PLaMo2 LLM architecture
`modeling_plamo2_vl.py`	Vision-Language model wrapper
`processing_plamo2_vl.py`	Processor: image tiling, preprocessing and tokenization
`tokenization_plamo.py`	PLaMo2 tokenizer

Usage

import torch
from transformers import AutoModelForCausalLM, AutoProcessor
import PIL.Image

image = PIL.Image.open("/path/to/image.jpg")
model = AutoModelForCausalLM.from_pretrained(
    "pfnet/plamo-2.1-2b-vl", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="cuda:0"
)
processor = AutoProcessor.from_pretrained("pfnet/plamo-2.1-2b-vl", trust_remote_code=True)

inputs = processor(text="Describe this image.", images=[image]).to("cuda:0")
output_ids = model.generate(**inputs, max_new_tokens=512)
print(processor.batch_decode(output_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Example outputs

VQA

Photo: Pixabay, "Fruit Mart"; source: Stockvault; license: Creative Commons – CC0; URL: https://www.stockvault.net/photo/200223/adler32

============================================================
Prompt: What kind of item can be seen stacked in the front of the image?
============================================================
Output: A stack of green watermelons can be seen stacked in the front of the image.
============================================================

Visual grounding

Photo: Pixabay (via PICRYL); license: CC0 1.0 Universal (Public Domain Dedication); URL: https://picryl.com/media/highway-construction-site-valley-bridge-crash-dc08bd; note: bounding boxes have been added for this README.

============================================================
Prompt: Detect the crane near the excavator.
============================================================
Output: [0.739,0.193,0.906,0.717]
============================================================

The output bounding box coordinates are in (xmin, ymin, xmax, ymax) format in range [0, 1] relative to the input image size.

Object detection

Photo: Kamyar Adl, "Family Ride bicycle cycle trailer"; source: Wikimedia Commons (original image published on Flickr); license: Creative Commons Attribution 2.0 Generic (CC BY 2.0, https://creativecommons.org/licenses/by/2.0/); URL: https://commons.wikimedia.org/wiki/File:Family_Ride_bicycle_cycle_trailer.jpg; note: bounding boxes have been added for this README.

============================================================
Prompt: Detect any bicycle helmet.
============================================================
Output: bicycle helmet[0.217,0.499,0.278,0.555]
bicycle helmet[0.320,0.458,0.378,0.515]
bicycle helmet[0.552,0.355,0.608,0.415]

The output bounding box coordinates are in (xmin, ymin, xmax, ymax) format in range [0, 1] relative to the input image size.

Model Details

Base LLM model size: 2B
Trained tokens: 2.99B tokens
Developed by: Preferred Networks, Inc.
Model type: Causal decoder-only
Language(s): English, Japanese
License: PLaMo community license

Training Method

We trained PLaMo 2.1-2B-VL in two stages; stage 1.0 with 0.58B tokens and stage 1.5 with 2.41B tokens.

Tech Blog

Bias, Risks, and Limitations

PLaMo 2.1-2B-VL is a new technology that carries risks with use. Testing conducted to date has been in English and Japanese, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, PLaMo 2.1-2B-VL’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of PLaMo 2.1-2B-VL, developers should perform safety testing and tuning tailored to their specific applications of the model.

Acknowledgement

This model is created under the project　"GENIAC (Generative AI Accelerator Challenge)" project, implemented by the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO), with the aim of strengthening Japan's development capabilities in generative AI.