Buckets:

hf-doc-build
/

doc

Files

xet

hf-doc-build/doc / transformers /main /en /add_vision_processing_components.md

HuggingFaceDocBuilder

about 11 hours ago

preview code

download

raw

12.2 kB

	# Add vision processing components

	Adding a vision model requires image or video processing components on top of the standard [modular](./modular_transformers) approach. Image-only models need image processors and video models need a video processor, both of which are accessible behind the [AutoImageProcessor](/docs/transformers/main/en/model_doc/auto#transformers.AutoImageProcessor) and [AutoVideoProcessor](/docs/transformers/main/en/model_doc/auto#transformers.AutoVideoProcessor) entry points.

	> [!NOTE]
	> For the modeling and config steps, follow the [modular](./modular_transformers) guide first.

	## Image processors

	Create image processors when the model consumes images. The [torchvision](https://docs.pytorch.org/vision/stable/index.html) backend is the default and supports GPU acceleration. [PIL](https://pillow.readthedocs.io/en/stable/index.html) is the fallback when torchvision isn't available.

	Both image processor classes share the same preprocessing logic but have different backends. Their constructor signatures and default values must be identical. [AutoImageProcessor.from_pretrained()](/docs/transformers/main/en/model_doc/auto#transformers.AutoImageProcessor.from_pretrained) selects the backend at load time and falls back to PIL when torchvision isn't available. Mismatched signatures cause the same saved config to behave differently across environments.

	### torchvision

	Create `image_processing_<model_name>.py` with a class that inherits from [TorchvisionBackend](/docs/transformers/main/en/main_classes/image_processor#transformers.TorchvisionBackend). If your processor needs custom parameters beyond the standard [ImagesKwargs], define a kwargs class.

	```py
	from ...image_processing_backends import TorchvisionBackend
	from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
	from ...processing_utils import ImagesKwargs, Unpack
	from ...utils import auto_docstring

	class MyModelImageProcessorKwargs(ImagesKwargs, total=False):
	tile_size: int # any model-specific kwargs

	@auto_docstring
	class MyModelImageProcessor(TorchvisionBackend):
	resample = PILImageResampling.BICUBIC
	image_mean = OPENAI_CLIP_MEAN
	image_std = OPENAI_CLIP_STD
	size = {"shortest_edge": 224}
	do_resize = True
	do_rescale = True
	do_normalize = True
	do_convert_rgb = True

	def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
	super().__init__(**kwargs)
	```

	> [!TIP]
	> See [LlavaOnevisionImageProcessor](/docs/transformers/main/en/model_doc/llava_onevision#transformers.LlavaOnevisionImageProcessor) for reference.

	### PIL

	Create `image_processing_pil_<model_name>.py` with a class that inherits from [PilBackend](/docs/transformers/main/en/main_classes/image_processor#transformers.PilBackend). Duplicate the kwargs class here instead of importing it from the torchvision file because it can fail when torchvision isn't installed. Add an `# Adapted from` comment so the two stay in sync. For processors with no custom parameters, use [ImagesKwargs](/docs/transformers/main/en/main_classes/processors#transformers.ImagesKwargs) directly.

	```py
	from ...image_processing_backends import PilBackend
	from ...image_utils import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD, PILImageResampling
	from ...processing_utils import ImagesKwargs, Unpack
	from ...utils import auto_docstring

	# Adapted from transformers.models.my_model.image_processing_my_model.MyModelImageProcessorKwargs
	class MyModelImageProcessorKwargs(ImagesKwargs, total=False):
	tile_size: int # any model-specific kwargs

	@auto_docstring
	class MyModelImageProcessorPil(PilBackend):
	resample = PILImageResampling.BICUBIC
	image_mean = OPENAI_CLIP_MEAN
	image_std = OPENAI_CLIP_STD
	size = {"shortest_edge": 224}
	do_resize = True
	do_rescale = True
	do_normalize = True
	do_convert_rgb = True

	def __init__(self, **kwargs: Unpack[MyModelImageProcessorKwargs]):
	super().__init__(**kwargs)
	```

	> [!TIP]
	> See [LlavaOnevisionImageProcessorPil](/docs/transformers/main/en/model_doc/llava_onevision#transformers.LlavaOnevisionImageProcessorPil) for reference.

	## Video processor

	Add a video processor when the model consumes videos or sampled video frames.

	Create `video_processing_<model_name>.py` in the model directory. [BaseVideoProcessor](/docs/transformers/main/en/main_classes/video_processor#transformers.BaseVideoProcessor) inherits from the [TorchvisionBackend](/docs/transformers/main/en/main_classes/image_processor#transformers.TorchvisionBackend) and provides shared decoding, frame sampling, resizing, rescaling, normalization, saving, and loading behavior.

	The class attributes are the default preprocessing values. Users can override them at initialization or call time. Use the same names as [VideosKwargs](/docs/transformers/main/en/main_classes/processors#transformers.VideosKwargs) when possible, such as `size`, `crop_size`, `do_resize`, `do_sample_frames`, `num_frames`, and `fps`.

	Define a kwargs class if your video processor needs custom parameters beyond the standard [VideosKwargs](/docs/transformers/main/en/main_classes/processors#transformers.VideosKwargs). Set it as `valid_kwargs` and use it to annotate `__init__` for both runtime validation and the auto-generated docstring.

	```py
	from ...processing_utils import Unpack, VideosKwargs
	from ...utils import auto_docstring
	from ...video_processing_utils import BaseVideoProcessor

	class MyModelVideoProcessorKwargs(VideosKwargs, total=False):
	min_frames: int
	max_frames: int

	@auto_docstring
	class MyModelVideoProcessor(BaseVideoProcessor):
	size = {"shortest_edge": 224}
	crop_size = {"height": 224, "width": 224}
	do_resize = True
	do_center_crop = True
	do_normalize = True
	do_sample_frames = True
	num_frames = 16
	model_input_names = ["pixel_values_videos"]
	valid_kwargs = MyModelVideoProcessorKwargs

	def __init__(self, **kwargs: Unpack[MyModelVideoProcessorKwargs]):
	super().__init__(**kwargs)
	```

	Override [sample_frames()](/docs/transformers/main/en/main_classes/video_processor#transformers.BaseVideoProcessor.sample_frames) only when the model requires a sampling rule that the base uniform sampler can't express. For example, some models enforce a minimum or maximum number of frames, or sample based on model-specific constraints.

	If the model's forward method expects a legacy input name, override `preprocess` and rename the key after calling the base implementation.

	```py
	class MyModelVideoProcessor(BaseVideoProcessor):
	model_input_names = ["pixel_values"]

	def preprocess(self, videos, **kwargs):
	batch = super().preprocess(videos, **kwargs)
	batch["pixel_values"] = batch.pop("pixel_values_videos")
	return batch
	```

	Save the video processor with the checkpoint by instantiating it in the conversion script and calling [save_pretrained()](/docs/transformers/main/en/main_classes/video_processor#transformers.BaseVideoProcessor.save_pretrained). If a [ProcessorMixin](/docs/transformers/main/en/main_classes/processors#transformers.ProcessorMixin) wraps the video processor, call [save_pretrained()](/docs/transformers/main/en/model_doc/wav2vec2#transformers.Wav2Vec2Processor.save_pretrained) instead. Do not manually create or edit preprocessing config files.

	> [!TIP]
	> See [Qwen3VLVideoProcessor](/docs/transformers/main/en/model_doc/qwen3_vl#transformers.Qwen3VLVideoProcessor) for reference.

	## Register the classes

	Expose the processing classes from the model package `__init__.py`. Follow the lazy import pattern used by nearby models and guard imports with the same optional dependencies required by each backend.

	Map the new classes to the model config so the `Auto` classes can load them. The generated auto mapping file has a warning at the top. Do not edit it by hand. Add or update the model config, then run:

	```bash
	python utils/check_auto.py --fix_and_overwrite
	```

	After the mapping is generated, verify the model type appears in the relevant mappings in `src/transformers/models/auto/auto_mappings.py`.

	- `IMAGE_PROCESSOR_MAPPING_NAMES` for [AutoImageProcessor](/docs/transformers/main/en/model_doc/auto#transformers.AutoImageProcessor)
	- `VIDEO_PROCESSOR_MAPPING_NAMES` for [AutoVideoProcessor](/docs/transformers/main/en/model_doc/auto#transformers.AutoVideoProcessor)

	## Testing

	Add tests for each vision processing component in the model test directory. Image and video processor tests follow the same pattern. Inherit from the shared mixin, indicate the fast and slow processing classes when automatic discovery isn't enough, provide model-specific init kwargs, and override the input name when the model uses a non-default output key.

	### Image processor tests

	Image processor tests usually live in `tests/models/<model_name>/test_image_processing_<model_name>.py` and inherit from `ImageProcessingTestMixin`.

	The image processing mixin finds the image processor classes from `IMAGE_PROCESSOR_MAPPING_NAMES`. Expose model-specific defaults through `image_processor_dict`. Add a tester object only when you need reusable dummy inputs or helper methods for focused tests.

	```py
	from transformers.testing_utils import require_torch, require_vision
	from ...test_image_processing_common import ImageProcessingTestMixin

	@require_torch
	@require_vision
	class MyModelImageProcessingTest(ImageProcessingTestMixin, unittest.TestCase):
	@property
	def image_processor_dict(self):
	return {"size": {"shortest_edge": 224}, "do_resize": True}
	```

	Add focused tests for behavior the mixin can't infer, such as custom resizing rules or model-specific kwargs.

	### Video processor tests

	Video processor tests usually live in `tests/models/<model_name>/test_video_processing_<model_name>.py` and inherit from `VideoProcessingTestMixin`. Set `fast_video_processing_class`, define `video_processor_dict`, and override `input_name` if the model uses a key other than `pixel_values_videos`.

	```py
	from transformers.testing_utils import require_torch, require_vision
	from transformers.utils import is_torchvision_available
	from ...test_video_processing_common import VideoProcessingTestMixin

	@require_torch
	@require_vision
	class MyModelVideoProcessingTest(VideoProcessingTestMixin, unittest.TestCase):
	fast_video_processing_class = MyModelVideoProcessor if is_torchvision_available() else None
	input_name = "pixel_values_videos"

	@property
	def video_processor_dict(self):
	return {"size": {"shortest_edge": 224}, "num_frames": 16}
	```

	Add focused video tests for frame sampling, metadata handling, decoded video inputs, list-of-frame inputs, and output shapes. If your processor renames `pixel_values_videos`, assert the renamed key is returned.

	If the model also has a [ProcessorMixin](/docs/transformers/main/en/main_classes/processors#transformers.ProcessorMixin) that wraps the image or video processor, add `tests/models/<model_name>/test_processing_<model_name>.py` and inherit from `ProcessorTesterMixin`. Set `processor_class` and override `_setup_<component>()` class methods for components that can't be constructed without arguments. Use `_setup_test_attributes()` to expose placeholder tokens used by the common processor tests.

	```py
	from ...test_processing_common import ProcessorTesterMixin

	class MyModelProcessorTest(ProcessorTesterMixin, unittest.TestCase):
	processor_class = MyModelProcessor

	@classmethod
	def _setup_image_processor(cls):
	return cls._get_component_class_from_processor("image_processor")(size={"shortest_edge": 224})

	@classmethod
	def _setup_video_processor(cls):
	return cls._get_component_class_from_processor("video_processor")(num_frames=2)

	@classmethod
	def _setup_test_attributes(cls, processor):
	cls.image_token = getattr(processor, "image_token", "")
	cls.video_token = getattr(processor, "video_token", "")
	```

	## Next steps

	- Read the [Auto-generating docstrings](./auto_docstring) guide to auto-generate consistent docstrings with `@auto_docstring`.
	- Read the [Image processors](./image_processors) and [Video processors](./video_processors) guides for user-facing preprocessing behavior.

Xet Storage Details

Size:: 12.2 kB
Xet hash:: b33a0f3a19ea509eaa518ce8153735b6a06b8b6211a56704da017ee3bd15a371

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.