| Field | Response | |
| :------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------- | |
| Intended Task/Domain: | Text-Image Matching | |
| Model Type: | Transformer | |
| Intended Users: | The model is intended for researchers and developers requiring Hebrew vision-language capabilities, such as matching Hebrew text with images in a shared embedding space. | |
| Output: | 768-dimensional embedding vector aligned with CLIP ViT-Large/14 | |
| Describe how the model works: | Hebrew text input is encoded into an embedding vector and compared with an embedding vector from an image to output a matching score. | |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable | |
| Technical Limitations & Mitigation: | Currently, this model only supports the Hebrew language. This model may struggle with transliterated/code-mixed text or out-of-domain terminology. | |
| Verified to have met prescribed NVIDIA quality standards: | Yes | |
| Performance Metrics: | Retrieval Recall (R@1/5/10), Zero-Shot Classification Accuracy | |
| Potential Known Risks: | The model may generate and/or reproduce biases present in the training data and/or the underlying base model. Since the model supports image retrieval based on text captions, it may return mismatched or irrelevant images for captions that fall outside the distribution of the training data. | |
| Licensing: | [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) | |