[English part of the README is under the Polish part]

Multimodalny Bielik-1.5B-v3.0-Instruct rozumiejący zarówno obrazy i tekst.

Nadaje się on do generowania prostych opisów ogólnych obrazów, zadawania pytań dotyczących obrazów (np. czy coś jest widoczne na obrazie), oraz podstawowych zadań OCR (czytanie tekstu z obrazów).

Wykorzystuje on google/siglip-so400m-patch14-384 jako enkoder obrazów.

Ze względu na mały rozmiar, może on źle opisywać obrazy i halucynować ze zwiększonym prawdopodobieństwem. Model często udziela minimalistycznych odpowiedzi na pytania dotyczących obrazów. Nie nadaje się on do analizy logicznej czy długiego kreatywnego pisania o obrazach.

Parametry:

Architektura: LLaVA-style (Linear Projector).

Model bazowy: speakleash/Bielik-1.5B-v3.0-Instruct

Zalecana temperatura 0.2-0.5.

W repozytorium znajdują się pliki Python umożliwiające inferencję (uruchomienie modelu).

Model wytrenowano w dwóch etapach (Feature Alignment + Visual Instruction Tuning).

Model wytrenowano z użyciem datasetów liuhaotian/LLaVA-Pretrain oraz kaiyuyue/llava-1.5-665k-instructions. Ten drugi przetłumaczono na język Polski z użyciem facebook/nllb-200-distilled-600M.

Przetłumaczony dataset: Wojtekb30/llava-1.5-665k-instructions-Polish

Do zadań OCR zaleca się dodanie na końcu prompta \nReference OCR token: i następnie tekst odczytany z obrazu przez inny, klasyczny OCR (nawet fragmenty w losowej kolejności).

Przykłady promptów i odpowiedzi:

[zostaną dodane wkrótce]

Linki:

https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain

https://huggingface.co/datasets/kaiyuyue/llava-1.5-665k-instructions

https://huggingface.co/datasets/Wojtekb30/llava-1.5-665k-instructions-Polish

https://huggingface.co/facebook/nllb-200-distilled-600M

https://huggingface.co/speakleash/Bielik-1.5B-v3.0-Instruct

https://huggingface.co/google/siglip-so400m-patch14-384

Licencje (należy ich przestrzegać korzystając z tego modelu):

liuhaotian/LLaVA-Pretrain (cytat dot. licencji z repozytorium):

License: Must comply with license of CC-3M, BLIP (if you use their synthetic caption).

CC-3M The dataset may be freely used for any purpose, although acknowledgement of Google LLC ("Google") as the data source would be appreciated.
The dataset is provided "AS IS" without any warranty, express or implied.
Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

kaiyuyue/llava-1.5-665k-instructions - Creative Commons Attribution 4.0

facebook/nllb-200-distilled-600M - Creative Commons Attribution Non Commercial 4.0

speakleash/Bielik-1.5B-v3.0-Instruct - Apache 2.0

siglip-so400m-patch14-384 - Apache 2.0

Nota prawna: Model bazowy (Bielik) i Vision Tower są na licencji Apache 2.0. Jednakże ze względu na użycie NLLB (Non-Commercial) do tłumaczenia części danych treningowych, zaleca się ostrożność przy komercyjnym wykorzystaniu tego konkretnego modelu fine-tuned.

Multimodal Bielik-1.5B-v3.0-Instruct understanding both images and text.

It is suitable for generating simple general descriptions of images, asking questions about images (e.g., whether something is visible in the image), and basic OCR tasks (reading text from images).

It uses google/siglip-so400m-patch14-384 as the image encoder.

Due to its small size, it may describe images incorrectly and hallucinate with increased probability. The model often provides minimalist answers to questions about images. It is not suitable for logical analysis or long creative writing about images.

Parameters:

Architecture: LLaVA-style (Linear Projector).

Base model: speakleash/Bielik-1.5B-v3.0-Instruct

Recommended temperature 0.2–0.5.

The repository contains Python files enabling inference (running the model).

The model was trained in two stages (Feature Alignment + Visual Instruction Tuning).

The model was trained using the datasets liuhaotian/LLaVA-Pretrain and kaiyuyue/llava-1.5-665k-instructions. The latter was translated into Polish using facebook/nllb-200-distilled-600M.

The translated dataset: Wojtekb30/llava-1.5-665k-instructions-Polish

For OCR tasks, it is recommended to add at the end of the prompt \nReference OCR token: and then the text read from the image by another, classic OCR (even fragments in random order).

Prompt and response examples:

[coming soon]

Links: