--- license: apache-2.0 datasets: - liuhaotian/LLaVA-CC3M-Pretrain-595K - liuhaotian/LLaVA-Instruct-150K language: - en - ko base_model: meta-llama/Meta-Llama-3.1-8B-Instruct pipeline_tag: image-text-to-text library_name: transformers ---
: Hancom InSpace Large Language and Vision Assistant
developer: Hancom InSpace
Supported languages: Korean, English
Model Release Date: August 30th, 2024.
## Dataset
> Translate to Korean
This model leverages the LLaVA architecture and has been fine-tuned using the "openai/clip-vit-large-patch14-336" vision model and "meta-llama/Meta-Llama-3.1-8B-Instruct" language model. The pretraining process was executed over 4 hours using 4 H100 GPUs, followed by an 11-hour fine-tuning phase. This model is designed to deliver high performance in multimodal tasks, particularly those involving vision and language.
The datasets used in the training process include the LLaVA Visual Instruct Pretrain LCS-558K and LLaVA Visual Instruct 150K datasets. These datasets contribute to the model's ability to align visual features with instruction-following tasks, pushing toward GPT-4 level multimodal capabilities.
Licensing Information
When using or distributing this model, it is important to adhere to the following licenses:
LLaVA Visual Instruct Pretrain LCS-558K:
The dataset includes images and captions from the LAION/CC/SBU dataset and BLIP-generated captions. Users must comply with the licenses associated with CC-3M and BLIP.
The CC-3M dataset may be freely used with proper attribution to Google LLC. However, the dataset is provided "AS IS" without any warranties.
LLaVA Visual Instruct 150K:
This dataset is released under the Creative Commons Attribution 4.0 International License. Additionally, users must comply with OpenAI's Terms of Use as the dataset was generated using the GPT-4 API.
It is crucial to include these licensing details when distributing or using the model to ensure compliance with the original data sources' terms.
## Example Output
|  |  |  |  |
|-------------------------------------|-------------------------------------|-------------------------------------|-------------------------------------|
```markdown
# Image1
Q: ํด๋น ์ด๋ฏธ์ง๋ฅผ ์ค๋ช
ํด์ฃผ์ธ์.
HiLLaVA: ์ด ์ด๋ฏธ์ง์๋ ํ๋ฐญ์ ๊ฐ๋ก์ง๋ฌ ๊ฑท๊ณ ์๋ ํ๋ฐฑ ๊ณ ์์ด ๋ ๋ง๋ฆฌ๊ฐ ๋ฑ์ฅํฉ๋๋ค. ํ ๋ง๋ฆฌ๋ ์ผ์ชฝ์, ๋ค๋ฅธ ํ ๋ง๋ฆฌ๋ ์ค๋ฅธ์ชฝ์ ์์ต๋๋ค. ๋ ๊ณ ์์ด๋ ์๋ก ๊ฐ๊น์ด ๋ถ์ด ์์ผ๋ฉฐ, ํ ๋ง๋ฆฌ๋ ๋ค๋ฅธ ๊ณ ์์ด๋ณด๋ค ์ฝ๊ฐ ์์ ๊ฑท๊ณ ์๋ ๊ฒ์ฒ๋ผ ๋ณด์
๋๋ค. ๊ณ ์์ด ์ค ํ ๋ง๋ฆฌ๊ฐ ๋ค๋ฅธ ๊ณ ์์ด๋ณด๋ค ์ฝ๊ฐ ์์ ๋ณด์ด๋๋ฐ, ์ด๋ ๋์ด ์ฐจ์ด๋ฅผ ๋ํ๋ด๋ ๊ฒ์ผ ์ ์์ต๋๋ค. ๊ณ ์์ด๋ค์ ํ๋ฐญ์ ๊ฐ๋ก์ง๋ฌ ํจ๊ป ์ด๋ํ๋ฉด์ ์ฅ๋์ค๋ฝ๊ฒ ์ํธ์์ฉํ๋ ๊ฒ์ฒ๋ผ ๋ณด์
๋๋ค.
# Image2
Q: ์ด๋ค ๋๋ฌผ์ด ๋ณด์ด๋์?
HiLLaVA: ์ด๋ฏธ์ง ์ ๋๋ฌผ์ ํธ๋์ด์
๋๋ค.
# Image3
Q: ํด๋น ๊ฑด์ถ๋ฌผ์ ๋ฌด์์ธ๊ฐ์?
HiLLaVA: ํด๋น ๊ฑด์ถ๋ฌผ์ ํ๋ฆฌ, ํ๋์ค์ ์์นํ ์ ๋ช
ํ ๋๋๋งํฌ์ธ ์ํ ํ์
๋๋ค.
# Image4
Q: ๋ํ ์์
์ฉ ๋นํ๊ธฐ๊ฐ ์๋ ์์ญ์ ๋ฐ์ด๋ฉ ๋ฐ์ค ์ขํ๋ฅผ ์๋ ค์ฃผ์ธ์:
HiLLaVA:์ด ์ด๋ฏธ์ง๋ ๊ณตํญ ํ์ฃผ๋ก์์ ์ด๋ฅ์ ์ค๋นํ๋ ๋ํ ์์
์ฉ ๋นํ๊ธฐ์ ๋ชจ์ต์ ๋ณด์ฌ์ค๋๋ค. ๋นํ๊ธฐ๋ ํ์ฃผ๋ก ์ค์ ๊ทผ์ฒ์ ์์นํ์ฌ ์ด๋ฅ์ ์ค๋นํ๊ณ ์์ต๋๋ค. ํ์ฃผ๋ก ์ฃผ๋ณ์๋ ์ฌ๋ฌ ๋์ ์ฐจ๋์ด ํฉ์ด์ ธ ์์ผ๋ฉฐ, ์ผ๋ถ๋ ๋นํ๊ธฐ์ ๊ฐ๊น๊ฒ, ๋ค๋ฅธ ์ฐจ๋์ ๋ฉ๋ฆฌ ๋จ์ด์ ธ ์์ต๋๋ค. ์ด ์ฐจ๋๋ค์ ๋นํ๊ธฐ๊ฐ ์ด๋ฅํ ์ค๋น๋ฅผ ํ๋ ๋์ ๋นํ๊ธฐ๋ฅผ ์ง์ํ๋ ์ง์ ์๋น์ค ์ฐจ๋์ผ ๊ฐ๋ฅ์ฑ์ด ๋์ต๋๋ค. ์ด ์ฅ๋ฉด์ ๋นํ๊ธฐ๊ฐ ์ด๋ฅ์ ์ค๋นํ๋ ๊ณตํญ์ ๋ถ์ฃผํ ๋ถ์๊ธฐ๋ฅผ ํฌ์ฐฉํฉ๋๋ค.
Q: [0.44, 0.47, 0.55, 0.56]
```