Update README.md
Browse files
README.md
CHANGED
|
@@ -12,3 +12,148 @@ tags:
|
|
| 12 |
- visual-question-answering
|
| 13 |
- Bilingual
|
| 14 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
- visual-question-answering
|
| 13 |
- Bilingual
|
| 14 |
---
|
| 15 |
+
|
| 16 |
+
# ViLaH
|
| 17 |
+
ViLaH (Vision Language Hindi) is a model with 3 billion parameters, fine-tuned from the base-model [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) to handle input images and bilingual (Hindi and English) text sequences for both input and output.
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
# Training Details
|
| 21 |
+
* Model Configuration: Fine-tuned on a single epoch using 2 T4 GPUs with Distributed Data Parallel (DDP) setup.
|
| 22 |
+
* Training Duration: Approximately one day.
|
| 23 |
+
* Evaluation Loss: Achieved an eval loss of 1.6384 at the end of the epoch.
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
# Dataset
|
| 27 |
+
The dataset was finetuned on only one dataset
|
| 28 |
+
* [damerajee/clean_hin_vqa](https://huggingface.co/datasets/damerajee/clean_hin_vqa) : This dataset was derived from [Lin-Chen/ShareGPT4V](https://huggingface.co/google/paligemma-3b-pt-224) and filtered to include only images from the COCO dataset. The original dataset was translated and cleaned to ensure high-quality Hindi visual question answering content.
|
| 29 |
+
|
| 30 |
+
# How to Use
|
| 31 |
+
## The following snippets use model BhashaAI/ViLaH for reference purposes. The model in this repo you are now browsing may have been trained for other tasks, please make sure you use appropriate inputs for the task at hand.
|
| 32 |
+
|
| 33 |
+
```python
|
| 34 |
+
!pip install peft trl datasets accelerate bitsandbytes
|
| 35 |
+
!pip install transformers --upgrade
|
| 36 |
+
```
|
| 37 |
+
### To Run the model on a single T4 GPU on Float16
|
| 38 |
+
```python
|
| 39 |
+
from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
|
| 40 |
+
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
|
| 41 |
+
from peft import PeftModel, PeftConfig
|
| 42 |
+
from datasets import load_dataset
|
| 43 |
+
import torch
|
| 44 |
+
from datasets import load_dataset
|
| 45 |
+
|
| 46 |
+
dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
|
| 47 |
+
test_example = dataset[10000]
|
| 48 |
+
test_image = test_example["image"]
|
| 49 |
+
text = test_example['question']
|
| 50 |
+
|
| 51 |
+
device_index = torch.cuda.current_device()
|
| 52 |
+
print("device_index:",device_index)
|
| 53 |
+
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},torch_dtype=torch.float16,low_cpu_mem_usage=True)
|
| 54 |
+
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")
|
| 55 |
+
|
| 56 |
+
MAX_LENGTH = 500
|
| 57 |
+
# Autoregressively generate
|
| 58 |
+
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
|
| 59 |
+
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH)
|
| 60 |
+
|
| 61 |
+
# Next we turn each predicted token ID back into a string using the decode method
|
| 62 |
+
# We chop of the prompt, which consists of image tokens and our text prompt
|
| 63 |
+
image_token_index = base_model.config.image_token_index
|
| 64 |
+
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
|
| 65 |
+
num_text_tokens = len(processor.tokenizer.encode(text))
|
| 66 |
+
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
|
| 67 |
+
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
| 68 |
+
generated_text
|
| 69 |
+
|
| 70 |
+
```
|
| 71 |
+
### To Run the model on a single T4 GPU in 4Bits
|
| 72 |
+
```python
|
| 73 |
+
from peft import get_peft_model, LoraConfig,prepare_model_for_kbit_training
|
| 74 |
+
from transformers import TrainingArguments, Trainer , PaliGemmaForConditionalGeneration , AutoProcessor,BitsAndBytesConfig,AutoTokenizer
|
| 75 |
+
from peft import PeftModel, PeftConfig
|
| 76 |
+
from datasets import load_dataset
|
| 77 |
+
import torch
|
| 78 |
+
from datasets import load_dataset
|
| 79 |
+
|
| 80 |
+
dataset = load_dataset("damerajee/clean_hin_vqa",split='train')
|
| 81 |
+
test_example = dataset[10000]
|
| 82 |
+
test_image = test_example["image"]
|
| 83 |
+
text = test_example['question']
|
| 84 |
+
|
| 85 |
+
device_index = torch.cuda.current_device()
|
| 86 |
+
print("device_index:",device_index)
|
| 87 |
+
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
|
| 88 |
+
base_model = PaliGemmaForConditionalGeneration.from_pretrained("BhashaAI/ViLaH",device_map={"": device_index},quantization_config=quantization_config,torch_dtype=torch.float16,low_cpu_mem_usage=True)
|
| 89 |
+
processor = AutoProcessor.from_pretrained("BhashaAI/ViLaH")
|
| 90 |
+
|
| 91 |
+
MAX_LENGTH = 500
|
| 92 |
+
# Autoregressively generate
|
| 93 |
+
# We use greedy decoding here, for more fancy methods see https://huggingface.co/blog/how-to-generate
|
| 94 |
+
generated_ids = base_model.generate(**inputs, max_new_tokens=MAX_LENGTH)
|
| 95 |
+
|
| 96 |
+
# Next we turn each predicted token ID back into a string using the decode method
|
| 97 |
+
# We chop of the prompt, which consists of image tokens and our text prompt
|
| 98 |
+
image_token_index = base_model.config.image_token_index
|
| 99 |
+
num_image_tokens = len(generated_ids[generated_ids==image_token_index])
|
| 100 |
+
num_text_tokens = len(processor.tokenizer.encode(text))
|
| 101 |
+
num_prompt_tokens = num_image_tokens + num_text_tokens + 2
|
| 102 |
+
generated_text = processor.batch_decode(generated_ids[:, num_prompt_tokens:], skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
| 103 |
+
generated_text
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
## Usage and limitations
|
| 107 |
+
|
| 108 |
+
### Intended usage
|
| 109 |
+
|
| 110 |
+
Open Vision Language Models (VLMs) have a wide range of applications across
|
| 111 |
+
various industries and domains. The following list of potential uses is not
|
| 112 |
+
comprehensive. The purpose of this list is to provide contextual information
|
| 113 |
+
about the possible use-cases that the model creators considered as part of model
|
| 114 |
+
training and development.
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
### Ethical considerations and risks
|
| 119 |
+
|
| 120 |
+
The development of vision-language models (VLMs) raises several ethical concerns. In creating an open model, we have carefully considered the following:
|
| 121 |
+
|
| 122 |
+
* Bias and Fairness
|
| 123 |
+
* VLMs trained on large-scale, real-world image-text data can reflect socio-cultural biases embedded in the training material. These models underwent careful scrutiny, input data pre-processing described and posterior evaluations reported in this card.
|
| 124 |
+
* Misinformation and Misuse
|
| 125 |
+
* VLMs can be misused to generate text that is false, misleading, or harmful.
|
| 126 |
+
* Guidelines are provided for responsible use with the model, see the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).
|
| 127 |
+
* Transparency and Accountability
|
| 128 |
+
* This model card summarizes details on the models' architecture, capabilities, limitations, and evaluation processes.
|
| 129 |
+
* A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
Risks identified and mitigations:
|
| 133 |
+
|
| 134 |
+
* **Perpetuation of biases:** It's encouraged to perform continuous monitoring
|
| 135 |
+
(using evaluation metrics, human review) and the exploration of de-biasing
|
| 136 |
+
techniques during model training, fine-tuning, and other use cases.
|
| 137 |
+
* **Generation of harmful content:** Mechanisms and guidelines for content
|
| 138 |
+
safety are essential. Developers are encouraged to exercise caution and
|
| 139 |
+
implement appropriate content safety safeguards based on their specific
|
| 140 |
+
product policies and application use cases.
|
| 141 |
+
* **Misuse for malicious purposes:** Technical limitations and developer and
|
| 142 |
+
end-user education can help mitigate against malicious applications of LLMs.
|
| 143 |
+
Educational resources and reporting mechanisms for users to flag misuse are
|
| 144 |
+
provided. Prohibited uses of Gemma models are outlined in the [Gemma
|
| 145 |
+
Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
|
| 146 |
+
* **Privacy violations:** Models were trained on data filtered to remove certain personal information and sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
|
| 147 |
+
|
| 148 |
+
### Limitations
|
| 149 |
+
|
| 150 |
+
* Most limitations inherited from the underlying Gemma model still apply:
|
| 151 |
+
* VLMs are better at tasks that can be framed with clear prompts and
|
| 152 |
+
instructions. Open-ended or highly complex tasks might be challenging.
|
| 153 |
+
* Natural language is inherently complex. VLMs might struggle to grasp
|
| 154 |
+
subtle nuances, sarcasm, or figurative language.
|
| 155 |
+
* VLMs generate responses based on information they learned from their
|
| 156 |
+
training datasets, but they are not knowledge bases. They may generate
|
| 157 |
+
incorrect or outdated factual statements.
|
| 158 |
+
* VLMs rely on statistical patterns in language and images. They might
|
| 159 |
+
lack the ability to apply common sense reasoning in certain situations.
|