aashay96
/

indic-BloomLM

Model card Files Files and versions

aashay96 commited on May 3, 2023

Commit

55d0815

·

1 Parent(s): fc89b90

Added readme

Files changed (1) hide show

README.md +49 -22

README.md CHANGED Viewed

@@ -1,22 +1,49 @@
----
-license: bigscience-openrail-m
-datasets:
-- aashay96/indic_language_corpus
-language:
-- hi
-- ta
-- te
-- gu
-- pa
-- or
-- as
-- kn
-- mr
-library_name: transformers
-pipeline_tag: text-generation
-tags:
-- indic
-- text-generation-inference
-- peft
-- Bloom
----

+# Indic Language Bloom Model Training
+This repository contains the code and resources for fine-tuning the Huggingface Bloom model on the Indic language dataset using Low-Rank Adaptation (LoRA). The goal is to create a high-performance language model specifically tailored to Indic languages.
+## Dataset
+The dataset used for training is provided by AI4Bharat. I have uploaded it to huggingface hub at:
+- [Processed Indic Language Corpus](https://huggingface.co/datasets/aashay96/indic_language_corpus/tree/main)
+## Progress
+### Completed
+- [x] Low-Rank Adaptation fine-tuning of the Bloom model on streaming data
+- [x] Single checkpoint available (training logs at [Weights & Biases](https://wandb.ai/indic-lm/huggingface/runs/7kq2m62v/))
+### To Do
+- [ ] Benchmark current multilingual LLMs on IndicGLUE using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
+- [ ] Integrate DeepSpeed for better resource utilization
+- [ ] Convert current instruction dataset to Indic languages and train (dolly v2 dataset, distilled from GPT, etc.)
+- [ ] Model doesn't stop producing text - how to fix?
+- [ ] Deploy RLHF community app using [Cheese](https://github.com/CarperAI/cheese)
+## Using the Model
+```bash
+import torch
+from peft import PeftModel, PeftConfig
+from transformers import AutoModelForCausalLM, AutoTokenizer
+peft_model_id = "aashay96/indic-BloomLM"
+config = PeftConfig.from_pretrained(peft_model_id)
+model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
+tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
+# Load the Lora model
+model = PeftModel.from_pretrained(model, peft_model_id)
+batch = tokenizer("आप कैसे हैं", return_tensors='pt')
+with torch.cuda.amp.autocast():
+  output_tokens = model.generate(**batch, max_new_tokens=10)
+print('\n\n', tokenizer.decode(output_tokens[0], skip_special_tokens=True))