| | --- |
| | license: mit |
| |
|
| | --- |
| | |
| |
|
| | This model is a part of two model series, AryaBhatta-1 and AryaBhatta-2 and is finetuned from HuggingFaceH4/zephyr-7b-gemma-v0.1 or Google/gemma and is finetuned on 9 Indian languages (Hindi, Tamil, Punjabi, Bengali, Gujarati, Oriya, Telugu, Kannada, Malayalam) plus English. |
| |
|
| | There are two models. One finetuned on Google's Gemma and one fine-tuned on Zephyr's Gemma base. Repo for other one (Google one): GenVRadmin/AryaBhatta-GemmaOrca-Merged |
| |
|
| | To improve the resoning and maths skills, we first SFT tune the gemma on Microsoft's Orca datasets. |
| |
|
| | We utilize Orca maths Hindi dataset: GenVRadmin/Aryabhatta-Orca-Maths-Hindi \ |
| | And original Orca maths dataset: microsoft/orca-math-word-problems-200k |
| |
|
| | This pushes the MATHS score from 24.3 in Gemma-7B to 25.5 in Zephyr-Gemma and 31.6 in GemmaOrca. |
| |
|
| | The model is then finetuned on GenVR's Samvaad datasets (GenVRadmin/Samvaad-Indic-Positive and GenVRadmin/Samvaad-Tamil-Mixtral and a subset of GenVRadmin/Samvaad-Mixed-Language-3). |
| |
|
| | This is then finetuned on various open sourced datasets like: |
| |
|
| | Telugu-LLM-Labs/yahma_alpaca_cleaned_telugu_filtered_and_romanized \ |
| | Telugu-LLM-Labs/teknium_GPTeacher_general_instruct_telugu_filtered_and_romanized \ |
| | abhinand/tamil-alpaca \ |
| | Tensoic/airoboros-3.2_kn \ |
| | Tensoic/gpt-teacher_kn \ |
| | Tensoic/Alpaca-Gujarati \ |
| | HydraIndicLM/bengali_alpaca_dolly_67k \ |
| | Open-Orca/OpenOrca \ |
| | pankajmathur/alpaca_orca \ |
| | OdiaGenAI/Odia_Alpaca_instructions_52k \ |
| | OdiaGenAI/gpt-teacher-roleplay-odia-3k \ |
| | GenVRadmin/Samvaad-Punjabi-Mini \ |
| | pankajmathur/WizardLM_Orca |
| | |
| | The model achieves following scores on benchmarks: |
| | |
| | Model AGIEval GPT4All TruthfulQA BigBench Average ⬇️ \ |
| | AryaBhatta-GemmaOrca 35.9 72.26 53.85 40.35 50.59 \ |
| | zephyr-7b-beta 37.52 71.77 55.26 39.77 51.08 \ |
| | zephyr-7b-gemma-v0.1 34.22 66.37 52.19 37.10 47.47 \ |
| | mlabonne/Gemmalpaca-7B 21.6 40.87 44.85 30.49 34.45 \ |
| | google/gemma-7b-it 21.33 40.84 41.70 30.25 33.53 |
| | |
| | |
| | |
| | |
| | |
| | How to use:- |
| | ``` |
| | from peft import AutoPeftModelForCausalLM |
| | from transformers import AutoTokenizer |
| | |
| | model = AutoPeftModelForCausalLM.from_pretrained( |
| | "GenVRadmin/AryaBhatta-GemmaOrca", |
| | load_in_4bit = False, |
| | token = hf_token |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained("GenVRadmin/AryaBhatta-GemmaOrca") |
| | |
| | input_prompt = """ |
| | ### Instruction: |
| | {} |
| | |
| | ### Input: |
| | {} |
| | |
| | ### Response: |
| | {}""" |
| | |
| | input_text = input_prompt.format( |
| | "Answer this question about India.", # instruction |
| | "Who is the Prime Minister of India", # input |
| | "", # output - leave this blank for generation! |
| | ) |
| | |
| | inputs = tokenizer([input_text], return_tensors = "pt").to("cuda") |
| | |
| | outputs = model.generate(**inputs, max_new_tokens = 300, use_cache = True) |
| | response = tokenizer.batch_decode(outputs)[0] |
| | ``` |