| | --- |
| | license: mit |
| | --- |
| | |
| | This model is finetuned from HuggingFaceH4/zephyr-7b-gemma-v0.1 and is finetuned on 9 Indian languages (Hindi, Tamil, Punjabi, Bengali, Gujarati, Oriya, Telugu, Kannada, Malayalam) plus English. |
| | To improve the reasoning and maths skills, we first SFT tune the gemma on Microsoft's Orca datasets. |
| |
|
| | We utilize Orca maths Hindi dataset: GenVRadmin/Aryabhatta-Orca-Maths-Hindi \ |
| | And original Orca maths dataset: microsoft/orca-math-word-problems-200k |
| |
|
| | This pushes the MATHS score from 24.3 in Gemma-7B to 25.5 in Zephyr-Gemma and 31.6 in GemmaOrca. |
| |
|
| | The model is then finetuned on GenVR's Samvaad datasets (GenVRadmin/Samvaad-Indic-Positive and GenVRadmin/Samvaad-Tamil-Mixtral and a subset of GenVRadmin/Samvaad-Mixed-Language-3). |
| |
|
| | This is then finetuned on various open sourced datasets like: |
| |
|
| | Telugu-LLM-Labs/yahma_alpaca_cleaned_telugu_filtered_and_romanized \ |
| | Telugu-LLM-Labs/teknium_GPTeacher_general_instruct_telugu_filtered_and_romanized \ |
| | abhinand/tamil-alpaca \ |
| | Tensoic/airoboros-3.2_kn \ |
| | Tensoic/gpt-teacher_kn \ |
| | Tensoic/Alpaca-Gujarati \ |
| | HydraIndicLM/bengali_alpaca_dolly_67k \ |
| | Open-Orca/OpenOrca \ |
| | pankajmathur/alpaca_orca \ |
| | OdiaGenAI/Odia_Alpaca_instructions_52k \ |
| | OdiaGenAI/gpt-teacher-roleplay-odia-3k \ |
| | GenVRadmin/Samvaad-Punjabi-Mini \ |
| | pankajmathur/WizardLM_Orca |
| | |
| | The model achieves following scores on benchmarks: |
| | |
| | Model AGIEval GPT4All TruthfulQA BigBench Average ⬇️ \ |
| | AryaBhatta-GemmaOrca 35.9 72.26 53.85 40.35 50.59 \ |
| | zephyr-7b-beta 37.52 71.77 55.26 39.77 51.08 \ |
| | zephyr-7b-gemma-v0.1 34.22 66.37 52.19 37.10 47.47 \ |
| | mlabonne/Gemmalpaca-7B 21.6 40.87 44.85 30.49 34.45 \ |
| | google/gemma-7b-it 21.33 40.84 41.70 30.25 33.53 |