|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
llama3 variant for 22 Indian languages:- |
|
|
1. Tamil |
|
|
2. Telugu |
|
|
3. Assamese |
|
|
4. Kashmiri |
|
|
5. Punjabi |
|
|
6. Bengali |
|
|
7. Sanskrit |
|
|
8. Malyalam |
|
|
9. Sindhi |
|
|
10. Marathi |
|
|
11. Gujarati |
|
|
12. Kannada |
|
|
13. Odia |
|
|
14. Maithili |
|
|
15. Urdu |
|
|
16. Nepali |
|
|
17. Manipuri |
|
|
18. Dogri |
|
|
19. English |
|
|
20. Arabic |
|
|
21. Santali |
|
|
22. Bodo |
|
|
|
|
|
We first pre-trained the model on 100 million plus Indic language tokens. |
|
|
Then, it was finetuned on close sourced GenZ_Vikas datasets consisting 7.5 million SFT pairs, including 5.5 million Hindi SFT pairs. |
|
|
Finally it underwent DPO training to align it with human preferences. |
|
|
|
|
|
The model has been benchmarked on Indic LLM leaderboard where it outperforms our AryaBhatta series on Hindi evals. |
|
|
And llama3 base model on all Indian languages. |
|
|
|
|
|
Training happened on 2*A100 for 24 days. |
|
|
|
|
|
Link: https://huggingface.co/spaces/Cognitive-Lab/indic_llm_leaderboard |
|
|
|
|
|
Release link: https://www.linkedin.com/feed/update/urn:li:activity:7199506579828662272 |