--- library_name: transformers language: - mr tags: - SLM - marathi-slm - sangraha - SmolLM2 datasets: - ai4bharat/sangraha --- # Model Card for Model ID ## Model Details An experimental 145M parameter pre-trained base model for marathi. Inspired by SmolLM2 and its architecture. Pre-trained on verified marathi split of the [`ai4bharat/sangraha`](https://huggingface.co/datasets/ai4bharat/sangraha) dataset, around ~2.8 billion tokens. Note: This is an experimental model and will be followed by more pre-training, followed by task specific instruction finetuning. ## How to use ```python # Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("sky-2002/Marathi-SmolLM2-145M") model = AutoModelForCausalLM.from_pretrained("sky-2002/Marathi-SmolLM2-145M") sentence = "पुणे विद्यापीठाने म्हटले आहे" inputs = tokenizer(sentence, return_tensors="pt") output = model.generate(**inputs, max_length=50) print(tokenizer.batch_decode(output, skip_special_tokens=True)) ``` ### Model Description, data and training details **Architecture**: SmolLM2 based **Tokenizer**: Uses the `sarvamai/sarvam-1` tokenizer, since it has been trained on indic languages and has lower fertility rates than existing multilingual tokenizers. **Training dataset**: The training dataset covers the following domains. ![alt text](image.png) **Training**: - Trained using modal platform on an A100. - Trained for 1 epoch on verified marathi split of sangraha dataset, covering ~5.8M samples. This model can generate coherent text, especially in the domains similar to those in the training dataset. ## Bias, Risks, and Limitations This model is trained on data of 2.8 B tokens and using a context length of 512, due to computational constraints of training. Often gives out gibberish if prompt is not related to domains shown, or if in a conversational style.