hilarl's picture
Upload README.md with huggingface_hub
28f90cb verified
metadata
license: apache-2.0
language:
  - dv
base_model: google/functiongemma-270m-it
tags:
  - dhivehi
  - maldives
  - conversational
  - naturecode
library_name: transformers
pipeline_tag: text-generation

Naturecode Dhivehi 270m

The first open-source Dhivehi language model optimized for natural conversations.

Model Description

Naturecode Dhivehi 270m is a fine-tuned language model specifically designed for Dhivehi, the official language of the Maldives. Built on Google's FunctionGemma-270m architecture, this model has been trained through a 6-phase curriculum learning approach on authentic Dhivehi text data.

Key Features

  • Native Dhivehi Support: Trained on authentic Dhivehi text including formal and informal language
  • Conversational: Optimized for natural dialogue in Dhivehi
  • Lightweight: 270M parameters - efficient for deployment
  • Curriculum Trained: Progressive training through 6 phases for robust language understanding

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("google/functiongemma-270m-it")
model = PeftModel.from_pretrained(base_model, "hilarl/naturecode-dhivehi-270m")
tokenizer = AutoTokenizer.from_pretrained("google/functiongemma-270m-it")

prompt = "<start_of_turn>user\nކިހިނެއް؟<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Training Details

  • Base Model: google/functiongemma-270m-it
  • Training Method: LoRA fine-tuning with curriculum learning (6 phases)
  • Training Data: Curated Dhivehi corpus including dictionary entries, formal text, informal conversations, and SFT data

Intended Use

  • Dhivehi language chatbots and assistants
  • Dhivehi text generation
  • Educational applications for Dhivehi language learning
  • Research on low-resource language models

Limitations

  • Primarily trained for conversational use; may not excel at technical domains
  • Performance may vary with highly specialized vocabulary
  • Should not be used for generating harmful or misleading content

License

Apache 2.0

Citation

@misc{naturecode-dhivehi-270m,
  author = {Naturecode},
  title = {Naturecode Dhivehi 270m},
  year = {2025},
  publisher = {HuggingFace},
}