NagaNLP Project
Collection
Resources for the NagaNLP project: Low-resource NLP for Nagamese (Naga Pidgin), including conversational corpora, NER, and POS tagging resources. • 10 items • Updated
• 3
NagaLLaMA-3.2-3B-Instruct is a Low-Rank Adapter (LoRA) fine-tune of the Llama-3.2-3B-Instruct model, designed to understand and generate text in Nagamese (Naga Pidgin/Creole).
This model serves as a general-purpose instruction-following assistant for the Nagamese language, capable of answering queries, translating, and maintaining conversation in the local dialect used in Nagaland, India.
nag)The model was trained on the NagaNLP Conversational Corpus, which contains 10,021 Nagamese instruction-following pairs.
Data Splitting: To ensure robust evaluation, the dataset was split as follows:
This release represents the final model from a data-scaling ablation study, trained on 100% of the available training split.
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_projThis model is intended for:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load Base Model
base_model_id = "meta-llama/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Load NagaLLaMA Adapter
adapter_id = "agnivamaiti/NagaLLaMA-3.2-3B-Instruct"
model = PeftModel.from_pretrained(model, adapter_id)
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
# Inference
prompt = "Machine Learning ki ase aru kote use hoi?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=150,
do_sample=True,
temperature=0.3,
top_k=15,
top_p=0.3,
repetition_penalty=1.2,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Base model
meta-llama/Llama-3.2-3B-Instruct